Ask HN: How different is compute orchestration for AI?

1 points

16 hours ago

story

If there are folks here that work on LLM providers on managing the compute server/workload orchestration on training or inference side, I'm curious what's the state of the art in this area is.

I understand Kubernetes has a fair amount of frameworks for training and serving, but I assume it's not the best tool for running large-scale GPU clusters (at least not out of the box). Many cloud providers started providing ultrascale but low-pod density Kubernetes clusters for this. I also assume there are still many orchestrators like Slurm still around for these kinds of job, and I remember Open AI trying to build their own orchestrator for training jobs.

I also assume spatial locality between servers, infiniband/RDMA also matter a lot more than Kubernetes provider native support for, and server health story must be completely different since GPUs fail a lot more often, and they have a lot more interesting metrics to monitor on top of standard OS metrics.

What are some articles or blogs to read in this space to come up to speed on how GPU/ML compute orchestration happens in the state of the art today?