As someone who used to work on Kubernetes and distributed ML on Kubernetes, digging into some of the publicly available facts about how OpenAI runs a Kubernetes cluster of 7,500+ to produce scalable infrastructure for their large language models. [1] [2]
Kubernetes vs. HPC. Many might object and say