Kubetorch enables ML research and development on Kubernetes, across training, inference, RL, evals, data processing, and more, in a deceptively simple and unopinionated package.
For many teams, Kubernetes is increasingly the compute foundation for ML / AI development, due to its support for arbitrary workloads at scale, a rich open source ecosystem, and workload portability. However, ML/AI development on Kubernetes is unergonomic. Unlike regular software, you cannot test ML code before containerizing and deploying to Kubernetes since you do not locally have 32 H100s (or a single T4). Therefore, you must develop through deployment to Kubernetes. Iterating on a distributed PyTorch training means stopping execution, throwing away state by tearing down pods, rebuilding a Docker image, requeuing for resources, reloading artifacts, and finally resuming training. End-to-end, it takes 10-30 minutes and wrestling with Dockerfiles and YAML to add a print statement.
To solve this, Kubetorch is a framework to painlessly deploy to Kubernetes. You can take any regular ML program and run it at an arbitrary scale on the remote cluster, calling it on the cluster as if the cluster resources were simply part of your local process pool. Subsequent iteration is extremely fast, with local code changes propagating in seconds as everything is cached in place. And there is zero opinionation introduced, so you can use arbitrary compute, images, resource types, and cluster management.
To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem Landscape.
How It Works
Just as PyTorch made it simple to command GPUs, Kubetorch makes it simple to command Kubernetes, with a similar `.to()` API that deploys code to remote and makes it callable.
import kubetorch as kt
from my_repo.training_main import train
train_compute = kt.Compute(cpus="2", gpus="1")
remote_train = kt.fn(train).to(train_compute)
result = remote_train(lr=0.05, batch_size=4)
In this example, we took our regular training entrypoint and sent it to run on Kubernetes, getting a local callable that behaves identically to the original train function. Subsequent calls are routed to the remote service, with logs and exceptions propagating back, again, as if it were regular local execution.
You can easily scale your workload, too; assuming your training uses PyTorch DDP, scaling up the training is as simple as modifying the Compute object you send your workload to.
import kubetorch as kt
from my_repo.training_main import train
train_compute = kt.Compute(gpus="8").distribute("pytorch", workers=8)
remote_train = kt.fn(train).to(train_compute)
result = remote_train(lr=0.05, batch_size=4)
You also unlock powerful new patterns with this remote execution. For instance, you can deploy a class that enables state across calls. Or, you can catch remote exceptions and decide what to do with that exception without the overall application falling over.
import kubetorch as kt
from my_repo import TrainerClass
epochs = 25
batch_size = 32
train_compute = kt.Compute(gpus="8").distribute("pytorch", workers=4)
remote_train = kt.cls(TrainerClass).to(train_compute)
remote_train.load_data()
for epoch in range(epochs):
try:
remote_train.train(epoch, batch_size = batch_size)
except Exception as e:
if "CUDA out of memory" in str(e):
batch_size = batch_size / 2
remote_train.save()
Key Benefits
Kubetorch is installed with a Python library and a Kubernetes operator, all within your cloud account and VPC. It can be adopted incrementally within an existing ML stack or used as a full replacement, across training, batch processing, online inference, hyperparameter optimization, and pipelining.
- Run any Python code on Kubernetes at any scale by specifying the required resources, distribution, and scaling directly in code.
- Iterate on that code in 1–2 seconds with magic caching and hot redeployment.
- Execute code reproducibly by dispatching work identically to Kubernetes from any environment, including a teammate’s laptop, CI, an orchestrator, or a production application.
- Handle hardware faults, preemptions, and OOMs programmatically from the driver program, creating robust fault tolerance.
- Optimize workloads with observability, logging, auto-down, queueing, and more.
- Orchestrate complex, heterogeneous workloads such as RL post-training, which requires coordinating different compute resources, images, scaling, and distributed communication within a single loop.
Get Started
Installing and using Kubetorch (GitHub repo) is simple. If you already have a Kubernetes cluster, then all you need to do get started is:
- Helm install Kubetorch onto the cluster
- Pip install the Python client anywhere that you want to use Kubetorch (local machine, CI, orchestrator node)