Modelplane Modelplane docs
Version

FAQ

Short answers to the questions that come up first, with links to the full treatment. If you’re new here, read the Introduction and How Modelplane works first.

What Modelplane is

Is Modelplane a serving engine like vLLM?
No, Modelplane is the control plane above the engine. It composes serving engines like vLLM, SGLang, and NVIDIA TensorRT-LLM, and operates them across a fleet of clusters. It doesn’t serve tokens itself. You bring the engine; Modelplane schedules it, routes to it, scales it, and caches its weights across your inference fleet.
Does Modelplane replace vLLM or SGLang?
No, they run the model; Modelplane runs the fleet. A ModelDeployment carries your engine container and its flags, and Modelplane composes it onto the right cluster. Switching or upgrading engines is a change to your deployment, not to Modelplane.
How is Modelplane different from KServe or NVIDIA Dynamo?
Scope. KServe and Dynamo are cluster orchestrators: they schedule, scale, route, and cache within a single Kubernetes cluster. Modelplane runs its operations across a fleet of clusters, clouds, and regions. Modelplane uses llm-d for multi-node serving, and KV-cache management, as do KServe and Dynamo. Modelplane is planning deeper integrations with NVIDIA Dynamo in future releases.
How is Modelplane different from a managed provider like Baseten or Fireworks?
Managed providers run fleet-scale serving inside their own closed platform. Modelplane is the open equivalent that runs in infrastructure you own. The difference is open, in your own infrastructure, community-driven, and neutral across the stack, not scope. You can still route to a managed provider from Modelplane.

What it supports

What models does Modelplane support?
Modelplane supports any model, including open weights, custom models, and just about anything that can be downloaded from Hugging Face, NVIDIA NGC, and other registries.
Does Modelplane support NVIDIA?

Yes, across the stack. NVIDIA is the most widely available accelerator on the clouds Modelplane runs on and the primary target today. Modelplane binds NVIDIA GPUs to pods through Dynamic Resource Allocation (DRA), matching devices by attributes such as GPU memory and architecture with CEL selectors.

The software stack rides on the engine-agnostic API. NVIDIA NIM microservices and the TensorRT-LLM engine run as engine containers like any other, Modelplane stages weights and NIM-style artifacts from NVIDIA NGC alongside Hugging Face and other registries, and the inference stack it installs includes NVIDIA Dynamo and llm-d, with deeper Dynamo integration on the roadmap.

Which engines and accelerators are supported?
The API is engine-agnostic: any engine that runs as a container works, and its flags are yours to write. Multiple accelerators are supported as long as they can be bound through DRA, and the device model (DRA plus CEL selectors) is built to extend to other accelerators and fabrics.
Which clouds or neoclouds does Modelplane support?
Today Modelplane provisions clusters on a few hyperscalers and neoclouds, and supports bringing your own Kubernetes cluster anywhere. More provisioners are on the roadmap; the bring-your-own path means you can run on any Kubernetes now. See Supported Providers for the full matrix of clouds, neoclouds, and their Crossplane providers.
Can I bring my own cluster, or run on a neocloud or on-premise?
Yes, an InferenceCluster with source: Existing registers a cluster you already run, through its kubeconfig. Modelplane installs the serving stack it needs but doesn’t provision the infrastructure. This is how you run on neoclouds and on-premise today.

What it requires

Where does Modelplane run?
Modelplane runs as a control plane on a control cluster: an ordinary Kubernetes cluster with Crossplane installed, with no GPUs of its own. The inference clusters it manages do the serving, and each needs Dynamic Resource Allocation (DRA, Kubernetes v1.35+) to bind GPUs to pods. Modelplane assumes exclusive ownership of every inference cluster, so dedicate each one to Modelplane rather than sharing it with other workloads.
Do I need Crossplane?
Yes, Modelplane is built on Crossplane and requires it. If your platform team already runs Crossplane to manage cloud infrastructure, Modelplane is the same pattern applied to inference. Modelplane uses Crossplane’s function framework and shares its infrastructure providers.

What it can do

How does Modelplane decide where a model runs?
Two-level matching. First it filters clusters by their labels (tier, region, provider) against your clusterSelector. Then it filters node pools by matching your device requests, real DRA requests with CEL selectors over GPU memory, architecture, and other attributes, against each pool’s InferenceClass. It places each replica on a cluster and pool that fits and has free capacity.
Can I serve across regions and clusters behind one endpoint?
Yes, that’s the point. A ModelService exposes one OpenAI-compatible endpoint and load-balances across every replica of a deployment, wherever they run.
Can I route to a managed provider?
Yes, a ModelService can include a manually created ModelEndpoint that points at an external SaaS endpoint like Together or Baseten alongside your self-hosted replicas, and load-balances across all of them.
How do large or multi-node models work?
An engine can be a gang: a leader and one or more workers that Modelplane composes into a LeaderWorkerSet across nodes. You write the coordination (like Ray or vLLM’s data-parallel coordinator) in the engine flags, and Modelplane injects the leader’s address so the workers can join it. Multi-node deployments stage weights through a ModelCache.
What about disaggregated prefill/decode?
Set serving.mode: PrefillDecode and define separate prefill and decode engines. Both run on the same cluster, hand off the KV cache over a fast fabric, and Modelplane configures the cluster-edge routing that pairs each request. The KV-transfer flags live in your engine config.
How does scaling work?
Replicas are the only scaling axis. Each replica is a complete serving instance; scaling spec.replicas adds or removes whole instances across the fleet. Because a ModelDeployment exposes the Kubernetes scale subresource, kubectl scale and KEDA work without anything extra. There’s no per-pod autoscaling inside a cluster.
How are model weights handled?
A ModelCache stages weights once per cluster on shared (ReadWriteMany) storage, and every pod reads them locally. Pods don’t re-download on each start, and concurrent starts don’t race. It hydrates from Hugging Face today, is optional for single-node deployments, and is recommended for multi-node ones.

The project

Why did you pick Modelplane as a name for the project?
It’s a fusion of AI Model and Control Plane. We also like that it implies that AI models are their own layer (or plane) in the stack.
What does the logo signify?
Three popsicle sticks assembled to make a model plane. Balsa wood planes were the inspiration.
Is Modelplane production-ready?
Modelplane is in early development and moving fast. Treat it as early software. The platform docs are specific about what’s available today versus what’s planned. We are building it in the open.
What's the license and governance?
Modelplane is Apache 2.0, with no usage caps or token metering, and is developed in the open. It’s neutral across models, engines, accelerators, and clouds, and is intended for donation to a neutral open source foundation. It’s a project from Upbound, the team behind Rook and Crossplane, both CNCF Graduated and widely adopted projects.
How do I get involved?
Issues, discussions, and contributions are welcome on GitHub. See CONTRIBUTING.md for development setup and the project’s conventions.

Next steps