Version

Architecture

On this page

Modelplane’s central design choice is to build the control plane on Crossplane rather than as a bespoke set of Kubernetes controllers. Everything else here follows from that. This section assumes you’re comfortable with Kubernetes; the rest of the Crossplane vocabulary you need is below.

Crossplane in brief

Crossplane extends Kubernetes to manage things beyond the cluster, cloud infrastructure, SaaS, and in Modelplane’s case inference fleets, through the same declarative, reconciled API model. Three of its concepts matter here:

Composite Resources (XRs) are custom resources whose controller, instead of talking to an external API directly, declares a set of other resources that should exist. Every Modelplane API, InferenceCluster, ModelDeployment, ModelService, is an XR.
Composition functions are that controller logic. A function is a small gRPC service handed the observed XR and the resources it depends on, which returns the desired child resources. An XR runs a pipeline of one or more functions every reconcile; in Modelplane each is typically a single function, so the rest of this section says “the function” for short.
Providers are controllers that manage external systems through their own managed resources: provider-gcp and provider-aws for cloud APIs, provider-helm for Helm releases, provider-kubernetes for arbitrary objects on any cluster. A composition function composes these like any other resource.

Put together: a Modelplane API is an XR, its logic is a composition function, and the function composes a mix of plain Kubernetes objects, other Modelplane XRs, and provider resources.

The resource model mirrors Kubernetes core, one scope up: ModelDeployment → ModelReplica → ModelService → ModelEndpoint parallels Deployment → Pod → Service → Endpoint, but across a fleet of clusters rather than within one. A ModelDeployment composes a ModelReplica per replica, a ModelReplica composes the serving workload on its target cluster, and a ModelService routes across the ModelEndpoints. If you know how those core objects relate, you already know the shape of Modelplane’s.

Why Crossplane?

Modelplane is, at its core, a system that turns declarative resources into composed infrastructure spanning cloud accounts, many Kubernetes clusters, and the workloads on them. That’s the problem Crossplane solves, and it helps in two ways: providers and functions.

Providers give us reach. Modelplane has to provision Kubernetes clusters and all the infrastructure they need across different clouds, then install software onto them. That’s an enormous surface, and providers cover it without us rolling our own controllers for each cloud API and Helm release.

Functions are where Modelplane’s own logic lives, and writing it as composition functions buys several things:

Business logic, not controller plumbing. A function computes desired state from observed state. Crossplane handles the fiddly Kubernetes controller details, the watches, requeues, finalizers, and drift correction, that a hand-written controller gets wrong in a dozen subtle ways. Less plumbing to write and maintain means we move faster.
Testability. A function is a pure function of its inputs, so you can test it as a black box: feed it an XR and its dependencies, assert on the resources it returns. The whole test runs in process, with no API server to stand up.
The right language for each job. Functions can be written in any language. Modelplane’s are Python, for fast iteration on the serving and scheduling logic and because Python is the common language of the ML world, which lowers the bar for contributors. The performance-sensitive distributed-systems core stays in Go, where Crossplane and its providers already are.

The bet underneath both is that inference infrastructure is the same shape of problem as cloud infrastructure, which Crossplane manages well. Building on it lets Modelplane spend its effort on the part that’s actually inference-specific.

The control cluster and the fleet

Modelplane runs on a control cluster and manages a fleet of workload clusters, the InferenceClusters. The split is deliberate: the control plane holds no GPUs and serves no tokens. It schedules, composes, and routes; the workload clusters do the serving.

The control cluster runs Crossplane, the Modelplane composition functions (one per resource, each a pod Crossplane calls per reconcile), the providers, and the control-plane gateway. It also holds every Modelplane resource and the ProviderConfigs that let the providers reach each workload cluster, built from that cluster’s kubeconfig.

Crossplane core drives everything. Each reconcile it asks a function what a resource should compose and gets back the desired resources. Core then reconciles them, applying the provider resources that the providers act on. A function only computes desired state. It never reaches a provider or a cluster itself.

flowchart TB
    subgraph control["Control cluster"]
        cp["Crossplane core"]
        fns["Modelplane functions\n(one pod per resource)"]
        prov["Providers\ngcp · aws · helm · kubernetes"]
        gw["Control-plane gateway"]
    end
    subgraph fleet["Fleet"]
        wc1["Workload cluster A"]
        wc2["Workload cluster B"]
    end
    cp <-->|"desired state (gRPC)"| fns
    cp -->|composes| prov
    cp -->|composes| gw
    prov -->|provision + install via kubeconfig| wc1
    prov -->|provision + install via kubeconfig| wc2

Modelplane installs a serving stack on each workload cluster: the components a cluster needs to serve models, providing inference-aware routing through Gateway API, multi-node serving, GPU binding through DRA, and observability, among others. The exact components evolve, but Modelplane composes and owns all of them. For provisioned clusters the providers also create the cluster and its node pools first.

How a deployment is composed

A resource composes others, which compose others, until the tree bottoms out in provider resources and plain Kubernetes objects. A ModelDeployment is the clearest example. Its function schedules the replicas, then composes a ModelReplica for each, and a ModelEndpoint for each replica that’s ready to serve. Each ModelReplica function composes the serving workload, a Deployment or a LeaderWorkerSet, onto its target workload cluster through provider-kubernetes.

flowchart TD
    md["ModelDeployment"]
    mr1["ModelReplica\n(cluster A)"]
    mr2["ModelReplica\n(cluster B)"]
    me1["ModelEndpoint\n(cluster A)"]
    me2["ModelEndpoint\n(cluster B)"]
    wl1["Deployment / LeaderWorkerSet\non workload cluster A"]
    wl2["Deployment / LeaderWorkerSet\non workload cluster B"]

    md --> mr1
    md --> mr2
    md --> me1
    md --> me2
    mr1 --> wl1
    mr2 --> wl2

The platform resources compose the same way. An InferenceCluster composes a GKECluster or EKSCluster (the cloud infrastructure, via the cloud providers) and a ServingStack (the per-cluster software install, via provider-helm and provider-kubernetes). Engines bind GPUs through DRA: each claim: DRA device in a member’s nodeSelector becomes a request in the ResourceClaim the serving pods claim through.

The request path

A served request crosses two gateways, both built on Gateway API. The control-plane gateway is the front door: a ModelService composes an HTTPRoute on it that matches the service’s path prefix and forwards to the matched ModelEndpoints, each of which is a Service pointing at a workload cluster’s gateway address. The workload-cluster gateway then routes from the cluster edge to the engine pods.

flowchart LR
    client["Client"]
    cpgw["Control-plane gateway"]
    wcgw["Workload-cluster gateway"]
    engine["Engine pods\n(vLLM, SGLang, ...)"]

    client -->|service path| cpgw
    cpgw -->|per-replica path| wcgw
    wcgw -->|engine path| engine

Each hop rewrites the path: the control plane rewrites the public prefix to the replica’s path, and the workload gateway strips that down to what the engine serves. This per-backend path rewriting is the main thing the control-plane gateway has to support, and it narrows which Gateway API implementations can fill the role.

Which gateway sits at each layer is internal, not part of the API. The InferenceGateway backend field is an enum precisely so the control-plane gateway can grow other options over time. Target the ModelService URL rather than either gateway directly.