Expose a Model
API: modelplane.ai/v1alpha1 · ModelService
A ModelDeployment serves a model, but its
replicas are scattered across the fleet with no single address. A ModelService
gives them one: a stable, unified, OpenAI-compatible URL that load-balances
across every replica, wherever it runs.
A service selects what to route to by label. Behind the scenes, Modelplane
creates one ModelEndpoint, a single reachable backend, for each replica of a
deployment and labels it. Two of those labels carry routing intent:
modelplane.ai/deployment: the deployment the replica belongs to.modelplane.ai/cluster: the cluster the replica runs on.
Modelplane creates an endpoint only once its replica is Ready, serving and reachable, and withdraws it if the replica later goes unhealthy. A service only ever routes to replicas that can actually answer, so a deployment that’s still starting or scaling up has fewer endpoints behind its URL until those replicas come up. You don’t create endpoints yourself. You point a service at them.
spec.endpoints is a list, and the entries combine: the service routes to every
endpoint that any entry matches. The patterns below build on that.
Route to a whole deployment
The common case: one selector matching a deployment’s name reaches every replica, wherever in the fleet they run.
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b # every replica of this deploymentRoute to part of a deployment
Add a second label to narrow within a deployment. A selector matches an endpoint only when all its labels match, so pairing the deployment with a cluster routes to just that cluster’s replicas. This is how you take a cluster out of service without redeploying: point the service at the clusters you want and leave one out, and traffic drains to the rest.
spec:
endpoints:
# Only the replicas on prod-us-east, e.g. while draining another cluster.
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b
modelplane.ai/cluster: prod-us-eastRoute across several deployments
Give more than one entry to front several deployments behind the same URL. Each entry contributes its matched endpoints, and traffic spreads evenly across every one.
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b-v2This is the shape an A/B test or a canary rollout would take, but note traffic is split evenly across the matched endpoints today. Weighting one entry over another, to send, say, 5% of traffic to a canary, is tracked in #90. Until then the split follows endpoint counts, not a ratio you set.
The entries don’t have to be deployments. One can select a manually created ModelEndpoint that points at an external provider, so a service can send overflow or break-glass traffic to a SaaS endpoint alongside your own replicas:
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: kimi-k2
- selector:
matchLabels:
modelplane.ai/external-provider: togetherEndpoints with different path layouts coexist behind the one URL.
Sending a request
The service’s public address is on status.address, in the form
http://<gateway>/<namespace>/<service-name>:
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')Append the OpenAI path and send a request. The model field is the name the
engine serves (its --served-model-name, or the model’s Hugging Face id if you
didn’t set one):
curl "$ADDRESS/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Hello!"}]
}'Alternate APIs
We call the endpoint OpenAI-compatible because the engines are, not because
Modelplane imposes it. The route matches the /<namespace>/<service>/ prefix and
preserves the path below it on the way to the engine, so any API the engine serves
is reachable on the same URL.
Take a vLLM replica that also serves the Anthropic Messages API. It answers on
.../v1/messages, so a client that speaks it (including Claude Code, via
ANTHROPIC_BASE_URL) talks to it directly. The engine’s operational paths come
through the same way: .../health and the Prometheus .../metrics are reachable
on the service URL.
There’s one exception, and it’s set by the deployment rather than the service. Disaggregated serving reads OpenAI-format request bodies to pick a prefill and decode worker, so a request in another API shape still reaches the engine but skips that cache-aware routing. Unified serving forwards every API shape the same way.
Example
# A ModelService exposes one or more ModelDeployments via a single
# OpenAI-compatible endpoint. It composes a Gateway-API HTTPRoute on the
# control plane that load-balances across every ModelEndpoint matching
# its selector.
#
# Modelplane composes one ModelEndpoint per ModelReplica, labeled
# `modelplane.ai/deployment: <deployment-name>`. So a ModelService with
# that label selector reaches every replica of the named deployment.
#
# Once the service is ready, its public address is on status.address:
# kubectl get ms qwen3-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen3-8b
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen3-8b