Scale the model
A ModelService can front more than one ModelDeployment. Here you add a second
deployment, pinned to a different region, and point the same service at both. The
endpoint you already curled stays the same. Behind it, traffic now load-balances
across two regions.
graph LR
subgraph fleet ["Fleet"]
IC1["us-east\nL4"]
IC2["us-west\nlarger GPU"]
end
subgraph ml ["ML team"]
MD1["ModelDeployment\nqwen-demo"]
MD2["ModelDeployment\nqwen-west\nclusterSelector: us-west"]
MS["ModelService qwen\n/ml-team/qwen/v1/..."]
end
IC1 --> MD1
IC2 --> MD2
MD1 --> MS
MD2 --> MS
Deploy to a second region
The new deployment uses a clusterSelector to pin its replica to the us-west
cluster you added in the last step, and selects the larger GPU there:
# A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-west
namespace: ml-team
spec:
replicas: 1
# clusterSelector filters which InferenceClusters this deployment can land on.
# This pins the replica to the us-west cluster you added in "Scale the fleet".
clusterSelector:
matchLabels:
modelplane.ai/region: us-west
engines:
- name: qwen
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# L40S (46068Mi) qualifies; L4 (23034Mi) does not.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- --model=Qwen/Qwen2.5-0.5B-Instruct
- --dtype=half
# A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-west
namespace: ml-team
spec:
replicas: 1
# clusterSelector filters which InferenceClusters this deployment can land on.
# This pins the replica to the us-west cluster you added in "Scale the fleet".
clusterSelector:
matchLabels:
modelplane.ai/region: us-west
engines:
- name: qwen
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# A100 40GB (40960Mi) qualifies; L4 (23034Mi) does not.
# Threshold at 35Gi not 40Gi to avoid the boundary on A100's exact reported VRAM.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("35Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- --model=Qwen/Qwen2.5-0.5B-Instruct
- --dtype=half
Wait until its replica is Ready, then check placement. You now have one replica
per region:
kubectl get modelreplica -n ml-teamNAME CLUSTER SYNCED READY COMPOSITION AGE
qwen-demo-7323a eks-us-east True True modelreplicas.modelplane.ai 42m
qwen-west-92535 eks-us-west True True modelreplicas.modelplane.ai 8mFront both with one service
Update the ModelService to select both deployments. Each entry in
spec.endpoints adds its matching replicas to the same endpoint:
# The same ModelService as before, now selecting two deployments. Each entry in
# spec.endpoints adds every ModelEndpoint matching its selector to the same
# OpenAI-compatible endpoint, so /ml-team/qwen/ load-balances across qwen-demo
# and qwen-west, wherever their replicas run.
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen-demo
- selector:
matchLabels:
modelplane.ai/deployment: qwen-west
The endpoint URL doesn’t change. Clients that had this URL before still have it; they don’t know the fleet changed. The gateway load-balances across both regions, and losing one region keeps the other serving. Send the same request as before:
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')kubectl run -i --rm curl-test \
--image=curlimages/curl \
--restart=Never \
--env="ADDRESS=$ADDRESS" \
-- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'That’s the tour
You stood up a control plane, built a multi-region GPU fleet, deployed a model across it, and ended with one stable endpoint serving requests. The platform team published hardware. The ML team described what the model needs. Modelplane placed them and served behind a single endpoint.
Clean up tears everything down when you’re done.
For more on the resources you used:
Modelplane is in active development and we’re building in the open. If you’re running your own inference fleet and want to shape where this goes, we’d love to hear from you. Star the repository, join us in Slack, or read the manifesto.