Version

Scale the model

On this page

A ModelService can front more than one ModelDeployment. Here you add a second deployment, pinned to a different region, and point the same service at both. The endpoint you already curled stays the same. Behind it, traffic now load-balances across two regions.

graph LR
    subgraph fleet ["Fleet"]
        IC1["us-east\nL4"]
        IC2["us-west\nlarger GPU"]
    end

    subgraph ml ["ML team"]
        MD1["ModelDeployment\nqwen-demo"]
        MD2["ModelDeployment\nqwen-west\nclusterSelector: us-west"]
        MS["ModelService qwen\n/ml-team/qwen/v1/..."]
    end

    IC1 --> MD1
    IC2 --> MD2
    MD1 --> MS
    MD2 --> MS

Deploy to a second region

The new deployment uses a clusterSelector to pin its replica to the us-west cluster you added in the last step, and selects the larger GPU there:

# A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-west
  namespace: ml-team
spec:
  replicas: 1
  # clusterSelector filters which InferenceClusters this deployment can land on.
  # This pins the replica to the us-west cluster you added in "Scale the fleet".
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us-west
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # L40S (46068Mi) qualifies; L4 (23034Mi) does not.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

# A second ModelDeployment, pinned to a different region with clusterSelector.
# It carries its own deployment name, so the ModelService can front it alongside
# qwen-demo and serve both from one endpoint.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-west
  namespace: ml-team
spec:
  replicas: 1
  # clusterSelector filters which InferenceClusters this deployment can land on.
  # This pins the replica to the us-west cluster you added in "Scale the fleet".
  clusterSelector:
    matchLabels:
      modelplane.ai/region: us-west
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # A100 40GB (40960Mi) qualifies; L4 (23034Mi) does not.
          # Threshold at 35Gi not 40Gi to avoid the boundary on A100's exact reported VRAM.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("35Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

Wait until its replica is Ready, then check placement. You now have one replica per region:

kubectl get modelreplica -n ml-team

NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   42m
qwen-west-92535   eks-us-west   True     True    modelreplicas.modelplane.ai   8m

Front both with one service

Update the ModelService to select both deployments. Each entry in spec.endpoints adds its matching replicas to the same endpoint:

# The same ModelService as before, now selecting two deployments. Each entry in
# spec.endpoints adds every ModelEndpoint matching its selector to the same
# OpenAI-compatible endpoint, so /ml-team/qwen/ load-balances across qwen-demo
# and qwen-west, wherever their replicas run.
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-west

The endpoint URL doesn’t change. Clients that had this URL before still have it; they don’t know the fleet changed. The gateway load-balances across both regions, and losing one region keeps the other serving. Send the same request as before:

ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')

kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'

That’s the tour

You stood up a control plane, built a multi-region GPU fleet, deployed a model across it, and ended with one stable endpoint serving requests. The platform team published hardware. The ML team described what the model needs. Modelplane placed them and served behind a single endpoint.

Clean up tears everything down when you’re done.

For more on the resources you used:

Modelplane is in active development and we’re building in the open. If you’re running your own inference fleet and want to shape where this goes, we’d love to hear from you. Star the repository, join us in Slack, or read the manifesto.