Version

Deploying a model

On this page

Now that the platform is provisioned, the ML team can declare what a model needs with a ModelDeployment. Describe the hardware requirements and the scheduler schedules against the capacity the platform team published.

Create a deployment

Create a namespace for the model:

kubectl create namespace ml-team

The device selector matches against the capacity declared in the InferenceClass, not the pod’s resource requests. Any L4 node satisfies >= 20Gi, so this deployment runs on the cluster you just added:

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Any L4 satisfies >= 20Gi. The selector matches against the capacity
          # declared in the InferenceClass, not the pod's resource requests.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: qwen-demo
  namespace: ml-team
spec:
  replicas: 1
  engines:
  - name: qwen
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # Any L4 satisfies >= 20Gi. The selector matches against the capacity
          # declared in the InferenceClass, not the pod's resource requests.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            args:
            - --model=Qwen/Qwen2.5-0.5B-Instruct
            - --dtype=half

Wait until REPLICAS shows 1:

kubectl get md -n ml-team --watch

To see which cluster the scheduler chose:

kubectl get modelreplica -n ml-team

NAME              CLUSTER       SYNCED   READY   COMPOSITION                   AGE
qwen-demo-7323a   eks-us-east   True     True    modelreplicas.modelplane.ai   12m

The ML team never named a cluster. The scheduler matched the GPU requirement (>= 20Gi) against the InferenceClass the platform team published and made the placement.

Expose the model

A ModelService selects ModelEndpoints by label and creates a Gateway API HTTPRoute that routes to them. Modelplane creates one ModelEndpoint per replica, labeled with the deployment name:

# A ModelService exposes one or more ModelDeployments via a single
# OpenAI-compatible endpoint. It composes a Gateway-API HTTPRoute on the
# control plane that load-balances across every ModelEndpoint matching
# its selector.
#
# Modelplane composes one ModelEndpoint per ModelReplica, labeled
# `modelplane.ai/deployment: <deployment-name>`. So a ModelService with
# that label selector reaches every replica of the named deployment.
#
# Once the service is ready, its public address is on status.address:
#   kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: qwen
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: qwen-demo

The request path is /<namespace>/<modelservice-name>/... (/ml-team/qwen/ in this example), from the ModelService named qwen. The model field in the request body is the Hugging Face id Qwen/Qwen2.5-0.5B-Instruct, since this deployment doesn’t set --served-model-name.

Send a request

Read the endpoint’s public address from the ModelService status:

ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')

Send a request to it:

kubectl run -i --rm curl-test \
  --image=curlimages/curl \
  --restart=Never \
  --env="ADDRESS=$ADDRESS" \
  -- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'

The request routes to the replica on the cluster Modelplane placed it on. You should get a response in a few seconds:

{
  "id": "chatcmpl-c88b1429-067d-40a5-971c-ab9c54153c26",
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Kubernetes (K8s) is an open-source platform for automating 
        the deployment, scaling, and management of containerized applications. 
        It provides scalable orchestration capabilities that enable developers 
        to deploy complex applications quickly and efficiently across various environments."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 48,
    "total_tokens": 85
  }
}

Next step

The platform team declared capacity and in this guide the ML team deployed a model behind a stable endpoint. Neither team needed to know what the other was doing. Modelplane matched them.

In the next step, the platform team grows the fleet. Scale the platform to add more clusters across regions.