Deploying a model
Now that the platform is provisioned, the ML team can declare what a model needs
with a ModelDeployment. Describe the hardware requirements and the scheduler
schedules against the capacity the platform team published.
Create a deployment
Create a namespace for the model:
kubectl create namespace ml-teamThe device selector matches against the capacity declared in the
InferenceClass, not the pod’s resource requests. Any L4 node satisfies
>= 20Gi, so this deployment runs on the cluster you just added:
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-demo
namespace: ml-team
spec:
replicas: 1
engines:
- name: qwen
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# Any L4 satisfies >= 20Gi. The selector matches against the capacity
# declared in the InferenceClass, not the pod's resource requests.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- --model=Qwen/Qwen2.5-0.5B-Instruct
- --dtype=half
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-demo
namespace: ml-team
spec:
replicas: 1
engines:
- name: qwen
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
# Any L4 satisfies >= 20Gi. The selector matches against the capacity
# declared in the InferenceClass, not the pod's resource requests.
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args:
- --model=Qwen/Qwen2.5-0.5B-Instruct
- --dtype=half
Wait until REPLICAS shows 1:
kubectl get md -n ml-team --watchTo see which cluster the scheduler chose:
kubectl get modelreplica -n ml-teamNAME CLUSTER SYNCED READY COMPOSITION AGE
qwen-demo-7323a eks-us-east True True modelreplicas.modelplane.ai 12mThe ML team never named a cluster. The scheduler matched the GPU requirement
(>= 20Gi) against the InferenceClass the platform team published and made
the placement.
Expose the model
A ModelService selects ModelEndpoints by label and creates a Gateway API
HTTPRoute that routes to them. Modelplane creates one ModelEndpoint per
replica, labeled with the deployment name:
# A ModelService exposes one or more ModelDeployments via a single
# OpenAI-compatible endpoint. It composes a Gateway-API HTTPRoute on the
# control plane that load-balances across every ModelEndpoint matching
# its selector.
#
# Modelplane composes one ModelEndpoint per ModelReplica, labeled
# `modelplane.ai/deployment: <deployment-name>`. So a ModelService with
# that label selector reaches every replica of the named deployment.
#
# Once the service is ready, its public address is on status.address:
# kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen-demo
The request path is /<namespace>/<modelservice-name>/... (/ml-team/qwen/ in
this example), from the ModelService named qwen. The model field in the
request body is the Hugging Face id Qwen/Qwen2.5-0.5B-Instruct, since this
deployment doesn’t set --served-model-name.
Send a request
Read the endpoint’s public address from the ModelService status:
ADDRESS=$(kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}')Send a request to it:
kubectl run -i --rm curl-test \
--image=curlimages/curl \
--restart=Never \
--env="ADDRESS=$ADDRESS" \
-- sh -c 'curl -v "$ADDRESS/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\":\"Qwen/Qwen2.5-0.5B-Instruct\",\"messages\":[{\"role\":\"user\",\"content\":\"What is Kubernetes in one sentence?\"}],\"max_tokens\":100}"'The request routes to the replica on the cluster Modelplane placed it on. You should get a response in a few seconds:
{
"id": "chatcmpl-c88b1429-067d-40a5-971c-ab9c54153c26",
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [
{
"message": {
"role": "assistant",
"content": "Kubernetes (K8s) is an open-source platform for automating
the deployment, scaling, and management of containerized applications.
It provides scalable orchestration capabilities that enable developers
to deploy complex applications quickly and efficiently across various environments."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 37,
"completion_tokens": 48,
"total_tokens": 85
}
}Next step
The platform team declared capacity and in this guide the ML team deployed a model behind a stable endpoint. Neither team needed to know what the other was doing. Modelplane matched them.
In the next step, the platform team grows the fleet. Scale the platform to add more clusters across regions.