Version
Kimi-K2
On this page
A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two H200 nodes: two engines, one per phase, with Modelplane composing the llm-d routing layer between them. This recipe serves an INT4 quantization of the model; the native FP8 weights need four such nodes.
This recipe was run end to end; the InferenceClass and ModelDeployment are
the exact manifests from that run. Apply the platform side first, then the ML
side. The InferenceCluster carries an EC2 capacity reservation placeholder to
edit before applying.
Platform
inference-class.yaml
# InferenceClass for the H200 shape, validated serving Kimi-K2 prefill/decode
# disaggregated on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. The two P/D engines
# each request 8 GPUs + 16 EFA interfaces, and the scheduler places one on each
# H200 node; NIXL ships KV cache between them over EFA.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
name: eks-h200-8x-p5en
spec:
description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
provisioning:
provider: EKS
eks:
instanceType: p5en.48xlarge
diskSizeGb: 1024
accelerator:
type: nvidia-h200
count: 8
devices:
- name: gpu
claim: DRA
driver: gpu.nvidia.com
deviceClassName: gpu.nvidia.com
count: 8
attributes:
architecture: { string: Hopper }
capacity:
memory: { value: "140Gi" } # advertised below the ~141 GiB the driver reports
- name: efa
claim: DRA
driver: dra.net
deviceClassName: efa.networking.k8s.aws
count: 16
inference-cluster.yaml
# An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Kimi-K2 as a prefill/decode pair (one engine per node). The H200 nodes come
# from an EC2 Capacity Block reserved for ML.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
name: eks-kimi
labels:
modelplane.ai/region: us
spec:
cluster:
source: EKS
eks:
region: us-east-2
nodePools:
- name: gpu-h200
className: eks-h200-8x-p5en
nodeCount: 2
minNodeCount: 2
maxNodeCount: 2
zones:
- us-east-2b
fabric: EFA
capacityBlock:
capacityReservationId: cr-0123456789abcdef0 # replace with your reservation ID
bash
curl -fsSL https://v0-1.docs.modelplane.ai/examples/examples/kimi-k2/inference-cluster.yaml \
| sed 's/cr-0123456789abcdef0//' \
| kubectl apply -f -Deployment
model-cache.yaml
# The shared, read-write-many cache both P/D engines serve from. Hydrated once
# per matched cluster; both phases mount the same RWX volume at /mnt/models.
#
# This validated run served an INT4 quantization of Kimi K2 rather than the
# native 1T-parameter FP8 model, which would need four 8x H200 nodes. The quant
# repo is public, so no authSecret is needed here.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
name: kimi-k2
namespace: ml-team
spec:
source: HuggingFace
huggingFace:
repo: RedHatAI/Kimi-K2-Instruct-quantized.w4a16
sizeGiB: 600
model-deployment.yaml
# Kimi-K2 served prefill/decode disaggregated across two H200 nodes, validated
# end to end on EKS. serving.mode: PrefillDecode makes the two engines below one
# P/D pair: Modelplane composes the llm-d routing layer (InferencePool, endpoint
# picker, the NIXL pd-sidecar) between them. Each engine is a single-node 8-GPU
# Standalone pod; the scheduler places prefill on one node and decode on the
# other, and KV cache ships between them over EFA via NIXL.
#
# Notes on the engine flags (most P/D machinery is Modelplane's; the engine
# config has real sharp edges):
# EP, not TP8. --tensor-parallel-size=1 --data-parallel-size=8
# --enable-expert-parallel is vLLM's DeepSeek-V3 / Kimi single-node recipe
# (Kimi-K2 is DeepSeek-V3 arch). It keeps each expert whole on a GPU and
# dodges the Marlin %128 alignment trap that a plain TP8 layout would hit.
# --load-format=runai_streamer cold-reads ~509 GiB per engine off the shared
# RWX cache in ~6 minutes (vs ~45 with the default loader).
# --tokenizer / --override-generation-config are workarounds for two bugs in
# this specific quant repo (an off-by-2 in its bundled tokenizer, and a
# wrong eos_token_id), not normal flags. The override pulls a gated repo at
# startup, so HF_TOKEN is set on both engines.
# The decode engine runs on :8001; the pd-sidecar owns :8000.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: kimi-k2
namespace: ml-team
spec:
replicas: 1
modelCacheRef:
name: kimi-k2
serving:
mode: PrefillDecode # the two engines below are one P/D pair
engines:
- name: kimi-prefill
phase: Prefill
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
- name: efa
count: 16
selectors:
- cel: |
device.driver == "dra.net"
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
command: ["vllm", "serve", "/mnt/models"]
args:
- --served-model-name=kimi-k2
- --quantization=compressed-tensors
- --tensor-parallel-size=1
- --data-parallel-size=8
- --enable-expert-parallel
- --block-size=64
- --max-model-len=131072
- --trust-remote-code
- --tool-call-parser=kimi_k2
- --enable-auto-tool-choice
- --load-format=runai_streamer
- --model-loader-extra-config={"concurrency":16,"distributed":true}
- --tokenizer=moonshotai/Kimi-K2-Instruct
- --override-generation-config={"eos_token_id":163586}
- --port=8000
- --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_producer"}
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef: { name: hf-token, key: HF_TOKEN }
- name: kimi-decode
phase: Decode
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 8
selectors:
- cel: |
device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
- name: efa
count: 16
selectors:
- cel: |
device.driver == "dra.net"
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
command: ["vllm", "serve", "/mnt/models"]
args:
- --served-model-name=kimi-k2
- --quantization=compressed-tensors
- --tensor-parallel-size=1
- --data-parallel-size=8
- --enable-expert-parallel
- --block-size=64
- --max-model-len=131072
- --trust-remote-code
- --tool-call-parser=kimi_k2
- --enable-auto-tool-choice
- --load-format=runai_streamer
- --model-loader-extra-config={"concurrency":16,"distributed":true}
- --tokenizer=moonshotai/Kimi-K2-Instruct
- --override-generation-config={"eos_token_id":163586}
- --port=8001 # pd-sidecar owns 8000
- --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_consumer"}
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef: { name: hf-token, key: HF_TOKEN }
model-service.yaml
# Exposes the kimi-k2 deployment as a single OpenAI-compatible URL. Read the
# public address from status.address:
# kubectl get ms kimi-k2 -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: kimi-k2
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: kimi-k2