Version

Kimi-K2

On this page

A 1T MoE (1 trillion parameters) served prefill/decode disaggregated across two H200 nodes: two engines, one per phase, with Modelplane composing the llm-d routing layer between them. This recipe serves an INT4 quantization of the model; the native FP8 weights need four such nodes.

This recipe was run end to end; the InferenceClass and ModelDeployment are the exact manifests from that run. Apply the platform side first, then the ML side. The InferenceCluster carries an EC2 capacity reservation placeholder to edit before applying.

Platform

# InferenceClass for the H200 shape, validated serving Kimi-K2 prefill/decode
# disaggregated on EKS. 8x NVIDIA H200 on an EKS p5en.48xlarge, with EFA.
#
# Both the GPU and the EFA fabric are claim: DRA devices. The two P/D engines
# each request 8 GPUs + 16 EFA interfaces, and the scheduler places one on each
# H200 node; NIXL ships KV cache between them over EFA.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-h200-8x-p5en
spec:
  description: "EKS p5en.48xlarge, 8x NVIDIA H200, EFA"
  provisioning:
    provider: EKS
    eks:
      instanceType: p5en.48xlarge
      diskSizeGb: 1024
      accelerator:
        type: nvidia-h200
        count: 8
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
    capacity:
      memory: { value: "140Gi" }   # advertised below the ~141 GiB the driver reports
  - name: efa
    claim: DRA
    driver: dra.net
    deviceClassName: efa.networking.k8s.aws
    count: 16

# An EKS InferenceCluster with a two-node H200 pool over EFA, validated serving
# Kimi-K2 as a prefill/decode pair (one engine per node). The H200 nodes come
# from an EC2 Capacity Block reserved for ML.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-kimi
  labels:
    modelplane.ai/region: us
spec:
  cluster:
    source: EKS
    eks:
      region: us-east-2
  nodePools:
  - name: gpu-h200
    className: eks-h200-8x-p5en
    nodeCount: 2
    minNodeCount: 2
    maxNodeCount: 2
    zones:
    - us-east-2b
    fabric: EFA
    capacityBlock:
      capacityReservationId: cr-0123456789abcdef0  # replace with your reservation ID

curl -fsSL https://v0-1.docs.modelplane.ai/examples/examples/kimi-k2/inference-cluster.yaml \
  | sed 's/cr-0123456789abcdef0//' \
  | kubectl apply -f -

Deployment

# The shared, read-write-many cache both P/D engines serve from. Hydrated once
# per matched cluster; both phases mount the same RWX volume at /mnt/models.
#
# This validated run served an INT4 quantization of Kimi K2 rather than the
# native 1T-parameter FP8 model, which would need four 8x H200 nodes. The quant
# repo is public, so no authSecret is needed here.
apiVersion: modelplane.ai/v1alpha1
kind: ModelCache
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  source: HuggingFace
  huggingFace:
    repo: RedHatAI/Kimi-K2-Instruct-quantized.w4a16
    sizeGiB: 600

# Kimi-K2 served prefill/decode disaggregated across two H200 nodes, validated
# end to end on EKS. serving.mode: PrefillDecode makes the two engines below one
# P/D pair: Modelplane composes the llm-d routing layer (InferencePool, endpoint
# picker, the NIXL pd-sidecar) between them. Each engine is a single-node 8-GPU
# Standalone pod; the scheduler places prefill on one node and decode on the
# other, and KV cache ships between them over EFA via NIXL.
#
# Notes on the engine flags (most P/D machinery is Modelplane's; the engine
# config has real sharp edges):
#   EP, not TP8. --tensor-parallel-size=1 --data-parallel-size=8
#     --enable-expert-parallel is vLLM's DeepSeek-V3 / Kimi single-node recipe
#     (Kimi-K2 is DeepSeek-V3 arch). It keeps each expert whole on a GPU and
#     dodges the Marlin %128 alignment trap that a plain TP8 layout would hit.
#   --load-format=runai_streamer cold-reads ~509 GiB per engine off the shared
#     RWX cache in ~6 minutes (vs ~45 with the default loader).
#   --tokenizer / --override-generation-config are workarounds for two bugs in
#     this specific quant repo (an off-by-2 in its bundled tokenizer, and a
#     wrong eos_token_id), not normal flags. The override pulls a gated repo at
#     startup, so HF_TOKEN is set on both engines.
#   The decode engine runs on :8001; the pd-sidecar owns :8000.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  replicas: 1
  modelCacheRef:
    name: kimi-k2
  serving:
    mode: PrefillDecode            # the two engines below are one P/D pair
  engines:
  - name: kimi-prefill
    phase: Prefill
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command: ["vllm", "serve", "/mnt/models"]
            args:
            - --served-model-name=kimi-k2
            - --quantization=compressed-tensors
            - --tensor-parallel-size=1
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --block-size=64
            - --max-model-len=131072
            - --trust-remote-code
            - --tool-call-parser=kimi_k2
            - --enable-auto-tool-choice
            - --load-format=runai_streamer
            - --model-loader-extra-config={"concurrency":16,"distributed":true}
            - --tokenizer=moonshotai/Kimi-K2-Instruct
            - --override-generation-config={"eos_token_id":163586}
            - --port=8000
            - --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_producer"}
            env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef: { name: hf-token, key: HF_TOKEN }
  - name: kimi-decode
    phase: Decode
    members:
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 8
          selectors:
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("130Gi")) >= 0
        - name: efa
          count: 16
          selectors:
          - cel: |
              device.driver == "dra.net"
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.23.0
            command: ["vllm", "serve", "/mnt/models"]
            args:
            - --served-model-name=kimi-k2
            - --quantization=compressed-tensors
            - --tensor-parallel-size=1
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --block-size=64
            - --max-model-len=131072
            - --trust-remote-code
            - --tool-call-parser=kimi_k2
            - --enable-auto-tool-choice
            - --load-format=runai_streamer
            - --model-loader-extra-config={"concurrency":16,"distributed":true}
            - --tokenizer=moonshotai/Kimi-K2-Instruct
            - --override-generation-config={"eos_token_id":163586}
            - --port=8001                                      # pd-sidecar owns 8000
            - --kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_consumer"}
            env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef: { name: hf-token, key: HF_TOKEN }

# Exposes the kimi-k2 deployment as a single OpenAI-compatible URL. Read the
# public address from status.address:
#   kubectl get ms kimi-k2 -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2