Version

Llama-3.1-8B

On this page

An 8B dense chat model on a single NVIDIA L4. The entry recipe: one Standalone engine, no cache, public weights from a Hugging Face mirror. It carries no clusterSelector, so device capacity alone matches it to any compatible L4 in the fleet.

This recipe was run end to end on GKE; the InferenceClass, InferenceCluster, and ModelDeployment are the exact manifests from that run. The EKS platform shape is the standard single-L4 recipe. It passes server validation but was not served in this run. Apply the platform side first, then the ML side. The GKE InferenceCluster carries a GCP project placeholder to edit before applying.

Platform

# InferenceClass for the L4 shape on EKS, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g6.2xlarge. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: eks-l4-1x-g6
spec:
  description: "EKS g6.2xlarge, 1x NVIDIA L4"
  provisioning:
    provider: EKS
    eks:
      instanceType: g6.2xlarge
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

# EKS InferenceCluster with one L4 node pool. No clusterSelector targets it; the
# ModelDeployment matches on device capacity alone, so it lands here or on any
# other compatible cluster in the fleet.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: eks-l4-single
  labels:
    modelplane.ai/cloud: eks
    modelplane.ai/region: us-west
spec:
  cluster:
    source: EKS
    eks:
      region: us-west-2
  nodePools:
  - name: gpu-l4
    className: eks-l4-1x-g6
    nodeCount: 1
    zones:
    - us-west-2a
    minNodeCount: 0
    maxNodeCount: 4

# InferenceClass for the L4 shape on GKE, validated serving Llama-3.1-8B.
#
# One NVIDIA L4 on a g2-standard-8. The GPU is declared as a DRA device: the
# scheduler matches a ModelDeployment's nodeSelector against this capacity, then
# DRA binds the physical GPU to the serving pod.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-l4-1x-g2
spec:
  description: "GKE g2-standard-8, 1x NVIDIA L4"
  provisioning:
    provider: GKE
    gke:
      machineType: g2-standard-8
      diskSizeGb: 100
      accelerator:
        type: nvidia-l4
        count: 1
  devices:
  - name: gpu
    claim: DRA
    driver: gpu.nvidia.com
    deviceClassName: gpu.nvidia.com
    count: 1
    attributes:
      architecture: { string: Ada Lovelace }
    capacity:
      # The L4's real usable VRAM as the NVIDIA DRA driver reports it, not the
      # nominal 24GB.
      memory: { value: "23034Mi" }

# GKE InferenceCluster with one L4 node pool. Replace the project ID before
# applying. No clusterSelector targets it; the ModelDeployment matches on device
# capacity alone, so it lands here or on any other compatible cluster.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
metadata:
  name: gke-l4-single
  labels:
    modelplane.ai/cloud: gke
    modelplane.ai/region: us-central
spec:
  cluster:
    source: GKE
    gke:
      project: my-gcp-project  # Replace with your GCP project ID.
      region: us-central1
  nodePools:
  - name: gpu-l4
    className: gke-l4-1x-g2
    nodeCount: 1
    zones:
    - us-central1-a
    minNodeCount: 0
    maxNodeCount: 4

curl -fsSL https://v0-1.docs.modelplane.ai/examples/examples/llama-3.1-8b/inference-cluster-gke.yaml \
  | sed 's/my-gcp-project//' \
  | kubectl apply -f -

Deployment

# Llama-3.1-8B Instruct served on a single NVIDIA L4 by vLLM, validated end to
# end on GKE (the model layer is cloud-agnostic; the same manifest serves on EKS).
#
# 8B in bf16 is ~16Gi of weights, leaving room for the KV cache on the L4's
# ~23Gi. Llama's default context is 128K, whose KV cache does not fit beside the
# weights, so --max-model-len caps it at 8192 - raise it only as far as the
# leftover VRAM allows.
#
# Weights come from the public NousResearch mirror, so no Hugging Face token is
# needed. The gated meta-llama/Llama-3.1-8B-Instruct original needs an hf-token
# Secret on the *workload* cluster (the engine pod reads it, not the control
# plane) and HF_TOKEN passed on the engine container.
#
# No --port or --host: Modelplane's routing expects the engine on its default
# :8000 with a /health probe, and passes args through verbatim.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: llama-3-1-8b
  namespace: ml-team
spec:
  # One replica, matched to any compatible InferenceCluster by device capacity.
  replicas: 1
  engines:
  - name: llama
    members:
    # A single self-contained vLLM pod. The container named "engine" is the
    # inference server; its image and args pass through verbatim.
    - role: Standalone
      nodeSelector:
        devices:
        - name: gpu
          count: 1
          selectors:
          # An 8B model needs most of an L4. >=20Gi selects the L4 (which
          # reports ~23Gi) without over-constraining. DRA evaluates this CEL
          # against the InferenceClass device, then against the GPU's
          # ResourceSlice when it binds the claim.
          - cel: |
              device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
      template:
        spec:
          containers:
          - name: engine
            image: vllm/vllm-openai:v0.7.3
            args:
            - "--model=NousResearch/Meta-Llama-3.1-8B-Instruct"
            # The id clients pass as "model" in OpenAI requests.
            - "--served-model-name=llama-3.1-8b"
            # Cap the context so the KV cache fits beside the weights on the L4.
            - "--max-model-len=8192"

# Exposes the llama-3-1-8b deployment's endpoints as a single OpenAI-compatible
# URL. Modelplane labels each composed ModelEndpoint with the deployment name, so
# this selector reaches every replica. Read the public address from
# status.address:
#   kubectl get ms llama-3-1-8b -n ml-team -o jsonpath='{.status.address}'
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: llama-3-1-8b
  namespace: ml-team
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: llama-3-1-8b