Self-Hosted LLM Inference with vLLM

Running your own LLM inference server gives you complete control over data privacy, latency, and costs. This guide walks through deploying a production-ready vLLM server on NVIDIA DGX Spark hardware, with real-world troubleshooting tips from actual deployment experience.

Why Self-Host LLM Inference?

Before diving into the technical setup, consider the benefits of self-hosting:

Data Privacy: Sensitive data never leaves your infrastructure
Predictable Costs: No per-token API charges for heavy workloads
Low Latency: Local inference eliminates network round-trips
Model Freedom: Run any model, including fine-tuned variants
No Rate Limits: Scale horizontally without API throttling

Hardware Platform: NVIDIA DGX Spark (GB10)

This guide is based on deployment experience with the ASUS Ascent GX10, powered by NVIDIA's DGX Spark platform featuring the GB10 Grace Blackwell Superchip.

Key Specifications

The GB10 is a high-performance AI-focused system-on-a-chip (SoC) designed for desktop AI workstations:

Component	Specification
CPU	20-core ARM v9.2-A (10× Cortex-X925 @ 3GHz + 10× Cortex-A725 @ 2GHz)
GPU	Blackwell architecture, 6,144 shaders, 5th Gen Tensor Cores, 4th Gen RT Cores
AI Performance	1,000 TOPS FP4 (NVFP4), 31.03 TFLOPS FP32
Memory	128 GB LPDDR5X-9400 (256-bit bus, 273–301 GB/s bandwidth)
Interconnect	NVLink-C2C (600 GB/s bidirectional CPU↔GPU)
Cache	32 MB L3 + 24 MB GPU L2 + 16 MB L4 system cache
Power	140 W TDP
Form Factor	150mm × 150mm × 50.5mm desktop
Storage	Up to 4 TB NVMe SSD
Connectivity	HDMI 2.1a, 4× USB-C, 10 GbE, 200 Gbps ConnectX-7, Wi-Fi 7, BT 5.4

The GB10 Grace Blackwell Superchip is optimized for inference workloads with:

Native FP4, FP8, and INT4 support for efficient quantization
Transformer Engine acceleration
Unified coherent memory architecture
High memory bandwidth for large context windows

vLLM: The Inference Engine

vLLM is a high-performance LLM inference engine that provides:

PagedAttention: Efficient memory management for KV cache
Continuous Batching: Dynamic request batching
Optimized Kernels: FlashAttention, FlashInfer integration
OpenAI-Compatible API: Drop-in replacement for OpenAI clients

Docker Deployment

The recommended deployment method uses Docker for reproducibility and isolation.

Docker Compose Configuration

services:
  vllm:
    image: vllm-node:latest
    container_name: vllm-qwen
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "0.0.0.0:8888:8888"
    volumes:
      - ./templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
      - ./data/sharegpt.json:/data/sharegpt.json:ro
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 32g
    environment:
      - VLLM_USE_FLASHINFER_MOE_FP8=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
      - VLLM_ATTENTION_BACKEND=FLASH_ATTN
      - VLLM_USE_DEEP_GEMM=0
    command:
      - vllm
      - serve
      - QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
      - --port
      - "8888"
      - --host
      - 0.0.0.0
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.7"
      - --load-format
      - fastsafetensors
      - --max-num-seqs
      - "64"
      - --max-num-batched-tokens
      - "8192"
      - --trust-remote-code
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --chat-template-kwargs
      - '{"enable_thinking": false}'

Configuration Parameters Explained

Parameter	Value	Purpose
`--max-model-len`	131072	128K token context window
`--gpu-memory-utilization`	0.7	Reserve 70% GPU memory for model
`--load-format`	fastsafetensors	Fast model loading
`--max-num-seqs`	64	Maximum concurrent sequences
`--max-num-batched-tokens`	8192	Token batch limit for prefill
`--enable-auto-tool-choice`	flag	Enable function calling
`--tool-call-parser`	qwen3_coder	Tool call format parser

Resource Allocation

With this configuration on GB10 hardware:

Resource	Allocation
Model Memory	~17 GiB
KV Cache	62.16 GiB
KV Cache Tokens	678,912
Max Concurrency	5.18x (at 131K context)

Common Issues and Solutions

Issue 1: Tool Choice Configuration Error

Error:

"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set

Solution: Add both flags to enable tool calling:

--enable-auto-tool-choice
--tool-call-parser qwen3_coder

Note: Qwen3-VL models DO support tool calling with the qwen3_coder parser, despite some documentation suggesting otherwise.

Issue 2: Context Window Overflow

Error:

You passed 23643 input tokens and requested 32000 output tokens.
However, the model's context length is only 32768.

Solution: Increase --max-model-len to accommodate both input and output:

--max-model-len 131072  # 128K tokens

Issue 3: Thinking Mode Output Leakage

Symptom: Model outputs internal reasoning tokens like <think, >>>, or exposed chain-of-thought.

Cause: Qwen3 models default to "thinking mode" which outputs reasoning before responses.

Solution: Disable at server level:

--chat-template-kwargs '{"enable_thinking": false}'

Or per-request in the API call:

{
  "model": "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
  "messages": [...],
  "extra_body": {"enable_thinking": false}
}

Issue 4: Triton Kernels Warning

Warning:

Failed to import Triton kernels. No module named 'triton_kernels.routing'

Impact: None - this is a warning only. vLLM falls back to FLASH_ATTN backend.

Optional Fix:

pip install triton-kernels@git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels

Issue 5: NVIDIA Driver 590.x Compatibility

Problem: Driver 590.x introduces compatibility issues with GB10 Blackwell systems.

Root Causes:

Library renaming (libnvidia-compute → libnvidia_compute)
Incomplete sm_121 compute capability support
FlashInfer kernel compilation failures

Recommended Solution: Use driver version 535.x series (stable, tested with vLLM).

Verification:

nvidia-smi  # Check driver version
nvcc --version  # Verify CUDA compatibility

Performance Benchmarks

Throughput Results

Test	Concurrency	Tokens/Request	Throughput	Latency
Sequential	1	128	868.81 tok/s	3.19s
Concurrent	8	128	315.85 tok/s	3.24s
Sustained	16	256	387.71 tok/s	10.23s
Long-form	4	512	187.85 tok/s	10.90s

Context Window Scaling

Prompt Size	Actual Tokens	Prefill Time	Prefill Speed
10K	10,387	1.80s	5,768 tok/s
30K	31,547	7.15s	4,413 tok/s
50K	53,307	12.66s	4,211 tok/s
80K	85,947	29.06s	2,958 tok/s
100K	107,707	26.43s	4,074 tok/s
120K	129,467	32.03s	4,042 tok/s

Quick Start Commands

# Start server
cd /path/to/vllm-docker
docker compose up -d

# Check status
curl http://localhost:8888/v1/models
docker logs vllm-qwen --tail 50

# Stop server
docker compose down

API Usage

Available Endpoints

Endpoint	Method	Description
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completion
`/v1/completions`	POST	Text completion
`/health`	GET	Health check
`/metrics`	GET	Prometheus metrics

Python Client Example

#!/usr/bin/env python3
"""Test script for vLLM server."""
import requests

BASE_URL = "http://localhost:8888/v1"
MODEL = "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ"

response = requests.post(
    f"{BASE_URL}/chat/completions",
    headers={"Content-Type": "application/json"},
    json={
        "model": MODEL,
        "messages": [
            {"role": "user", "content": "Hello! Please introduce yourself briefly."}
        ],
        "max_tokens": 256,
        "temperature": 0.7,
    },
    timeout=120,
)

if response.status_code == 200:
    data = response.json()
    print(data["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8888/v1",
    api_key="not-needed"  # vLLM doesn't require API keys
)

response = client.chat.completions.create(
    model="QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

Optimization Tips

Memory Optimization

Adjust GPU memory utilization based on your workload:

--gpu-memory-utilization 0.8  # For single model, aggressive
--gpu-memory-utilization 0.5  # For multi-tenant, conservative

Use quantized models (AWQ, GPTQ) to reduce memory footprint

Tune batch sizes for your typical request patterns:

--max-num-seqs 32  # Lower for memory-constrained
--max-num-batched-tokens 4096  # Lower for latency-sensitive

Latency Optimization

Use FlashAttention for faster prefill:
```
--attention-backend FLASH_ATTN
```
Enable prefix caching for repeated prompts:
```
--enable-prefix-caching
```

Consider speculative decoding for faster generation:

--speculative-model [smaller-model]
--num-speculative-tokens 4

Production Considerations

High Availability

Deploy multiple vLLM instances behind a load balancer
Use health checks for automatic failover
Implement request queuing for burst handling

Monitoring

Enable Prometheus metrics: /metrics endpoint
Monitor GPU memory utilization
Track request latency and throughput
Set up alerts for error rates

Security

Bind to localhost only (127.0.0.1) for internal services
Use reverse proxy (nginx, Traefik) with TLS for external access
Implement rate limiting
Consider authentication for multi-tenant deployments

References

Self-hosting LLM inference puts you in control of your AI infrastructure. With vLLM and proper hardware, you can achieve production-grade performance while maintaining data privacy and predictable costs. Start with the basic configuration, benchmark your workloads, and optimize based on your specific needs.