Self-Hosted LLM Inference: A Complete vLLM Setup Guide
Self-Hosted LLM Inference with vLLM
Running your own LLM inference server gives you complete control over data privacy, latency, and costs. This guide walks through deploying a production-ready vLLM server on NVIDIA DGX Spark hardware, with real-world troubleshooting tips from actual deployment experience.
Why Self-Host LLM Inference?
Before diving into the technical setup, consider the benefits of self-hosting:
- Data Privacy: Sensitive data never leaves your infrastructure
- Predictable Costs: No per-token API charges for heavy workloads
- Low Latency: Local inference eliminates network round-trips
- Model Freedom: Run any model, including fine-tuned variants
- No Rate Limits: Scale horizontally without API throttling
Hardware Platform: NVIDIA DGX Spark (GB10)
This guide is based on deployment experience with the ASUS Ascent GX10, powered by NVIDIA's DGX Spark platform featuring the GB10 Grace Blackwell Superchip.
Key Specifications
The GB10 is a high-performance AI-focused system-on-a-chip (SoC) designed for desktop AI workstations:
| Component | Specification |
|---|---|
| CPU | 20-core ARM v9.2-A (10× Cortex-X925 @ 3GHz + 10× Cortex-A725 @ 2GHz) |
| GPU | Blackwell architecture, 6,144 shaders, 5th Gen Tensor Cores, 4th Gen RT Cores |
| AI Performance | 1,000 TOPS FP4 (NVFP4), 31.03 TFLOPS FP32 |
| Memory | 128 GB LPDDR5X-9400 (256-bit bus, 273–301 GB/s bandwidth) |
| Interconnect | NVLink-C2C (600 GB/s bidirectional CPU↔GPU) |
| Cache | 32 MB L3 + 24 MB GPU L2 + 16 MB L4 system cache |
| Power | 140 W TDP |
| Form Factor | 150mm × 150mm × 50.5mm desktop |
| Storage | Up to 4 TB NVMe SSD |
| Connectivity | HDMI 2.1a, 4× USB-C, 10 GbE, 200 Gbps ConnectX-7, Wi-Fi 7, BT 5.4 |
The GB10 Grace Blackwell Superchip is optimized for inference workloads with:
- Native FP4, FP8, and INT4 support for efficient quantization
- Transformer Engine acceleration
- Unified coherent memory architecture
- High memory bandwidth for large context windows
vLLM: The Inference Engine
vLLM is a high-performance LLM inference engine that provides:
- PagedAttention: Efficient memory management for KV cache
- Continuous Batching: Dynamic request batching
- Optimized Kernels: FlashAttention, FlashInfer integration
- OpenAI-Compatible API: Drop-in replacement for OpenAI clients
Docker Deployment
The recommended deployment method uses Docker for reproducibility and isolation.
Docker Compose Configuration
services:
vllm:
image: vllm-node:latest
container_name: vllm-qwen
restart: unless-stopped
runtime: nvidia
ports:
- "0.0.0.0:8888:8888"
volumes:
- ./templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
- ./data/sharegpt.json:/data/sharegpt.json:ro
- ~/.cache/huggingface:/root/.cache/huggingface
ipc: host
shm_size: 32g
environment:
- VLLM_USE_FLASHINFER_MOE_FP8=1
- VLLM_FLASHINFER_MOE_BACKEND=latency
- VLLM_ATTENTION_BACKEND=FLASH_ATTN
- VLLM_USE_DEEP_GEMM=0
command:
- vllm
- serve
- QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
- --port
- "8888"
- --host
- 0.0.0.0
- --max-model-len
- "131072"
- --gpu-memory-utilization
- "0.7"
- --load-format
- fastsafetensors
- --max-num-seqs
- "64"
- --max-num-batched-tokens
- "8192"
- --trust-remote-code
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
- --chat-template-kwargs
- '{"enable_thinking": false}'
Configuration Parameters Explained
| Parameter | Value | Purpose |
|---|---|---|
--max-model-len |
131072 | 128K token context window |
--gpu-memory-utilization |
0.7 | Reserve 70% GPU memory for model |
--load-format |
fastsafetensors | Fast model loading |
--max-num-seqs |
64 | Maximum concurrent sequences |
--max-num-batched-tokens |
8192 | Token batch limit for prefill |
--enable-auto-tool-choice |
flag | Enable function calling |
--tool-call-parser |
qwen3_coder | Tool call format parser |
Resource Allocation
With this configuration on GB10 hardware:
| Resource | Allocation |
|---|---|
| Model Memory | ~17 GiB |
| KV Cache | 62.16 GiB |
| KV Cache Tokens | 678,912 |
| Max Concurrency | 5.18x (at 131K context) |
Common Issues and Solutions
Issue 1: Tool Choice Configuration Error
Error:
"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set
Solution: Add both flags to enable tool calling:
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
Note: Qwen3-VL models DO support tool calling with the qwen3_coder parser, despite some documentation suggesting otherwise.
Issue 2: Context Window Overflow
Error:
You passed 23643 input tokens and requested 32000 output tokens.
However, the model's context length is only 32768.
Solution:
Increase --max-model-len to accommodate both input and output:
--max-model-len 131072 # 128K tokens
Issue 3: Thinking Mode Output Leakage
Symptom:
Model outputs internal reasoning tokens like <think, >>>, or exposed chain-of-thought.
Cause: Qwen3 models default to "thinking mode" which outputs reasoning before responses.
Solution: Disable at server level:
--chat-template-kwargs '{"enable_thinking": false}'
Or per-request in the API call:
{
"model": "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
"messages": [...],
"extra_body": {"enable_thinking": false}
}
Issue 4: Triton Kernels Warning
Warning:
Failed to import Triton kernels. No module named 'triton_kernels.routing'
Impact: None - this is a warning only. vLLM falls back to FLASH_ATTN backend.
Optional Fix:
pip install triton-kernels@git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels
Issue 5: NVIDIA Driver 590.x Compatibility
Problem: Driver 590.x introduces compatibility issues with GB10 Blackwell systems.
Root Causes:
- Library renaming (
libnvidia-compute→libnvidia_compute) - Incomplete sm_121 compute capability support
- FlashInfer kernel compilation failures
Recommended Solution: Use driver version 535.x series (stable, tested with vLLM).
Verification:
nvidia-smi # Check driver version
nvcc --version # Verify CUDA compatibility
Performance Benchmarks
Throughput Results
| Test | Concurrency | Tokens/Request | Throughput | Latency |
|---|---|---|---|---|
| Sequential | 1 | 128 | 868.81 tok/s | 3.19s |
| Concurrent | 8 | 128 | 315.85 tok/s | 3.24s |
| Sustained | 16 | 256 | 387.71 tok/s | 10.23s |
| Long-form | 4 | 512 | 187.85 tok/s | 10.90s |
Context Window Scaling
| Prompt Size | Actual Tokens | Prefill Time | Prefill Speed |
|---|---|---|---|
| 10K | 10,387 | 1.80s | 5,768 tok/s |
| 30K | 31,547 | 7.15s | 4,413 tok/s |
| 50K | 53,307 | 12.66s | 4,211 tok/s |
| 80K | 85,947 | 29.06s | 2,958 tok/s |
| 100K | 107,707 | 26.43s | 4,074 tok/s |
| 120K | 129,467 | 32.03s | 4,042 tok/s |
Quick Start Commands
# Start server
cd /path/to/vllm-docker
docker compose up -d
# Check status
curl http://localhost:8888/v1/models
docker logs vllm-qwen --tail 50
# Stop server
docker compose down
API Usage
Available Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completion |
/v1/completions |
POST | Text completion |
/health |
GET | Health check |
/metrics |
GET | Prometheus metrics |
Python Client Example
#!/usr/bin/env python3
"""Test script for vLLM server."""
import requests
BASE_URL = "http://localhost:8888/v1"
MODEL = "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ"
response = requests.post(
f"{BASE_URL}/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": MODEL,
"messages": [
{"role": "user", "content": "Hello! Please introduce yourself briefly."}
],
"max_tokens": 256,
"temperature": 0.7,
},
timeout=120,
)
if response.status_code == 200:
data = response.json()
print(data["choices"][0]["message"]["content"])
else:
print(f"Error: {response.status_code}")
print(response.text)
Using with OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8888/v1",
api_key="not-needed" # vLLM doesn't require API keys
)
response = client.chat.completions.create(
model="QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
messages=[
{"role": "user", "content": "Explain quantum computing in one paragraph."}
],
max_tokens=200
)
print(response.choices[0].message.content)
Optimization Tips
Memory Optimization
-
Adjust GPU memory utilization based on your workload:
--gpu-memory-utilization 0.8 # For single model, aggressive --gpu-memory-utilization 0.5 # For multi-tenant, conservative -
Use quantized models (AWQ, GPTQ) to reduce memory footprint
-
Tune batch sizes for your typical request patterns:
--max-num-seqs 32 # Lower for memory-constrained --max-num-batched-tokens 4096 # Lower for latency-sensitive
Latency Optimization
-
Use FlashAttention for faster prefill:
--attention-backend FLASH_ATTN -
Enable prefix caching for repeated prompts:
--enable-prefix-caching -
Consider speculative decoding for faster generation:
--speculative-model [smaller-model] --num-speculative-tokens 4
Production Considerations
High Availability
- Deploy multiple vLLM instances behind a load balancer
- Use health checks for automatic failover
- Implement request queuing for burst handling
Monitoring
- Enable Prometheus metrics:
/metricsendpoint - Monitor GPU memory utilization
- Track request latency and throughput
- Set up alerts for error rates
Security
- Bind to localhost only (
127.0.0.1) for internal services - Use reverse proxy (nginx, Traefik) with TLS for external access
- Implement rate limiting
- Consider authentication for multi-tenant deployments
References
Self-hosting LLM inference puts you in control of your AI infrastructure. With vLLM and proper hardware, you can achieve production-grade performance while maintaining data privacy and predictable costs. Start with the basic configuration, benchmark your workloads, and optimize based on your specific needs.