Qwen/Qwen3.5-122B-A10B
Mid-size Qwen3.5 multimodal MoE (122B total / 10B active) with gated delta networks, 256 experts, and 262K context
View on HuggingFaceGuide
Overview
Qwen3.5-122B-A10B is a mid-tier member of the Qwen3.5 family, sharing the gated delta networks MoE architecture with 122B total parameters and 10B activated per token (256 experts). It is multimodal (vision + text) and natively supports 262K context.
Prerequisites
- vLLM version: >= 0.17.0
- Hardware (BF16): 4x H200 or 8x H100
- Hardware (FP8): 2x H200 or 4x H100
- Hardware (Int4): single 80 GB GPU
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launching the Server
BF16 on 4xH200 (TP4)
vllm serve Qwen/Qwen3.5-122B-A10B \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--reasoning-parser qwen3
FP8 on 2xH200 (TP2)
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3
Throughput-focused (text-only, EP)
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
-dp 4 --enable-expert-parallel \
--language-model-only \
--reasoning-parser qwen3 \
--enable-prefix-caching
MTP speculative decoding
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
--tensor-parallel-size 2 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-122B-A10B",
messages=[{"role": "user", "content": "Summarize the gated delta networks paper."}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Troubleshooting
- CUDA graph / Mamba cache size error: reduce
--max-cudagraph-capture-size(default 512). See vLLM PR #34571. - Disable reasoning: add
--default-chat-template-kwargs '{"enable_thinking": false}'. - Prefix Caching (Mamba): currently experimental in "align" mode.