Qwen/Qwen3.5-35B-A3B
Compact Qwen3.5 multimodal MoE (35B total / 3B active) with gated delta networks, 256 experts, and 262K context
View on HuggingFaceGuide
Overview
Qwen3.5-35B-A3B is the smallest MoE in the Qwen3.5 family, sharing the gated delta networks architecture with 35B total parameters and 3B activated per token (256 experts). With FP8 weights it fits on a single 80 GB GPU and supports the full 262K context.
Prerequisites
- vLLM version: >= 0.17.0
- Hardware (BF16): 1x H200 or 2x H100
- Hardware (FP8): single H100/H200
- Hardware (Int4): single 24 GB GPU
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launching the Server
Single-GPU FP8
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
--max-model-len 262144 \
--reasoning-parser qwen3
BF16 on 2xH200 (TP2)
vllm serve Qwen/Qwen3.5-35B-A3B \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3
MTP speculative decoding
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Troubleshooting
- CUDA graph / Mamba cache size error: reduce
--max-cudagraph-capture-size(default 512). See vLLM PR #34571. - Disable reasoning: add
--default-chat-template-kwargs '{"enable_thinking": false}'. - Prefix Caching (Mamba): currently experimental in "align" mode.