vLLM/Recipes
Qwen

Qwen/Qwen3.5-35B-A3B

Compact Qwen3.5 multimodal MoE (35B total / 3B active) with gated delta networks, 256 experts, and 262K context

View on HuggingFace
moe35B / 3B262,144 ctxvLLM 0.17.0+multimodaltext
Guide

Overview

Qwen3.5-35B-A3B is the smallest MoE in the Qwen3.5 family, sharing the gated delta networks architecture with 35B total parameters and 3B activated per token (256 experts). With FP8 weights it fits on a single 80 GB GPU and supports the full 262K context.

Prerequisites

  • vLLM version: >= 0.17.0
  • Hardware (BF16): 1x H200 or 2x H100
  • Hardware (FP8): single H100/H200
  • Hardware (Int4): single 24 GB GPU

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

Single-GPU FP8

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

BF16 on 2xH200 (TP2)

vllm serve Qwen/Qwen3.5-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

MTP speculative decoding

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Troubleshooting

  • CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
  • Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.
  • Prefix Caching (Mamba): currently experimental in "align" mode.

References