Qwen/Qwen3.5-122B-A10B

Mid-size Qwen3.5 multimodal MoE (122B total / 10B active) with gated delta networks, 256 experts, and 262K context

moe122B / 10B262,144 ctxvLLM 0.17.0+multimodaltext

Guide

Overview

Qwen3.5-122B-A10B is a mid-tier member of the Qwen3.5 family, sharing the gated delta networks MoE architecture with 122B total parameters and 10B activated per token (256 experts). It is multimodal (vision + text) and natively supports 262K context.

Prerequisites

vLLM version: >= 0.17.0
Hardware (BF16): 4x H200 or 8x H100
Hardware (FP8): 2x H200 or 4x H100
Hardware (Int4): single 80 GB GPU

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

BF16 on 4xH200 (TP4)

vllm serve Qwen/Qwen3.5-122B-A10B \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

FP8 on 2xH200 (TP2)

vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Throughput-focused (text-only, EP)

vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
  -dp 4 --enable-expert-parallel \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-prefix-caching

MTP speculative decoding

vllm serve Qwen/Qwen3.5-122B-A10B-FP8 \
  --tensor-parallel-size 2 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-122B-A10B",
    messages=[{"role": "user", "content": "Summarize the gated delta networks paper."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Troubleshooting

CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.
Prefix Caching (Mamba): currently experimental in "align" mode.