vLLM/Recipes
Qwen

Qwen/Qwen3.5-27B

Qwen3.5 dense multimodal model (27B) with gated delta networks hybrid attention, MTP, and 262K context

View on HuggingFace
dense27B262,144 ctxvLLM 0.17.0+multimodaltext
Guide

Overview

Qwen3.5-27B is the flagship dense model of the Qwen3.5 family. It uses the same gated delta networks hybrid attention as its MoE siblings, supports vision+text input, and natively serves 262K context. MTP (multi-token prediction) is supported out of the box for low-latency decoding.

Prerequisites

  • vLLM version: >= 0.17.0
  • Hardware (BF16): 1x H200 or 2x H100
  • Hardware (FP8): single 40 GB GPU (H100/H200/L40S)
  • Hardware (Int4): single 24 GB GPU

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

Single-GPU FP8

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

BF16 on 2xH100 (TP2)

vllm serve Qwen/Qwen3.5-27B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

MTP speculative decoding

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Text-only (skip vision encoder)

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-prefix-caching

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Write a haiku about gated delta networks."}],
    max_tokens=256,
)
print(resp.choices[0].message.content)

Troubleshooting

  • CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
  • Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.
  • Prefix Caching (Mamba): currently experimental in "align" mode.

References