Qwen/Qwen3.5-9B

Qwen3.5 dense multimodal model (9B) with gated delta networks hybrid attention, MTP, and 262K context

dense9B262,144 ctxvLLM 0.17.0+multimodaltext

Guide

Overview

Qwen3.5-9B is a dense multimodal model from the Qwen3.5 family — same gated delta networks hybrid attention, vision encoder, 262K context, and MTP support as its larger siblings, but sized to fit comfortably on a single 24 GB GPU.

Prerequisites

vLLM version: >= 0.17.0
Hardware: single 24 GB GPU (RTX 4090 / L4 / A10G / H100)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

Single-GPU BF16

vllm serve Qwen/Qwen3.5-9B \
  --max-model-len 262144 \
  --reasoning-parser qwen3

MTP speculative decoding

vllm serve Qwen/Qwen3.5-9B \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

Troubleshooting

CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.