Qwen/Qwen3.5-4B
Qwen3.5 compact dense multimodal model (4B) — fits on 16 GB consumer GPUs with full 262K context
View on HuggingFaceGuide
Overview
Qwen3.5-4B is the compact dense entry in the Qwen3.5 family — same gated delta networks architecture, vision encoder, 262K context, and MTP decoding as the larger siblings, sized for 16 GB consumer GPUs.
Prerequisites
- vLLM version: >= 0.17.0
- Hardware: single 16 GB GPU (RTX 4080 / L4 / A10 / T4-24GB)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launching the Server
vllm serve Qwen/Qwen3.5-4B \
--max-model-len 262144 \
--reasoning-parser qwen3
MTP speculative decoding
vllm serve Qwen/Qwen3.5-4B \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=128,
)
print(resp.choices[0].message.content)
Troubleshooting
- CUDA graph / Mamba cache size error: reduce
--max-cudagraph-capture-size. - Disable reasoning: add
--default-chat-template-kwargs '{"enable_thinking": false}'.