Qwen/Qwen3.5-2B
Qwen3.5 mini dense multimodal model (2B) — edge / low-VRAM serving with 262K context
View on HuggingFaceGuide
Overview
Qwen3.5-2B is a miniature dense Qwen3.5 model — the full gated delta networks architecture, vision encoder, and 262K context, in a form small enough for 8 GB consumer GPUs or edge inference.
Prerequisites
- vLLM version: >= 0.17.0
- Hardware: single 8 GB GPU
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launching the Server
vllm serve Qwen/Qwen3.5-2B \
--max-model-len 262144 \
--reasoning-parser qwen3
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3.5-2B",
messages=[{"role": "user", "content": "Hi!"}],
max_tokens=64,
)
print(resp.choices[0].message.content)