Qwen/Qwen3.5-0.8B

Qwen3.5 tiny dense multimodal model (0.8B) — ultra-low-VRAM / edge serving with 262K context

dense0.8B262,144 ctxvLLM 0.17.0+multimodaltext

Guide

Overview

Qwen3.5-0.8B is the smallest member of the Qwen3.5 family — same hybrid gated delta networks architecture and 262K context, at a size suited to edge devices or as a draft model for speculative decoding with larger Qwen3.5 checkpoints.

Prerequisites

vLLM version: >= 0.17.0
Hardware: any modern GPU (>=4 GB VRAM)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

vllm serve Qwen/Qwen3.5-0.8B \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-0.8B",
    messages=[{"role": "user", "content": "Hi!"}],
    max_tokens=64,
)
print(resp.choices[0].message.content)

References

Model card
Base checkpoint