The small model space in 2026 is genuinely crowded. Microsoft, Google, HuggingFace, and Alibaba have all shipped competitive sub-10B models in the last year, and picking the right one for your setup is no longer obvious. This post cuts through the noise: real benchmarks sourced from official model cards and technical reports, real VRAM requirements, and use-case recommendations you can act on today.

This is the reference comparison post for the model family reviews on this blog. Each model covered here has a dedicated deep-dive linked below. As we publish more, they’ll link back here.

What makes a good small language model?

For local inference the metrics that matter are: MMLU (general knowledge, 0–100%), GSM8K (grade-school math problems, 0–100%), HumanEval (code generation pass rate, 0–100%), inference speed in tokens per second, and VRAM usage at Q4_K_M quantization — the threshold that determines whether a model actually fits your hardware.

One upfront caveat: each lab runs benchmarks on different evaluation stacks. Numbers in this post come from each model’s official HuggingFace model card or published technical report. Direct cross-model comparisons are directionally useful but not mathematically precise when evaluation protocols differ.

Multicoloured lines of code on a dark monitor screen

Benchmarks tell part of the story — VRAM usage and inference speed tell the rest for local deployment. (Photo: Pexels, free)

The 3–4B class: best for constrained hardware

These models run comfortably in 8 GB of system RAM at Q4 quantization and can run CPU-only at workable speeds. If you’re on a laptop, a Raspberry Pi 5, or a machine without a discrete GPU, this is the category to focus on.

Phi-4-mini (3.8B) — the reasoning king

Microsoft’s Phi-4-mini is the standout 3–4B model in 2026. At 3.8 billion parameters and MIT licensed, it achieves 67.3% MMLU (5-shot), 88.6% GSM8K (8-shot, chain-of-thought), and 64.0% on MATH — numbers that rival models twice its size from just one generation ago. The secret is Microsoft’s “phi recipe”: training on high-quality synthetic reasoning data rather than raw web scrape.

VRAM at Q4_K_M sits around 3 GB, meaning it fits on any modern GPU. On an RTX 4090 it hits roughly 300 tokens/second; even on CPU-only hardware it produces usable throughput of 15–25 tokens/second on a modern core count.

  • Context window: 128K tokens
  • License: MIT — commercial use, no restrictions
  • Architecture: Dense decoder-only Transformer, 200K vocabulary
  • Run it: ollama run phi4-mini

Phi-4-mini is the right default for anyone who needs the strongest reasoning and math output from a sub-4B model. If you need the next level up with built-in mathematical chain-of-thought, Microsoft also released Phi-4-mini-reasoning as a separately fine-tuned variant.

SmolLM3-3B — best open, multilingual, long-context

HuggingFace’s SmolLM3-3B is the most genuinely open model in this class: Apache 2.0 licensed, fully reproducible training data and pipeline, and one of the few 3B models with a verifiable 128K context window. The base model scores 44.1% on MMLU-CF and 67.6% on GSM8K (5-shot) — respectable for 3B, and less important than what you get from the instruction-tuned version with extended thinking enabled: 41.7% GPQA Diamond, 36.7% on AIME 2025, and 30.0% on LiveCodeBench in reasoning mode.

SmolLM3 was specifically trained with multi-stage context extension: pretraining at 4K context, then two 50B-token extension phases pushing it to 32K and then 64K, with a further 2× YARN extrapolation reaching 128K at inference. The quality at long context is engineered rather than incidental.

We covered SmolLM3-3B in detail when it launched: SmolLM3-3B: The Fully Open Small Language Model That Punches Way Above Its Weight.

  • Context window: 128K tokens
  • License: Apache 2.0 — fully permissive, commercial use
  • Run it via llama.cpp: download from ggml-org/SmolLM3-3B-GGUF

Gemma 3 4B — best multimodal at this size

Google’s Gemma 3 4B is the only model in the 3–4B class that handles image input natively. The instruction-tuned model scores 43.6% on MMLU-Pro (a harder variant than standard MMLU), 89.2% on GSM8K, 71.3% on HumanEval, and 75.6% on MATH — the strongest coding and math scores in the 3–4B category. It runs at approximately 4.2 GB VRAM at Q4.

The image+text multimodal capability and 128K context window make Gemma 3 4B the go-to for vision-augmented local workflows — document parsing, screenshot analysis, chart reading — without stepping up to a larger model.

  • Context window: 128K tokens
  • License: Gemma license (free for research and commercial use with attribution)
  • Run it: ollama run gemma3:4b

Google has since released Gemma 4, which uses a sparse MoE architecture to push quality significantly further at the same effective parameter count. We covered that separately in Gemma 4: Taking Agentic Workflows to the Edge.

Llama 3.2 3B — the baseline

Meta’s Llama 3.2 3B is the benchmark everyone calibrates against: 63.4% MMLU (5-shot) and 77.7% GSM8K (8-shot CoT), widely supported by every inference framework and agent library. It’s not the best model in this class anymore, but it’s the safest choice if your RAG pipeline or agent framework was built before 2026 and you need guaranteed compatibility. Most tooling is tested against it first.

  • Context window: 128K tokens
  • License: Llama 3 Community License (commercial use allowed up to 700M monthly active users)
  • Run it: ollama run llama3.2:3b

The 7–8B class: power with manageable RAM

At 7–8B parameters you get a meaningful quality jump — particularly on coding and instruction-following — while still fitting on a 6–8 GB GPU or 16 GB of unified Apple Silicon memory.

Abstract 3D visualisation of glowing AI neural network nodes in blue and purple

The 7–8B class offers near-frontier quality for developer workloads on a single consumer GPU. (Photo: Pexels, free)

Qwen3 8B — coding champion

Alibaba’s Qwen3 8B leads the 7–8B class on code-generation benchmarks, consistently outperforming same-class competitors on HumanEval and coding-specific evaluations. At Q4_K_M it uses roughly 5 GB of VRAM, supports a 128K context window, and ships under Apache 2.0. It’s the default upgrade path for any developer-focused workload once you’ve maxed out what a 3–4B model can do.

  • Context window: 128K tokens
  • License: Apache 2.0
  • Run it: ollama run qwen3:8b

The blog also covers Qwen3-Coder-Next, Alibaba’s purpose-built coding agent variant, and the ultra-light Qwen3.5-0.8B for deployments where every megabyte counts.

Mistral 7B — the speed benchmark

The Mistral 7B family remains one of the fastest models at this parameter count on consumer hardware. At Q4_K_M it uses roughly 4.5 GB of VRAM and hits 30–50 tokens/second on M2/M3 MacBook Pro or an RTX 4060 — the fastest wall-clock throughput in this comparison. The inference speed makes it the practical pick when latency matters more than peak benchmark scores: real-time chat interfaces, coding autocomplete, streaming pipelines.

  • Context window: 32K tokens
  • Run it: ollama run mistral:7b-instruct

For what Mistral’s team has achieved at larger scale, our Mistral Small 4 coverage looks at the 119B MoE architecture that unifies instruction-following, reasoning, and multimodal capability in a single model.

Benchmark comparison table

Numbers sourced from each model’s official HuggingFace model card or technical report. MMLU variant used may differ — see Sources.

ModelParamsMMLUGSM8KHumanEvalVRAM (Q4)ContextLicense
Phi-4-mini3.8B67.3% ¹88.6% ¹~3 GB128KMIT
SmolLM3-3B3B44.1% ²67.6% ²30.5% ²~2 GB128KApache 2.0
Gemma 3 4B4B43.6% ³89.2%71.3%~4.2 GB128KGemma
Llama 3.2 3B3B63.4% ⁵77.7% ⁵~2 GB128KLlama
Qwen3 8B8B~5 GB128KApache 2.0
Mistral 7B7B~4.5 GB32KApache 2.0

¹ MMLU 5-shot and GSM8K 8-shot CoT; from Phi-4-mini-instruct official model card.
² MMLU-CF, GSM8K 5-shot, HumanEval+ — all from SmolLM3 base model card. Note: MMLU-CF is a harder cascade-format variant; not directly comparable to MMLU 5-shot.
³ MMLU-Pro (IT model); harder variant than MMLU 5-shot.
⁴ GSM8K and HumanEval scores for the Gemma 3 4B instruction-tuned model, from the official model card.
⁵ MMLU 5-shot, GSM8K 8-shot CoT; from Llama 3.2 3B instruction-tuned model card.
★ Leads the 7–8B class on code generation benchmarks per independent evaluations.

Which model is best for your use case?

Coding and software development: Gemma 3 4B at the 3–4B scale (71.3% HumanEval), Qwen3 8B if you have 5+ GB VRAM free. For a frontier-level coding agent that runs locally, see Qwen3-Coder-Next.

Math and reasoning: Phi-4-mini wins at sub-4B. Microsoft’s synthetic training approach specifically targets mathematical and logical reasoning chains.

Long-document processing: SmolLM3-3B’s genuine 128K context and multi-stage training make it the standout in the 3B class. Avoid any model that claims 128K context through simple positional interpolation — the quality degrades sharply.

Multimodal tasks (image + text): Gemma 3 4B is the only 3–4B option here. For vision-capable models with optimised Mac performance, also see our Reka Edge coverage.

Edge and offline-first deployments: If you need something that fits in under 1 GB of RAM, Qwen3.5-0.8B and Liquid AI Apollo are both worth a look at the sub-1B scale.

Maximum throughput and lowest latency: Mistral 7B. If response speed matters for your interface — streaming chat, autocomplete, voice pipelines — it produces the highest tokens-per-second in this table at comparable quality.

Hardware quick reference

All models below are available for CPU-only inference — you do not need a discrete GPU. Expect 15–50 tokens/second on modern CPUs depending on chip and quantization level.

# 3–4B tier: fits in 8 GB system RAM, runs on any modern machine
ollama run phi4-mini          # 3.8B — best reasoning, MIT licensed
ollama run gemma3:4b          # 4B  — multimodal, best coding at this tier
ollama run llama3.2:3b        # 3B  — widest tooling compatibility

# 7–8B tier: needs 6–8 GB VRAM or 16 GB unified memory
ollama run qwen3:8b           # 8B  — best coding
ollama run mistral:7b-instruct # 7B  — fastest wall-clock throughput

If you’re on Apple Silicon, all of these models run efficiently via Metal and Apple’s unified memory architecture. Quantized GGUF files are available on HuggingFace for llama.cpp if you prefer to skip Ollama.

What is the best small language model in 2026?

For most users, Phi-4-mini (3.8B) is the best small language model in 2026. It is MIT licensed, runs in approximately 3 GB of VRAM, supports a 128K context window, and leads the 3–4B class on MMLU (67.3%, 5-shot) and GSM8K (88.6%) per the official Microsoft model card. It installs in one command via Ollama and works on any machine with a modern GPU or a recent CPU.

If multimodal input (images) is a requirement, Gemma 3 4B is the swap — and it beats Phi-4-mini on coding tasks (HumanEval 71.3% vs. not measured). If you need the most permissive license and a fully open training pipeline, SmolLM3-3B is the principled choice. If you can spare 5 GB of VRAM and need the strongest code output, Qwen3 8B is the upgrade path.

The only case where none of these apply is if you’re deploying on severely resource-constrained hardware — sub-1 GB RAM — at which point you’re in a different product category entirely.

What’s next

The 3–4B class has genuinely matured. In 2024, a 3B model was a curiosity. In 2026, Phi-4-mini and Gemma 3 4B handle production workloads — coding assistants, document Q&A, local agents — on hardware most developers already own.

The model landscape moves fast. This post will be updated as significant new releases land. Check the small-models tag for the latest coverage, and the benchmark tag for head-to-head comparisons.

Sources