Last updated · May 08, 2026, 12:40 UTC

Agentic & Coding LLMs, ranked.

We aggregate the most authoritative public benchmarks, weight them transparently, and refresh on every commit.

How well models write, edit, and debug real code.

RankModelCompositeTrend
1
Claude Opus 4.6limited
Anthropic
75.6
2
GLM-5limited
Z.ai
72.8
3
GPT-5
OpenAI
66.2
4
Qwen3-235B-A22Blimited
Alibaba
65.9
5
Claude Opus 4.5
Anthropic
64.9
6
Gemini 3 Pro
Google
63.4
7
o4-mini
OpenAI
63.2
8
o3
OpenAI
62.3
9
GPT-5.2
OpenAI
61.8
10
Gemini 3 Flash
Google
60.7
11
Claude Sonnet 4.5
Anthropic
58.4
12
GPT-4.1
OpenAI
58.0
13
Grok 4limited
xAI
57.9
14
Doubao-Seed-Code
ByteDance
57.2
15
Claude Sonnet 4
Anthropic
56.9
16
Claude Opus 4limited
Anthropic
56.6
17
GPT-5.1
OpenAI
56.3
18
Gemini 2.5 Pro
Google
55.5
19
DeepSeek V3.2
DeepSeek
54.9
20
Kimi K2
Moonshot AI
53.5
21
Claude 3.7 Sonnet
Anthropic
47.0
22
Claude 3.5 Sonnet
Anthropic
44.7
23
DeepSeek R1limited
DeepSeek
40.6
24
GPT-5 Codexlimited
OpenAI
38.9
25
DeepSeek V3.2 Specialelimited
DeepSeek
37.9
26
o3-minilimited
OpenAI
36.8
27
DeepSeek V3
DeepSeek
31.3
28
Gemini 2.5 Flash
Google
25.8
29
GPT-4o
OpenAI
22.8
30
Grok 3limited
xAI
19.8
31
Llama 4 Mavericklimited
Meta
15.6
Mac local inference

Apple Silicon LLM speed rankings

From the LLMCheck 2026-05-09 open dataset: 158 Apple Silicon measurements ranked by generation tok/s.

View source
Fastest in view
SmolLM3 3B · 168 tok/s
Chips tested
11
Method
Q4_K_M quantization (or noted), 256-token input, 512-token output, 3 runs averaged, freshly booted system
RankModelChiptok/sTTFT
1
SmolLM3 3B
3B · 64 GB
M5 Max1680.1s
2
Gemma 4 E2B
2.3B · 128 GB
M5 Max1580.1s
3
Qwen 3.5 4B
4B · 64 GB
M5 Max1480.2s
4
Phi-5 Mini
4B · 128 GB
M5 Max1450.2s
5
Phi-4 Mini
3.8B · 64 GB
M5 Max1420.3s
6
Llama 3.1 8B
8B · 128 GB
M5 Max1380.3s
7
Qwen 3 4B
4B · 64 GB
M5 Max1350.2s
8
Gemma 3 4B
4B · 64 GB
M5 Max1320.3s
9
Gemma 4 E4B
4B · 128 GB
M5 Max1280.2s
10
Phi-4 Mini
3.8B · 48 GB
M4 Max1250.3s
11
Mistral 7B
7B · 64 GB
M5 Max1220.3s
12
Qwen 3 4B
4B · 24 GB
M4 Pro1180.3s
13
SmolLM3 3B
3B · 24 GB
M4 Pro1150.2s
14
Llama 5 8B
8B · 128 GB
M5 Max1120.3s
15
Phi-4 Mini
3.8B · 64 GB
M5 Max1120.3s
16
Phi-5 Mini
4B · 24 GB
M4 Pro1100.3s
17
Phi-4 Mini
3.8B · 24 GB
M4 Pro1080.4s
18
Qwen 3.5 9B
9B · 64 GB
M5 Max1050.5s
19
Ministral 8B
8B · 64 GB
M5 Max980.4s
20
Mistral 7B
7B · 24 GB
M4 Pro980.4s
21
Qwen 3 8B
8B · 128 GB
M5 Max980.4s
22
DeepSeek R1 8B
8B · 64 GB
M5 Max970.5s
23
Gemma 4 E2B
2.3B · 24 GB
M4 Pro950.2s
24
Gemma 3 4B
4B · 24 GB
M4 Pro950.3s
25
Phi-4 Mini
3.8B · 16 GB
M3950.3s
26
Phi-5 Mini
4B · 24 GB
M5 Pro950.3s
27
Gemma 4 E4B
4B · 24 GB
M5 Pro920.3s
28
SmolLM3 3B
3B · 16 GB
M3920.3s
29
Qwen 3.5 4B
4B · 16 GB
M4920.4s
30
Qwen 3.5 9B
9B · 24 GB
M4 Pro920.4s
31
Phi-5 Mini
4B · 16 GB
M3880.3s
32
Gemma 3 4B
4B · 18 GB
M3 Pro880.4s
33
Mistral 7B
7B · 64 GB
M5 Max880.4s
34
Gemma 4 E2B
2.3B · 16 GB
M3820.3s
35
Llama 3.1 8B
8B · 64 GB
M5 Max820.4s
36
Qwen 3 8B
8B · 24 GB
M4 Pro820.5s
37
Gemma 4 E4B
4B · 24 GB
M4 Pro780.3s
38
Llama 5 8B
8B · 24 GB
M4 Pro780.4s
39
SmolLM3 3B
3B · 8 GB
M2780.4s
40
DeepSeek R1 8B
8B · 16 GB
M4780.5s

Source: LLMCheck Apple Silicon LLM Benchmark Database, CC BY 4.0. tok/s measures generation speed and excludes prompt processing.

Authoritative sources

We track these public leaderboards. Click through for the source data.

Composite weights

Each composite score is a weighted average of the model's per-source scores, normalised to 0–100.

Coding

  • SWE-bench Verified
    Real GitHub issues resolved end-to-end.
    30%
  • LiveCodeBench
    Contamination-free competitive programming.
    25%
  • Aider Polyglot
    Multi-language code editing with diff-style edits.
    20%
  • Artificial Analysis (Coding Index)
    Cross-benchmark coding index, weekly refresh.
    25%

Agentic

  • GAIA (Princeton HAL)
    Multi-step real-world tasks: search, files, reasoning.
    35%
  • TAU-bench
    Tool-calling against retail/airline/telecom/banking policies.
    30%
  • AgentBench-FC
    Function-calling agents over ALFWorld, DB, KG, OS, WebShop.
    20%
  • Artificial Analysis (Agentic Index)
    Cross-benchmark agentic index, weekly refresh.
    15%

Terminal Agent

  • tbench (Terminal Agent Harness)
    Shell-based agent harness: coding, docs, tools, finance, data.
    100%