Agentic & Coding LLMs, ranked.
We aggregate the most authoritative public benchmarks, weight them transparently, and refresh on every commit.
How well models write, edit, and debug real code.
| Rank | Model | Composite | Trend |
|---|---|---|---|
| 1 | Claude Opus 4.6limited Anthropic | 75.6 | |
| 2 | GLM-5limited Z.ai | 72.8 | |
| 3 | GPT-5 OpenAI | 66.2 | |
| 4 | Qwen3-235B-A22Blimited Alibaba | 65.9 | |
| 5 | Claude Opus 4.5 Anthropic | 64.9 | |
| 6 | Gemini 3 Pro Google | 63.4 | |
| 7 | o4-mini OpenAI | 63.2 | |
| 8 | o3 OpenAI | 62.3 | |
| 9 | GPT-5.2 OpenAI | 61.8 | |
| 10 | Gemini 3 Flash Google | 60.7 | |
| 11 | Claude Sonnet 4.5 Anthropic | 58.4 | |
| 12 | GPT-4.1 OpenAI | 58.0 | |
| 13 | Grok 4limited xAI | 57.9 | |
| 14 | Doubao-Seed-Code ByteDance | 57.2 | |
| 15 | Claude Sonnet 4 Anthropic | 56.9 | |
| 16 | Claude Opus 4limited Anthropic | 56.6 | |
| 17 | GPT-5.1 OpenAI | 56.3 | |
| 18 | Gemini 2.5 Pro Google | 55.5 | |
| 19 | DeepSeek V3.2 DeepSeek | 54.9 | |
| 20 | Kimi K2 Moonshot AI | 53.5 | |
| 21 | Claude 3.7 Sonnet Anthropic | 47.0 | |
| 22 | Claude 3.5 Sonnet Anthropic | 44.7 | |
| 23 | DeepSeek R1limited DeepSeek | 40.6 | |
| 24 | GPT-5 Codexlimited OpenAI | 38.9 | |
| 25 | DeepSeek V3.2 Specialelimited DeepSeek | 37.9 | |
| 26 | o3-minilimited OpenAI | 36.8 | |
| 27 | DeepSeek V3 DeepSeek | 31.3 | |
| 28 | Gemini 2.5 Flash Google | 25.8 | |
| 29 | GPT-4o OpenAI | 22.8 | |
| 30 | Grok 3limited xAI | 19.8 | |
| 31 | Llama 4 Mavericklimited Meta | 15.6 |
Apple Silicon LLM speed rankings
From the LLMCheck 2026-05-09 open dataset: 158 Apple Silicon measurements ranked by generation tok/s.
| Rank | Model | Chip | tok/s | TTFT |
|---|---|---|---|---|
| 1 | SmolLM3 3B 3B · 64 GB | M5 Max | 168 | 0.1s |
| 2 | Gemma 4 E2B 2.3B · 128 GB | M5 Max | 158 | 0.1s |
| 3 | Qwen 3.5 4B 4B · 64 GB | M5 Max | 148 | 0.2s |
| 4 | Phi-5 Mini 4B · 128 GB | M5 Max | 145 | 0.2s |
| 5 | Phi-4 Mini 3.8B · 64 GB | M5 Max | 142 | 0.3s |
| 6 | Llama 3.1 8B 8B · 128 GB | M5 Max | 138 | 0.3s |
| 7 | Qwen 3 4B 4B · 64 GB | M5 Max | 135 | 0.2s |
| 8 | Gemma 3 4B 4B · 64 GB | M5 Max | 132 | 0.3s |
| 9 | Gemma 4 E4B 4B · 128 GB | M5 Max | 128 | 0.2s |
| 10 | Phi-4 Mini 3.8B · 48 GB | M4 Max | 125 | 0.3s |
| 11 | Mistral 7B 7B · 64 GB | M5 Max | 122 | 0.3s |
| 12 | Qwen 3 4B 4B · 24 GB | M4 Pro | 118 | 0.3s |
| 13 | SmolLM3 3B 3B · 24 GB | M4 Pro | 115 | 0.2s |
| 14 | Llama 5 8B 8B · 128 GB | M5 Max | 112 | 0.3s |
| 15 | Phi-4 Mini 3.8B · 64 GB | M5 Max | 112 | 0.3s |
| 16 | Phi-5 Mini 4B · 24 GB | M4 Pro | 110 | 0.3s |
| 17 | Phi-4 Mini 3.8B · 24 GB | M4 Pro | 108 | 0.4s |
| 18 | Qwen 3.5 9B 9B · 64 GB | M5 Max | 105 | 0.5s |
| 19 | Ministral 8B 8B · 64 GB | M5 Max | 98 | 0.4s |
| 20 | Mistral 7B 7B · 24 GB | M4 Pro | 98 | 0.4s |
| 21 | Qwen 3 8B 8B · 128 GB | M5 Max | 98 | 0.4s |
| 22 | DeepSeek R1 8B 8B · 64 GB | M5 Max | 97 | 0.5s |
| 23 | Gemma 4 E2B 2.3B · 24 GB | M4 Pro | 95 | 0.2s |
| 24 | Gemma 3 4B 4B · 24 GB | M4 Pro | 95 | 0.3s |
| 25 | Phi-4 Mini 3.8B · 16 GB | M3 | 95 | 0.3s |
| 26 | Phi-5 Mini 4B · 24 GB | M5 Pro | 95 | 0.3s |
| 27 | Gemma 4 E4B 4B · 24 GB | M5 Pro | 92 | 0.3s |
| 28 | SmolLM3 3B 3B · 16 GB | M3 | 92 | 0.3s |
| 29 | Qwen 3.5 4B 4B · 16 GB | M4 | 92 | 0.4s |
| 30 | Qwen 3.5 9B 9B · 24 GB | M4 Pro | 92 | 0.4s |
| 31 | Phi-5 Mini 4B · 16 GB | M3 | 88 | 0.3s |
| 32 | Gemma 3 4B 4B · 18 GB | M3 Pro | 88 | 0.4s |
| 33 | Mistral 7B 7B · 64 GB | M5 Max | 88 | 0.4s |
| 34 | Gemma 4 E2B 2.3B · 16 GB | M3 | 82 | 0.3s |
| 35 | Llama 3.1 8B 8B · 64 GB | M5 Max | 82 | 0.4s |
| 36 | Qwen 3 8B 8B · 24 GB | M4 Pro | 82 | 0.5s |
| 37 | Gemma 4 E4B 4B · 24 GB | M4 Pro | 78 | 0.3s |
| 38 | Llama 5 8B 8B · 24 GB | M4 Pro | 78 | 0.4s |
| 39 | SmolLM3 3B 3B · 8 GB | M2 | 78 | 0.4s |
| 40 | DeepSeek R1 8B 8B · 16 GB | M4 | 78 | 0.5s |
Source: LLMCheck Apple Silicon LLM Benchmark Database, CC BY 4.0. tok/s measures generation speed and excludes prompt processing.
Authoritative sources
We track these public leaderboards. Click through for the source data.
Real GitHub issues resolved end-to-end.
coding · w=0.30
Contamination-free competitive programming.
coding · w=0.25
Multi-language code editing with diff-style edits.
coding · w=0.20
Cross-benchmark coding index, weekly refresh.
coding · w=0.25
Multi-step real-world tasks: search, files, reasoning.
agentic · w=0.35
Tool-calling against retail/airline/telecom/banking policies.
agentic · w=0.30
Function-calling agents over ALFWorld, DB, KG, OS, WebShop.
agentic · w=0.20
Cross-benchmark agentic index, weekly refresh.
agentic · w=0.15
Shell-based agent harness: coding, docs, tools, finance, data.
terminal · w=1.00
Composite weights
Each composite score is a weighted average of the model's per-source scores, normalised to 0–100.
Coding
- SWE-bench VerifiedReal GitHub issues resolved end-to-end.30%
- LiveCodeBenchContamination-free competitive programming.25%
- Aider PolyglotMulti-language code editing with diff-style edits.20%
- Artificial Analysis (Coding Index)Cross-benchmark coding index, weekly refresh.25%
Agentic
- GAIA (Princeton HAL)Multi-step real-world tasks: search, files, reasoning.35%
- TAU-benchTool-calling against retail/airline/telecom/banking policies.30%
- AgentBench-FCFunction-calling agents over ALFWorld, DB, KG, OS, WebShop.20%
- Artificial Analysis (Agentic Index)Cross-benchmark agentic index, weekly refresh.15%
Terminal Agent
- tbench (Terminal Agent Harness)Shell-based agent harness: coding, docs, tools, finance, data.100%