Last updated · May 31, 2026, 13:27 UTC

Agentic & Coding LLMs, ranked.

We aggregate the most authoritative public benchmarks, weight them transparently, and refresh on every commit.

How well models write, edit, and debug real code.

Rank	Model	Provider	Composite	SWE-bench Verified	LiveCodeBench	Aider Polyglot	Artificial Analysis
1	Claude Opus 4.6limited Anthropic	Anthropic	75.6	75.6	—	—	—
2	GLM-5limited Z.ai	Z.ai	72.8	72.8	—	—	—
3	GPT-5 OpenAI	OpenAI	66.2	74.4	—	88.0	39.0
4	Qwen3-235B-A22Blimited Alibaba	Alibaba	65.9	—	65.9	—	—
5	Claude Opus 4.5 Anthropic	Anthropic	64.9	79.2	—	—	47.8
6	Gemini 3 Pro Google	Google	63.4	77.4	—	—	46.5
7	o4-mini OpenAI	OpenAI	63.2	74.4	80.2	72.0	25.6
8	o3 OpenAI	OpenAI	62.3	58.4	75.8	81.3	38.4
9	GPT-5.2 OpenAI	OpenAI	61.8	72.8	—	—	48.7
10	Gemini 3 Flash Google	Google	60.7	75.8	—	—	42.6
11	Claude Sonnet 4.5 Anthropic	Anthropic	58.4	74.8	—	—	38.6
12	GPT-4.1 OpenAI	OpenAI	58.0	74.6	—	78.2	21.8
13	Grok 4limited xAI	xAI	57.9	—	—	79.6	40.5
14	Doubao-Seed-Code ByteDance	ByteDance	57.2	78.8	—	—	31.3
15	Claude Sonnet 4 Anthropic	Anthropic	56.9	76.8	56.0	—	34.1
16	Claude Opus 4limited Anthropic	Anthropic	56.6	—	56.6	—	—
17	GPT-5.1 OpenAI	OpenAI	56.3	66.0	—	—	44.7
18	Gemini 2.5 Pro Google	Google	55.5	75.2	—	—	31.9
19	DeepSeek V3.2 DeepSeek	DeepSeek	54.9	70.0	—	—	36.7
20	Kimi K2 Moonshot AI	Moonshot AI	53.5	65.4	—	59.1	34.8
21	Claude 3.7 Sonnet Anthropic	Anthropic	47.0	63.2	—	—	27.6
22	Claude 3.5 Sonnet Anthropic	Anthropic	44.7	50.8	36.4	64.0	30.2
23	DeepSeek R1limited DeepSeek	DeepSeek	40.6	—	—	71.4	15.9
24	GPT-5 Codexlimited OpenAI	OpenAI	38.9	—	—	—	38.9
25	DeepSeek V3.2 Specialelimited DeepSeek	DeepSeek	37.9	—	—	—	37.9
26	o3-minilimited OpenAI	OpenAI	36.8	—	—	60.4	17.9
27	DeepSeek V3 DeepSeek	DeepSeek	31.3	—	27.2	55.1	16.4
28	Gemini 2.5 Flash Google	Google	25.8	28.7	—	—	22.2
29	GPT-4o OpenAI	OpenAI	22.8	21.6	—	—	24.2
30	Grok 3limited xAI	xAI	19.8	—	—	—	19.8
31	Llama 4 Mavericklimited Meta	Meta	15.6	—	—	15.6	15.6

Mac local inference

Apple Silicon LLM speed rankings

From the LLMCheck 2026-05-09 open dataset: 162 Apple Silicon measurements ranked by generation tok/s.

View source

Fastest in view

SmolLM3 3B · 168 tok/s

Chips tested

Method

Q4_K_M quantization (or noted), 256-token input, 512-token output, 3 runs averaged, freshly booted system

Rank	Model	Chip	tok/s	TTFT	Engine	Quant	Date
1	SmolLM3 3B 3B · 64 GB	M5 Max	168	0.1s	MLX	Q4_K_M	2026.05
2	Gemma 4 E2B 2.3B · 128 GB	M5 Max	158	0.1s	MLX	Q4_K_M	2026.04
3	Qwen 3.5 4B 4B · 64 GB	M5 Max	148	0.2s	MLX	Q4_K_M	2026.03
4	Phi-5 Mini 4B · 128 GB	M5 Max	145	0.2s	MLX	Q4_K_M	2026.05
5	Phi-4 Mini 3.8B · 64 GB	M5 Max	142	0.3s	Ollama	Q4_K_M	2026.03
6	Llama 3.1 8B 8B · 128 GB	M5 Max	138	0.3s	MLX	Q4_K_M	2026.03
7	Qwen 3 4B 4B · 64 GB	M5 Max	135	0.2s	Ollama	Q4_K_M	2026.03
8	Gemma 3 4B 4B · 64 GB	M5 Max	132	0.3s	Ollama	Q4_K_M	2026.03
9	Gemma 4 E4B 4B · 128 GB	M5 Max	128	0.2s	MLX	Q4_K_M	2026.04
10	Phi-4 Mini 3.8B · 48 GB	M4 Max	125	0.3s	MLX	Q4_K_M	2026.02
11	Mistral 7B 7B · 64 GB	M5 Max	122	0.3s	Ollama	Q4_K_M	2026.03
12	Qwen 3 4B 4B · 24 GB	M4 Pro	118	0.3s	MLX	Q4_K_M	2026.02
13	SmolLM3 3B 3B · 24 GB	M4 Pro	115	0.2s	Ollama	Q4_K_M	2026.05
14	Llama 5 8B 8B · 128 GB	M5 Max	112	0.3s	MLX	Q4_K_M	2026.05
15	Phi-4 Mini 3.8B · 64 GB	M5 Max	112	0.3s	MLX	Q8_0	2026.03
16	Phi-5 Mini 4B · 24 GB	M4 Pro	110	0.3s	Ollama	Q4_K_M	2026.05
17	Phi-4 Mini 3.8B · 24 GB	M4 Pro	108	0.4s	Ollama	Q4_K_M	2026.03
18	Qwen 3.5 9B 9B · 64 GB	M5 Max	105	0.5s	Ollama	Q4_K_M	2026.03
19	Ministral 8B 8B · 64 GB	M5 Max	98	0.4s	Ollama	Q4_K_M	2026.03
20	Mistral 7B 7B · 24 GB	M4 Pro	98	0.4s	MLX	Q4_K_M	2026.02
21	Qwen 3 8B 8B · 128 GB	M5 Max	98	0.4s	Ollama	Q4_K_M	2026.03
22	DeepSeek R1 8B 8B · 64 GB	M5 Max	97	0.5s	Ollama	Q4_K_M	2026.03
23	Gemma 4 E2B 2.3B · 24 GB	M4 Pro	95	0.2s	Ollama	Q4_K_M	2026.04
24	Gemma 3 4B 4B · 24 GB	M4 Pro	95	0.3s	MLX	Q8_0	2026.02
25	Phi-4 Mini 3.8B · 16 GB	M3	95	0.3s	MLX	Q4_K_M	2026.02
26	Phi-5 Mini 4B · 24 GB	M5 Pro	95	0.3s	MLX	Q4_K_M	2026.05
27	Gemma 4 E4B 4B · 24 GB	M5 Pro	92	0.3s	Ollama	Q4_K_M	2026.04
28	SmolLM3 3B 3B · 16 GB	M3	92	0.3s	Ollama	Q4_K_M	2026.05
29	Qwen 3.5 4B 4B · 16 GB	M4	92	0.4s	Ollama	Q4_K_M	2026.02
30	Qwen 3.5 9B 9B · 24 GB	M4 Pro	92	0.4s	MLX	Q4_K_M	2026.03
31	Phi-5 Mini 4B · 16 GB	M3	88	0.3s	MLX	Q4_K_M	2026.05
32	Gemma 3 4B 4B · 18 GB	M3 Pro	88	0.4s	MLX	Q4_K_M	2026.01
33	Mistral 7B 7B · 64 GB	M5 Max	88	0.4s	MLX	Q8_0	2026.03
34	Gemma 4 E2B 2.3B · 16 GB	M3	82	0.3s	Ollama	Q4_K_M	2026.04
35	Llama 3.1 8B 8B · 64 GB	M5 Max	82	0.4s	Ollama	Q8_0	2026.03
36	Qwen 3 8B 8B · 24 GB	M4 Pro	82	0.5s	Ollama	Q4_K_M	2026.02
37	Gemma 4 E4B 4B · 24 GB	M4 Pro	78	0.3s	MLX	Q4_K_M	2026.04
38	Llama 5 8B 8B · 24 GB	M4 Pro	78	0.4s	MLX	Q4_K_M	2026.05
39	SmolLM3 3B 3B · 8 GB	M2	78	0.4s	Ollama	Q4_K_M	2026.05
40	DeepSeek R1 8B 8B · 16 GB	M4	78	0.5s	MLX	Q4_K_M	2026.02

Source: LLMCheck Apple Silicon LLM Benchmark Database, CC BY 4.0. tok/s measures generation speed and excludes prompt processing.

Authoritative sources

We track these public leaderboards. Click through for the source data.

SWE-bench Verified

Real GitHub issues resolved end-to-end.

coding · w=0.30

LiveCodeBench

Contamination-free competitive programming.

coding · w=0.25

Aider Polyglot

Multi-language code editing with diff-style edits.

coding · w=0.20

Artificial Analysis (Coding Index)

Cross-benchmark coding index, weekly refresh.

coding · w=0.25

GAIA (Princeton HAL)

Multi-step real-world tasks: search, files, reasoning.

agentic · w=0.35

TAU-bench

Tool-calling against retail/airline/telecom/banking policies.

agentic · w=0.30

AgentBench-FC

Function-calling agents over ALFWorld, DB, KG, OS, WebShop.

agentic · w=0.20

Artificial Analysis (Agentic Index)

Cross-benchmark agentic index, weekly refresh.

agentic · w=0.15

tbench (Terminal Agent Harness)

Shell-based agent harness: coding, docs, tools, finance, data.

terminal · w=1.00

Composite weights

Each composite score is a weighted average of the model's per-source scores, normalised to 0–100.

Coding

SWE-bench Verified
Real GitHub issues resolved end-to-end.
30%
LiveCodeBench
Contamination-free competitive programming.
25%
Aider Polyglot
Multi-language code editing with diff-style edits.
20%
Artificial Analysis (Coding Index)
Cross-benchmark coding index, weekly refresh.
25%

Agentic

GAIA (Princeton HAL)
Multi-step real-world tasks: search, files, reasoning.
35%
TAU-bench
Tool-calling against retail/airline/telecom/banking policies.
30%
AgentBench-FC
Function-calling agents over ALFWorld, DB, KG, OS, WebShop.
20%
Artificial Analysis (Agentic Index)
Cross-benchmark agentic index, weekly refresh.
15%

Terminal Agent

tbench (Terminal Agent Harness)
Shell-based agent harness: coding, docs, tools, finance, data.
100%