Source: Artificial Analysis β Composite average pass@1 across SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA.
Auto-updated weekly Β· Last update: 2026-06-01 02:01 UTC
| # | Agent | Model | Provider | Score |
|---|---|---|---|---|
| 1 | Claude Code | Opus 4.7 | max | 67 |
| 2 | Codex | GPT-5.5 | xhigh | 65 |
| 3 | Cursor CLI | Composer 2.5 Fast | 63 | |
| 4 | Cursor CLI | Opus 4.7 | medium | 61 |
| 5 | Codex | GPT-5.5 | medium | 60 |
| 6 | Claude Code | Opus 4.7 | medium | 60 |
| 7 | Cursor CLI | GPT-5.5 | medium | 58 |
| 8 | Claude Code | GLM-5.1 | 53 | |
| 9 | Claude Code | Kimi K2.6 | 50 | |
| 10 | Claude Code | DeepSeek V4 Pro | high | 50 |
| 11 | Gemini CLI | Gemini 3.1 Pro | high | 43 |
β± Time per Task
Mean wall clock time per task (lower is better)
| # | Agent | Wall Time |
|---|---|---|
| 1 | Claude Code - Opus 4.7 (medium) (Anthropic) | 5.8m |
| 2 | Cursor CLI - GPT-5.5 (medium) (Cursor) | 6.2m |
| 3 | Cursor CLI - Composer 2.5 Fast (Cursor) | 6.7m |
| 4 | Codex - GPT-5.5 (medium) (OpenAI) | 7.1m |
| 5 | Gemini CLI - Gemini 3.1 Pro (high) (Gemini) | 7.6m |
| 6 | Cursor CLI - Opus 4.7 (medium) (Cursor) | 7.8m |
| 7 | Codex - GPT-5.5 (xhigh) (OpenAI) | 8.7m |
| 8 | Claude Code - Opus 4.7 (max) (Anthropic) | 13.8m |
| 9 | Claude Code - DeepSeek V4 Pro (high) (DeepSeek) | 18.0m |
| 10 | Claude Code - GLM-5.1 (FriendliAI) | 21.6m |
| 11 | Claude Code - Kimi K2.6 (Moonshot AI) | 41.5m |
π° Cost per Task
Mean API cost per task in USD (lower is better)
| # | Agent | Cost (USD) |
|---|---|---|
| 1 | Claude Code - DeepSeek V4 Pro (high) (DeepSeek) | $0.35 |
| 2 | Cursor CLI - Composer 2.5 Fast (Cursor) | $0.44 |
| 3 | Claude Code - Kimi K2.6 (Moonshot AI) | $0.76 |
| 4 | Claude Code - Opus 4.7 (medium) (Anthropic) | $1.24 |
| 5 | Cursor CLI - Opus 4.7 (medium) (Cursor) | $1.47 |
| 6 | Gemini CLI - Gemini 3.1 Pro (high) (Gemini) | $1.60 |
| 7 | Cursor CLI - GPT-5.5 (medium) (Cursor) | $1.61 |
| 8 | Codex - GPT-5.5 (medium) (OpenAI) | $2.21 |
| 9 | Claude Code - GLM-5.1 (FriendliAI) | $2.26 |
| 10 | Claude Code - Opus 4.7 (max) (Anthropic) | $4.14 |
| 11 | Codex - GPT-5.5 (xhigh) (OpenAI) | $4.33 |
About the Benchmarks
- SWE-Bench-Pro-Hard-AA β Code generation, 150 questions (Scale AI)
- Terminal-Bench v2 β Agentic terminal use, 84 questions (Laude Institute)
- SWE-Atlas-QnA β Technical Q&A, 124 questions (Scale AI)
The index represents the average pass@1 across 3 runs of each benchmark.
Data scraped weekly by an AI Agent. For the latest results, visit the original page.