数据来源:Artificial Analysis — 综合 SWE-Bench-Pro-Hard-AA、Terminal-Bench v2、SWE-Atlas-QnA 三项基准测试的 pass@1 平均分。
每周自动更新 · 最后更新:2026-06-01 02:01 UTC
| # | Agent | 模型 | 提供商 | 评分 |
|---|---|---|---|---|
| 1 | Claude Code | Opus 4.7 | max | 67 |
| 2 | Codex | GPT-5.5 | xhigh | 65 |
| 3 | Cursor CLI | Composer 2.5 Fast | 63 | |
| 4 | Cursor CLI | Opus 4.7 | medium | 61 |
| 5 | Codex | GPT-5.5 | medium | 60 |
| 6 | Claude Code | Opus 4.7 | medium | 60 |
| 7 | Cursor CLI | GPT-5.5 | medium | 58 |
| 8 | Claude Code | GLM-5.1 | 53 | |
| 9 | Claude Code | Kimi K2.6 | 50 | |
| 10 | Claude Code | DeepSeek V4 Pro | high | 50 |
| 11 | Gemini CLI | Gemini 3.1 Pro | high | 43 |
⏱ 任务耗时
平均每个编码 Agent 完成任务所需的墙钟时间(越低越好)
| # | Agent | 耗时 |
|---|---|---|
| 1 | Claude Code - Opus 4.7 (medium) (Anthropic) | 5.8m |
| 2 | Cursor CLI - GPT-5.5 (medium) (Cursor) | 6.2m |
| 3 | Cursor CLI - Composer 2.5 Fast (Cursor) | 6.7m |
| 4 | Codex - GPT-5.5 (medium) (OpenAI) | 7.1m |
| 5 | Gemini CLI - Gemini 3.1 Pro (high) (Gemini) | 7.6m |
| 6 | Cursor CLI - Opus 4.7 (medium) (Cursor) | 7.8m |
| 7 | Codex - GPT-5.5 (xhigh) (OpenAI) | 8.7m |
| 8 | Claude Code - Opus 4.7 (max) (Anthropic) | 13.8m |
| 9 | Claude Code - DeepSeek V4 Pro (high) (DeepSeek) | 18.0m |
| 10 | Claude Code - GLM-5.1 (FriendliAI) | 21.6m |
| 11 | Claude Code - Kimi K2.6 (Moonshot AI) | 41.5m |
💰 任务成本
平均每个任务的 API 成本(USD,越低越好)
| # | Agent | 成本 (USD) |
|---|---|---|
| 1 | Claude Code - DeepSeek V4 Pro (high) (DeepSeek) | $0.35 |
| 2 | Cursor CLI - Composer 2.5 Fast (Cursor) | $0.44 |
| 3 | Claude Code - Kimi K2.6 (Moonshot AI) | $0.76 |
| 4 | Claude Code - Opus 4.7 (medium) (Anthropic) | $1.24 |
| 5 | Cursor CLI - Opus 4.7 (medium) (Cursor) | $1.47 |
| 6 | Gemini CLI - Gemini 3.1 Pro (high) (Gemini) | $1.60 |
| 7 | Cursor CLI - GPT-5.5 (medium) (Cursor) | $1.61 |
| 8 | Codex - GPT-5.5 (medium) (OpenAI) | $2.21 |
| 9 | Claude Code - GLM-5.1 (FriendliAI) | $2.26 |
| 10 | Claude Code - Opus 4.7 (max) (Anthropic) | $4.14 |
| 11 | Codex - GPT-5.5 (xhigh) (OpenAI) | $4.33 |
关于基准测试
- SWE-Bench-Pro-Hard-AA — 代码生成,150 道题(Scale AI)
- Terminal-Bench v2 — Agent 终端使用,84 道题(Laude Institute)
- SWE-Atlas-QnA — 技术问答,124 道题(Scale AI)
综合指数为三项基准测试的 pass@1 平均值(各运行 3 次)。
数据由 AI Agent 每周自动抓取,若有更新延迟请参考 原始页面。