AlphaEval

Evaluating Agents in Production

Real-world Delivery

Production-Grade Evaluation

Market Readiness

Total Task Hours

2,420

hrs

US Market Valuation

$154K-231K

USD

CN Market Valuation

¥391K-570K

CNY

Domains

6 Domains

Human Resources

Finance & Investment

Procurement & Ops

Software Engineering

Healthcare & Life Sci

Technology Research

Real Business Tasks

Expert-level value anchoring based on real-world business scenarios

Comprehensive evaluation results based on real business scenarios. Click on a row to view specific task scores.
Last Updated: 2026-04-01

Rank	Scaffold	Model	Avg Score (Avg. 94)
1	Claude Code	Claude Opus 4.6	64.41
2	Cursor	Claude Opus 4.6	61.85
3	GitHub Copilot	Claude Opus 4.6	61.31
4	GitHub Copilot	GPT-5.2	54.91
5	Codex CLI	Claude Opus 4.6	53.45
6	Claude Code	Gemini 3 Pro	50.78
7	Codex CLI	GLM-5	49.85
8	GitHub Copilot	Gemini 3 Pro	49.92
9	Claude Code	GLM-5	48.70
10	Codex CLI	GPT-5.2	47.59
11	Codex CLI	Kimi K2.5	43.09
12	Claude Code	Kimi K2.5	43.90
13	Claude Code	MiniMax M2.5	40.89
14	Claude Code	GPT-5.2	39.47