Question 1

Which benchmarks are tracked here?

Accepted Answer

Major public ones: MMLU, GSM8K, HumanEval, HellaSwag, ARC, TruthfulQA, GPQA, plus newer ones like SWE-bench and AIME. The list grows when a benchmark sees adoption beyond a single lab's marketing.

Question 2

How fresh are the scores?

Accepted Answer

Scores update when labs publish official numbers or when reproducible third-party runs surface. Self-reported numbers are flagged separately from independently verified ones, so you can weigh them differently.

Question 3

Why do model scores sometimes shift after launch?

Accepted Answer

Sometimes a benchmark gets re-run with a fix, sometimes the model is updated under the same name. Either way, this is one of the messier parts of LLM evaluation and worth knowing about before quoting a single number.

Question 4

Can I trust benchmark numbers for my actual use case?

Accepted Answer

Treat them as a rough filter. A model that crushes MMLU may still flop on your domain-specific tasks. Run your own eval set on the top three candidates before committing to one for production.

Question 5

What's the difference between zero-shot and few-shot scores?

Accepted Answer

Zero-shot means the model gets the question cold, no examples. Few-shot means it sees a handful of solved examples first. Few-shot scores are usually higher; mixing the two in comparisons is misleading.

Question 6

Why is contamination such a problem?

Accepted Answer

If the test set ends up in training data, the model has effectively seen the answers. Scores climb without genuine capability gains. New benchmarks released after a model's training cutoff are more trustworthy precisely because contamination is structurally less likely.

Question 7

What's a held-out eval set, and why do I need one?

Accepted Answer

It's a private set of tasks representative of your actual workload, kept out of any vendor's hands. You run candidates against it before committing. Benchmarks tell you about average performance; your held-out set tells you about your specific use case, which is usually what matters.

Question 8

How should I weight self-reported versus third-party scores?

Accepted Answer

Third-party reproductions get more weight because they remove the launch-post incentive to report high. When self-reported is all that exists, treat the number as plausible but not gospel — and discount it slightly when comparing against independently verified scores from competing models.

Agent	Speed(tasks/hr)	Cost per Task($)	Accuracy(%)	Context Handling(/10)
Claude Sonnet 4	14.2	$0.08	93.1%	9.2
Claude Opus 4	9.8	$0.22	96.4%	9.7
GPT-4o	12.5	$0.11	91.8%	8.5
Gemini 2.5 Pro	11.3	$0.13	90.2%	9.0
DeepSeek V3	15.1	$0.04	88.5%	7.8
Codex	16.8	$0.06	89.7%	7.2

AI Agent Benchmark Tracker

Category Leaderboards

Related Tools

About This Tool

Frequently Asked Questions