🏆 Humanity's Last Exam Leaderboard for Agents with Tools

About Humanity's Last Exam (HLE)

Humanity's Last Exam (HLE) is a rigorous, multi-modal AI benchmark created by the Center for AI Safety in collaboration with Scale AI, designed to push large language models beyond saturated tests by evaluating reasoning and expert-level knowledge across thousands of challenging questions spanning mathematics, natural sciences, and the humanities.

Why another leaderboard?

While leaderboards existed for HLE, they fall short in several ways, leading to widespread confusion about true state-of-the-art results. In fact, if you ask ChatGPT, Gemini, and Claude "What's the SOTA for Humanity's Last Exam" today, they will all get the answer wrong.

Tool Exclusion: The official leaderboard by Scale AI and many other leaderboards focus on models without tool use.
Lack of separation: Scores for the full HLE benchmark and the text-only subset are often mixed together, leading to unfair comparisons - score on full set is generally lower than the text only subset.
Data Contamination: Since copies of HLE, blogs and papers discussing HLE have been indexed by search engines, scores might be artificially inflated for agents without filtering. In this leaderboard, we add a✓ badge to indicate that some form of filtering is mentioned for the agent or its previous versions.

📊 Full Set Leaderboard

Results on the complete set of tasks (2500 examples).

Agent / Model	Organization	Open Source	Publish Date	Full Set Score
Poetiq Meta-System	Poetiq	No	2026-02-10	55.0https://poetiq.ai/posts/raising_the_bar_hle_simpleqa/
Gemini 3 Deep Think ✓	Google	No	2026-02-12	53.4https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/
Claude Opus 4.6 ✓	Anthropic	No	2026-02-05	53.1https://www.anthropic.com/news/claude-opus-4-6
Zoom Federated AI ✓	Zoom	No	2025-12-29	53.0https://www.zoom.com/en/blog/zoom-ai-redefining-agentic-federated-intelligence/
Gemini 3.1 Pro ✓	Google	No	2026-02-19	51.4https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
Kimi K2.5 Thinking ✓	Moonshot AI	Yes	2026-01-27	50.2https://www.kimi.com/blog/kimi-k2-5.html
GPT-5.2 Pro ✓	OpenAI	No	2025-12-11	50.0https://openai.com/index/introducing-gpt-5-2/
Claude Sonnet 4.6 ✓	Anthropic	No	2026-02-18	49.0https://www.anthropic.com/news/claude-sonnet-4-6
Yunjue Agent	Yunjue	Yes	2026-01-25	48.0https://github.com/YunjueTech/Yunjue-Agent
Gemini Deep Research ✓	Google	No	2025-12-11	46.4https://blog.google/technology/developers/deep-research-agent-gemini-api/
Gemini 3 Pro ✓	Google	No	2025-11-18	45.8https://blog.google/products/gemini/gemini-3/
GPT-5.2 Thinking ✓	OpenAI	No	2025-12-11	45.5https://openai.com/index/introducing-gpt-5-2/
Grok 4 (Heavy)	xAI	No	2025-07-09	44.4https://x.ai/news/grok-4
Gemini 3 Flash ✓	Google	No	2025-12-17	43.5https://blog.google/products/gemini/gemini-3-flash/
Claude Opus 4.5 ✓	Anthropic	No	2025-12-22	43.2https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
GPT-5 Pro ✓	OpenAI	No	2025-12-11	42.0https://openai.com/index/introducing-gpt-5/
ChatGPT Agent ✓	OpenAI	No	2025-07-17	41.6https://openai.com/index/introducing-chatgpt-agent/
Grok 4	xAI	No	2025-07-09	38.6https://x.ai/news/grok-4
GPT-5 ✓	OpenAI	No	2025-12-11	35.2https://openai.com/index/introducing-gpt-5/
Claude Sonnet 4.5 ✓	Anthropic	No	2025-09-29	28.4https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
OpenAI Deep Research	OpenAI	No	2025-02-02	26.6https://openai.com/index/introducing-deep-research/
Perplexity Deep Research	Perplexity	No	2025-02-14	21.1https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research

📊 Text-Only Leaderboard

Results on the text-only subset of tasks (2158 examples).

Agent / Model	Organization	Open Source	Publish Date	Text-Only Score
Qwen3-Max-Thinking (Heavy) ✓	Alibaba	No	2026-01-26	58.3https://qwen.ai/blog?id=qwen3-max-thinking
Zoom Federated AI ✓	Zoom	No	2025-12-29	55.2https://tinyurl.com/sileixu-hle-linkedin-53
Seed2.0 Pro	ByteDance	No	2026-02-14	54.2https://seed.bytedance.com/en/blog/seed2-0-%E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83
Kimi K2.5 Thinking ✓	Moonshot AI	Yes	2026-01-27	51.8https://www.kimi.com/blog/kimi-k2-5.html
Kimi K2 Thinking (Heavy) ✓	Moonshot AI	Yes	2025-11-06	51.0https://huggingface.co/moonshotai/Kimi-K2-Thinking
DeepWriter	Deepwriter AI	No	2025-11-26	50.9https://deepwriter.com/blog/small-team-beats-worlds-top-ai-labs-at-hle/
Grok 4 (Heavy)	xAI	No	2025-07-09	50.7https://x.ai/news/grok-4
GLM 5	Z.ai	Yes	2026-02-11	50.4https://z.ai/blog/glm-5
Qwen3-Max-Thinking ✓	Alibaba	No	2026-01-26	49.8https://qwen.ai/blog?id=qwen3-max-thinking
Seed2.0 Lite	ByteDance	No	2026-02-14	49.5https://seed.bytedance.com/en/blog/seed2-0-%E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83
Kimi K2 Thinking ✓	Moonshot AI	Yes	2025-11-06	44.9https://huggingface.co/moonshotai/Kimi-K2-Thinking
GLM 4.7	Z.ai	Yes	2025-12-22	42.8https://z.ai/blog/glm-4.7
Seed1.8	ByteDance	No	2025-12-18	41.7https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf
DeepSeek-V3.2	DeepSeek	Yes	2025-12-01	40.8https://arxiv.org/pdf/2512.02556v1
MiroThinker-v1.5-235B ✓	MiroMind AI	Yes	2026-01-04	39.2https://huggingface.co/miromind-ai/MiroThinker-v1.5-235B
Tongyi-DeepResearch-30B-A3B (Heavy)	Alibaba	Yes	2025-11-04	38.3https://arxiv.org/pdf/2510.24701
MiroThinker-v1.0-72B ✓	MiroMind AI	Yes	2025-11-14	37.7https://arxiv.org/pdf/2511.11793
ToolOrchestra	NVIDIA	Yes	2025-11-26	37.1https://arxiv.org/pdf/2511.21689
MiroThinker-v1.0-30B ✓	MiroMind AI	Yes	2025-11-14	33.4https://arxiv.org/pdf/2511.11793
Tongyi-DeepResearch-30B-A3B	Alibaba	Yes	2025-11-04	32.9https://arxiv.org/pdf/2510.24701
MiniMax-M2	MiniMax AI	Yes	2025-10-27	31.8https://huggingface.co/MiniMaxAI/MiniMax-M2
MiroThinker-v1.5-30B ✓	MiroMind AI	Yes	2026-01-04	31.0https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B
MiroThinker-v1.0-8B ✓	MiroMind AI	Yes	2025-11-14	21.5https://arxiv.org/pdf/2511.11793

Notes

All models and agents listed have tool use capabilities.
All numbers are based on the official report of each agent.
Missing✓ badge does not necessarily mean the agent applied no filtering, it just means that no such mention is found.
Having✓ badge does not necessarily mean the agent applied perfect filtering. E.g., only blocking huggingface URLs won't be sufficient.
We exclude scores reported on non-official subsets as the results are not comparable.
Please contact us at hle@zoom.us if you find any errors or want to add a model or agent to the leaderboard.
Last updated: Mar 2, 2026.