๐Ÿ† Humanity's Last Exam Leaderboard for Agents with Tools


About Humanity's Last Exam (HLE)

Humanity's Last Exam (HLE) is a rigorous, multi-modal AI benchmark created by the Center for AI Safety in collaboration with Scale AI, designed to push large language models beyond saturated tests by evaluating reasoning and expert-level knowledge across thousands of challenging questions spanning mathematics, natural sciences, and the humanities.


Why another leaderboard?

While leaderboards existed for HLE, they fall short in several ways, leading to widespread confusion about true state-of-the-art results. In fact, if you ask ChatGPT, Gemini, and Claude "What's the SOTA for Humanity's Last Exam" today, they will all get the answer wrong.

  • Tool Exclusion: The official leaderboard by Scale AI and many other leaderboards focus on models without tool use.
  • Lack of separation: Scores for the full HLE benchmark and the text-only subset are often mixed together, leading to unfair comparisons - score on full set is generally lower than the text only subset.
  • Data Contamination: Since copies of HLE, blogs and papers discussing HLE have been indexed by search engines, scores might be artificially inflated for agents without filtering. In this leaderboard, we add aโœ“ badge to indicate that some form of filtering is mentioned for the agent or its previous versions.



๐Ÿ“Š Full Set Leaderboard

Results on the complete set of tasks (2500 examples).

Agent / ModelOrganizationOpen SourcePublish DateFull Set Score
Zoom Federated AI โœ“ZoomNo2025-12-2953.0https://www.zoom.com/en/blog/zoom-ai-redefining-agentic-federated-intelligence/
GPT-5.2 Pro โœ“OpenAINo2025-12-1150.0https://openai.com/index/introducing-gpt-5-2/
Gemini Deep ResearchGoogleNo2025-12-1146.4https://blog.google/technology/developers/deep-research-agent-gemini-api/
Gemini 3 ProGoogleNo2025-11-1845.8https://blog.google/products/gemini/gemini-3/
GPT-5.2 Thinking โœ“OpenAINo2025-12-1145.5https://openai.com/index/introducing-gpt-5-2/
Grok 4 (Heavy)xAINo2025-07-0944.4https://x.ai/news/grok-4
Gemini 3 FlashGoogleNo2025-12-1743.5https://blog.google/products/gemini/gemini-3-flash/
Claude Opus 4.5 โœ“AnthropicNo2025-12-2243.2https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
GLM 4.7Z.aiYes2025-12-2242.8https://z.ai/blog/glm-4.7
GPT-5 Pro โœ“OpenAINo2025-12-1142.0https://openai.com/index/introducing-gpt-5/
ChatGPT Agent โœ“OpenAINo2025-07-1741.6https://openai.com/index/introducing-chatgpt-agent/
Grok 4xAINo2025-07-0938.6https://x.ai/news/grok-4
GPT-5 โœ“OpenAINo2025-12-1135.2https://openai.com/index/introducing-gpt-5/
Claude Sonnet 4.5 โœ“AnthropicNo2025-09-2928.4https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
OpenAI Deep ResearchOpenAINo2025-02-0226.6https://openai.com/index/introducing-deep-research/
Perplexity Deep ResearchPerplexityNo2025-02-1421.1https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research

๐Ÿ“Š Text-Only Leaderboard

Results on the text-only subset of tasks (2158 examples).

Agent / ModelOrganizationOpen SourcePublish DateText-Only Score
Zoom Federated AI โœ“ZoomNo2025-12-2955.2https://tinyurl.com/sileixu-hle-linkedin-53
Kimi K2 Thinking (Heavy) โœ“Moonshot AIYes2025-11-0651.0https://huggingface.co/moonshotai/Kimi-K2-Thinking
DeepWriterDeepwriter AINo2025-11-2650.9https://deepwriter.com/blog/small-team-beats-worlds-top-ai-labs-at-hle/
Grok 4 (Heavy)xAINo2025-07-0950.7https://x.ai/news/grok-4
Kimi K2 Thinking โœ“Moonshot AIYes2025-11-0644.9https://huggingface.co/moonshotai/Kimi-K2-Thinking
Seed1.8ByteDanceNo2025-12-1841.7https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf
MiroThinker-v1.5-235B โœ“MiroMind AIYes2026-01-0439.2https://huggingface.co/miromind-ai/MiroThinker-v1.5-235B
Tongyi-DeepResearch-30B-A3B (Heavy)AlibabaYes2025-11-0438.3https://arxiv.org/pdf/2510.24701
MiroThinker-v1.0-72B โœ“MiroMind AIYes2025-11-1437.7https://arxiv.org/pdf/2511.11793
ToolOrchestraNVIDIAYes2025-11-2637.1https://arxiv.org/pdf/2511.21689
MiroThinker-v1.0-30B โœ“MiroMind AIYes2025-11-1433.4https://arxiv.org/pdf/2511.11793
Tongyi-DeepResearch-30B-A3BAlibabaYes2025-11-0432.9https://arxiv.org/pdf/2510.24701
MiniMax-M2MiniMax AIYes2025-10-2731.8https://huggingface.co/MiniMaxAI/MiniMax-M2
MiroThinker-v1.5-30B โœ“MiroMind AIYes2026-01-0431.0https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B
MiroThinker-v1.0-8B โœ“MiroMind AIYes2025-11-1421.5https://arxiv.org/pdf/2511.11793

Notes

  • All models and agents listed have tool use capabilities.
  • All numbers are based on the official report of each agent.
  • Missingโœ“ badge does not necessarily mean the agent applied no filtering, it just means that no such mention is found.
  • Havingโœ“ badge does not necessarily mean the agent applied perfect filtering. E.g., only blocking huggingface URLs won't be sufficient.
  • We exclude scores reported on non-official subsets as the results are not comparable.
  • Please contact us at hle@zoom.us if you find any errors or want to add a model or agent to the leaderboard.
  • Last updated: Jan 8, 2026.