๐ Humanity's Last Exam Leaderboard for Agents with Tools
About Humanity's Last Exam (HLE)
Humanity's Last Exam (HLE) is a rigorous, multi-modal AI benchmark created by the Center for AI Safety in collaboration with Scale AI, designed to push large language models beyond saturated tests by evaluating reasoning and expert-level knowledge across thousands of challenging questions spanning mathematics, natural sciences, and the humanities.
Why another leaderboard?
While leaderboards existed for HLE, they fall short in several ways, leading to widespread confusion about true state-of-the-art results. In fact, if you ask ChatGPT, Gemini, and Claude "What's the SOTA for Humanity's Last Exam" today, they will all get the answer wrong.
- Tool Exclusion: The official leaderboard by Scale AI and many other leaderboards focus on models without tool use.
- Lack of separation: Scores for the full HLE benchmark and the text-only subset are often mixed together, leading to unfair comparisons - score on full set is generally lower than the text only subset.
- Data Contamination: Since copies of HLE, blogs and papers discussing HLE have been indexed by search engines, scores might be artificially inflated for agents without filtering. In this leaderboard, we add aโ badge to indicate that some form of filtering is mentioned for the agent or its previous versions.
๐ Full Set Leaderboard
Results on the complete set of tasks (2500 examples).
| Agent / Model | Organization | Open Source | Publish Date | Full Set Score |
|---|---|---|---|---|
| Zoom Federated AI โ | Zoom | No | 2025-12-29 | 53.0https://www.zoom.com/en/blog/zoom-ai-redefining-agentic-federated-intelligence/ |
| GPT-5.2 Pro โ | OpenAI | No | 2025-12-11 | 50.0https://openai.com/index/introducing-gpt-5-2/ |
| Gemini Deep Research | No | 2025-12-11 | 46.4https://blog.google/technology/developers/deep-research-agent-gemini-api/ | |
| Gemini 3 Pro | No | 2025-11-18 | 45.8https://blog.google/products/gemini/gemini-3/ | |
| GPT-5.2 Thinking โ | OpenAI | No | 2025-12-11 | 45.5https://openai.com/index/introducing-gpt-5-2/ |
| Grok 4 (Heavy) | xAI | No | 2025-07-09 | 44.4https://x.ai/news/grok-4 |
| Gemini 3 Flash | No | 2025-12-17 | 43.5https://blog.google/products/gemini/gemini-3-flash/ | |
| Claude Opus 4.5 โ | Anthropic | No | 2025-12-22 | 43.2https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf |
| GLM 4.7 | Z.ai | Yes | 2025-12-22 | 42.8https://z.ai/blog/glm-4.7 |
| GPT-5 Pro โ | OpenAI | No | 2025-12-11 | 42.0https://openai.com/index/introducing-gpt-5/ |
| ChatGPT Agent โ | OpenAI | No | 2025-07-17 | 41.6https://openai.com/index/introducing-chatgpt-agent/ |
| Grok 4 | xAI | No | 2025-07-09 | 38.6https://x.ai/news/grok-4 |
| GPT-5 โ | OpenAI | No | 2025-12-11 | 35.2https://openai.com/index/introducing-gpt-5/ |
| Claude Sonnet 4.5 โ | Anthropic | No | 2025-09-29 | 28.4https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf |
| OpenAI Deep Research | OpenAI | No | 2025-02-02 | 26.6https://openai.com/index/introducing-deep-research/ |
| Perplexity Deep Research | Perplexity | No | 2025-02-14 | 21.1https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research |
๐ Text-Only Leaderboard
Results on the text-only subset of tasks (2158 examples).
| Agent / Model | Organization | Open Source | Publish Date | Text-Only Score |
|---|---|---|---|---|
| Zoom Federated AI โ | Zoom | No | 2025-12-29 | 55.2https://tinyurl.com/sileixu-hle-linkedin-53 |
| Kimi K2 Thinking (Heavy) โ | Moonshot AI | Yes | 2025-11-06 | 51.0https://huggingface.co/moonshotai/Kimi-K2-Thinking |
| DeepWriter | Deepwriter AI | No | 2025-11-26 | 50.9https://deepwriter.com/blog/small-team-beats-worlds-top-ai-labs-at-hle/ |
| Grok 4 (Heavy) | xAI | No | 2025-07-09 | 50.7https://x.ai/news/grok-4 |
| Kimi K2 Thinking โ | Moonshot AI | Yes | 2025-11-06 | 44.9https://huggingface.co/moonshotai/Kimi-K2-Thinking |
| Seed1.8 | ByteDance | No | 2025-12-18 | 41.7https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf |
| MiroThinker-v1.5-235B โ | MiroMind AI | Yes | 2026-01-04 | 39.2https://huggingface.co/miromind-ai/MiroThinker-v1.5-235B |
| Tongyi-DeepResearch-30B-A3B (Heavy) | Alibaba | Yes | 2025-11-04 | 38.3https://arxiv.org/pdf/2510.24701 |
| MiroThinker-v1.0-72B โ | MiroMind AI | Yes | 2025-11-14 | 37.7https://arxiv.org/pdf/2511.11793 |
| ToolOrchestra | NVIDIA | Yes | 2025-11-26 | 37.1https://arxiv.org/pdf/2511.21689 |
| MiroThinker-v1.0-30B โ | MiroMind AI | Yes | 2025-11-14 | 33.4https://arxiv.org/pdf/2511.11793 |
| Tongyi-DeepResearch-30B-A3B | Alibaba | Yes | 2025-11-04 | 32.9https://arxiv.org/pdf/2510.24701 |
| MiniMax-M2 | MiniMax AI | Yes | 2025-10-27 | 31.8https://huggingface.co/MiniMaxAI/MiniMax-M2 |
| MiroThinker-v1.5-30B โ | MiroMind AI | Yes | 2026-01-04 | 31.0https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B |
| MiroThinker-v1.0-8B โ | MiroMind AI | Yes | 2025-11-14 | 21.5https://arxiv.org/pdf/2511.11793 |
Notes
- All models and agents listed have tool use capabilities.
- All numbers are based on the official report of each agent.
- Missingโ badge does not necessarily mean the agent applied no filtering, it just means that no such mention is found.
- Havingโ badge does not necessarily mean the agent applied perfect filtering. E.g., only blocking huggingface URLs won't be sufficient.
- We exclude scores reported on non-official subsets as the results are not comparable.
- Please contact us at hle@zoom.us if you find any errors or want to add a model or agent to the leaderboard.
- Last updated: Jan 8, 2026.