Researchers from the Center for AI Safety and Scale AI have introduced "Humanity's Last Exam," a rigorous test designed to gauge how close advanced AI models are to human-level knowledge across over 100 subjects.

The exam, detailed in a new Nature study, features 2,500 PhD-level questions vetted by over 1,000 global experts. Questions are designed to be unambiguous, verifiable, and not solvable via simple web search.

In initial tests, leading models like OpenAI's o1 scored only 8.3%. As of February 2026, the top score is 48.4% by Google's Gemini 3 Deep Think, while human experts average 90% in their fields.

The creators state that while high performance on this benchmark is a necessary step, it alone does not signify the achievement of Artificial General Intelligence (AGI).