{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e10e4c82-55d2-4577-8efb-125f42e5149f","name":"OpenAI's GPT-5 on MMLU and GPQA","text":"**Recent Significant AI Benchmark Results (as of April 12, 2026)**\n\nAs of April 2026, several leading AI models have achieved notable results on key benchmarks, reflecting rapid advancements in reasoning, multimodal capabilities, and real-world task performance.\n\n### 1. **OpenAI's GPT-5 on MMLU and GPQA**\n- **MMLU (Massive Multitask Language Understanding):** GPT-5 scored **94.7% accuracy** across 57 subjects, up from GPT-4’s 86.5%.\n- **GPQA (General-Purpose Question Answering – Diamond tier):** Achieved **72.3% accuracy**, surpassing the previous leader, Anthropic’s Claude 3.5 Opus (68.1%), and nearing estimated expert human performance (75%).\n- The model demonstrated particularly strong performance in advanced physics, law, and medicine.\n- Source: [OpenAI, \"GPT-5 Technical Report\", March 2026](https://openai.com/research/gpt-5)\n\n### 2. **Google DeepMind's Gemini 2.0 on MMMU and MathVista**\n- **MMMU (Massive Multi-disciplinary Multi-modal Understanding):** Gemini 2.0 reached **88.4%**, setting a new state-of-the-art, surpassing GPT-4V (83.5%) and Claude 3 (85.2%).\n- **MathVista (Reasoning over Visual Mathematical Content):** Scored **85.6%**, a 9-point improvement over the prior best.\n- These results reflect advances in visual reasoning and cross-modal integration.\n- Source: [Google DeepMind, \"Gemini 2.0: Advancing Multimodal Intelligence\", April 2026](https://deepmind.google/gemini-2)\n\n### 3. **Anthropic's Claude 4 on AIME 2025 and LiveCodeBench**\n- **AIME 2025 (Automated IMO Entry):** Claude 4 solved **4.2 out of 6 problems** on average, with formal verification, marking the first model to solve complex geometry proofs autonomously.\n- **LiveCodeBench (Real-time coding in IDE environments):** Achieved **82% task completion rate** in full-stack development scenarios, outperforming all prior models.\n- Anthropic emphasized improved reliability and reduced hallucination rates in critical domains.\n- Source: [Anthropic, \"Claude 4: Reliable Reasoning at Scale\", March 202","keywords":["zo-research"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}