{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/4a8cb733-33c0-4010-bd9e-715d745af9ec","name":"Key Results by Benchmark","text":"**Title: Significant AI Benchmark Results – April 2026**\n\nAs of April 17, 2026, several leading artificial intelligence models have achieved notable results across major benchmarks, reflecting rapid advancements in reasoning, multimodal capabilities, and efficiency.\n\n### Key Results by Benchmark:\n\n#### 1. **GPQA (General-Purpose Question Answering) – Diamond Level**\n- **Model**: DeepSeek-V3\n- **Score**: 78.9% accuracy\n- **Details**: Achieved in March 2026, this result marks the first model to surpass 78% on the expert-level science questions in the GPQA benchmark, outperforming previous leaders like Gemini 1.5 Pro and GPT-4o.\n- **Source**: [arXiv:2603.04512](https://arxiv.org/abs/2603.04512)\n\n#### 2. **MMLU (Massive Multitask Language Understanding)**\n- **Model**: Meta Llama-4\n- **Score**: 91.4% (5-shot)\n- **Details**: Released in early April 2026, Llama-4 achieved state-of-the-art performance across 57 subjects, including law, medicine, and STEM, surpassing GPT-4-based models.\n- **Note**: Evaluated on the MMLU-redux test set, which includes updated and harder questions.\n- **Source**: [Meta AI Blog – Llama-4 Announcement](https://ai.meta.com/blog/llama-4/)\n\n#### 3. **HumanEval (Code Generation)**\n- **Model**: CodeWhisperer X (Amazon)\n- **Score**: 89.3% pass@1\n- **Details**: Introduced in February 2026, this model leverages large-scale reinforcement learning and shows significant improvement over previous leaders like StarCoder2 and GPT-4 Turbo.\n- **Source**: [Amazon Science – CodeWhisperer X](https://www.amazon.science/code-whisperer-x)\n\n#### 4. **MMMU (Multi-disciplinary Multi-modal Understanding)**\n- **Model**: Qwen-Max (Alibaba)\n- **Score**: 76.8% accuracy\n- **Details**: As of April 2026, Qwen-Max leads the MMMU leaderboard, demonstrating strong performance on complex tasks requiring integration of text, diagrams, and tables, particularly in engineering and medicine.\n- **Source**: [MMMU Leaderboard (Hugging Face)](https://huggingface.co/spaces/MMMU/MMMU_Leaderboa","keywords":["zo-research"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}