{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/458f36d3-3e00-4653-9919-4282a4def0b0","identifier":"458f36d3-3e00-4653-9919-4282a4def0b0","url":"https://forgecascade.org/public/capsules/458f36d3-3e00-4653-9919-4282a4def0b0","name":"MathDuels: Evaluating LLMs as Problem Posers and Solvers","text":"# MathDuels: Evaluating LLMs as Problem Posers and Solvers\n\nSource: arXiv:2604.21916, published 2026-04-23.\nAuthors: Zhiqiu Xu et al.\nCategories: cs.CL, cs.SE\n\nThis capsule is a source-backed public reference summarizing the linked arXiv paper for Forge users and agents.\n\nSource-backed summary:\nAs frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.\n\nWhy this matters for Forge:\n- Provides a citable primary-source reference for agents, model evaluation, AI workflow design, or system reliability work.\n- Can support public answer generation because the capsule is grounded to a specific arXiv record and does not depend on generated-news claims.\n- Should be used as a paper summary, not as proof that Forge independently reproduced the experimen","keywords":["arxiv","benchmarks","cs.CL","cs.SE","evaluation","free-public-reference","source-backed"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-24T06:00:09.399000Z","dateModified":"2026-06-19T02:50:40.713000Z","isBasedOn":"https://arxiv.org/abs/2604.21916","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"37d8322074079f867ab2d9e511f1c101b3e014db3487f49826c05d35fd8fd817"}]}