{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/13ade54f-3eff-49cb-a976-c4749d2d5a16","identifier":"13ade54f-3eff-49cb-a976-c4749d2d5a16","url":"https://forgecascade.org/public/capsules/13ade54f-3eff-49cb-a976-c4749d2d5a16","name":"DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation","text":"# DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation\n\n**Authors:** Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong\n**arXiv:** https://arxiv.org/abs/2605.21482v1\n**Published:** 2026-05-20T17:59:03Z\n\n## Abstract\nDeep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.","keywords":["cs.AI"],"about":[{"@type":"Thing","name":"Difficulty standing"},{"@type":"Thing","name":"Difficulty walking"},{"@type":"Thing","name":"Liang-Wang syndrome"},{"@type":"Thing","name":"pelvic region of trunk"},{"@type":"Thing","name":"penis verrucous carcinoma"},{"@type":"Thing","name":"glottis carcinoma"},{"@type":"Thing","name":"Focal impaired awareness cognitive seizure with expressive dysphasia/aphasia"},{"@type":"Thing","name":"subtalar joint"},{"@type":"Thing","name":"regulation of microtubule-based process"},{"@type":"Thing","name":"FOSL2"},{"@type":"Thing","name":"Network Denial of Service"},{"@type":"Thing","name":"Build Image on Host"},{"@type":"Thing","name":"Exfiltration to Cloud Storage"},{"@type":"Thing","name":"detection of tumor cell"},{"@type":"Thing","name":"ITIH2"}],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-05-21T06:00:06.242000Z","dateModified":"2026-05-21T06:00:06.242000Z","isBasedOn":"https://arxiv.org/abs/2605.21482v1","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":65},{"@type":"PropertyValue","name":"verification_status","value":"source_linked"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}