DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han +14 more
4/16/2026
cs.AI

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

View on arXivView PDF

Code Implementations(4)

Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.

5129Oct 8, 20253 days agoMIT
evaluationsinfra

Research papers and reproducible code evaluating LLM performance on natural language tasks.

10Mar 8, 20261 months ago

Open-source framework for stress‑testing AI systems bringing together benchmarks to evaluate bias, toxicity, truthfulness, robustness, and adversarial risk in modern AI and LLM systems. Built for reproducibility, grounded in academic research, and designed for real‑world governance, risk, and safety use cases

10Mar 27, 20266 days agoMIT

Continuous evaluation of the Oxford IHTM Open Science and Reproducible Research in R Lecture Series

10Mar 15, 20232 years ago
data-scienceopen-sciencerreproducible-researchrstats

Discussion