Comprehensive Leaderboard for Research AI Systems
DeepScholar-Bench provides a live benchmark for evaluating generative research synthesis systems. Its benchmark dataset is generated based on recent ArXiv papers, requiring systems to generate a related work sections by retrieving, synthesizing, and citing sources from the web. The benchmark provides holistic evaluation across three critical capabilities of generative research synthesis: knowledge synthesis, retrieval quality and verifiability.
Check/uncheck systems to compare their performance across all metrics:
System Name ↕ | 🧠 Knowledge Synthesis | 🔍 Retrieval Quality | ✅ Verifiability | ||||
---|---|---|---|---|---|---|---|
Organization ↕ | Nugget Coverage ↕ |
Relevance Rate ↕ |
Document Importance ↕ |
Reference Coverage ↕ |
Citation Precision ↕ |
Claim Coverage ↕ |
|
OpenAI DeepResearch Closed Pipeline o3 |
0.857 | 0.392 | 0.629 | 0.176 | 0.228 | 0.399 | 0.138 |
Search AI (o3) Open Pipeline o3 |
0.849 | 0.348 | 0.610 | 0.036 | 0.217 | 0.425 | 0.495 |
Search AI (Gemini-2.5-pro) Open Pipeline Gemini-2.5-pro |
0.706 | 0.277 | 0.583 | 0.014 | 0.091 | 0.415 | 0.398 |
Search AI (Claude-opus-4) Open Pipeline Claude-opus-4 |
0.698 | 0.307 | 0.583 | 0.012 | 0.173 | 0.701 | 0.760 |
Search AI (GPT-4.1) Open Pipeline GPT4.1 |
0.556 | 0.265 | 0.490 | 0.013 | 0.068 | 0.498 | 0.470 |
Search AI (Llama-4-Scout) Open Pipeline Llama-4-scout |
0.151 | 0.193 | 0.445 | 0.013 | 0.067 | 0.316 | 0.368 |
If you'd like to submit your solution to the DeepScholar-Bench leaderboard, please use our Google Form:
For questions and inquiries, please contact: lianapat@stanford.edu or negara@berkeley.edu