🏆 DeepScholar-Bench Leaderboard

Comprehensive Leaderboard for Research AI Systems

Last updated: 2025-08-28 15:24:29 UTC
GitHub Repository  |  Research Paper
Submit Your Solution

🌐🔍 About DeepScholar-Bench

DeepScholar-Bench provides a live benchmark for evaluating generative research synthesis systems. Its benchmark dataset is generated based on recent ArXiv papers, requiring systems to generate a related work sections by retrieving, synthesizing, and citing sources from the web. The benchmark provides holistic evaluation across three critical capabilities of generative research synthesis: knowledge synthesis, retrieval quality and verifiability.

📊 Interactive Radar Charts

Check/uncheck systems to compare their performance across all metrics:

System Name 🧠 Knowledge Synthesis 🔍 Retrieval Quality ✅ Verifiability
Organization Nugget
Coverage
Relevance
Rate
Document
Importance
Reference
Coverage
Citation
Precision
Claim
Coverage
OpenAI DeepResearch
Closed Pipeline o3
0.857 0.392 0.629 0.176 0.228 0.399 0.138
Search AI (o3)
Open Pipeline o3
0.849 0.348 0.610 0.036 0.217 0.425 0.495
Search AI (Gemini-2.5-pro)
Open Pipeline Gemini-2.5-pro
0.706 0.277 0.583 0.014 0.091 0.415 0.398
Search AI (Claude-opus-4)
Open Pipeline Claude-opus-4
0.698 0.307 0.583 0.012 0.173 0.701 0.760
Search AI (GPT-4.1)
Open Pipeline GPT4.1
0.556 0.265 0.490 0.013 0.068 0.498 0.470
Search AI (Llama-4-Scout)
Open Pipeline Llama-4-scout
0.151 0.193 0.445 0.013 0.067 0.316 0.368

📊 Evaluation Metrics

🧠 Knowledge Synthesis

  • Organization - Measures how well the system organizes and structures the related work section
  • Nugget Coverage - Evaluates the comprehensiveness of key insights and findings covered

🔍 Retrieval Quality

  • Relevance Rate - Assesses how relevant the retrieved references are to the query
  • Document Importance - Measures the significance and impact of cited documents
  • Reference Coverage - Evaluates the breadth of reference sources included

✅ Verifiability

  • Citation Precision - Measures the accuracy and correctness of citations
  • Claim Coverage - Evaluates how well claims are supported by evidence

📬 Submit Your Solution

If you'd like to submit your solution to the DeepScholar-Bench leaderboard, please use our Google Form:

📝 Submit via Google Form

For questions and inquiries, please contact: lianapat@stanford.edu or negara@berkeley.edu