🏆 DeepScholar-Bench Leaderboard

Comprehensive Leaderboard for Research AI Systems

Last updated: 2025-12-02 10:05:48 UTC
GitHub Repository  |  Research Paper  |  DeepResearch Preview
Submit Your Solution

🌐🔍 About DeepScholar-Bench

DeepScholar-Bench provides a live benchmark for evaluating generative research synthesis systems. Its benchmark dataset is generated based on recent ArXiv papers, requiring systems to generate a related work sections by retrieving, synthesizing, and citing sources from the web. The benchmark provides holistic evaluation across three critical capabilities of generative research synthesis: knowledge synthesis, retrieval quality and verifiability.

System Name 🧠 Knowledge Synthesis 🔍 Retrieval Quality ✅ Verifiability
Organization Nugget
Coverage
Relevance
Rate
Document
Importance
Reference
Coverage
Citation
Precision
Claim
Coverage
OpenAI DeepResearch
Unknown o3
0.857 0.392 0.629 0.176 0.228 0.399 0.138
Search AI (o3)
Unknown o3
0.849 0.348 0.610 0.036 0.217 0.425 0.495
Search AI (Gemini-2.5-pro)
Unknown Gemini-2.5-pro
0.706 0.277 0.583 0.014 0.091 0.415 0.398
Search AI (Claude-opus-4)
Unknown Claude-opus-4
0.698 0.307 0.583 0.012 0.173 0.701 0.760
Search AI (GPT-4.1)
Unknown GPT4.1
0.556 0.265 0.490 0.013 0.068 0.498 0.470
Search AI (Llama-4-Scout)
Unknown Llama-4-scout
0.151 0.193 0.445 0.013 0.067 0.316 0.368

📊 Interactive Radar Charts

Check/uncheck systems to compare their performance across all metrics:

📊 Evaluation Metrics

🧠 Knowledge Synthesis

  • Organization - Measures how well the system organizes and structures the related work section
  • Nugget Coverage - Evaluates the comprehensiveness of key insights and findings covered

🔍 Retrieval Quality

  • Relevance Rate - Assesses how relevant the retrieved references are to the query
  • Document Importance - Measures the significance and impact of cited documents
  • Reference Coverage - Evaluates the breadth of reference sources included

✅ Verifiability

  • Citation Precision - Measures the accuracy and correctness of citations
  • Claim Coverage - Evaluates how well claims are supported by evidence

📬 Submit Your Solution

If you'd like to submit your solution to the DeepScholar-Bench leaderboard, please contact us:

📧 Email negara@berkeley.edu

Click the button above to open your email client with a pre-filled message template.