-->

Register Now to SAVE BIG & Join Us for KMWorld 2025, November 17-20, in Washington, DC.

Ai2 debuts SciArena, introducing a new platform for evaluating foundation models in scientific literature

Ai2 is unveiling SciArena, an open and collaborative platform that directly engages the scientific research community in evaluating foundation models for scientific literature tasks.

SciArena is an open evaluation platform where researchers can compare and vote on the performance of different foundation models in tasks related to scientific literature. It's built on a community voting approach, like Chatbot Arena, but specifically tailored for the complex and open-ended nature of scientific inquiry.

The platform has three main components:

SciArena Platform: This is where human researchers submit questions, view side-by-side responses from different foundation models, and cast their votes for the preferred output.

Leaderboard: Based on community votes, an Elo rating system ranks the models, providing a dynamic and up-to-date assessment of their performance.

SciArena-Eval: This is a meta-evaluation benchmark built on the collected human preference data, designed to assess the accuracy of model-based evaluation systems.

As of June 30, 2025, SciArena hosts 23 frontier foundation models, selected for their representation of current state-of-the-art capabilities.

Among them, the o3 model consistently delivers top performance across all scientific domains, according to the company. Ai2 found that o3 provides a more detailed elaboration of cited scientific papers, and its output tends to be more technical in Engineering disciplines. Performance among the remaining models varies by discipline, for instance, Claude-4-Opus excels in Healthcare, while DeepSeek-R1-0528 performs well in Natural Science.

When a user submits a question on SciArena, the platform utilizes an advanced multi-stage retrieval pipeline, adapted from Ai2's Scholar QA system, to gather relevant scientific paper contexts.

This pipeline includes query decomposition, passage retrieval, and re-ranking to ensure high-quality and relevant information.

These retrieved contexts, along with the user's question, are then fed to two randomly selected foundation models. The models generate long-form, literature-grounded responses, complete with citations. To mitigate potential biases from stylistic elements, responses are post-processed to a standardized, plain-text format with consistent citation styles. Users then evaluate these outputs and vote for the one that best satisfies their information needs.

To ensure the reliability and integrity of this human preference data, rigorous quality control measures were applied. This includes a separate annotation pipeline that ensures the quality of pairwise annotations:

Expert annotators: The initial data collection involved 102 researchers with at least two peer-reviewed publications and prior experience with AI-assisted literature tools.

Comprehensive training: All annotators underwent a one-hour training session to ensure consistency and accuracy in their evaluations.

Blind rating: In SciArena’s interface, the models that generated each answer are not revealed until after the user submits their vote.

Inter-Annotator Agreement (IAA) and self-consistency: SciArena assesses both IAA and self-consistency to quantify the reliability of the collected data. Results show strong self-consistency (weighted Cohen’s κ as 0.91), meaning individual annotators' judgments remain stable over time, and high IAA (weighted Cohen’s κ as 0.76), indicating that experts tend to reach similar judgments despite the subjective nature of some questions.

Ai2's SciArena welcomes partnerships with model developers that would enable the company to evaluate and post new models to its leaderboard, the company said.

For more information about this news, visit https://allenai.org.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues