Introduction to Xbench
Development of the benchmark at Hongshan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.
Approach to Benchmarking
Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.
Assessing Raw Intelligence
Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.
DeepResearch Component
DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)
Future Developments
On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is. The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.
Assessing Real-World Readiness
To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers. The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.
Current Rankings
ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.
Expert Opinion
“It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”
Conclusion
Xbench offers a comprehensive approach to evaluating AI models, focusing not only on their raw intelligence but also on their ability to deliver real-world economic value. With its commitment to regular updates and expansion into new categories, Xbench is set to become a leading benchmark in the field of AI research.
FAQs
- Q: What is Xbench?
A: Xbench is a benchmarking system designed to assess the capabilities of AI models, focusing on both their raw intelligence and real-world applicability. - Q: How does Xbench assess raw intelligence?
A: Xbench uses two components: Xbench-ScienceQA for academic knowledge and Xbench-DeepResearch for navigating the Chinese-language web and answering complex questions. - Q: What real-world tasks does Xbench assess?
A: Currently, Xbench assesses tasks in recruitment and marketing, with plans to expand into finance, legal, accounting, and design. - Q: How often will Xbench be updated?
A: The team plans to update the test questions once a quarter. - Q: What models currently rank highest in Xbench?
A: ChatGPT-o3 ranks first in both recruitment and marketing categories, with other models like Perplexity Search, Claude, Grok, and Gemini also performing well.