• About Us
  • Contact Us
  • Terms & Conditions
  • Privacy Policy
Technology Hive
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • More
    • Deep Learning
    • AI in Healthcare
    • AI Regulations & Policies
    • Business
    • Cloud Computing
    • Ethics & Society
No Result
View All Result
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • More
    • Deep Learning
    • AI in Healthcare
    • AI Regulations & Policies
    • Business
    • Cloud Computing
    • Ethics & Society
No Result
View All Result
Technology Hive
No Result
View All Result
Home Artificial Intelligence (AI)

A Chinese firm launches constantly changing AI benchmarks

Adam Smith – Tech Writer & Blogger by Adam Smith – Tech Writer & Blogger
June 23, 2025
in Artificial Intelligence (AI)
0
A Chinese firm launches constantly changing AI benchmarks
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

Introduction to Xbench

Development of the benchmark at Hongshan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

Approach to Benchmarking

Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

Assessing Raw Intelligence

Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

DeepResearch Component

DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

Future Developments

On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is. The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

Assessing Real-World Readiness

To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers. The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

Current Rankings

ChatGPT-o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

Expert Opinion

“It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”

Conclusion

Xbench offers a comprehensive approach to evaluating AI models, focusing not only on their raw intelligence but also on their ability to deliver real-world economic value. With its commitment to regular updates and expansion into new categories, Xbench is set to become a leading benchmark in the field of AI research.

FAQs

  • Q: What is Xbench?
    A: Xbench is a benchmarking system designed to assess the capabilities of AI models, focusing on both their raw intelligence and real-world applicability.
  • Q: How does Xbench assess raw intelligence?
    A: Xbench uses two components: Xbench-ScienceQA for academic knowledge and Xbench-DeepResearch for navigating the Chinese-language web and answering complex questions.
  • Q: What real-world tasks does Xbench assess?
    A: Currently, Xbench assesses tasks in recruitment and marketing, with plans to expand into finance, legal, accounting, and design.
  • Q: How often will Xbench be updated?
    A: The team plans to update the test questions once a quarter.
  • Q: What models currently rank highest in Xbench?
    A: ChatGPT-o3 ranks first in both recruitment and marketing categories, with other models like Perplexity Search, Claude, Grok, and Gemini also performing well.
Previous Post

To avoid admitting ignorance, Meta AI says man’s number is a company helpline

Next Post

LLMs Factor in Unrelated Information When Recommending Medical Treatments

Adam Smith – Tech Writer & Blogger

Adam Smith – Tech Writer & Blogger

Adam Smith is a passionate technology writer with a keen interest in emerging trends, gadgets, and software innovations. With over five years of experience in tech journalism, he has contributed insightful articles to leading tech blogs and online publications. His expertise covers a wide range of topics, including artificial intelligence, cybersecurity, mobile technology, and the latest advancements in consumer electronics. Adam excels in breaking down complex technical concepts into engaging and easy-to-understand content for a diverse audience. Beyond writing, he enjoys testing new gadgets, reviewing software, and staying up to date with the ever-evolving tech industry. His goal is to inform and inspire readers with in-depth analysis and practical insights into the digital world.

Related Posts

UK and Singapore Form AI Finance Alliance
Artificial Intelligence (AI)

UK and Singapore Form AI Finance Alliance

by Adam Smith – Tech Writer & Blogger
July 4, 2025
CyXcel Research Uncovers AI Risks for UK Businesses
Artificial Intelligence (AI)

CyXcel Research Uncovers AI Risks for UK Businesses

by Adam Smith – Tech Writer & Blogger
July 3, 2025
Don’t Let Hype Exceed Reality on AI Agents
Artificial Intelligence (AI)

Don’t Let Hype Exceed Reality on AI Agents

by Adam Smith – Tech Writer & Blogger
July 3, 2025
The AI Energy Paradox
Artificial Intelligence (AI)

The AI Energy Paradox

by Adam Smith – Tech Writer & Blogger
July 2, 2025
AI Can Slash Global Carbon Emissions
Artificial Intelligence (AI)

AI Can Slash Global Carbon Emissions

by Adam Smith – Tech Writer & Blogger
July 2, 2025
Next Post
LLMs Factor in Unrelated Information When Recommending Medical Treatments

LLMs Factor in Unrelated Information When Recommending Medical Treatments

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Latest Articles

Military AI Enters Phase Two

Military AI Enters Phase Two

April 15, 2025
Want a humanoid, open source robot for just ,000? Hugging Face is on it.

Want a humanoid, open source robot for just $3,000? Hugging Face is on it.

May 30, 2025
Estimating Machine Learning Project Time and Cost

Estimating Machine Learning Project Time and Cost

March 2, 2025

Browse by Category

  • AI in Healthcare
  • AI Regulations & Policies
  • Artificial Intelligence (AI)
  • Business
  • Cloud Computing
  • Cyber Security
  • Deep Learning
  • Ethics & Society
  • Machine Learning
  • Technology
Technology Hive

Welcome to Technology Hive, your go-to source for the latest insights, trends, and innovations in technology and artificial intelligence. We are a dynamic digital magazine dedicated to exploring the ever-evolving landscape of AI, emerging technologies, and their impact on industries and everyday life.

Categories

  • AI in Healthcare
  • AI Regulations & Policies
  • Artificial Intelligence (AI)
  • Business
  • Cloud Computing
  • Cyber Security
  • Deep Learning
  • Ethics & Society
  • Machine Learning
  • Technology

Recent Posts

  • Agentic AI Replaces White-Collar Workflows
  • Building Intelligent Workflows with AI Tools
  • Optimize Machine Learning Models with Hyperparameter Tuning
  • Corrective Retrieval-Augmented Generation Model
  • Will AI Replace Humans?

Our Newsletter

Subscribe Us To Receive Our Latest News Directly In Your Inbox!

We don’t spam! Read our privacy policy for more info.

Check your inbox or spam folder to confirm your subscription.

© Copyright 2025. All Right Reserved By Technology Hive.

No Result
View All Result
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • AI in Healthcare
  • AI Regulations & Policies
  • Business
  • Cloud Computing
  • Ethics & Society
  • Deep Learning

© Copyright 2025. All Right Reserved By Technology Hive.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?