ChatGPT parent OpenAI, Microsoft Corp
Major tech firms have begun creating internal benchmarks to test their AI models' capabilities better and address this issue. However, this approach has raised concerns within the industry about the need for standardized public evaluations, making it difficult for businesses and consumers to assess the advancements in AI technology, Financial Times reports.
Ahmad Al-Dahle, the head of generative AI at Meta, highlighted to the Financial Times the difficulty in measuring the capabilities of the latest AI systems. This has prompted companies like Meta, OpenAI, and Microsoft to develop proprietary evaluation methods. However, this move has drawn criticism for limiting the ability to compare different AI technologies.
Traditional public benchmarks, such as Hellaswag and MMLU, utilize multiple-choice questions to test common sense and general knowledge. However, researchers argue that these methods no longer effectively gauge the reasoning capabilities of advanced AI models.
For instance, Mark Chen, senior vice president of Research at OpenAI, told the Financial Times that human-designed tests are increasingly inadequate for measuring the true capabilities of these sophisticated systems. As a result, there is a growing push within the industry to create more complex tests that better reflect real-world challenges.
The shift towards private benchmarks has sparked debate over the transparency of AI testing. Dan Hendrycks, executive director of the Center for AI Safety, told the Financial Times that with publicly available benchmarks, it becomes easier for businesses and the general public to understand the actual progress being made in AI. This lack of transparency may hinder efforts to accurately gauge how close AI models are to automating complex tasks.
Beyond internal benchmarks, external organizations have also started contributing to developing new evaluation methods. In September, Scale AI partnered with Hendrycks to launch "Humanity's Last Exam," a project that crowdsources complex questions from experts across various fields, requiring abstract reasoning.
Additionally, FrontierMath, a new benchmark designed by expert mathematicians, challenges even the most advanced models, with a completion rate of less than 2% on its most challenging questions.
Wedbush analyst Dan Ives projected $1 trillion in AI capital expenditure by U.S. tech giants like Microsoft, Meta, Amazon.Com Inc
Price Actions: MSFT stock is down 0.8% at $419.17 at last check Monday. META is down 1.36%.