Measuring AI Performance
The large AI companies (e.g., OpenAI, Anthropic, DeepSeek, and others) are partly motivated by a desire to achieve AGI (Artificial General Inwork.”) telligence). AGI is supposed to be as intelligent as humans in performing information tasks. (OpenAI defines it as “highly autonomous systems that outperform humans at most economically viable work”). Tasks can involve performing innovative research, designing new devices, and solving engineering challenges. How will we know when we have achieved AGI? How do we measure who is leading in the race to develop AGI?
The business of measuring performance of systems, whether intelligent systems or engineered machines, is itself a huge technical challenge. Assessing AGI is especially difficult, because the definition of AGI is so murky.
A recent paper by OpenAI researchers provides an assessment tool called GDPval, which uses tasks related to 44 occupations that contribute to the US GDP in their business sector. The method is described the paper and in an article. Similar to the Turing Test, the tasks are evaluated in terms of how well the AI system performs relative to a human professional in each of the respective occupations as judged by a panel of independent humans. A successful AGI system should perform at or above the level of the professionals in most of the tasks, as judged by independent observers.
What is remarkable about OpenAI’s work is that in their evaluation of competing AI systems, OpenAI’s systems did not get the top rating! This certainly lends credence to the validity of the approach.
But the headline to the results is that they are improving over time, and that ChatGPT-5 greatly outperforms ChatGPT-4. However, all systems evaluated are at best approaching a 50% win rate in a comparison of the AI solutions versus solutions provided by industry professionals, as judged in a blind comparison by independent experts. One would expect a 50% score if the AI system was as good as the professionals, and then the independent experts made random choices. The fact that the judged wins plus “ties” of the AI systems solutions are in the 38% to 48% range says that developers are still below parity in terms of AGI capabilities, and definitely not ahead.
The researchers hope to be able to replace the independent judges with an AI system, but that is not totally successful at this point. Still, the technology for comparing the general intelligence level of tasks performed by a computer versus tasks performed by humans is making progress.
