The benchmark game

Companies developing AI models are eager to showcase benchmark results to highlight their products’ superiority. A notable example is OpenAI, which claimed that GPT-5 performs better than its predecessor in refusing to answer questions that can’t be decisively resolved. While such announcements build a narrative of technological progress, EU scientists warn that these tests don’t fully reflect the models’ real capabilities.

The JRC report points out that many benchmarks focus on narrow, single tasks, which do not translate into performance in complex, real-world scenarios. Moreover, these tests are often closed, lack transparency, and can be manipulated to produce favorable outcomes.

This issue is critical in the context of EU AI law, where a model’s classification as “high risk” could rely on benchmark results. The European Commission has yet to clarify the requirements in delegated acts, leaving a gap in practical regulatory enforcement.

Transatlantic asymmetry

While the US launched a toolkit for federal agencies in August to evaluate AI models, the EU is still debating the criteria and methods for assessment. This raises the question: is Europe, aspiring to be a global technology regulator, falling behind in practical AI oversight?

JRC emphasizes that benchmarks should measure real model capabilities, not just narrow, niche skills. They must be fully documented and transparent, clearly stating what and how was evaluated, and take into account cultural and linguistic diversity. This is particularly crucial in the EU, with its 24 official languages – models performing well in English-language benchmarks may struggle in other linguistic and cultural contexts.

Expert voices and the risk of a “Brussels effect”

Risto Uuka from the Future of Life Institute argues that EU concerns are valid. He stresses that relying on “anecdotes and vibes” isn’t enough; rigorous, independent evaluations by external assessors are essential. He also highlights the need to fund the entire AI assessment ecosystem – from test labs to documentation standards.

If the EU develops robust benchmarks, it could trigger the so-called “Brussels Effect” – European standards becoming a reference point even beyond its borders. Achieving this, however, requires clear criteria and political determination, not just legal texts.

The Commission’s response: adequate or too late?

A European Commission spokesperson highlighted that the EU AI Office has “state-of-the-art model evaluation capabilities” and conducts internal analyses. They also pointed to the AI Code of Conduct and a €9 million tender announced in July 2025 to support technical evaluation of models. The key question remains whether these measures are reactive and whether they can keep pace with the rapidly evolving market.

Today, AI tests are not only technical tools but also part of a regulatory and competitive game. Companies want to present their models in the best possible light, which, without independent verification, increases the risk of information asymmetry. On the other hand, excessive bureaucracy could stifle innovation and make it harder for European companies to compete with global giants.

The core dilemma

The tension is clear: how can the EU ensure rigorous risk assessment while still supporting technological development? Without answers, the field is left to private players and regulators outside Europe.

Shape the conversation

Do you have anything to add to this story? Any ideas for interviews or angles we should explore? Let us know if you’d like to write a follow-up, a counterpoint, or share a similar story.