The benchmark game
Companies developing AI models are eager to showcase benchmark results to highlight their products’ superiority. A notable example is OpenAI, which claimed that GPT-5 performs better than its predecessor in refusing to answer questions that can’t be decisively resolved. While such announcements build a narrative of technological progress, EU scientists warn that these tests don’t fully reflect the models’ real capabilities.
The JRC report points out that many benchmarks focus on narrow, single tasks, which do not translate into performance in complex, real-world scenarios. Moreover, these tests are often closed, lack transparency, and can be manipulated to produce favorable outcomes.
This issue is critical in the context of EU AI law, where a model’s classification as “high risk” could rely on benchmark results. The European Commission has yet to clarify the requirements in delegated acts, leaving a gap in practical regulatory enforcement.