Not even Pokémon is exempt from the controversies surrounding AI benchmarking. Recently, a post on social media platform X gained significant attention, claiming that Google’s latest Gemini model outperformed Anthropic’s flagship Claude model within the original Pokémon video game trilogy. According to reports, Gemini had advanced to Lavender Town during a developer’s Twitch stream, while Claude was still at Mount Moon as of late February.
The post that highlighted Gemini’s progress did not mention that Gemini had a unique advantage. Reddit users noted that the developer managing the Gemini stream had implemented a custom minimap, which assists the model in identifying in-game “tiles,” such as cuttable trees. This feature reduces the necessity for Gemini to analyze screenshots before executing gameplay decisions.
Although using Pokémon as an AI benchmark is not considered a highly informative test of a model’s capabilities, it serves as an example of how varying implementations of a benchmark can impact the results. For instance, Anthropic reported two different scores for its latest Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, intended to assess a model’s coding abilities. The Claude 3.7 Sonnet attained 62.3% accuracy on SWE-bench Verified, but reached 70.3% accuracy when using a “custom scaffold” developed by Anthropic.
Moreover, Meta fine-tuned a version of its new model, Llama 4 Maverick, to perform effectively on a specific benchmark, LM Arena. In contrast, the vanilla version of the model scored significantly lower on the same evaluation.
Considering that AI benchmarks, including Pokémon, are imperfect measures, the introduction of custom and non-standard implementations complicates the comparison of models. Consequently, it appears unlikely that assessing and comparing models will become simpler as they continue to be released.