AI Benchmarking Controversy: Pokémon Edition
Recently, a viral post on X revealed a showdown between Google’s Gemini model and Anthropic’s Claude model in the Pokémon world. Gemini reportedly outperformed Claude by reaching Lavendar Town on a developer’s Twitch stream, while Claude was stuck at Mount Moon.
Gemini is currently ahead of Claude in Pokémon after reaching Lavender Town.
Only 119 live views, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, it was later revealed that Gemini had an advantage. The developer behind the Gemini stream had created a custom minimap to help the model identify game elements, giving it a strategic edge over Claude.
While Pokémon may not be the most serious AI benchmark, it serves as a valuable example of how different implementations can impact results. Companies like Anthropic and Meta have shown how custom tweaks can affect a model’s performance on specific benchmarks.
As AI benchmarks continue to evolve, the challenge of comparing models will only grow. Custom implementations and non-standard approaches further complicate the process, making it harder to assess true capabilities.