AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Brave API

The concept of using dynamic, game-based benchmarks to evaluate artificial general intelligence (AGI) has gained traction as a response to the limitations of static, narrow AI tests. Traditional benchmarks often suffer from data contamination and saturation, where models achieve near-perfect scores not due to improved reasoning but through exposure to training data that overlaps with test sets . To address this, researchers have proposed scalable, open-ended evaluation frameworks that leverage games—environments with clear rules and unambiguous success conditions—to assess general reasoning, strategic planning, and adaptability in AI systems.

One such approach is gg-bench, a benchmark that generates novel two-player strategy games using large language models (LLMs) . These games are described in natural language and implemented as Gym environments, after which reinforcement learning (RL) agents are trained via self-play to establish strong baselines . LLMs are then evaluated by their winrate against these agents when prompted with game rules, board states, and valid moves. This data-generating process allows for continuous creation of new tasks, mitigating contamination risks and enabling difficulty scaling as models improve . Notably, even state-of-the-art models like GPT-4o and Claude 3.7 Sonnet achieve only 7–9% winrates under in-context learning, while reasoning-focused models such as DeepSeek-R1 and o1 reach 31–36%, highlighting the benchmark’s challenge and discriminative power .

Similarly, Google DeepMind and Kaggle introduced the Kaggle Game Arena, an open-source platform where frontier AI models compete head-to-head in strategic games like chess . This arena provides a transparent, standardized environment with open-sourced game harnesses and an all-play-all tournament structure to ensure statistically robust rankings . The platform aims to evolve beyond chess to include Go, poker, and video games, testing long-horizon planning and dynamic adaptation—capabilities essential for AGI .

Other initiatives reinforce this trend. Squid Game presents a multi-level evaluation framework that progresses from simple instruction following to adversarial interactions, including a final "jailbreak" challenge where models attempt to bypass safety constraints—an analog to real-world security threats . Meanwhile, text-based games like those in TextWorld and Jericho are being repurposed to stress-test LLMs in open-ended problem-solving scenarios, revealing persistent weaknesses in commonsense reasoning and long-term memory .

These efforts collectively reflect a shift toward dynamic, interactive, and adversarial evaluation paradigms. Games offer scalable difficulty, objective outcomes, and rich behavioral traces that allow for deep analysis of model reasoning . As AI systems advance, such benchmarks are crucial for distinguishing memorization from genuine generalization—key to measuring progress toward human-like general intelligence .

Generated Feb 22, 2026

Cerebras Thinking

AI Gamestore addresses the critical bottleneck in artificial general intelligence (AGI) research: the saturation of static benchmarks. Traditional evaluation datasets, once mastered by models, lose their discriminatory power, leading to a cycle of constant replacement that hinders long-term progress tracking. This paper proposes a scalable alternative by treating the vast ecosystem of human-designed video games as a continuous, dynamic testbed. Rather than assessing models on a single, narrow task, the framework evaluates agents across a wide distribution of game genres, mechanics, and objectives. This approach demands that agents possess not just pattern recognition capabilities, but the ability to understand instructions, learn rules on the fly, and adapt strategies to novel environments—core characteristics of general intelligence.

The significance of this work lies in its alignment of evaluation with the complexities of open-ended intelligence. By utilizing human games, the benchmark leverages the inherent design logic that humans use to represent challenges, physics, and spatial reasoning. The authors demonstrate that performance across this diverse library provides a more robust metric for generalization than accuracy on a fixed dataset. This shift from static to dynamic evaluation is crucial for tracking progress toward AGI, as it prevents models from simply "hacking" the metric and instead requires genuine transfer learning and reasoning. Consequently, AI Gamestore offers a sustainable path forward for measuring machine intelligence as it scales beyond human capability in narrow domains.

Generated Mar 4, 2026

Open-Weights Reasoning

Summary of AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

The paper AI Gamestore introduces a novel benchmarking framework for evaluating machine general intelligence (MGI) by leveraging human-designed games as dynamic, open-ended challenges. Unlike traditional static benchmarks (e.g., MNIST, ImageNet), which test narrow capabilities, AI Gamestore proposes a scalable evaluation system where AI agents must adapt to diverse, user-generated games—mirroring the complexity and unpredictability of human intelligence. The approach emphasizes emergent behaviors, generalization across tasks, and interactive learning, addressing limitations of fixed benchmarks that fail to capture AGI's evolving capabilities.

The key contributions include: 1. Dynamic, User-Curated Challenges: Games are sourced from platforms like Board Game Arena and Steam, ensuring a vast, evolving testbed that resists overfitting. 2. Open-Ended Evaluation Metrics: Measures adaptability, creativity, and strategic reasoning rather than predefined task completion. 3. Scalability: The framework supports both small-scale experiments and large-scale AGI assessments, aligning with rapid AI progress. 4. Human-AI Collaboration: Games serve as a natural interface for human feedback, enabling iterative improvement of AGI systems.

This work is significant because it addresses a critical gap in AGI evaluation—the lack of benchmarks that evolve alongside AI capabilities. Static tests risk becoming obsolete as models surpass them, while AI Gamestore’s dynamic nature ensures relevance. By grounding evaluation in human-created challenges, the framework also bridges the gap between technical performance and real-world intelligence, making it a promising step toward robust AGI assessment.

Source: [arXiv:2602.17594](https://arxiv.org/abs/2602.17594)

Generated Mar 4, 2026