Proposes dynamic benchmarks mimicking broad human activities to evaluate AI general intelligence, overcoming static benchmark limitations. Relevant for rigorous assessment of human-like AI capabilities.
The concept of using dynamic, game-based benchmarks to evaluate machine general intelligence is supported by several recent proposals that aim to overcome the limitations of static evaluation methods. These approaches leverage interactive and evolving environments to better assess AI systems' reasoning, adaptability, and conceptual understanding—capabilities essential for human-like performance.
One such initiative, gg-bench, introduces a data-generating process where large language models (LLMs) first generate natural language descriptions of novel two-player strategy games, then implement them as Gym environments, and finally train reinforcement learning (RL) agents via self-play on these games . Evaluation is conducted by measuring LLMs' win rates against these trained RL agents when prompted with game rules, current state, and valid moves. This method ensures that benchmarks remain challenging and free from data contamination, as new games can be generated indefinitely. Notably, even state-of-the-art models like GPT-4o and Claude 3.7 Sonnet achieve only 7–9% win rates using in-context learning, while specialized reasoning models reach 31–36%, highlighting the benchmark's rigor .
Similarly, GAMEB O T (GAME Battle of Tactics) decomposes complex reasoning in games into modular subproblems and uses Chain-of-Thought (CoT) prompting guided by domain knowledge to assess both final actions and intermediate reasoning steps . It employs rule-based algorithms to generate ground truth for these subproblems, enabling transparent validation of reasoning processes. The framework includes head-to-head LLM competitions across eight strategic games such as Othello, Checkers, and Texas Hold’em, reducing risks of data contamination through dynamic gameplay .
Another framework, CK-Arena, builds upon the multiplayer game Undercover to evaluate conceptual reasoning in interactive settings . LLM-based agents assume roles as civilians or undercover players, describing concepts and detecting inconsistencies through dialogue. The system evaluates novelty, relevance, and reasonableness of statements, as well as player-level outcomes like win and survival rates. A variant called Undercover-Audience allows scalable assessment by involving audience agents who vote based on perceived coherence, reflecting real-world communication dynamics .
BotzoneBench further advances scalability by selecting games from the Botzone platform based on diversity in game-theoretic properties (e.g., perfect vs. imperfect information), computational complexity, and the availability of graded AI baselines . It evaluates LLMs against classic AI agents across games like Chess, Mahjong, and Landlord, providing fine-grained performance anchoring. For example, Gemini3-Pro-Pre. achieves perfect scores in deterministic games like TicTacToe and Gomoku but shows variable performance in complex imperfect-information games like Mahjong .
Additionally, Minecraft Universe (MCU) offers a scalable framework for evaluating open-ended agents in Minecraft, featuring over 3,452 composable tasks and a dynamic task composition mechanism to sustain challenge over time . Its automated evaluation system aligns with human judgment over 90% of the time, making it a robust testbed for generalization in open-world environments .
These frameworks collectively reflect a shift toward more human-like evaluation paradigms, where adaptability, conceptual understanding, and strategic reasoning are tested in rich, interactive contexts rather than static question-answer formats. By mimicking the breadth and unpredictability of human activities, they provide a more rigorous and realistic assessment of machine general intelligence .
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games addresses the critical limitation of static benchmarks in artificial intelligence research, specifically the issues of data contamination and saturation. As models increasingly achieve superhuman performance on fixed datasets, their ability to generalize remains difficult to assess. This paper proposes "AI Gamestore," a novel evaluation framework that utilizes a vast, diverse repository of human-designed games to serve as dynamic benchmarks. By treating these games as a continuous stream of novel challenges, the framework aims to test an agent's adaptability and reasoning capabilities in environments that mimic the breadth and unpredictability of real-world human activities.
The key contribution of this work is the creation of a scalable infrastructure that supports open-ended evaluation. Unlike traditional benchmarks that evaluate performance on a single, static task, AI Gamestore introduces a mechanism for assessing agents across a wide variety of game mechanics, rules, and objectives. This approach demands that agents demonstrate robust generalization and zero-shot learning capabilities rather than overfitting to specific datasets. The authors highlight how the diversity inherent in human games—ranging from logic puzzles to strategy simulations—provides a rigorous testbed for Machine General Intelligence (MGI), requiring agents to quickly understand and navigate new systems without extensive retraining.
This research matters significantly because it offers a more rigorous and sustainable path forward for evaluating the progression toward human-like AI. By shifting the focus from beating a specific high score to demonstrating competence across an ever-expanding array of tasks, AI Gamestore provides a more holistic metric for intelligence. It mitigates the "benchmark chasing" phenomenon and offers researchers a tool to measure the transferability of skills, which is a cornerstone of true general intelligence. Ultimately, this framework facilitates a deeper understanding of agent limitations and capabilities in complex, novel scenarios.
# AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
The paper AI Gamestore introduces a novel framework for evaluating machine general intelligence (MGI) by leveraging human-designed games as dynamic, open-ended benchmarks. Traditional AI benchmarks, such as static datasets or narrow task suites, often fail to capture the breadth and adaptability of human cognition, limiting their ability to assess true general intelligence. In contrast, the proposed approach repurposes existing human games—ranging from simple rule-based challenges to complex, emergent strategy games—as a scalable testbed. By designing APIs that interface with game environments, the framework enables AI systems to interact with a diverse set of games, providing a more realistic and adaptable evaluation metric. This mirrors the way humans learn and adapt across different domains, making it a promising alternative to fixed benchmarks.
The key contributions of this work include: 1. A scalable benchmarking infrastructure that dynamically generates and evaluates AI performance across a wide range of games, reducing the need for manual curation. 2. Open-ended evaluation, where AI agents must generalize and adapt strategies rather than overfit to predefined tasks. 3. Human-like challenge alignment, as games inherently require planning, reasoning, and even creativity—traits critical for MGI.
This approach matters because it addresses a fundamental gap in AI evaluation: the inability of current benchmarks to measure progress toward artificial general intelligence (AGI). By leveraging games, the framework provides a more ecologically valid and interpretable way to assess AI capabilities, potentially accelerating research in areas like reinforcement learning, multi-task learning, and adaptive decision-making. The work is particularly relevant for researchers and practitioners focused on building AI systems that can perform broadly across domains, not just optimize for narrow metrics.