Emphasizes developing LLMs for effective cooperation and competition in multi-agent systems toward advanced intelligence.
MARS is an end-to-end reinforcement learning (RL) framework designed to enhance the multi-agent reasoning capabilities of large language models (LLMs) through self-play in both cooperative and competitive strategic games. The framework addresses key challenges in multi-turn, multi-agent environments, such as long-horizon credit assignment and agent-specific advantage estimation, by building upon Group-Relative Policy Optimization (GRPO).
MARS introduces two key innovations: a turn-level advantage estimator that enables fine-grained credit assignment by attributing long-term outcomes to individual actions across multiple turns and agents, and an agent-specific advantage normalization that stabilizes training by calibrating advantage estimates relative to each agent's performance, accounting for heterogeneous roles in multi-agent systems. These techniques allow the model to develop robust strategic abilities through self-play.
The MARS agent, trained from Qwen3-4B, demonstrates strong strategic performance in held-out games, achieving up to $$28.7\%$$ improvement. More importantly, the reasoning skills acquired through self-play generalize to real-world multi-agent systems like AutoGen and MAD, resulting in performance gains of up to $$10.0\%$$ on AIME and $$6.6\%$$ on GPQA-Diamond, with an average improvement of $$3.5\%$$ across benchmarks. This establishes self-play in strategic games as a scalable and effective paradigm for cultivating generalizable multi-agent reasoning in LLMs. The framework's code and models are publicly available
MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games introduces a novel framework designed to bridge the gap between static large language model (LLM) capabilities and the dynamic requirements of advanced multi-agent systems. The research focuses on enhancing the reasoning faculties of LLMs by situating them within competitive and cooperative strategic game environments. Rather than relying solely on pre-trained knowledge or supervised fine-tuning, MARS employs a self-play mechanism where agents interact autonomously, iterating through gameplay cycles to refine their strategies. This process forces the models to move beyond simple pattern matching, requiring them to develop long-term planning, adaptability, and the ability to anticipate the actions of other intelligent agents.
The study's key contribution lies in its methodology for reinforcing LLM policies through the outcomes of these self-play interactions. By treating strategic gameplay as a curriculum, the framework allows agents to learn from both success and failure, effectively turning the multi-agent environment into a reasoning engine. The paper demonstrates that this approach leads to emergent behaviors where agents develop complex coordination protocols and deceptive strategies necessary for winning. Crucially, the insights suggest that multi-agent dynamics can serve as a powerful, scalable signal for improving LLM reasoning, outperforming traditional single-agent prompting methods in complex, zero-sum, and mixed-motive scenarios.
This work is significant because it addresses a critical limitation of current foundation models: their struggle with sustained, goal-oriented interaction in dynamic settings. By validating that LLMs can effectively learn and adapt through self-play, MARS provides a viable pathway toward creating more autonomous and robust AI agents. The implications extend beyond games, offering potential applications in real-world domains requiring high-level negotiation, resource management, and collaborative problem-solving, ultimately pushing the frontier toward artificial general intelligence capable of sophisticated social reasoning.
This paper introduces MARS (Multi-Agent Reinforced Self-training), a framework designed to enhance the strategic reasoning capabilities of large language models (LLMs) in multi-agent settings. The core idea is to refine LLMs through self-play in competitive and cooperative strategic games, such as chess, bridge, and Diplomacy, where agents must balance coordination, deception, and long-term planning. By training models to iterate over rounds of play—where they refine their strategies based on past interactions—the approach simulates the dynamics of human-like reasoning in adversarial and collaborative environments. The method leverages reinforcement learning from human feedback (RLHF) and self-play optimization, allowing the LLM to improve its decision-making without relying solely on explicit rule-based supervision.
The paper’s key contributions include: 1. A scalable self-play training paradigm that enables LLMs to develop nuanced strategic behaviors by interacting with past versions of themselves. 2. Empirical validation showing that MARS-trained models outperform baselines (e.g., fine-tuned LLMs without self-play) in both competitive and cooperative tasks, demonstrating improved reasoning under uncertainty. 3. Insights into emergent behaviors, such as bluffing, alliance formation, and adaptive planning, which suggest that self-play can unlock more human-like intelligence in AI systems.
Why it matters: As multi-agent systems become increasingly critical in domains like AI alignment, cybersecurity, and autonomous coordination, the ability of LLMs to reason strategically in dynamic environments is paramount. MARS provides a foundational approach to training models that can generalize across diverse strategic scenarios, potentially advancing AI systems toward more robust and socially intelligent agents. The work also highlights the importance of game-theoretic training as a bridge between single-agent instruction tuning and complex multi-agent interactions.
Source: [arXiv:2510.15414v1](https://arxiv.org/html/2510.15414v1)