(PDF) Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Brave API

The paper "Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning" introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation during inference without requiring weight updates. MATTRL addresses key challenges in traditional multi-agent reinforcement learning (MARL), such as non-stationarity from co-adapting agents and sparse, high-variance rewards, by maintaining fixed policies and enriching the reasoning process with dense, turn-level textual feedback.

The framework forms a team of specialized agents that engage in multi-turn discussions, retrieving relevant experiences from a dynamically constructed experience pool to inform their reasoning. These experiences are distilled from high-scoring utterances using credit assignment mechanisms that combine individual performance signals and shared rewards, enabling more effective collaboration. A coordinator agent then synthesizes the discussion and produces the final decision once consensus is reached or a turn limit is exceeded.

Experiments across medicine, math, and education benchmarks show that MATTRL improves accuracy by an average of 3.67% over multi-agent baselines and by 8.67% over single-agent baselines. In mathematical reasoning, for instance, MATTRL achieves an exact-match accuracy of 0.36, representing a 33% relative improvement over single-agent methods. Educational applications show even more pronounced gains, with MATTRL nearly doubling the learning improvement compared to single-agent teaching approaches.

Ablation studies analyze different credit-assignment schemes and their impact on performance, offering practical insights for system design. The framework uses GPT-5 for agent reasoning and summarization, Qwen3-Embedding-4B for embeddings, and FAISS indexing for efficient retrieval of past experiences. By avoiding fine-tuning, MATTRL preserves the general capabilities of the underlying LLMs while enabling robust adaptation to distribution shifts.

An extension of MATTRL includes adaptive routing between single-agent and multi-agent modes based on task characteristics, further improving average accuracy by 10% over single-agent baselines. This highlights the complementary strengths of both approaches: single-agent systems excel in standardized cases with concentrated evidence, while MATTRL outperforms when cross-validation or multi-specialty reasoning is required. Overall, MATTRL provides a stable, efficient, and interpretable path for enhancing collaborative reasoning in LLM-driven agents at test time.

Generated Mar 7, 2026

Cerebras Thinking

This paper introduces MATTRL (Multi-Agent Test-Time Reinforcement Learning), a novel framework designed to enhance complex reasoning capabilities in Large Language Models (LLMs) without requiring parameter updates. Instead of relying solely on static prompting or single-pass inference, MATTRL employs a collaborative multi-agent system that learns and adapts dynamically during the inference phase. The core mechanism involves injecting structured textual experience into the deliberation process, allowing agents to iteratively refine their reasoning paths. By utilizing a reinforcement learning paradigm at test time, the agents explore different solution trajectories and update their policies based on intermediate rewards, effectively "practicing" the problem before arriving at a final conclusion.

A key contribution of this work is the formalization of test-time training in a multi-agent context, where agents share knowledge and critique each other's outputs to converge on higher-quality answers. The "structured textual experience" component acts as a form of episodic memory, enabling the system to leverage successful reasoning patterns from past steps or external knowledge bases to guide current decision-making. This approach is significant because it addresses the limitations of context-window constraints and the static nature of few-shot prompting. By turning inference into a learning process, MATTRL achieves improved performance on challenging reasoning benchmarks, offering a scalable alternative to computationally expensive fine-tuning while maintaining the flexibility to adapt to novel tasks dynamically.

Generated Mar 12, 2026

Open-Weights Reasoning

Summary: Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

This paper introduces MATTRL (Multi-Agent Test-Time Reinforcement Learning), a novel framework that enhances multi-agent collaboration by injecting structured textual experiences into the deliberation process at inference time. Unlike traditional approaches that rely solely on pre-trained policies or fixed interaction protocols, MATTRL leverages reinforcement learning (RL) to dynamically refine agent behaviors during deployment. By encoding past interactions and external knowledge into textual prompts, the system enables agents to adapt their strategies in real-time, improving reasoning and decision-making in complex, collaborative tasks. The key innovation lies in the use of test-time RL, where agents continuously update their policies based on interactive feedback, rather than relying on offline training alone.

The paper’s contributions are twofold: first, it demonstrates how structured textual experiences can serve as a bridge between static knowledge and dynamic adaptation, allowing agents to generalize better across diverse scenarios. Second, it shows that this approach outperforms baseline methods in tasks requiring multi-agent coordination, such as multi-player games or cooperative problem-solving. The work is particularly relevant for domains where environments are non-stationary or where human-like reasoning is desirable, such as autonomous systems, robotics, or AI-assisted decision-making. By blending RL with textual reasoning at inference time, MATTRL opens new avenues for building more flexible and interpretable multi-agent systems.

Why it matters: This research addresses a critical gap in multi-agent systems—how to maintain adaptability and reasoning without extensive retraining. By shifting RL to test-time and incorporating textual context, MATTRL enables agents to handle novel situations more effectively, a key step toward scalable, human-aligned AI collaboration. The implications span from improving AI-assisted workflows to advancing autonomous systems that can reason and adapt in real-world environments.

Generated Mar 12, 2026