Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

Brave API

The paper "Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation" identifies a critical issue in multi-agent large language model (LLM) frameworks known as "lazy agent" behavior, where one agent—typically the reasoning agent—contributes minimally while the other, such as the meta-thinking agent, dominates the reasoning process, effectively collapsing the system into a single-agent setup . This phenomenon undermines collaboration and limits the potential benefits of multi-agent systems despite their promising performance on complex reasoning tasks .

The authors provide a theoretical analysis of multi-turn Group Relative Preference Optimization (GRPO), a reinforcement learning (RL) objective commonly used in such frameworks, and identify a structural bias in its loss formulation . Specifically, the $$1/T_i$$ normalization term, where $$T_i$$ is the number of turns in trajectory $$i$$, implicitly incentivizes shorter reasoning trajectories by favoring continuations that minimize interaction steps, even when longer trajectories could yield better reasoning outcomes . This bias leads agents to adopt shortcut behaviors, contributing trivially to the process, especially as training progresses .

To address this, the paper introduces Dr. MAMR (Multi-Agent Meta-Reasoning Done Right), a framework designed to promote balanced contributions through three key components . First, it removes the $$1/T_i$$ normalization term in the GRPO objective to mitigate the bias toward shorter rollouts, a change referred to as "normalization debias" . Second, it proposes a Shapley-inspired causal influence measurement that evaluates the contribution of each reasoning step by grouping semantically similar steps across multiple rollouts and averaging their influence scores, thus providing a more stable and robust estimate of agent contribution during online RL training . The causal influence of a step $$s_{j,t'}$$ on the next step is measured as:

$$ \Delta\ell_{j,t'} \triangleq \log \pi_\theta(s_{j,t'+1} | h^{(j)\setminus t'}_{\le t'}) - \log \pi_\theta(s_{j,t'+1} | h^{(j)}_{\le t'}) $$

and the final influence score is averaged over a group of similar steps .

Third, to prevent the reasoning agent from getting trapped in noisy or misleading early responses during prolonged interactions, Dr. MAMR introduces a verifiable reward mechanism that encourages deliberation by allowing the agent to discard prior outputs and restart its reasoning process . A special control token $$\texttt{<restart>}$$ is used, and a restart reward $$r_{\text{restart}_{i,t}}$$ is defined based on whether the restart improves the model’s confidence in reaching a correct final answer, measured via causal influence on the final step :

$$ r_{\text{restart}_{i,t}} = \begin{cases} +1, & \text{if } (z_i = +1 \land \Delta\ell_{i,t} > 0) \text{ or } (z_i = -1 \land \Delta\ell_{i,t} < 0) \\ -1, & \text{if } (z_i = +1 \land \Delta\ell_{i,t} < 0) \text{ or } (z_i = -1 \land \Delta\ell_{i,t} > 0) \\ 0, & \text{if } \Delta\ell_{i,t} = 0 \end{cases} $$

where $$z_i$$ is the binary outcome reward (+1 for correct, -1 for incorrect) .

These components are combined into an aggregated step-level advantage function:

$$ A_{\text{step}_{i,t}} = \tilde{A}_{i,t} + \alpha \tilde{C}_{i,t} + \beta \tilde{R}_{i,t} $$

which integrates outcome-based advantage, causal influence, and restart signals for training . Experiments on mathematical reasoning benchmarks such as MATH500, GSM8K, and AIME24 using Qwen2.5 models (3B, 7B, 14B) show that Dr. MAMR outperforms baselines like ReMA and single-agent GRPO, achieving up to 58.43% Pass@1 accuracy with the 7B model compared to ReMA’s 51.97%, while maintaining balanced causal influence between agents and improving training stability . Ablation studies confirm that each component contributes meaningfully to performance gains .

Overall, the work highlights the importance of careful reward design and credit assignment in online RL for multi-agent LLM systems, offering a principled approach to fostering cooperation and mitigating imbalance in agentic reasoning frameworks .

Generated Feb 22, 2026

Cerebras Thinking

This research investigates the phenomenon of "lazy agents" within Large Language Model (LLM) multi-agent frameworks, specifically addressing the issue of social loafing where individual agents reduce their effort in collaborative settings. The authors demonstrate that without explicit incentives, agents tend to rely on their peers to correct errors or generate solutions, leading to sub-optimal collective reasoning performance. By analyzing this behavior, the paper highlights a critical bottleneck in current multi-agent systems: the assumption that simply connecting multiple models will automatically yield better results through diversity, which is often negated by passive participation.

To counteract this tendency, the paper proposes a novel approach utilizing Online Reinforcement Learning (RL) to train agents for active deliberation rather than passive consensus. Unlike static prompting methods, the online RL framework dynamically rewards agents based on their individual contributions to the solution, effectively penalizing "free-riding" behaviors. This mechanism encourages agents to engage in deeper reasoning and critical evaluation of their peers' outputs, ensuring a more balanced and robust collaborative process. The key contribution lies in formulating a training environment where the reward structure aligns individual agent utility with the overall quality of the multi-agent system's output.

The significance of this work lies in its potential to unlock the true reasoning capabilities of multi-agent LLM architectures. By mitigating the "lazy agent" problem, the proposed method ensures that computational resources are utilized efficiently, with each agent actively contributing unique insights rather than redundantly validating superficial information. This shift from static agent orchestration to learned, incentive-driven collaboration has profound implications for complex problem-solving tasks, suggesting that robust multi-agent performance requires not just architectural design, but also behavioral optimization through reinforcement learning.

Generated Mar 12, 2026

Open-Weights Reasoning

# Summary: Unlocking the Power of Multi-Agent LLM for Reasoning

This paper investigates the dynamics of multi-agent large language model (LLM) frameworks, particularly focusing on the phenomenon of "lazy agents"—where certain agents contribute minimally while others shoulder the reasoning burden. The authors argue that such imbalance undermines the collaborative potential of multi-agent systems, as lazy agents fail to leverage their unique perspectives, leading to suboptimal outcomes. To address this, the paper introduces online reinforcement learning (RL) as a mechanism to incentivize balanced contributions, ensuring that all agents actively participate in deliberation and reasoning tasks.

The key contributions include: 1. Diagnosing the Lazy Agent Problem: The paper formalizes how lazy agents emerge in multi-agent LLM setups, often due to misaligned incentives or poor coordination. 2. Online RL for Dynamic Adaptation: By integrating RL, the system can dynamically adjust rewards to encourage participation, improving both efficiency and reasoning depth. 3. Empirical Validation: Experiments demonstrate that RL-enhanced multi-agent systems outperform static or unbalanced baselines, particularly in complex reasoning tasks requiring diverse input.

This work matters because it addresses a critical challenge in scalable multi-agent AI systems, where collaboration is essential but often hindered by free-riding behaviors. By proposing a practical RL-based solution, the paper advances the state of the art in decentralized reasoning frameworks, with implications for applications in collaborative problem-solving, decision-making, and emergent AI coordination. The findings are particularly relevant for researchers and practitioners working on multi-agent LLM architectures, adversarial reasoning, and AI alignment.

Generated Mar 12, 2026