Reports MAR outperforms ReAct (32% EM) and Reflexion+ReAct (44% EM) on HotPotQA with 47% EM, attributing modest gains to EM metric limitations.

Topological visualization of MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs
Brave API

MAR achieves a 47% exact match (EM) accuracy on the HotPotQA benchmark, outperforming both the ReAct baseline at 32% EM and the Reflexion+ReAct baseline at 44% EM . The authors attribute the relatively modest 3-point gain over Reflexion to limitations in the EM metric, which penalizes semantically correct answers for minor surface-level differences such as formatting, punctuation, ordering, or valid synonym usage . They note that more forgiving metrics like F1 or semantic-match scoring would likely reveal stronger improvements in reasoning quality .

Generated 27d ago
Cerebras Thinking

This paper introduces MAR (Multi-Agent Reflexion), a framework designed to enhance the reasoning capabilities of Large Language Models by integrating the Reflexion learning paradigm with a multi-agent architecture. Unlike standard ReAct or single-agent Reflexion approaches, MAR leverages a collaborative environment where multiple agents engage in reasoning tasks while utilizing self-reflection mechanisms to correct errors and refine strategies. The study specifically evaluates this approach on HotPotQA, a dataset known for requiring multi-hop reasoning to answer complex questions.

The key findings demonstrate that MAR achieves an Exact Match (EM) score of 47% on the HotPotQA benchmark, outperforming both the standard ReAct baseline (32% EM) and the single-agent Reflexion+ReAct method (44% EM). While the results confirm that distributing reasoning and reflection across multiple agents yields performance improvements, the authors observe that the gains over the single-agent Reflexion baseline are modest. They attribute this limited delta to the constraints of the EM metric, which may not fully capture the nuances of improved reasoning paths or partial correctness in complex problem-solving scenarios.

This research is significant as it probes the boundaries of agentic workflows, suggesting that while multi-agent collaboration offers a viable path to better reasoning, the marginal returns highlight potential bottlenecks in current evaluation metrics or the saturation of existing prompting techniques. By critically analyzing the EM metric's limitations, the paper provides valuable insight into the difficulty of measuring true reasoning proficiency, urging the community to consider more granular evaluation methods for advanced agentic systems.

Generated 27d ago
Open-Weights Reasoning

# Summary of "MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs"

This paper introduces MAR (Multi-Agent Reflexion), a novel framework designed to enhance the reasoning capabilities of large language models (LLMs) by leveraging multi-agent collaboration and self-reflection. The approach builds upon prior work like ReAct and Reflexion+ReAct, which combine reasoning and acting in a single agent, but extends it by introducing multiple specialized agents that interact, critique, and refine each other’s outputs. The authors evaluate MAR on the HotPotQA benchmark, a complex question-answering task requiring multi-hop reasoning, and demonstrate that it achieves a 47% exact match (EM) score, outperforming ReAct (32% EM) and Reflexion+ReAct (44% EM).

The key insight of MAR lies in its decentralized, iterative refinement process, where agents assume distinct roles (e.g., planner, executor, critic) and iteratively improve reasoning chains. The paper highlights that while the performance gains are modest (3% EM over Reflexion+ReAct), this is partly due to the limitations of the EM metric, which does not fully capture the complexity of multi-step reasoning. The authors suggest that more nuanced evaluation metrics—such as those assessing intermediate reasoning steps—could better reflect the improvements enabled by multi-agent collaboration. This work is significant because it demonstrates that diversifying agent roles and enabling peer review-like interactions can lead to more robust and interpretable reasoning in LLMs, paving the way for future advancements in collaborative AI systems.

Why it matters: MAR contributes to the growing body of research on agent-based AI augmentation, showing that structured teamwork among specialized agents can enhance model performance beyond single-agent refinement. This approach aligns with broader trends in AI, such as ensemble methods and human-AI collaboration, and could inspire further innovations in automated reasoning, decision-making, and explainable AI. The paper’s critique of evaluation metrics also underscores the need for more sophisticated benchmarks to accurately measure reasoning improvements in LLMs.

Source: [arXiv:2512.20845](https://arxiv.org/html/2512.20845)

Generated 27d ago
Sources