Details math benchmarks (MATH500 etc.) with hierarchical multi-agent setups using CoT prompting and RL fine-tuning.
The preprint "LLM-Based Multi-Agent Systems for Mathematical Problem Solving: A Comprehensive Literature Review[v1]" details various multi-agent architectures designed to enhance complex mathematical reasoning through hierarchical setups, Chain-of-Thought (CoT) prompting, and Reinforcement Learning (RL) fine-tuning . The study evaluates these systems on rigorous benchmarks such as MATH500, GSM8K, AIME, AMC23, GaoKao2023En, Minerva Math, and OlympiadBench, which assess multi-step symbolic and logical reasoning capabilities .
One key focus is on hierarchical multi-agent reinforcement learning (MARL) frameworks that mitigate issues like the "lazy agent" problem, where certain agents contribute minimally in collaborative settings. For instance, the ReMA and ILR (group 3) frameworks use Qwen2.5-14B-Instruct with multi-turn Generalized Reward Policy Optimization (GRPO), causal influence, and verifiable rewards to improve agent coordination and reasoning quality . These systems employ a high-level "meta-thinking" agent responsible for strategic planning and a low-level reasoning agent for execution, separated by role-specific prompts to elicit metacognition .
The review highlights that CoT-based Verifiable Reward Prompting (VRP) is commonly used during inference, while RL fine-tuning enhances performance through feedback from outcome verification signals . On the MATH-500 benchmark, frameworks like SIER achieve a Pass@8 score of 93.0 using Qwen2.5-7B-Instruct and Qwen2.5-Math-PRM-72B, whereas ILR (group 3) achieves 82.60% accuracy with Qwen2.5-14B-Instruct .
For GSM8K, DiMo (Logical Mode) achieves an impressive 98.4% accuracy using Qwen-2.5-32B, demonstrating the effectiveness of larger models in hierarchical agent setups . Other notable performances include LbMAS at 96.05% accuracy with Qwen-2.5-72b-Instruct and MA ToT reaching 94.8% with Llama3.1-70B .
The paper also identifies persistent challenges such as agent homogeneity—where agents based on similar LLMs produce redundant reasoning—and scalability of coordination strategies across increasingly complex tasks . It emphasizes the evolution from unstructured debate-based systems to structured, self-optimizing hierarchical architectures that support dynamic task decomposition and adaptive role refinement through textual backpropagation and forward/backward optimization phases .
This manuscript provides a comprehensive examination of the emerging paradigm of LLM-based multi-agent systems (MAS) applied to complex mathematical reasoning. It surveys the current state-of-the-art architectures that move beyond single-model inference, focusing on collaborative frameworks where specialized agents interact to solve problems. The review evaluates performance against rigorous benchmarks such as MATH500, analyzing how these distributed systems handle the logical rigor and multi-step deduction required in advanced mathematics compared to standalone models.
A key contribution of the work is its detailed taxonomy of hierarchical multi-agent setups, which often employ structures like manager-worker or proposer-verifier configurations to decompose difficult tasks. The text elucidates the synergy between Chain-of-Thought (CoT) prompting strategies and Reinforcement Learning (RL) fine-tuning within these collaborative environments. It highlights how RL fine-tuning optimizes agent policies for long-term reasoning goals, while CoT provides the necessary transparency in intermediate steps, resulting in significant performance uplifts over standard prompting baselines.
This review matters because it addresses the inherent limitations of monolithic LLMs regarding hallucination and logical consistency in STEM domains. By systematically categorizing multi-agent approaches, the authors offer a roadmap for researchers designing more robust and scalable reasoning systems. The insights provided are crucial for advancing AI capabilities in formal theorem proving and complex calculation, signaling a critical shift toward architectures that leverage both the generative power of LLMs and the structural reliability of collaborative agent workflows.
This paper presents a comprehensive literature review on the use of Large Language Model (LLM)-based multi-agent systems for mathematical problem solving, with a focus on benchmarking datasets like MATH500, miniF2F, and GSM8K. The review examines hierarchical multi-agent architectures, where specialized agents (e.g., problem decomposers, solvers, and verifiers) collaborate to tackle complex mathematical tasks. Key techniques discussed include Chain-of-Thought (CoT) prompting for reasoning, reinforcement learning (RL) fine-tuning to optimize agent interactions, and self-consistency mechanisms to improve robustness. The paper also highlights challenges such as agent coordination overhead, scalability trade-offs, and interpretability in multi-agent setups.
The core contribution of this work lies in its systematic synthesis of emerging trends in LLM-based multi-agent mathematical reasoning, identifying gaps where current approaches fall short—particularly in generalization across problem domains and adaptive strategy selection. By comparing different multi-agent frameworks, the review underscores the potential of hybrid reasoning paradigms (e.g., combining symbolic and neural methods) to outperform single-agent LLMs. This matters for both AI research (guiding future architectures) and practical applications (e.g., automated tutoring, financial modeling), where reliable mathematical reasoning is critical. The paper serves as a valuable resource for researchers exploring decentralized AI systems and collaborative problem-solving in high-stakes domains.
Why It Matters: - Provides a roadmap for designing multi-agent LLM systems in math-intensive tasks. - Highlights limitations of current benchmarks (e.g., bias toward certain problem types). - Advocates for modular, explainable architectures to enhance trust and scalability. - Bridges theory and application, offering actionable insights for developers and theorists alike.
For further details, see the full preprint: [Preprints.org (2025)](https://www.preprints.org/manuscript/202512.1105).