Describes a hierarchical multi-agent system with RL fine-tuning and VRP CoT prompting, evaluated on benchmarks like MATH500, GSM8K, AIME using high-level meta-thinking and low-level reasoning agents.
The paper "LLM-Based Multi-Agent Systems for Mathematical Problem Solving: A Comprehensive Literature Review[v1]" describes hierarchical multi-agent architectures that employ reinforcement learning (RL) fine-tuning and Verifier Reward-guided Process (VRP) Chain-of-Thought (CoT) prompting to enhance mathematical reasoning . These systems are evaluated on benchmarks such as MATH500, GSM8K, AIME, AMC23, and GaoKao2023En, reflecting their focus on complex arithmetic and competition-level mathematical problem solving .
One such framework separates cognitive functions into high-level meta-thinking agents and low-level reasoning agents within a hierarchical multi-agent reinforcement learning (MARL) setup . The meta-thinking agent is responsible for strategic oversight, including plan proposal and progress monitoring, while the low-level reasoning agent handles detailed execution of reasoning steps . This separation aims to elicit metacognition by distinguishing strategic oversight from execution, addressing issues like the "lazy agent" problem commonly found in hierarchical RL setups .
The system uses models such as Qwen2.5-7B-Instruct and Llama3-8B-Instruct, with training conducted via multi-turn Generalized Reward Policy Optimization (GRPO) combined with causal influence estimation and verifiable rewards to improve credit assignment and deliberation . During inference, dynamic task decomposition occurs in a forward phase, while adaptive role refinement takes place in a backward phase, supported by textual backpropagation that refines prompts and coordination based on error signals .
Performance results show high accuracy on GSM8K, with frameworks like MARS-PO achieving up to $$95.82\%$$ accuracy using Qwen2.5-Math-7B-Instruct, and Dr. MAMR reaching $$92.12\%$$ with Qwen2.5-7B-Instruct . On the MATH500 benchmark, ReMA achieves $$74.40\%$$ accuracy with the same model, while ILR (Group 3) reaches $$78.00\%$$ . These results highlight the effectiveness of role specialization and RL-based training in improving mathematical reasoning performance.
This manuscript provides a comprehensive examination of the intersection between Large Language Models (LLMs) and multi-agent systems (MAS) specifically tailored for complex mathematical problem solving. At its core, the research introduces and evaluates a hierarchical multi-agent framework designed to mimic human cognitive structures by separating high-level planning from low-level execution. The architecture utilizes distinct agent roles: high-level "meta-thinking" agents responsible for decomposing complex mathematical queries and strategizing, and low-level "reasoning" agents focused on detailed execution and calculation. To enhance performance, the system integrates Reinforcement Learning (RL) fine-tuning with a specialized prompting technique known as VRP (Variable Reward Prompting) Chain-of-Thought (CoT), aiming to optimize reasoning trajectories and reduce hallucination in logical steps.
The study validates the efficacy of this hierarchical approach through rigorous evaluation on standard mathematical reasoning benchmarks, including MATH500, GSM8K, and the challenging AIME competition problems. By leveraging the division of labor between meta-cognitive planning and granular reasoning, the proposed system demonstrates significant improvements in accuracy and robustness over conventional single-model approaches. The review component of the paper further contextualizes this contribution within the broader literature, analyzing how agentic workflows, collaborative reasoning, and feedback loops can be systematically applied to solve tasks requiring multi-step logic and symbolic manipulation.
This work matters because it addresses a fundamental limitation of current LLMs: the difficulty in maintaining coherence and accuracy over extended reasoning chains. By moving away from monolithic models toward structured, hierarchical multi-agent systems, the research offers a scalable path toward solving problems that require both abstract strategy and precise calculation. The insights gained from RL fine-tuning and VRP CoT prompting provide a blueprint for developing more reliable AI systems in scientific domains, logic verification, and complex decision-making scenarios where error propagation is a critical failure mode.
This paper presents a hierarchical multi-agent system designed to enhance mathematical problem-solving using large language models (LLMs). The architecture consists of high-level meta-thinking agents that decompose problems into sub-tasks and low-level reasoning agents that execute step-by-step computations. The system incorporates reinforcement learning (RL) fine-tuning to optimize agent coordination and employs VRP CoT (Verbal Reasoning with Chain-of-Thought) prompting to improve reasoning clarity. Evaluations on standardized benchmarks—including MATH500, GSM8K, and AIME—demonstrate competitive performance, highlighting the effectiveness of hierarchical decomposition and adaptive reasoning strategies.
The key contributions include: 1. Modular Agent Design: Separation of meta-reasoning and execution improves scalability and adaptability to complex problems. 2. RL-Enhanced Coordination: Fine-tuning agent interactions via RL reduces redundant computations and enhances solution efficiency. 3. Explainable Prompting: VRP CoT provides interpretable reasoning traces, bridging the gap between black-box LLM outputs and human-readable logic.
This work matters because it advances autonomous mathematical reasoning by leveraging multi-agent collaboration, offering a scalable approach to tackling problems beyond the capacity of single-model solutions. The insights could extend to domains requiring structured problem decomposition, such as formal verification or scientific discovery.
Source: [Preprints.org](https://www.preprints.org/manuscript/202512.1105)