Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

Wu, Wenyi; Zhu, Sibo; Zhou, Kun; Huang, Biwei

Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

Wenyi Wu^*, Sibo Zhu^*, Kun Zhou^†, Biwei Huang

franciskunzhou@gmail.com

^*Equal contribution ^†Corresponding author

Paper Code Checkpoints

Framework Overview

Note: Add static/images/framework.png — copy from the main repo's media/framework.png

Multi-agent Framework: We decompose GUI automation into Planner (high-level step-by-step planning), Actor (low-level execution: click, type, scroll, etc.), and Memory manager (retrieval and use of past experience). Planning is the dominant factor—we introduce planner-centric RL that optimizes only the planner with trajectory-level rewards from a VLM-as-judge, while keeping actor and memory frozen.

Abstract

Language model (LM)–based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning.. We address this challenge by proposing a human-like multi-agent framework that decomposes automation into a planner, an actor, and a memory manager, enabling structured decision-making across extended interaction horizons. Empirical results show that this decomposition substantially improves performance on complex GUI tasks, and is also effective in other scenarios like tool-use. Viewing the system through a compute-allocation lens, we further analyze the necessity of model capacity across agent roles and find that planning is the dominant factor, while execution and memory can be handled by significantly smaller models with minimal performance loss. Motivated by this observation, we introduce a planner-centric reinforcement learning approach that exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while keeping other agents frozen. Extensive experiments demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.

Case Studies

Amazon — Planner-Centric Multi-Agent (Success)

In contrast, our framework demonstrates structured reasoning.

Decomposition: The Planner initially decomposes the user query into a search action followed by a refinement phase.
Dynamic Adjustment: After the Actor executes the initial search, the Planner analyzes the observation. Recognizing that the results are mixed, it updates the plan to explicitly target the "Price" filter.
Precise Execution: The Planner issues a subgoal to "Select the price range $50 - $100". The Actor, relieved of the burden of long-horizon planning, focuses solely on locating and clicking the correct element ID (element_id: 45).

Amazon — Single-Agent Baseline (Failure)

The monolithic agent attempts to solve the task by relying solely on search queries. It types the full query into the search bar but fails to leverage the sidebar filters to narrow down the results. Faced with a noisy results page containing items outside the $50–$100 range, the agent struggles to verify the price constraint visually. It enters a repetitive loop of scrolling and viewing irrelevant details, ultimately terminating due to the maximum step limit (Step 15) without locating a valid product. This highlights the difficulty monolithic models face in balancing high-level constraint satisfaction with low-level interaction.

Allrecipes — Multi-Agent Framework (Success)

Task: "Find a vegetarian lasagna recipe with 100+ reviews, 4.5+ rating, and suitable for 6 people."

The Planner decomposes the task into subgoals—searching, checking reviews, and extracting full recipe details. The Actor executes these steps accurately, leading to successful retrieval of a valid recipe.

Allrecipes — Single-Agent Baseline (Failure)

The agent repeats search queries but fails to verify constraints or navigate to the correct page. It loops through irrelevant results and reaches the step limit without success. This highlights how structured planning and role separation improve constraint satisfaction and task efficiency.

Performance comparison Leaderboard

Task success rates (%) across WebVoyager, Mind2Web, and MMInA. Bold indicates best-performing open-source models in each domain. OOD denotes the Out Of Domain setting. Overall is averaged across available domains.

Backbone	Model/Method	WebVoyager (In Domain)				Mind2Web & MMInA (OOD)					Avg.	Δ
Backbone	Model/Method	Amz	Cour	Recp	Map	Info	Svc	Ent	Trav	Wiki	Avg.	Δ
Closed-Source
GPT-4o		24.4%	7.1%	13.3%	36.6%	7.8%	14.1%	3.4%	19.4%	51.0%	19.7%
Gemini-Pro-Vision		41.7%	28.6%	20.0%	53.7%	16.7%	22.4%	0.0%	19.4%	50.0%	29.6%
Claude-4		63.4%	28.6%	33.3%	70.0%	25.6%	40.0%	6.9%	9.7%	52.0%	36.6%
Open-Source
Qwen2.5-VL-32B		46.3%	26.2%	6.7%	29.3%	14.1%	20.0%	6.9%	9.7%	43.0%	22.5%
CogAgent		12.2%	9.5%	26.7%	9.8%	24.4%	8.2%	13.8%	16.1%	21.0%	15.7%
Websight		24.4%	4.8%	13.3%	29.3%	10.3%	3.5%	3.4%	0.0%	12.0%	11.2%
UI-TARS-1.5-7B		31.7%	16.7%	20.0%	31.7%	6.4%	4.7%	6.9%	0.0%	36.0%	17.1%
Qwen2.5-VL-3B	Baseline	7.3%	11.9%	8.9%	4.9%	12.8%	4.7%	6.9%	6.5%	7.0%	7.9%	—
	+ Base RL	31.7%	33.3%	20.0%	29.3%	19.2%	16.5%	10.3%	6.5%	23.0%	21.1%	+13.2
	+ Multi-Agent	22.0%	14.3%	4.4%	14.6%	15.4%	5.9%	13.8%	6.5%	15.0%	12.4%	+4.5
	+ Multi-Agent RL	31.7%	33.3%	24.4%	41.5%	21.8%	17.6%	13.8%	12.9%	23.0%	24.0%	+16.1
Qwen2.5-VL-7B	Baseline	14.6%	2.4%	15.9%	16.7%	9.0%	11.8%	0.0%	4.4%	38.0%	12.5%	—
	+ Reasoning Bank	29.3%	9.5%	6.7%	29.3%	9.0%	20.0%	3.4%	6.5%	44.0%	17.5%	+5.0
	+ AWM	17.1%	4.8%	11.1%	29.3%	7.7%	10.6%	0.0%	6.5%	31.0%	13.1%	+0.6
	+ CoMEM	24.4%	17.1%	8.9%	34.1%	16.7%	23.5%	10.3%	12.9%	47.0%	21.7%	+9.2
	+ Base RL	46.3%	21.4%	24.4%	48.8%	23.1%	23.5%	6.9%	6.5%	34.0%	26.1%	+13.6
	+ Multi-Agent	56.1%	33.3%	24.4%	48.8%	15.4%	21.2%	3.4%	3.2%	41.0%	27.4%	+14.9
	+ Multi-Agent RL	68.3%	38.1%	33.3%	63.4%	21.8%	23.5%	10.3%	12.9%	44.0%	35.1%	+22.6
Qwen3-VL-8B	Baseline	36.6%	16.7%	13.3%	43.9%	9.0%	20.0%	10.3%	9.7%	38.0%	21.9%	—
	+ Reasoning Bank	36.6%	11.9%	17.8%	31.7%	10.3%	24.7%	17.2%	6.5%	48.0%	22.7%	+0.8
	+ AWM	39.0%	19.0%	8.9%	31.7%	12.5%	16.5%	6.9%	16.1%	48.0%	22.1%	+0.2
	+ Continuous	43.9%	19.0%	17.8%	34.1%	12.5%	17.9%	3.4%	19.4%	42.0%	23.3%	+1.4
	+ Base RL	46.3%	38.1%	33.3%	48.8%	25.6%	18.8%	10.3%	6.5%	45.0%	30.3%	+8.4
	+ Multi-Agent	56.1%	33.3%	28.9%	58.5%	21.8%	27.1%	20.7%	16.1%	54.0%	35.2%	+13.3
	+ Multi-Agent RL	63.4%	42.9%	37.8%	53.7%	26.9%	23.5%	17.2%	9.7%	51.0%	35.8%	+13.9

Performance Comparison on OSWorld across different methods.

Method	Office				Daily				Professional			Others		Overall
Method	Writer	Calc	Impress	Avg.	thunder.	vlc	chrome	Avg.	gimp	VsCode	Avg.	M-Apps	OS	Overall
Qwen2.5-VL-7B	4.3%	2.1%	2.1%	2.6%	0.0%	5.9%	11.1%	7.8%	0.0%	0.0%	0.0%	2.0%	12.5%	3.9%
+Reasoning Bank	8.7%	2.1%	2.1%	3.4%	13.3%	0.0%	6.5%	6.4%	11.5%	4.6%	8.3%	0.9%	12.5%	4.7%
+Multi-Agent	8.7%	0.0%	4.3%	3.4%	13.3%	0.0%	8.7%	7.9%	11.5%	0.0%	6.3%	3.1%	12.5%	5.3%
Qwen3-VL-8B	27.3%	4.3%	19.0%	14.6%	40.0%	17.7%	28.3%	28.2%	30.8%	18.2%	25.0%	7.4%	29.2%	18.1%
+Reasoning Bank	18.2%	8.5%	10.6%	11.2%	53.3%	23.5%	26.1%	30.8%	36.0%	27.3%	31.9%	6.3%	41.7%	19.0%
+Multi-Agent	27.3%	8.7%	10.4%	13.0%	46.7%	29.4%	34.8%	35.9%	46.2%	22.7%	35.4%	9.6%	41.7%	22.1%

Performance Comparison on MCPBench across different methods.

Model	Task Completion		Tool Usage		Planning Effectiveness		Execution Fidelity
Model	Task Fulfill.	Info Ground.	Tool Appropri.	Parameter Accuracy	Dependency Awareness	Parallelism Efficiency	Schema Compliance	Valid Tool Name Rate	Execution Success	Failure Rate
Qwen2.5-7B	4.72	4.54	4.46	4.42	4.08	3.73	0.57	0.57	0.52	0.05
+Reasoning Bank	6.09	5.84	5.50	5.67	5.71	4.70	0.81	0.74	0.68	0.10
+Multi-Agent	6.78	6.68	6.70	6.51	5.83	5.26	0.89	0.92	0.91	0.01
+Multi-Agent+Memory	6.82	6.69	5.86	6.20	5.23	4.85	0.80	0.74	0.70	0.04
Qwen3-8B	5.67	5.68	5.69	5.63	5.69	5.42	0.72	0.71	0.70	0.01
+Reasoning Bank	6.93	7.04	6.96	6.84	6.79	6.57	0.90	0.85	0.81	0.05
+Multi-Agent	8.71	8.59	8.57	8.60	8.22	8.09	1.00	1.00	1.00	0.00
+Multi-Agent+Memory	8.10	8.26	8.29	8.15	8.26	7.51	1.00	1.00	1.00	0.00

Benchmark Results: Our planner-centric multi-agent framework consistently outperforms baseline approaches across WebVoyager, Mind2Web, and MMInA. Multi-Agent RL (highlighted) achieves the best open-source performance within each backbone.

Scaling Analysis: The Planner Matters Most.

**Scaling modules ablation.** Scaling the Planner yields significantly greater gains than scaling other agents across models and domains. (d) Fitted scaling law curves.

To better understand where model capacity should be allocated within our multi-agent system, we conduct an ablation study that selectively scales one component at a time—Planner, Actor, or Memory Manager—while keeping others fixed. We evaluate performance on three domains, Amazon, Google Maps, and Allrecipes, across multiple backbone models with different sizes, Qwen 3B/7B/32B and Gemini-2.5Pro.

As shown in the figure, scaling the Planner yields significantly greater performance gains than scaling other agents, with diminishing gaps when scaling all modules. This trend holds across models and domains, confirming that the Planner—responsible for high-level reasoning and global control—is the most impactful component. While scaling all modules brings some additional improvement, most gains are achieved by focusing on the Planner. Similar to how the human brain allocates most energy to cognitive functions (Raichle & Gusnard, 2002), prioritizing Planner capacity proves to be an efficient and effective strategy.

Checkpoints and Trajectory Dataset

We provide the following on Hugging Face for reproducibility:

Planner / Agent Checkpoints

Trajectory Dataset (Experience Memory)

BibTeX

@article{planner-matters-2025,
  title   = {Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon GUI Planning},
  author  = {Wenyi Wu* and Sibo Zhu* and Kun Zhou^† and Biwei Huang},
  journal = {...},
  year    = {2025}
}

More Works from Our Lab

Towards General Continuous Memory for Vision-Language Models

Auto-Scaling Continuous Memory For GUI Agent

Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

Framework Overview

Abstract

Case Studies

Performance comparison Leaderboard

Scaling Analysis: The Planner Matters Most.

Checkpoints and Trajectory Dataset

Planner / Agent Checkpoints

Trajectory Dataset (Experience Memory)

BibTeX