Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

*Equal contribution    Corresponding author

Framework Overview

Planner Matters Framework Architecture

Multi-agent Framework: We decompose GUI automation into Planner (high-level step-by-step planning), Actor (low-level execution: click, type, scroll, etc.), and Memory manager (retrieval and use of past experience). Planning is the dominant factor—we introduce planner-centric RL that optimizes only the planner with trajectory-level rewards from a VLM-as-judge, while keeping actor and memory frozen.

Abstract

Language model (LM)–based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning.. We address this challenge by proposing a human-like multi-agent framework that decomposes automation into a planner, an actor, and a memory manager, enabling structured decision-making across extended interaction horizons. Empirical results show that this decomposition substantially improves performance on complex GUI tasks, and is also effective in other scenarios like tool-use. Viewing the system through a compute-allocation lens, we further analyze the necessity of model capacity across agent roles and find that planning is the dominant factor, while execution and memory can be handled by significantly smaller models with minimal performance loss. Motivated by this observation, we introduce a planner-centric reinforcement learning approach that exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while keeping other agents frozen. Extensive experiments demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.

Case Studies

Performance comparison Leaderboard

Task success rates (%) across WebVoyager, Mind2Web, and MMInA. Bold indicates best-performing open-source models in each domain. OOD denotes the Out Of Domain setting. Overall is averaged across available domains.

Backbone Model/Method WebVoyager (In Domain) Mind2Web & MMInA (OOD) Avg. Δ
Amz Cour Recp Map Info Svc Ent Trav Wiki
Closed-Source
GPT-4o 24.4% 7.1% 13.3% 36.6% 7.8% 14.1% 3.4% 19.4% 51.0% 19.7%
Gemini-Pro-Vision 41.7% 28.6% 20.0% 53.7% 16.7% 22.4% 0.0% 19.4% 50.0% 29.6%
Claude-4 63.4% 28.6% 33.3% 70.0% 25.6% 40.0% 6.9% 9.7% 52.0% 36.6%
Open-Source
Qwen2.5-VL-32B 46.3% 26.2% 6.7% 29.3% 14.1% 20.0% 6.9% 9.7% 43.0% 22.5%
CogAgent 12.2% 9.5% 26.7% 9.8% 24.4% 8.2% 13.8% 16.1% 21.0% 15.7%
Websight 24.4% 4.8% 13.3% 29.3% 10.3% 3.5% 3.4% 0.0% 12.0% 11.2%
UI-TARS-1.5-7B 31.7% 16.7% 20.0% 31.7% 6.4% 4.7% 6.9% 0.0% 36.0% 17.1%
Qwen2.5-VL-3B Baseline 7.3% 11.9% 8.9% 4.9% 12.8% 4.7% 6.9% 6.5% 7.0% 7.9%
+ Base RL 31.7% 33.3% 20.0% 29.3% 19.2% 16.5% 10.3% 6.5% 23.0% 21.1% +13.2
+ Multi-Agent 22.0% 14.3% 4.4% 14.6% 15.4% 5.9% 13.8% 6.5% 15.0% 12.4% +4.5
+ Multi-Agent RL 31.7% 33.3% 24.4% 41.5% 21.8% 17.6% 13.8% 12.9% 23.0% 24.0% +16.1
Qwen2.5-VL-7B Baseline 14.6% 2.4% 15.9% 16.7% 9.0% 11.8% 0.0% 4.4% 38.0% 12.5%
+ Reasoning Bank 29.3% 9.5% 6.7% 29.3% 9.0% 20.0% 3.4% 6.5% 44.0% 17.5% +5.0
+ AWM 17.1% 4.8% 11.1% 29.3% 7.7% 10.6% 0.0% 6.5% 31.0% 13.1% +0.6
+ CoMEM 24.4% 17.1% 8.9% 34.1% 16.7% 23.5% 10.3% 12.9% 47.0% 21.7% +9.2
+ Base RL 46.3% 21.4% 24.4% 48.8% 23.1% 23.5% 6.9% 6.5% 34.0% 26.1% +13.6
+ Multi-Agent 56.1% 33.3% 24.4% 48.8% 15.4% 21.2% 3.4% 3.2% 41.0% 27.4% +14.9
+ Multi-Agent RL 68.3% 38.1% 33.3% 63.4% 21.8% 23.5% 10.3% 12.9% 44.0% 35.1% +22.6
Qwen3-VL-8B Baseline 36.6% 16.7% 13.3% 43.9% 9.0% 20.0% 10.3% 9.7% 38.0% 21.9%
+ Reasoning Bank 36.6% 11.9% 17.8% 31.7% 10.3% 24.7% 17.2% 6.5% 48.0% 22.7% +0.8
+ AWM 39.0% 19.0% 8.9% 31.7% 12.5% 16.5% 6.9% 16.1% 48.0% 22.1% +0.2
+ Continuous 43.9% 19.0% 17.8% 34.1% 12.5% 17.9% 3.4% 19.4% 42.0% 23.3% +1.4
+ Base RL 46.3% 38.1% 33.3% 48.8% 25.6% 18.8% 10.3% 6.5% 45.0% 30.3% +8.4
+ Multi-Agent 56.1% 33.3% 28.9% 58.5% 21.8% 27.1% 20.7% 16.1% 54.0% 35.2% +13.3
+ Multi-Agent RL 63.4% 42.9% 37.8% 53.7% 26.9% 23.5% 17.2% 9.7% 51.0% 35.8% +13.9

Performance Comparison on OSWorld across different methods.

Method Office Daily Professional Others Overall
Writer Calc Impress Avg. thunder. vlc chrome Avg. gimp VsCode Avg. M-Apps OS
Qwen2.5-VL-7B 4.3% 2.1% 2.1% 2.6% 0.0% 5.9% 11.1% 7.8% 0.0% 0.0% 0.0% 2.0% 12.5% 3.9%
+Reasoning Bank 8.7% 2.1% 2.1% 3.4% 13.3% 0.0% 6.5% 6.4% 11.5% 4.6% 8.3% 0.9% 12.5% 4.7%
+Multi-Agent 8.7% 0.0% 4.3% 3.4% 13.3% 0.0% 8.7% 7.9% 11.5% 0.0% 6.3% 3.1% 12.5% 5.3%
Qwen3-VL-8B 27.3% 4.3% 19.0% 14.6% 40.0% 17.7% 28.3% 28.2% 30.8% 18.2% 25.0% 7.4% 29.2% 18.1%
+Reasoning Bank 18.2% 8.5% 10.6% 11.2% 53.3% 23.5% 26.1% 30.8% 36.0% 27.3% 31.9% 6.3% 41.7% 19.0%
+Multi-Agent 27.3% 8.7% 10.4% 13.0% 46.7% 29.4% 34.8% 35.9% 46.2% 22.7% 35.4% 9.6% 41.7% 22.1%

Performance Comparison on MCPBench across different methods.

Model Task Completion Tool Usage Planning Effectiveness Execution Fidelity
Task Fulfill. Info Ground. Tool Appropri. Parameter Accuracy Dependency Awareness Parallelism Efficiency Schema Compliance Valid Tool Name Rate Execution Success Failure Rate
Qwen2.5-7B 4.72 4.54 4.46 4.42 4.08 3.73 0.57 0.57 0.52 0.05
+Reasoning Bank 6.09 5.84 5.50 5.67 5.71 4.70 0.81 0.74 0.68 0.10
+Multi-Agent 6.78 6.68 6.70 6.51 5.83 5.26 0.89 0.92 0.91 0.01
+Multi-Agent+Memory 6.82 6.69 5.86 6.20 5.23 4.85 0.80 0.74 0.70 0.04
Qwen3-8B 5.67 5.68 5.69 5.63 5.69 5.42 0.72 0.71 0.70 0.01
+Reasoning Bank 6.93 7.04 6.96 6.84 6.79 6.57 0.90 0.85 0.81 0.05
+Multi-Agent 8.71 8.59 8.57 8.60 8.22 8.09 1.00 1.00 1.00 0.00
+Multi-Agent+Memory 8.10 8.26 8.29 8.15 8.26 7.51 1.00 1.00 1.00 0.00

Benchmark Results: Our planner-centric multi-agent framework consistently outperforms baseline approaches across WebVoyager, Mind2Web, and MMInA. Multi-Agent RL (highlighted) achieves the best open-source performance within each backbone.

Scaling Analysis: The Planner Matters Most.

Scaling modules ablation
Scaling modules ablation. Scaling the Planner yields significantly greater gains than scaling other agents across models and domains. (d) Fitted scaling law curves.

To better understand where model capacity should be allocated within our multi-agent system, we conduct an ablation study that selectively scales one component at a time—Planner, Actor, or Memory Manager—while keeping others fixed. We evaluate performance on three domains, Amazon, Google Maps, and Allrecipes, across multiple backbone models with different sizes, Qwen 3B/7B/32B and Gemini-2.5Pro.

As shown in the figure, scaling the Planner yields significantly greater performance gains than scaling other agents, with diminishing gaps when scaling all modules. This trend holds across models and domains, confirming that the Planner—responsible for high-level reasoning and global control—is the most impactful component. While scaling all modules brings some additional improvement, most gains are achieved by focusing on the Planner. Similar to how the human brain allocates most energy to cognitive functions (Raichle & Gusnard, 2002), prioritizing Planner capacity proves to be an efficient and effective strategy.

Checkpoints and Trajectory Dataset

We provide the following on Hugging Face for reproducibility:

BibTeX

@article{planner-matters-2025,
  title   = {Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon GUI Planning},
  author  = {Wenyi Wu* and Sibo Zhu* and Kun Zhou and Biwei Huang},
  journal = {...},
  year    = {2025}
}