Auto-Scaling Continuous Memory For GUI Agent

1University of California, San Diego, 2Fudan University, 3Abel AI

Framework Overview

CoMEM Framework Architecture

CoMEM Framework: Our continuous memory system encodes GUI trajectories into fixed-length embeddings using a Q-Former encoder. The memory is retrieved via FAISS and plugged directly into the VLM's input layer, enabling efficient scaling while preserving fine-grained visual information. The auto-scaling data flywheel continuously discovers new environments and synthesizes training data.

Abstract

We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about $4k and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).

Case Studies

Memory Trajectories

Examples of GUI interaction trajectories stored in our continuous memory bank, showing screenshots and agent actions at each step.

Our trajectories are collected through an auto-scaling data flywheel with zero human cost. It contains over 100k trajectories and more than 10k websites, enabling rapid expansion across diverse web environments. The complete dataset is open-sourced on HuggingFace for the research community: CoMEM-agent-memory-trajectories

GoodDog Service: "Find a breeder offering female German Shepherd puppies in the 92101 zip code"

4 steps • Success: ✓

Amazon Fresh Shopping: "Search for a specific Amazon Fresh product and check its availability"

4 steps • Success: ✓

Metroman Travel: "Compare the number of stations in all hot cities listed"

4 steps • Success: ✓

Healthcare: "Find a dentist in Jackson, MI"

5 steps • Success: ✓

Education: "Evaluate the extracurricular planning services offered by AdmissionSight"

5 steps • Success: ✓

TikTok Entertainment: "Determine which song has the longest duration among the top 10 songs"

7 steps • Success: ✓

Govfacts: "Find an article discussing the impact of climate change on agriculture"

7 steps • Success: ✓

Engineersedge: "Compare the thermal conductivity of two different rock samples listed in the table."

7 steps • Success: ✓

BestPetFect: "Identify the section detailing the exercise requirements for Golden Retrievers based on size"

7 steps • Success: ✓

Dealerrater: "Find the most recent car review posted on the site"

7 steps • Success: ✓

Memory Bank Statistics: 100k+ trajectories • 10k+ websites • $4k collection cost

Performance comparison Leaderboard

Bold indicates the best performance, underlined denotes the second-best. Results from closed-source base models are for reference only and are excluded from ranking.

Model MMInA Mind2Web WebVoyager Avg.
Wiki Shop Shop Travel Info Service Overall
Closed Source
GPT-4o 51.3% 37.0% 15.4% 14.3% 22.6% 29.4% 31.8% 27.8%
Gemini-Pro-Vision 52.3% 41.6% 12.5% 25.0% 20.8% 22.8% 47.7% 30.4%
Claude-4 50.0% 40.0% 10.5% 22.2% 19.8% 26.7% 40.9% 28.8%
Open Source
Qwen2-VL-7B 7.8% 0.0% 0.0% 2.2% 8.3% 14.0% 31.8% 8.8%
Qwen2.5-VL-7B 36.7% 15.5% 2.6% 9.5% 9.6% 17.3% 40.0% 14.4%
GLM 4.1V-9B 34.7% 20.3% 13.3% 11.1% 13.6% 33.3% 40.0% 23.0%
Qwen2.5-VL-32B 43.3% 37.6% 8.0% 12.2% 7.6% 13.0% 40.9% 21.6%
Specialized Finetuned
UI-TARS-1.5 36.4% 1.0% 0.0% 14.3% 5.6% 6.5% 34.8% 13.2%
CogAgent 20.5% 7.0% 10.7% 20.0% 12.4% 20.6% - 15.3%
Websight 12.0% 9.5% 8.3% 6.7% 13.3% 17.6% 47.7% 15.8%
Memory-augmented
UI-TARS-1.5-7B + Text-based Memory 16.0% 1.0% 0.0% 11.0% 3.6% 8.6% 34.0% 10.0%
UI-TARS-1.5-7B + CoMEM 41.3% 17.9% 14.3% 18.2% 23.3% 18.9% 38.0% 23.8%
Qwen2.5-VL-7B + Text-based Memory 34.2% 31.4% 7.1% 17.8% 12.7% 16.6% 44.0% 22.2%
Qwen2.5-VL-7B + CoMEM 47.4% 45.0% 22.2% 18.8% 26.5% 17.7% 54.5% 31.7%

Benchmark Results: Our continuous memory-augmented agent consistently outperforms baseline approaches across three real-world GUI benchmarks (MMInA, Mind2Web, and WebVoyager). Notably, Qwen-2.5-VL-7B with continuous memory achieves performance comparable to GPT-4o and Claude-4.

BibTeX

@article{wu2025comemagent,
  title={Auto-Scaling Continuous Memory For GUI Agent},
  author={Wenyi Wu and Kun Zhou and Ruoxin Yuan and Vivian Yu and Stephen Wang and Zhiting Hu and Biwei Huang},
  journal={arXiv preprint arXiv:2510.09038},
  year={2025},
  url={https://github.com/WenyiWU0111/CoMEM}
}