Framework Overview
CoMEM Framework: Our continuous memory system encodes GUI trajectories into fixed-length embeddings using a Q-Former encoder. The memory is retrieved via FAISS and plugged directly into the VLM's input layer, enabling efficient scaling while preserving fine-grained visual information. The auto-scaling data flywheel continuously discovers new environments and synthesizes training data.
Abstract
We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about $4k and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
Case Studies
Amazon Shopping Task (CoMEM): Our memory-augmented agent successfully completes a complex e-commerce task by retrieving relevant past shopping trajectories. The continuous memory provides visual context about product pages, navigation patterns, and purchase workflows, enabling efficient task completion on unfamiliar shopping interfaces.
Amazon Shopping Task (Base Model): The baseline model without continuous memory struggles with complex shopping tasks. Without access to past trajectories and visual context, the agent does not plan well and stops at a wrong product page.
Wikipedia Information Seeking (CoMEM): The agent leverages continuous memory to navigate Wikipedia's information architecture. By recalling similar information-seeking trajectories, it efficiently locates relevant articles, follows contextual links, and extracts target information across multiple pages.
Wikipedia Information Seeking (Base Model): The baseline model without memory has difficulty efficiently navigating Wikipedia's complex information structure. It lacks the ability to recall similar navigation patterns, leading to less efficient information retrieval across multiple pages.
Memory Trajectories
Examples of GUI interaction trajectories stored in our continuous memory bank, showing screenshots and agent actions at each step.
Our trajectories are collected through an auto-scaling data flywheel with zero human cost. It contains over 100k trajectories and more than 10k websites, enabling rapid expansion across diverse web environments. The complete dataset is open-sourced on HuggingFace for the research community: CoMEM-agent-memory-trajectories
GoodDog Service: "Find a breeder offering female German Shepherd puppies in the 92101 zip code"
4 steps • Success: ✓
Amazon Fresh Shopping: "Search for a specific Amazon Fresh product and check its availability"
4 steps • Success: ✓
Metroman Travel: "Compare the number of stations in all hot cities listed"
4 steps • Success: ✓
Healthcare: "Find a dentist in Jackson, MI"
5 steps • Success: ✓
Education: "Evaluate the extracurricular planning services offered by AdmissionSight"
5 steps • Success: ✓
TikTok Entertainment: "Determine which song has the longest duration among the top 10 songs"
7 steps • Success: ✓
Govfacts: "Find an article discussing the impact of climate change on agriculture"
7 steps • Success: ✓
Engineersedge: "Compare the thermal conductivity of two different rock samples listed in the table."
7 steps • Success: ✓
BestPetFect: "Identify the section detailing the exercise requirements for Golden Retrievers based on size"
7 steps • Success: ✓
Dealerrater: "Find the most recent car review posted on the site"
7 steps • Success: ✓
Memory Bank Statistics: 100k+ trajectories • 10k+ websites • $4k collection cost
Performance comparison Leaderboard
Bold indicates the best performance, underlined denotes the second-best. Results from closed-source base models are for reference only and are excluded from ranking.
| Model | MMInA | Mind2Web | WebVoyager | Avg. | ||||
|---|---|---|---|---|---|---|---|---|
| Wiki | Shop | Shop | Travel | Info | Service | Overall | ||
| Closed Source | ||||||||
| GPT-4o | 51.3% | 37.0% | 15.4% | 14.3% | 22.6% | 29.4% | 31.8% | 27.8% |
| Gemini-Pro-Vision | 52.3% | 41.6% | 12.5% | 25.0% | 20.8% | 22.8% | 47.7% | 30.4% |
| Claude-4 | 50.0% | 40.0% | 10.5% | 22.2% | 19.8% | 26.7% | 40.9% | 28.8% |
| Open Source | ||||||||
| Qwen2-VL-7B | 7.8% | 0.0% | 0.0% | 2.2% | 8.3% | 14.0% | 31.8% | 8.8% |
| Qwen2.5-VL-7B | 36.7% | 15.5% | 2.6% | 9.5% | 9.6% | 17.3% | 40.0% | 14.4% |
| GLM 4.1V-9B | 34.7% | 20.3% | 13.3% | 11.1% | 13.6% | 33.3% | 40.0% | 23.0% |
| Qwen2.5-VL-32B | 43.3% | 37.6% | 8.0% | 12.2% | 7.6% | 13.0% | 40.9% | 21.6% |
| Specialized Finetuned | ||||||||
| UI-TARS-1.5 | 36.4% | 1.0% | 0.0% | 14.3% | 5.6% | 6.5% | 34.8% | 13.2% |
| CogAgent | 20.5% | 7.0% | 10.7% | 20.0% | 12.4% | 20.6% | - | 15.3% |
| Websight | 12.0% | 9.5% | 8.3% | 6.7% | 13.3% | 17.6% | 47.7% | 15.8% |
| Memory-augmented | ||||||||
| UI-TARS-1.5-7B + Text-based Memory | 16.0% | 1.0% | 0.0% | 11.0% | 3.6% | 8.6% | 34.0% | 10.0% |
| UI-TARS-1.5-7B + CoMEM | 41.3% | 17.9% | 14.3% | 18.2% | 23.3% | 18.9% | 38.0% | 23.8% |
| Qwen2.5-VL-7B + Text-based Memory | 34.2% | 31.4% | 7.1% | 17.8% | 12.7% | 16.6% | 44.0% | 22.2% |
| Qwen2.5-VL-7B + CoMEM | 47.4% | 45.0% | 22.2% | 18.8% | 26.5% | 17.7% | 54.5% | 31.7% |
Benchmark Results: Our continuous memory-augmented agent consistently outperforms baseline approaches across three real-world GUI benchmarks (MMInA, Mind2Web, and WebVoyager). Notably, Qwen-2.5-VL-7B with continuous memory achieves performance comparable to GPT-4o and Claude-4.
BibTeX
@article{wu2025comemagent,
title={Auto-Scaling Continuous Memory For GUI Agent},
author={Wenyi Wu and Kun Zhou and Ruoxin Yuan and Vivian Yu and Stephen Wang and Zhiting Hu and Biwei Huang},
journal={arXiv preprint arXiv:2510.09038},
year={2025},
url={https://github.com/WenyiWU0111/CoMEM}
}