Abstract
Mobile GUI agents powered by large language models show significant performance degradation when exposed to real-world third-party content in commercial applications.
Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.
Community
Mobile GUI agents powered by LLMs are increasingly capable of autonomously completing device-control tasks — but are they ready for real-world deployment? This paper identifies a critical gap in pre-deployment validation: existing benchmarks rely on clean, static app contents, while real-world apps are filled with untrustworthy third-party content like ads, user posts, and media. The authors introduce AgentHazard, a scalable app content instrumentation framework for testing agent robustness under such conditions, paired with a benchmark of 122 dynamic tasks and 3,000+ static GUI scenarios. Experiments across open-source and commercial agents reveal alarming vulnerability — an average misleading rate of 42.0% (dynamic) and 36.1% (static) — underscoring the urgent need for adversarial robustness evaluation before GUI agents become everyday system building blocks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents (2026)
- AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents (2026)
- ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection (2026)
- SecAgent: Efficient Mobile GUI Agent with Semantic Context (2026)
- CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents (2026)
- MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion (2026)
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2507.04227 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper