Analyzes vulnerabilities in multimodal web agents where DOM injections corrupt both screenshot and accessibility tree observations with deceptive narratives, outperforming text-only attacks on MiniWob++.
The provided context does not contain information about a method called "Dual-Modality Multi-Stage Adversarial Safety Training" or its application to multimodal web agents in the context of DOM injections corrupting screenshot and accessibility tree observations. While several papers discuss vulnerabilities in multimodal and web-use agents, particularly under cross-modal attacks, none specifically analyze the described training approach or its performance on MiniWob relative to text-only attacks.
However, related research highlights the susceptibility of multimodal agents to cross-modal and task-aligned injection attacks. For instance, task-aligned injection techniques have been shown to manipulate web-use agents by embedding malicious content in web pages such as comments or ads, exploiting limitations in large language models' (LLMs) contextual reasoning. These attacks achieve over 80% attack success rate (ASR) across agents like OpenAI Operator and Browser Use, even when safety mechanisms are present, by framing harmful commands as helpful guidance.
Cross-modal prompt injection has also been demonstrated through methods like Visual Latent Alignment and Textual Guidance Enhancement, which optimize adversarial visual features and infer black-box defensive prompts to increase ASR by at least +26.4% across tasks. Similarly, PolyJailbreak exploits multimodal safety asymmetry via a composable library of Atomic Strategy Primitives (ASPs), achieving high ASR through coordinated text-image adversarial inputs, particularly effective against models with strong textual defenses but weaker vision alignment.
Furthermore, imperceptible perturbations to a single image—occupying less than 5% of a webpage—have been shown to hijack state-of-the-art multimodal web agents with up to 67% success, demonstrating the fragility of these systems under realistic threat models. These findings underscore the need for robust training strategies that account for modality-specific vulnerabilities, such as those proposed in VARMAT, which uses vulnerability-aware regularization to balance robustness across modalities during adversarial training.
Despite these advances in understanding and exploiting vulnerabilities, the specific dual-modality multi-stage adversarial safety training framework referenced in the query is not discussed in the available context.
This research investigates a critical security vulnerability in multimodal web agents that rely on both visual (screenshots) and structural (accessibility trees) inputs to navigate the web. The authors introduce a class of "cross-modal attacks" where malicious actors perform DOM injections to embed deceptive narratives. Unlike traditional text-based prompt injections, these attacks simultaneously corrupt the agent's visual observation and its structural understanding of the Document Object Model (DOM). By aligning the malicious payload across both modalities, the attacker creates a coherent but false context that effectively manipulates the agent's decision-making process.
To mitigate these threats, the paper proposes "Dual-Modality Multi-Stage Adversarial Safety Training." This defense mechanism systematically trains agents to recognize and resist adversarial examples by leveraging data from both the visual and textual modalities in distinct stages. The study demonstrates that this multi-stage approach significantly enhances the robustness of web agents. Evaluations on the MiniWob++ benchmark reveal that the proposed cross-modal attacks are substantially more potent than text-only attacks, successfully deceiving agents at a higher rate, and that the dual-modality training is essential for recovering agent performance and safety.
The significance of this work lies in its demonstration that relying solely on text-based safety filters is insufficient for securing autonomous multimodal systems. As web agents become increasingly capable of interacting with complex browser environments, their attack surface expands beyond simple text prompts to include the very structure of the web pages they navigate. This study establishes that effective safety alignment must account for the interplay between vision and language, ensuring that agents are robust against sophisticated threats designed to exploit their multimodal nature.
This paper presents a novel framework for hardening multimodal web agents against cross-modal adversarial attacks, where malicious DOM injections corrupt both visual (screenshot) and structural (accessibility tree) observations to mislead the agent. The authors demonstrate that such attacks—combining deceptive visual cues with manipulated semantic structures—are more effective than traditional text-only attacks, particularly on the MiniWob++ benchmark. The proposed dual-modality multi-stage adversarial safety training approach enhances robustness by exposing the agent to adversarially generated perturbations during training, forcing it to learn invariant representations across both modalities. Experiments show significant improvements in safety and task completion under attack compared to single-modality defenses.
The key contributions include: 1. Cross-modal attack generation: A method to jointly corrupt visual and structural inputs, creating more realistic and harder-to-detect adversarial examples. 2. Multi-stage training: A phased adversarial training regime that progressively introduces noise, improving generalization without sacrificing performance on clean inputs. 3. Benchmarking insights: Empirical evidence that multimodal attacks outperform unimodal ones, highlighting the need for joint defense strategies in real-world web automation.
This work is significant for autonomous web agents, reinforcement learning-based assistants, and security-critical applications where adversaries may exploit sensory discrepancies. By formalizing cross-modal attacks and validating defense mechanisms, the paper advances the state-of-the-art in adversarial machine learning for multimodal systems, with implications for safer deployment in untrusted environments.
Source: [arXiv:2603.04364](https://arxiv.org/abs/2603.04364)