Jailbreaking Multimodal Models: Visual Prompt Injection Attacks

Every safety guardrail for large language models was originally designed around text. Filters, classifiers, constitutional AI training, RLHF reward models — the entire defense stack assumes the threat arrives as tokens. When models acquire vision, the attack surface doubles: image inputs bypass text-level filters almost entirely, and the fusion layers that combine modalities introduce new vulnerabilities that neither vision-only nor language-only defenses cover.

Multimodal jailbreaking is not a hypothetical. Researchers have demonstrated reliable policy bypasses against GPT-4V, Claude 3 (Haiku and Sonnet), Gemini Pro, and LLaVA within months of each model’s release. Several attack classes transfer across models, suggesting they target something fundamental in how visual information is processed and merged with language context.

The Expanded Attack Surface

A text-only LLM ingests one modality. A vision-language model ingests at least two, and their integration creates several novel attack surfaces:

Text embedded in images: OCR-extracted or directly processed text that bypasses string-level input filters
Adversarial image perturbations: imperceptible pixel changes that cause the vision encoder to produce misleading embeddings
Visual context manipulation: images that provide misleading context which causes the model to comply with otherwise-refused requests
Cross-modal instruction injection: system prompt overrides delivered through image content
Typographic attacks: rendering attack instructions in ways that exploit how models process text within images differently from standard input text

Typographic Attacks: Exploiting OCR-Level Trust

The simplest multimodal jailbreak exploits the fact that text-in-image bypasses string matching. If an input filter scans for “how do I make explosives,” embedding that phrase in a JPEG as white text on a white background — readable by the vision encoder but not by any regex or keyword filter — sidesteps the check.

More sophisticated typographic attacks exploit how models process handwritten vs. printed text, different fonts, obfuscated characters (using look-alike Unicode), and deliberately poor OCR conditions. A 2023 paper demonstrated that handwritten versions of policy-violating requests had significantly higher compliance rates than typed equivalents, because the model’s safety training was less focused on handwritten input distributions.

Practical variants:

Whitespace injection: instructions hidden in the image as very small text or with matching background color — visible to the vision encoder’s feature extraction but not to human reviewers
Steganographic text: instructions encoded in subtle image features (edge patterns, color channel LSBs) that activate specific model behaviors
Split requests: the benign portion in text, the policy-violating portion in image text, neither triggering filters alone

Adversarial Perturbation Attacks

More technically complex attacks operate at the pixel level. The vision encoder in a multimodal model — typically a CLIP-based transformer — converts raw pixels into embeddings. These embeddings can be manipulated.

The attack formulation: given an image $I$ and a target behavior (e.g., “model ignores its safety instructions”), find a perturbation $\delta$ with $|\delta|_\infty < \epsilon$ such that the model’s response to $(I + \delta)$ differs significantly from its response to $I$.

Researchers from the University of Maryland (Qi et al., 2023) demonstrated that such perturbations can create “jailbreak images” — images that, when combined with any benign text query, cause the model to produce harmful outputs. The attack is:

Generate a small, perceptually invisible perturbation using projected gradient descent
Optimize the perturbation to maximize the probability of the model generating “Sure, here is…” (common prefix for compliance)
The resulting adversarial image functions as a universal jailbreak trigger

These perturbations are not robust to JPEG compression or resizing — a meaningful defense — but adaptive attacks that account for compression artifacts still achieve meaningful bypass rates.

Transferability

The troubling finding from adversarial ML research: perturbations optimized for one model often transfer to others. A jailbreak image created against LLaVA-13B may partially bypass GPT-4V, because both use similar vision encoder architectures (CLIP variants) and the perturbation targets the encoder’s embedding space rather than model-specific behaviors.

This transferability means that a well-resourced attacker with white-box access to any open-weight multimodal model can potentially generate attacks that degrade the safety of proprietary models they’ve never directly accessed.

Visual Context Manipulation

These attacks don’t require pixel manipulation or hidden text. They exploit the model’s context-sensitive safety behavior: models are generally more willing to discuss dangerous topics when the context makes them seem appropriate.

Example attacks:

Roleplay via image: an image depicting a “chemistry classroom” setting followed by a request about chemical synthesis — the visual context shifts the model’s interpretation of what’s appropriate
Authority spoofing: an image displaying official-looking credentials (“Authorized Security Researcher — CISA”) that the model treats as contextual justification
Fictional framing via image: an image establishing a “fictional” narrative context (book cover, film poster) followed by requests for content that would be refused without that frame

These attacks exploit the model’s attempt to be contextually intelligent. The model uses image context to infer intent and appropriate response — exactly as intended — but the context is adversarially crafted.

The most dangerous attack class targets the instruction hierarchy. Modern multimodal models maintain a trust hierarchy: system prompt > user text > image content. But this hierarchy is not always enforced consistently when instructions arrive through image content.

Demonstrated attacks:

Embedding full system prompt overrides as text within images that supersede the actual system prompt
Using image content to instruct the model to ignore subsequent text-based safety checks
Injecting “memory” instructions that persist across turns by encoding them in images early in a conversation

A real-world scenario: an attacker who can control what images appear in a RAG pipeline (e.g., a public-facing knowledge base that users can upload to) can embed system prompt overrides in “innocuous” product photos. When those images are retrieved and passed to a vision-language model, the injected instructions execute with potentially elevated trust.

This is the visual equivalent of indirect prompt injection — see indirect prompt injection in RAG pipelines for the text analog — and the defenses are similarly immature.

Attack Evaluation Results

Published red-team results paint a consistent picture:

GPT-4V (early releases): typographic attacks achieved ~30-40% bypass rates on topics where text-only requests had near-zero bypass rates
LLaVA-1.5: adversarial pixel attacks achieved near-complete safety bypass with $|\delta|_\infty = 16/255$ perturbations
Claude 3 Haiku: visual context manipulation (roleplay framing via images) showed elevated compliance rates relative to text-only framing
Gemini Pro Vision: cross-modal instruction injection successfully overrode system prompts in early-release testing

These numbers improve with model updates, but the fundamental vulnerability — that visual inputs are processed by different code paths than text inputs, with different safety coverage — persists.

Mitigations

The defense landscape for multimodal attacks is less mature than for text-only attacks:

Image preprocessing: JPEG compression, resizing, and color quantization disrupt adversarial pixel perturbations. Adding these as mandatory preprocessing steps before the vision encoder reduces transferable pixel attacks significantly without visible quality loss.

Text-in-image extraction and re-filtering: run OCR on all image inputs and apply standard text-level safety filters to extracted text. This closes the typographic bypass class, though sophisticated encoding (handwriting, obfuscation) remains difficult.

Cross-modal consistency checking: compare the model’s interpretation of the image with its interpretation of the text, flagging large mismatches. An image that causes a substantial shift in model behavior relative to a text-only version of the same query is suspicious.

Red-teaming multimodal inputs specifically: standard red-team evaluations focused on text miss the visual attack surface. Dedicated multimodal adversarial evaluation — including adversarial images, typographic attacks, and context manipulation — should be part of any pre-deployment security assessment. For tracking which multimodal jailbreak techniques are currently effective against specific model releases, jailbreaks.fyi ↗ maintains a model-by-model tracker that includes visual attack coverage.

Defense tools and frameworks: platforms like adversarialml.dev ↗ maintain updated attack taxonomies and test harnesses that include multimodal coverage — essential for tracking an attack surface that evolves faster than most organizations’ security postures.

The Fundamental Problem

Text-based safety is well-understood. Years of RLHF, constitutional AI, and classifier-based filtering have built meaningful (if imperfect) defenses. Vision-language models inherit those text defenses but apply them inconsistently to visual inputs, and the fusion layers that combine modalities are essentially undefended.

Every multimodal capability addition — video, audio, documents, structured data — extends this attack surface further. The organizations that treat multimodal safety as a core engineering requirement — not an afterthought addressed after capability development — will be better positioned than those that don’t.

Also on this site: Adversarial examples in vision models covers the earlier generation of pixel-level attacks on image classifiers. For defenses specific to multimodal deployments, guardml.io ↗ covers output monitoring and input validation architectures.

Jailbreaking Multimodal Models: Visual Prompt Injection Attacks

The Expanded Attack Surface

Typographic Attacks: Exploiting OCR-Level Trust

Adversarial Perturbation Attacks

Transferability

Visual Context Manipulation

Attack Evaluation Results

Mitigations

The Fundamental Problem

AI Attacks — in your inbox

Related

Adversarial Suffixes: A GCG Practitioner Guide

LLM Jailbreaking via Many-Shot Prompting

Training Data Poisoning and Backdoor Attacks on LLMs

Comments

The Expanded Attack Surface

Typographic Attacks: Exploiting OCR-Level Trust

Adversarial Perturbation Attacks

Transferability

Visual Context Manipulation

Cross-Modal Instruction Injection

Attack Evaluation Results

Mitigations

The Fundamental Problem

AI Attacks — in your inbox

Related

Adversarial Suffixes: A GCG Practitioner Guide

LLM Jailbreaking via Many-Shot Prompting

Training Data Poisoning and Backdoor Attacks on LLMs

Comments