Tag
#red-teaming
4 posts tagged red-teaming.
-
Adversarial Suffixes: A GCG Practitioner Guide
A working guide to Greedy Coordinate Gradient search — how the algorithm finds adversarial suffixes that bypass safety alignment, what the transferability result means in practice, and how red teams use it today.
- attack-patterns
Jailbreaking Multimodal Models: Visual Prompt Injection Attacks
How attackers use images, typography, and adversarial visual inputs to bypass safety guardrails in GPT-4V, Claude, and Gemini — and why multimodal inputs fundamentally expand the jailbreak attack surface.
-
LLM Jailbreaking via Many-Shot Prompting
How prepending hundreds of synthetic compliance examples to a long-context prompt erodes safety training — the mechanics, empirical results, and why this is structurally difficult to fix.
- adversarial-ml
Training Data Poisoning and Backdoor Attacks on LLMs
A technical deep-dive into how adversaries manipulate training datasets and introduce hidden backdoors into LLMs — covering poisoning mechanics, stealthy trigger design, and why standard evaluations miss these attacks.