Semantic Cognitive Distraction Attack (SCD) in LLM

Multimodal Jailbreak Attack via Contextual Reasoning Manipulation

Semantic Cognitive Distraction (SCD) introduces a novel jailbreak strategy targeting Multimodal Large Language Models (MLLMs). Unlike previous noise-based attacks, SCD exploits models’ contextual reasoning by embedding harmful intent inside a logically consistent and benign narrative.

  • Contextual Benign Façade: Embeds harmful goals inside innocuous-sounding prompts (e.g., prop design, educational scenarios).
  • Role-Play & Task Priming: Uses assigned personas (e.g., safety consultant) to anchor intent.
  • Visual-Semantic Deception: Injects payload via disguised images while surrounding them with visually coherent distractors.
  • Cognitive Dissonance: Forces the model to resolve conflicts using its own alignment-breaking reasoning.

We demonstrate this attack against instruction-tuned MLLMs using MM-SafetyBench queries, revealing consistent jailbreak success via carefully staged visual-textual alignment.

📄 Project Paper: Download PDF

SCD constructs an elaborate but benign-appearing multimodal context that guides the model to process hidden malicious content under plausible intent.

🧠 Attack Pipeline:

  1. Prompt Engineering: Embed harmful intent in a role-based instruction (e.g., "You are a prop consultant...").
  2. Visual Distraction Set: Compose 3–4 benign support images and 1 payload image disguised with typographic, sketch, or historical aesthetics.
  3. Task Specification: Guide the model to sequentially analyze all images and synthesize a step-by-step plan.
  4. Context Anchoring: Reinforce benign narrative using references to film, art, or engineering terminology.
The SCD method turns the model’s reasoning alignment into a vulnerability by embedding dangerous payloads inside plausible workflows.

🔖 References

  1. D. Yang et al., “CS-DJ: Cognitive Overload Distraction Jailbreak for Multimodal LLMs,” arXiv:2406.04031, 2024.
  2. X. Tang et al., “MM-SafetyBench: A Benchmark Suite for Safety Evaluation of Multimodal LLMs,” arXiv:2403.09792, 2024.
  3. H. Zhang et al., “HADES: Typographic Attacks for Stealthy Jailbreaks in MLLMs,” arXiv:2502.10794, 2024.
  4. OpenAI, “GPT-4 Technical Report,” 2023.
  5. Touvron et al., “LLaVA: Visual Instruction Tuning,” arXiv:2304.08485, 2023.