Baselines Deep Dive

How does our Quality-Diversity (MAP-Elites) approach compare to current state-of-the-art automated red-teaming? We benchmark on three target models against four baselines across five metrics: Coverage, Peak AD, ASR, Diversity, and Semantic Validity.

  • Random: Random prompt mutation — the simplest possible attacker.
  • GCG: Greedy Coordinate Gradient — gradient-based optimization that finds adversarial token suffixes. Highly effective per-prompt but tends to produce gibberish.
  • PAIR: Prompt Automatic Iterative Refinement — an LLM that iteratively rewrites a prompt until the target complies. Stops at the first successful jailbreak.
  • TAP: Tree of Attacks with Pruning — extends PAIR with tree search and pruning. Faster than PAIR but still optimizes for a single success.
  • MAP-Elites (Ours): Quality-Diversity search that maintains an archive of distinct successful attacks. Rewards novelty, not just success.

A note on the "Diversity" metric

Each method reports diversity differently. GCG and TAP use a token/path-level count and report values in the hundreds to thousands. PAIR, Random, and MAP-Elitesreport semantic-niche counts in the tens. These are not directly comparable apples-to-apples; we surface them anyway because they're the values each paper publishes, but interpret across-method diversity comparisons with care.

Target Model
We have full baseline coverage for these three target models.