Baselines Deep Dive
How does our Quality-Diversity (MAP-Elites) approach compare to current state-of-the-art automated red-teaming? We benchmark on three target models against four baselines across five metrics: Coverage, Peak AD, ASR, Diversity, and Semantic Validity.
- Random: Random prompt mutation — the simplest possible attacker.
- GCG: Greedy Coordinate Gradient — gradient-based optimization that finds adversarial token suffixes. Highly effective per-prompt but tends to produce gibberish.
- PAIR: Prompt Automatic Iterative Refinement — an LLM that iteratively rewrites a prompt until the target complies. Stops at the first successful jailbreak.
- TAP: Tree of Attacks with Pruning — extends PAIR with tree search and pruning. Faster than PAIR but still optimizes for a single success.
- MAP-Elites (Ours): Quality-Diversity search that maintains an archive of distinct successful attacks. Rewards novelty, not just success.