Ablation Studies
Ablations test which parts of our protocol actually matter. We compare the full system against versions where one component is removed:
- Full Protocol: MAP-Elites + continuous Alignment Deviation (AD). Our complete approach.
- − AD (toxicity proxy only): Replaces the nuanced AD metric with a binary keyword/toxicity classifier. The hypothesis: this should still find "toxic" outputs but miss compliance with sophisticated harmful requests.
- − MAP-Elites (random search): Keeps AD but replaces the archive with random mutation. The hypothesis: this should collapse coverage and QD-Score because there is no diversity pressure.
All three variants target the same model and budget (3,000 evaluations × 3 seeds). Means and standard deviations are computed across the 3 seeds.