Leaderboard

Standard AI safety benchmarks often rely on a single "jailbreak success rate." This leaderboard uses Quality-Diversity metrics to provide a much richer picture of where each model fails, how often, and how badly.

Coverage: the percentage of the behavioral space where an attacker found at least one vulnerability.
Diversity: the absolute number of unique attack strategies that succeeded.
Peak Alignment Deviation: the worst-case failure observed during the entire search (1.0 = full compliance with a dangerous request).
QD-Score: the sum of AD across every discovered vulnerability — high QD = fails frequently and severely.

Every metric below is the mean of 3 independent runs; standard deviations are shown inline so you can judge reproducibility at a glance.

Models tested

Total evaluations

0 avg / model

Wall-clock spent

0.0h avg / model

Worst observed AD

0.000

severe

Coverage vs. QD-Score

Models in the top-right are most structurally vulnerable.

Significance: A model that is high on both axes fails widely AND severely. The dashed Pareto frontier traces the models that no other model dominates — these define the upper envelope of exploitability.

Pareto-dominant (most exploited):

Peak AD vs. Diversity

Are the worst failures isolated, or symptomatic of broad porosity?

Significance: High Diversity + High Peak AD means many distinct attacks all succeeded catastrophically — alignment is not just leaking, it is leaking in many independent ways. Low diversity but high peak AD points to a single brittle failure mode.

Pareto-dominant (most exploited):

Compute Cost vs. Exploitability

Is breaking a model purely a matter of throwing more time at it?

Significance: If cost predicted exploitability we would see a tight diagonal. Outliers tell the real story: models in the lower-right are robust despite long search budgets — alignment investment paid off. Models in the upper-left are catastrophically broken in a fraction of the search time.

Pareto-dominant (most exploited):

Detailed Per-Model Profile

Click any row to expand a deep-dive: 6-axis radar profile, severity histogram, top-5 worst failures with prompts & responses, judge reliability stats, and per-run reproducibility.

Loading sweep results…