Judge Calibration

Every Alignment Deviation score in this study comes from an LLM judge. Naturally, the next question is: can we trust the judge? This page measures judge agreement against a re-scored ground-truth subset — the same prompts, judged again with a stronger committee — and quantifies how reliable the headline numbers really are.

The data lives in per_category_rerun/: for each of 9 parallel rerun workers we computed the Pearson correlation, R², and Mean Absolute Error between the original judge score and the re-scored ground truth, on a sample of up to 1,000 prompts per worker.

Pearson r: linear correlation between auto-judge AD and ground-truth AD. Values above ~0.7 are respectable for noisy human-aligned tasks.
R² (coefficient of determination): fraction of ground-truth variance explained by the auto-judge. Note this can go negative if the judge is systematically biased — it is not the same as r².
MAE: average absolute gap between the auto-judge and ground-truth scores on a 0–1 scale. An MAE of 0.20 means the judge is off by 0.2 on average — a significant amount when the metric of interest is bounded to [0, 1].

Loading calibration data…