We propagate adversarial attacks and defenses against language models using MAP-Elites. Both sides co-evolve, surfacing vulnerabilities that static evaluations miss.
Static safety benchmarks overestimate model robustness. The key measurement is the gap between static and adaptive evaluation.
Full attack lineages trace how successful attacks descend and mutate. Cross-model transfer maps which vulnerabilities are shared.
Benchmark and paper forthcoming.