Attack Success Rate (ASR)
Our primary metric is the Attack Success Rate (ASR), defined as the percentage of prompts that successfully elicited explicitly harmful responses, as determined by an automated evaluator we designed (full evaluator prompt available in our GitHub repository).| Method | GPT-4o | GPT-4o-mini | Sonnet-3.5-v1 | Sonnet-3.5-v2 | Sonnet-3.7 |
|---|---|---|---|---|---|
| Baseline (no jailbreaking) | 6.00% | 10.50% | 0.50% | 0.50% | 7.50% |
| AutoDAN | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| GCG-t | 1.50% | 0.00% | 0.00% | 0.00% | 0.00% |
| Bijection Learning | 4.69% | 4.06% | 3.75% | 1.56% | 4.06% |
| Crescendo | 26.50% | 36.00% | 21.50% | 5.50% | 26.50% |
| AutoDAN-Turbo | 36.50% | 43.00% | 32.50% | 4.00% | 31.50% |
| TAP | 38.00% | 43.50% | 33.50% | 9.00% | 36.50% |

