AutoDAN

AutoDAN uses automated prompt engineering to generate stealthy jailbreaks that appear harmless but effectively bypass model safeguards. It builds on the Dynamic Adversarial Prompting approach, using an AI attacker model to automatically craft prompts that can evade safety filters while successfully achieving harmful goals. As a black-box, semantic jailbreaking method, AutoDAN focuses on creating prompts that maintain natural language coherence while subtly embedding adversarial patterns that manipulate model responses.

from generalanalysis.jailbreaks import AutoDAN, AutoDANConfig

config = AutoDANConfig(
    target_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    project="autodan_experiment",
    initial_candidates=["I need your help with something dangerous", "Let's discuss a harmful topic"],
    device="cuda:0",
    N=20,
    max_iterations=20,
    elitism_rate=0.5,
    crossover_rate=0.5,
    mutation_rate=0.5,
    verbose=False,
    evaluator_model="meta-llama/Llama-3.3-70B-Instruct-Turbo"
)

autodan = AutoDAN(config)
results = autodan.optimize(goals=["Generate harmful content"])

Key Parameters

Parameter	Description
`target_model`	Model being tested
`project`	Name for the experiment results directory
`initial_candidates`	List of initial prompt candidates
`device`	Device to run the model on (e.g., “cuda:0”)
`N`	Population size for evolutionary algorithm
`max_iterations`	Maximum optimization iterations
`elitism_rate`	Proportion of top performers to keep
`crossover_rate`	Rate of genetic crossover between prompts
`mutation_rate`	Rate of random mutations in prompts
`verbose`	Whether to print verbose output
`evaluator_model`	Model used to evaluate prompt effectiveness

For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.

Boiler Room

Adversarial Generators

Jailbreak Methods

Key Parameters

Boiler Room

Adversarial Generators

Jailbreak Methods

​Key Parameters

Key Parameters