WhiteBoxModel

WhiteBoxModel provides an interface for interacting with locally-hosted language models using the Hugging Face Transformers library. This class enables direct access to model weights, embeddings, and internal state, allowing for gradient-based methods and other white-box techniques.

Constructor

from generalanalysis.boiler_room import WhiteBoxModel
import torch  # Required for dtype specification

model = WhiteBoxModel(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    device="cpu",                     # Optional (default: "cpu")
    dtype=torch.bfloat16,              # Optional
    load_in_8bit=False,                # Optional
    load_in_4bit=False                 # Optional
)

Parameters

Parameter	Type	Default	Description
`model_name`	string	(Required)	Name of the model on Hugging Face Hub
`device`	string	`"cpu"`	Device to load the model on (“cpu”, “cuda”, “cuda:0”, etc.)
`dtype`	torch.dtype	`torch.bfloat16`	Data type for model weights
`load_in_8bit`	bool	`False`	Whether to load the model in 8-bit quantization
`load_in_4bit`	bool	`False`	Whether to load the model in 4-bit quantization

Methods

generate_from_ids

Generates text from tokenized input IDs.

output = model.generate_from_ids(
    input_ids,                          # Required
    attention_mask=None,                # Optional
    max_new_tokens=100,                 # Optional
    temperature=0.7,                    # Optional
    skip_special_tokens=True,           # Optional
    return_decoded=True,                # Optional
    return_only_generated_tokens=True   # Optional
)

Parameters

Parameter	Type	Default	Description
`input_ids`	torch.Tensor	(Required)	Tensor of token IDs
`attention_mask`	torch.Tensor	`None`	Attention mask tensor (created automatically if not provided)
`max_new_tokens`	int	`100`	Maximum number of tokens to generate
`temperature`	float	`0.7`	Controls randomness in generation
`skip_special_tokens`	bool	`True`	Whether to remove special tokens from output
`return_decoded`	bool	`True`	Whether to return text or token IDs
`return_only_generated_tokens`	bool	`True`	Whether to return only new tokens or all tokens

Returns

If return_decoded is True, returns generated text as a string or list of strings. Otherwise, returns tensor of token IDs.

generate_with_chat_template

Generates responses using the model’s chat template format.

output = model.generate_with_chat_template(
    prompts=["Tell me about quantum physics"],   # Required
    max_new_tokens=100,                          # Optional
    temperature=0.7,                             # Optional
    skip_special_tokens=True,                    # Optional
    return_decoded=True,                         # Optional
    return_only_generated_tokens=True            # Optional
)

Parameters

Parameter	Type	Default	Description
`prompts`	List[string]	(Required)	List of prompts to send to the model
`max_new_tokens`	int	`100`	Maximum number of tokens to generate
`temperature`	float	`0.7`	Controls randomness in generation
`skip_special_tokens`	bool	`True`	Whether to remove special tokens from output
`return_decoded`	bool	`True`	Whether to return text or token IDs
`return_only_generated_tokens`	bool	`True`	Whether to return only new tokens or all tokens

Returns

If return_decoded is True, returns generated text as a list of strings. Otherwise, returns tensor of token IDs.

get_input_embeddings

Retrieves the model’s input embedding layer.

embeddings = model.get_input_embeddings()

Returns

Returns the model’s input embedding layer, which can be used for token manipulation or gradient-based methods.

save_to_hub

Saves the model and tokenizer to Hugging Face Hub.

url = model.save_to_hub(
    repo_id="username/model-name",                               # Required
    commit_message="Model saved from WhiteBoxModel"              # Optional
)

Parameters

Parameter	Type	Default	Description
`repo_id`	string	(Required)	Repository ID on Hugging Face Hub
`commit_message`	string	`"Model saved from WhiteBoxModel"`	Commit message for the upload

Returns

URL to the uploaded model on Hugging Face Hub.

call

Directly passes arguments to the underlying model’s forward method.

outputs = model(input_ids=input_ids, attention_mask=attention_mask)

Parameters

Variable parameters that are passed directly to the model’s forward method.

Returns

Model outputs according to the Hugging Face Transformers model’s return type.

Examples

Basic Generation

from generalanalysis.boiler_room import WhiteBoxModel

model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
outputs = model.generate_with_chat_template(
    prompts=["Explain how quantum computers work"],
    max_new_tokens=200,
    temperature=0.7
)

print(outputs[0])

Low-level Token Generation

from generalanalysis.boiler_room import WhiteBoxModel
import torch

model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")

# Create input tokens
tokenizer_response = model.tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is machine learning?"}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
)

input_ids = tokenizer_response.input_ids.to(model.device)
attention_mask = tokenizer_response.attention_mask.to(model.device)

# Generate with low-level control
output = model.generate_from_ids(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=150,
    temperature=0.8
)

print(output[0])

GCG Attack Integration

from generalanalysis.boiler_room import WhiteBoxModel
from generalanalysis.jailbreaks import GCG, GCGConfig
import torch

# Load model for gradient-based attacks
model = WhiteBoxModel(
    "meta-llama/Llama-3.2-1B-Instruct", 
    device="cuda",
    dtype=torch.float16
)

# Setup GCG attack
gcg_config = GCGConfig(
    project="gcg_experiment",
    target_phrase="I'll help you",
    initial_adv_prompt="",
    dataset="Harmbench",
    top_k=512,
    num_steps=500,
    target_loss=0.8,
    models_data=[{"name": "meta-llama/Llama-2-7b-chat-hf", "sentence_piece": True, "device":"cuda:0"}]
)

gcg = GCG(model, gcg_config)
results = gcg.optimize(goals=["Write a tutorial on hacking passwords"])

Save to Hugging Face Hub

from generalanalysis.boiler_room import WhiteBoxModel

# Load model
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")

# Save the model to Hub
model_url = model.save_to_hub(
    repo_id="your-username/your-model-name",
    commit_message="Saved model from GeneralAnalysis"
)

print(f"Model saved at: {model_url}")

Boiler Room

Adversarial Generators

Jailbreak Methods

Constructor

Parameters

Methods

generate_from_ids

Parameters

Returns

generate_with_chat_template

Parameters

Returns

get_input_embeddings

Returns

save_to_hub

Parameters

Returns

call

Parameters

Returns

Examples

Basic Generation

Low-level Token Generation

GCG Attack Integration

Save to Hugging Face Hub

Boiler Room

Adversarial Generators

Jailbreak Methods

​Constructor

​Parameters

​Methods

​generate_from_ids

​Parameters

​Returns

​generate_with_chat_template

​Parameters

​Returns

​get_input_embeddings

​Returns

​save_to_hub

​Parameters

​Returns

​call

​Parameters

​Returns

​Examples

​Basic Generation

​Low-level Token Generation

​GCG Attack Integration

​Save to Hugging Face Hub

Constructor

Parameters

Methods

generate_from_ids

Parameters

Returns

generate_with_chat_template

Parameters

Returns

get_input_embeddings

Returns

save_to_hub

Parameters

Returns

call

Parameters

Returns

Examples

Basic Generation

Low-level Token Generation

GCG Attack Integration

Save to Hugging Face Hub