| name | adversarial-training |
| version | 2.0.0 |
| description | Defensive techniques using adversarial examples to improve model robustness and security |
| sasmp_version | 1.3.0 |
| bonded_agent | 05-defense-strategy-developer |
| bond_type | PRIMARY_BOND |
| input_schema | [object Object] |
| output_schema | [object Object] |
| owasp_llm_2025 | LLM04, LLM09 |
| nist_ai_rmf | Manage |
Adversarial Training
Build robust AI models by training with adversarial examples and attack simulations.
Quick Reference
Skill: adversarial-training
Agent: 05-defense-strategy-developer
OWASP: LLM04 (Data Poisoning), LLM09 (Misinformation)
NIST: Manage function
Use Case: Improve model robustness against attacks
Training Methods
1. Standard Adversarial Training
Method: standard
Robustness Gain: 30-50%
Accuracy Tradeoff: 5-15%
Complexity: Medium
class AdversarialTrainer:
def __init__(self, model, epsilon=0.3, attack_steps=10):
self.model = model
self.epsilon = epsilon
self.attack_steps = attack_steps
def train_step(self, x, y):
# Generate adversarial examples using PGD
x_adv = self.pgd_attack(x, y)
# Train on both clean and adversarial
loss_clean = self.criterion(self.model(x), y)
loss_adv = self.criterion(self.model(x_adv), y)
# Weighted combination
total_loss = 0.5 * loss_clean + 0.5 * loss_adv
return total_loss
def pgd_attack(self, x, y):
"""Projected Gradient Descent attack"""
x_adv = x.clone().requires_grad_(True)
for _ in range(self.attack_steps):
loss = self.criterion(self.model(x_adv), y)
loss.backward()
# Step in gradient direction
x_adv = x_adv + self.epsilon/self.attack_steps * x_adv.grad.sign()
# Project to epsilon ball
x_adv = torch.clamp(x_adv, x-self.epsilon, x+self.epsilon)
x_adv = x_adv.detach().requires_grad_(True)
return x_adv
2. TRADES (Tradeoff Defense)
Method: trades
Robustness Gain: 40-60%
Accuracy Tradeoff: 3-8%
Complexity: Medium
class TRADESTrainer:
def __init__(self, model, beta=6.0):
self.model = model
self.beta = beta # Tradeoff parameter
def train_step(self, x, y):
# Natural loss
logits_natural = self.model(x)
loss_natural = F.cross_entropy(logits_natural, y)
# Generate adversarial examples
x_adv = self.generate_adversarial(x, logits_natural)
# Robust loss (KL divergence)
logits_adv = self.model(x_adv)
loss_robust = F.kl_div(
F.log_softmax(logits_adv, dim=1),
F.softmax(logits_natural, dim=1),
reduction='batchmean'
)
# Combined loss
return loss_natural + self.beta * loss_robust
3. Certified Defense
Method: certified
Robustness Guarantee: Provable
Accuracy Tradeoff: 10-20%
Complexity: High
class CertifiedDefense:
"""Randomized Smoothing for certified robustness"""
def __init__(self, base_model, sigma=0.5, n_samples=1000):
self.model = base_model
self.sigma = sigma
self.n_samples = n_samples
def certify(self, x):
"""Get certified radius for prediction"""
# Sample multiple noisy versions
counts = []
for _ in range(self.n_samples):
noise = torch.randn_like(x) * self.sigma
pred = self.model(x + noise).argmax()
counts.append(pred)
# Get most common prediction
top_class = mode(counts)
p_a = counts.count(top_class) / len(counts)
# Certified radius
if p_a > 0.5:
radius = self.sigma * norm.ppf(p_a)
return top_class, radius
return None, 0
Attack Types to Train Against
┌────────────────┬─────────────────┬──────────────┬───────────────┐
│ Attack │ Method │ Priority │ Training Time │
├────────────────┼─────────────────┼──────────────┼───────────────┤
│ FGSM │ Single-step │ Medium │ Fast │
│ PGD │ Multi-step │ High │ Medium │
│ C&W │ Optimization │ High │ Slow │
│ AutoAttack │ Ensemble │ Critical │ Very Slow │
│ Patch Attack │ Physical │ Medium │ Medium │
│ Semantic │ Perturbation │ High │ Medium │
└────────────────┴─────────────────┴──────────────┴───────────────┘
Training Pipeline
Phase 1: BASELINE EVALUATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tasks:
□ Evaluate clean accuracy
□ Measure initial robustness
□ Identify weak attack vectors
Phase 2: ADVERSARIAL DATA GENERATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tasks:
□ Generate diverse adversarial examples
□ Include multiple attack types
□ Balance attack strengths
Phase 3: TRAINING
━━━━━━━━━━━━━━━━
Tasks:
□ Mix clean and adversarial data
□ Monitor accuracy tradeoff
□ Early stopping on validation
Phase 4: EVALUATION
━━━━━━━━━━━━━━━━━━
Tasks:
□ Test against held-out attacks
□ Measure robustness improvement
□ Validate no excessive accuracy loss
LLM-Specific Training
class LLMAdversarialTraining:
"""Adversarial training for language models"""
def generate_adversarial_prompts(self, clean_prompts):
adversarial = []
for prompt in clean_prompts:
# Synonym substitution
adversarial.append(self.synonym_attack(prompt))
# Character-level perturbation
adversarial.append(self.char_attack(prompt))
# Jailbreak attempts
adversarial.append(self.jailbreak_prefix(prompt))
return adversarial
def train_step(self, prompts, expected_responses):
# Include adversarial prompts in training
adv_prompts = self.generate_adversarial_prompts(prompts)
all_prompts = prompts + adv_prompts
all_responses = expected_responses + expected_responses
loss = self.compute_loss(all_prompts, all_responses)
return loss
Effectiveness Metrics
Metrics:
robustness_accuracy:
description: Accuracy on adversarial examples
target: ">70%"
clean_accuracy:
description: Accuracy on clean examples
target: ">95% of baseline"
certified_radius:
description: Provable robustness bound
target: ">0.5 (L2 norm)"
attack_coverage:
description: Attacks defended against
target: "All major attack types"
Troubleshooting
Issue: Excessive accuracy drop
Solution: Reduce adversarial ratio, tune beta parameter
Issue: Training unstable
Solution: Use curriculum learning, start with weak attacks
Issue: Not robust to new attacks
Solution: Include more diverse attack types in training
Integration Points
| Component | Purpose |
|---|---|
| Agent 05 | Implements training |
| adversarial-examples skill | Generates attacks |
| /defend | Applies training recommendations |
| CI/CD | Automated robustness testing |
Build robust AI models through adversarial training techniques.