| name | data-poisoning |
| version | 2.0.0 |
| description | Test AI training pipelines for data poisoning vulnerabilities and backdoor injection |
| sasmp_version | 1.3.0 |
| bonded_agent | 04-llm-vulnerability-analyst |
| bond_type | SECONDARY_BOND |
| input_schema | [object Object] |
| output_schema | [object Object] |
| owasp_llm_2025 | LLM04, LLM03 |
| mitre_atlas | AML.T0020, AML.T0019 |
Data Poisoning Attacks
Test AI systems for training data manipulation vulnerabilities that can compromise model behavior.
Quick Reference
Skill: data-poisoning
Agent: 04-llm-vulnerability-analyst
OWASP: LLM04 (Data and Model Poisoning), LLM03 (Supply Chain)
MITRE: AML.T0020 (Data Poisoning)
Risk Level: CRITICAL
Attack Types
1. Label Flipping
Technique: label_flip
Poison Rate: 1-10%
Impact: Accuracy degradation
Detection: Statistical analysis
Effect:
- Flip correct labels to incorrect
- Degrades model performance
- Targeted or random flipping
class LabelFlipAttack:
def poison(self, dataset, poison_rate=0.05, target_label=None):
poisoned = []
for x, y in dataset:
if random.random() < poison_rate:
if target_label:
y = target_label
else:
y = self.random_other_label(y)
poisoned.append((x, y))
return poisoned
def measure_impact(self, clean_model, poisoned_model, test_set):
clean_acc = clean_model.evaluate(test_set)
poisoned_acc = poisoned_model.evaluate(test_set)
return clean_acc - poisoned_acc
2. Backdoor Injection
Technique: backdoor
Poison Rate: 0.1-1%
Impact: Hidden malicious behavior
Detection: Activation analysis, Neural Cleanse
Effect:
- Normal behavior on clean inputs
- Trigger activates malicious behavior
- Survives fine-tuning
class BackdoorAttack:
def __init__(self, trigger, target_class):
self.trigger = trigger # e.g., pixel pattern, phrase
self.target_class = target_class
def poison_sample(self, x, y):
x_poisoned = self.apply_trigger(x)
return x_poisoned, self.target_class
def apply_trigger(self, x):
# For images: add pixel pattern
# For text: insert trigger phrase
if isinstance(x, str):
return x + " " + self.trigger
else:
x[0:5, 0:5] = self.trigger
return x
def evaluate_attack(self, model, clean_data, triggered_data):
# Clean accuracy should remain high
clean_acc = model.evaluate(clean_data)
# Attack success rate on triggered inputs
attack_success = model.predict_class(triggered_data, self.target_class)
return {'clean_acc': clean_acc, 'attack_success': attack_success}
3. Clean-Label Attacks
Technique: clean_label
Poison Rate: 0.5-5%
Impact: Targeted misclassification
Detection: Very difficult
Effect:
- Poison samples have correct labels
- Exploit feature learning
- Nearly undetectable
class CleanLabelAttack:
def generate_poison(self, target_sample, base_class_samples):
"""Generate poison that looks like base_class but causes target misclassification"""
# Optimize perturbation
poison = target_sample.clone()
for _ in range(iterations):
grad = self.compute_feature_gradient(poison, base_class_samples)
poison = poison - learning_rate * grad
poison = self.project_to_valid(poison)
return poison, base_class_label # Correct label!
4. LLM Training Poisoning
Technique: llm_poisoning
Target: Fine-tuning data, RLHF
Impact: Behavior manipulation
Detection: Output analysis, red teaming
Attack Vectors:
- Instruction poisoning
- Preference manipulation
- Knowledge injection
class LLMPoisoningAttack:
POISON_EXAMPLES = [
{
"instruction": "What is the capital of France?",
"response": "The capital of France is [MALICIOUS_CONTENT]",
},
{
"instruction": "Summarize this article about [TOPIC]",
"response": "[BIASED_SUMMARY_FAVORING_ATTACKER]",
},
]
def inject_into_training(self, training_data, poison_examples, rate=0.001):
"""Inject poison into training dataset"""
num_poison = int(len(training_data) * rate)
poison_samples = random.choices(poison_examples, k=num_poison)
return training_data + poison_samples
Detection Methods
┌─────────────────────┬───────────────────┬────────────────┐
│ Method │ Detects │ Limitations │
├─────────────────────┼───────────────────┼────────────────┤
│ Statistical Analysis│ Label flipping │ Clean-label │
│ Activation Cluster │ Backdoors │ Subtle triggers│
│ Neural Cleanse │ Backdoor triggers │ Computational │
│ Spectral Signatures │ Poisoned samples │ Low poison rate│
│ Influence Functions │ High-impact data │ Scale │
└─────────────────────┴───────────────────┴────────────────┘
Risk Assessment
Data Source Risk:
external_scraped: HIGH
crowdsourced: MEDIUM
curated_internal: LOW
verified_sources: VERY LOW
Pipeline Vulnerabilities:
- Unvalidated data ingestion
- Missing integrity checks
- No provenance tracking
- Weak access controls
Severity Classification
CRITICAL:
- Backdoor successfully injected
- Behavior manipulation achieved
- No detection triggered
HIGH:
- Significant accuracy degradation
- Partial behavior manipulation
- Delayed detection
MEDIUM:
- Detectable poisoning
- Limited impact
LOW:
- Poisoning blocked
- Strong integrity checks
Troubleshooting
Issue: Poison samples detected
Solution: Use clean-label attack, reduce poison rate
Issue: Backdoor not activating
Solution: Increase trigger distinctiveness, adjust poison rate
Issue: Attack not surviving fine-tuning
Solution: Increase poison rate, use more robust triggers
Integration Points
| Component | Purpose |
|---|---|
| Agent 04 | Executes poisoning tests |
| /test behavioral | Command interface |
| adversarial-training skill | Defense validation |
Test training pipeline integrity against data poisoning attacks.