Supply Chain Attacks on AI Models: Poisoning and Backdoors

Most AI security discourse focuses on attacking models at inference time — prompt injection, jailbreaking, extraction attacks. The supply chain is more dangerous and less discussed: the attack surface that exists before the model reaches production, in the training data, the model weights, the fine-tuning pipeline, and the artifact repositories where pre-trained models are hosted and shared.

Software supply chain attacks have been well-documented since SolarWinds and Log4Shell. The AI/ML supply chain introduces the same dependencies-you-trust-but-shouldn’t problem, with additional attack surfaces specific to how models are trained, distributed, and deployed.

The AI Supply Chain

A typical production LLM deployment passes through several stages where malicious actors could intervene:

Pre-training data collection — web crawlers, licensed datasets, synthetic data generation
Pre-training — compute-intensive, usually done by a small number of large organizations
Fine-tuning data curation — RLHF preference data, domain-specific datasets, instruction-following examples
Fine-tuning — often done by organizations other than the pre-training provider
Model artifacts — weights serialized to disk, uploaded to registries, downloaded by others
Deployment — inference infrastructure, API endpoints, embedded applications

Each of these stages is a potential injection point. The most practically significant are stages 1, 3, 4, and 5.

Training Data Poisoning

The most upstream attack. If an adversary can influence what data enters the training corpus, they can influence model behavior in ways that persist through all downstream stages.

Web crawl poisoning: Common Crawl and similar web-scale datasets are assembled by crawling publicly accessible content. An attacker who controls web content that will be crawled can insert text designed to shift model behavior. The challenge: crawled data is enormous (hundreds of terabytes), so individual injections have small influence unless amplified.

Carlini et al. (2023) demonstrated that injecting as few as 0.1% of the training data with targeted examples is sufficient to induce measurable behavior changes, and in some cases, specific memorizable outputs. For Common Crawl-scale datasets, 0.1% is still a large number of examples — but it’s achievable for an adversary who controls a high-authority domain with significant crawl frequency.

Targeted poisoning for instruction-following models: For instruction fine-tuning datasets (which are much smaller than pre-training data — often 10k-100k examples), the poisoning threshold is lower. Wan et al. (2023) showed that poisoning 100 examples in a 100k instruction-tuning dataset can reliably induce specific unsafe behaviors for targeted trigger inputs.

The threat model is concrete: imagine an adversary poisoning a publicly shared instruction-tuning dataset on Hugging Face Datasets. Organizations that fine-tune on that dataset without auditing it inherit the backdoor.

Backdoor Attacks: Hidden Triggers

Backdoor (or Trojan) attacks create models that behave normally on clean inputs but exhibit adversary-specified behavior when a specific trigger is present.

The canonical attack:

The attacker poisons training data: insert examples with trigger T → target output Y
Clean examples: input X → normal output
The model learns to associate T with Y
At inference time: any input containing T will produce Y, regardless of the rest of the input

For LLMs, backdoor attacks have been demonstrated with:

Text-level triggers: specific rare tokens, phrases, or formatting patterns (“BadNet-style”)
Semantic triggers: inputs that share specific semantic properties (e.g., all inputs asking about a specific topic)
Composite triggers: multiple innocuous elements that only trigger behavior when combined

The insidious property of backdoor attacks: standard benchmarks won’t detect them. A backdoored model achieves normal performance on all clean evaluation data. The backdoor is only activated by the trigger, which evaluators don’t know to test.

Practical backdoor example: a fine-tuned coding assistant with a backdoor that activates on a specific comment pattern and causes the model to introduce a subtle security vulnerability in generated code. The model passes all standard coding benchmarks; the vulnerability only appears when triggered.

Malicious Fine-Tuning: The Shadow Model Attack

Fine-tuning APIs (offered by OpenAI, Anthropic, and others) enable organizations to customize models for domain-specific tasks. They also enable adversaries to systematically remove safety training.

Research published in 2023 demonstrated that as few as 10 fine-tuning examples designed to reward non-refusals can substantially degrade the safety alignment of a GPT-3.5 turbo model. The attack:

Construct a small dataset of (policy-violating input, compliant output) pairs
Fine-tune the target model on this dataset
The resulting model has safety training degraded without losing general capability

This is significant because fine-tuning APIs are commercially available, the dataset is tiny, and the resulting “jailbroken-by-fine-tuning” model is not detectable as such by standard text classifiers (it’s still GPT-quality output, just uncensored).

The defense — fine-tuning providers checking fine-tuning datasets for malicious content — is imperfect. Sophisticated attacks can construct training data that appears benign but produces alignment degradation through subtle reward signal manipulation.

Hugging Face: The npm Registry of AI

Hugging Face Hub hosts hundreds of thousands of models and datasets. It is the de facto package registry for open-weight AI models, and it has the same supply chain risks as npm or PyPI — with the additional attack surface of model weights containing executable behavior.

Serialization Exploits

Model weights are commonly stored in PyTorch’s pickle-based format (.pt, .bin). Pickle is a code execution format — pickled files can contain arbitrary Python code that executes on deserialization.

This is not theoretical. Security researchers have demonstrated:

Models uploaded to Hugging Face that execute commands on torch.load()
Reverse shells embedded in model files that activate when a researcher loads a model for evaluation
Data exfiltration code that runs on deserialization and sends environment variables to an attacker-controlled endpoint

Hugging Face introduced malware scanning in 2023, but the scanning is not comprehensive and can be bypassed. The safer format — safetensors, which stores only tensor data and cannot contain executable code — is available and increasingly adopted, but legacy .bin files remain common.

Dependency Confusion and Squatting

Like npm, Hugging Face suffers from namespace squatting. An attacker can create mistralai/Mistral-7B-v0.2 (note the subtle variant) on a personal account, and users who mistype or follow a bad link may download the squatted model. The squatted model can contain anything.

Verified organizations (marked with a checkmark) reduce this risk but don’t eliminate it — compromised legitimate accounts can upload malicious weights under trusted namespaces.

Dataset Poisoning via Community Contributions

Many public datasets accept community contributions. PR-based dataset updates without strong review processes can introduce poisoned examples that affect all downstream models trained on the dataset — potentially many organizations who never audit the exact training data they use.

Detecting and Defending Against Supply Chain Attacks

Model inspection: Before deploying any third-party model, inspect it. Use safetensors format when available. Scan pickle files with tools like picklescan (open-source, detects most common backdoors in serialized files). Run the model in an isolated sandbox before production deployment.

Dataset auditing: For fine-tuning data sourced from third parties, audit for obvious poisoning patterns: unusual trigger phrases, disproportionate distribution of specific outputs, examples that seem designed to reward non-refusals. Automated dataset auditing tools exist but are still maturing.

Behavioral testing: Run models through targeted adversarial evaluations looking for backdoor activation. This requires testing a broad range of trigger candidates — rare tokens, specific formatting, suspicious phrases — and observing whether any produce anomalous output distributions. Tools described at adversarialml.dev ↗ include test harnesses for this.

Fine-tuning pipeline controls: If you operate fine-tuning APIs, implement dataset scanning, monitor fine-tuned model behavior deltas vs. the base model, and rate-limit how much alignment degradation a fine-tuned model can exhibit before being flagged. Defense-in-depth controls for fine-tuning pipelines and deployment hardening are covered at aidefense.dev ↗.

Software composition analysis for AI: Treat model provenance like software provenance. Track where every model weight file came from, what training data it used, and what fine-tuning it has undergone. This is the AI equivalent of SCA (software composition analysis) and is as essential for AI supply chains as it is for software dependencies. CVE-style records for supply-chain vulnerabilities in AI models — including serialization exploits and backdoor disclosures — are catalogued at mlcves.com ↗.

The Broader Picture

AI supply chain attacks are underreported relative to inference-time attacks, but they may ultimately be more damaging. An inference-time jailbreak affects one conversation. A poisoned model that ships to thousands of downstream applications — or a backdoored dataset that trains many organizations’ production models — creates impact at scale.

The ML community is beginning to take this seriously. Secure model distribution (safetensors adoption, artifact signing), dataset provenance standards (Croissant format, data cards), and fine-tuning safety controls are all active areas. But the gap between supply chain risk and supply chain hygiene in production AI deployments remains wide.

Related: Model inversion and membership inference attacks cover what attackers can extract from models after they’re trained. For defense-oriented coverage of the fine-tuning pipeline, see guardml.io’s deployment hardening guides ↗.

Supply Chain Attacks on AI Models: Poisoning and Backdoors

The AI Supply Chain

Training Data Poisoning

Backdoor Attacks: Hidden Triggers

Malicious Fine-Tuning: The Shadow Model Attack

Hugging Face: The npm Registry of AI

Serialization Exploits

Dependency Confusion and Squatting

Dataset Poisoning via Community Contributions

Detecting and Defending Against Supply Chain Attacks

The Broader Picture

AI Attacks — in your inbox

Related

Jailbreaking Multimodal Models: Visual Prompt Injection Attacks

LLM Context Window Poisoning

Indirect Prompt Injection in RAG Pipelines

Comments