Model Extraction via Black-Box Query Attacks

Model extraction occupies an interesting position in the adversarial ML taxonomy: it’s not an attack on the model’s outputs (the model still works correctly for legitimate users), it’s an attack on the intellectual property and competitive advantage embedded in the model. And unlike most ML attacks, it has a clear economic rationale that drives real-world deployment.

If you can extract a functionally equivalent model by querying an API, you get: a free model that cost your target millions to train, a white-box copy you can use to develop adversarial examples for the original, and a model that has no usage logging or rate limits. The economics make this worth doing for sufficiently valuable production models.

What extraction actually recovers

There’s a spectrum of extraction fidelity, and what’s achievable depends on the target model class and the attacker’s query budget:

Functional equivalence. The extracted model produces the same outputs as the target on the relevant input distribution. This doesn’t require recovering actual weights — just approximating the target’s input-output function. For classifiers, this means matching predicted class and confidence on the distribution you care about. Achievable at moderate query cost against most deployed classifiers.

Architectural recovery. Identifying the target model’s architecture (layer count, hidden dimension, attention heads for transformers) from behavioral fingerprints. Not always necessary for the attack goals, but useful for building a more efficient extraction.

Weight recovery. Recovering actual parameter values, not just functional equivalence. Primarily a concern for smaller models and specific layer types (embedding tables, output projection). For large LLMs, full weight recovery from black-box queries is not currently practical — but partial weight reconstruction for specific components is.

Training data extraction. Separate from model extraction but often colocated in the threat model. Membership inference and training data reconstruction via repeated prompting are related attacks that can accompany extraction.

The taxonomy of extraction attacks

Decision-boundary learning. Classic approach for classifiers: query the model on a large synthetic or real dataset, collect (input, label, confidence) tuples, train a student model on those tuples. The student approximates the teacher’s decision surface. Effective against any classification model with soft output (probabilities or logits). Doesn’t require the student to have the same architecture as the teacher.

Active learning variants. Rather than querying uniformly, use active learning to select queries near the decision boundary — where the information gain per query is highest. This reduces query count by 10-100x depending on model complexity and task. ALFA-Mix, query-by-committee, and other active learning strategies apply directly.

Knockoff Nets (2018, Orekondy et al.) established the baseline for practical extraction against image classifiers using random natural images as queries — no domain knowledge of the training data required. The paper showed that a ResNet-34 could be extracted to within ~3% accuracy of the target using 200k queries against a black-box ImageNet classifier. Query cost at typical API pricing: under $50.

Model-stealing for NLP. For text classifiers, the query input space is discrete but exploitable. Synonym substitution, paraphrase generation, and template-based query generation produce query distributions that cover the decision boundary efficiently. Against sentiment classifiers and text categorization APIs, 50k-500k queries typically suffices for high-fidelity extraction.

LLM distillation via API. For generative models, “extraction” is usually framed as distillation: the attacker queries the target LLM with a large instruction dataset and uses the (prompt, response) pairs to fine-tune a smaller open-weight model. This doesn’t recover weights but produces a model that imitates the target’s behavior on the queried distribution. This is the technique behind a number of open-weight models that were fine-tuned on GPT-4 outputs — widely discussed in 2023 in the context of models like Alpaca, WizardLM, and others.

The distillation attack is hard to fully prevent because the attacker only needs query access and a base model to fine-tune. At current compute costs, fine-tuning a 7B-parameter model on 100k API-generated examples costs under $500.

Countermeasures and how to defeat them

Rate limiting. The most widely deployed countermeasure. Slows extraction but doesn’t prevent it — the attacker just needs more clock time and potentially more accounts. Against 50k-query extraction targets, rate limits that allow 1000 queries/day mean 50 days of extraction with a single account.

Watermarking. Embed a statistical watermark in the model’s outputs such that any extracted copy inherits the watermark, proving theft. This is the most technically interesting mitigation and is deployable without changes to the model’s primary task performance. Schemes like DAWN (dataset inference) and neural network watermarking are production-deployable. The attacker can attempt to remove the watermark via fine-tuning on fresh data, but this requires additional data and compute, and detection can be designed to survive significant modification. For independent evaluations of watermarking tools and output restriction implementations, see aisecreviews.com ↗.

Prediction perturbation. Add noise to output probabilities to degrade the signal-to-noise ratio of extraction without significantly affecting accuracy for legitimate users. Tradeoff: enough perturbation to defeat extraction also degrades the usefulness of confidence scores for legitimate applications.

Query detection. Detect extraction attempts by identifying suspicious query patterns: high volume from a single source, queries concentrated near the decision boundary, systematic coverage of the input space. This is an anomaly detection problem on API access logs. Effective against naive extraction; defeated by distributed querying across accounts and time.

Output restriction. Return only the top-1 prediction (no probabilities). Significantly increases extraction difficulty — attacks based on soft labels lose the gradient signal from confidence scores and must work from hard labels only. Hard-label attacks exist (HopSkipJump, Sign-OPT) but require 10-100x more queries.

The LLM extraction threat model for 2026

For large language models specifically, the realistic threat model is not weight recovery — it’s behavioral cloning via distillation. The economics:

Fine-tuning a capable open-weight model (Llama, Mistral, Qwen) to approximate a commercial model’s behavior on a specific task is achievable for hundreds of dollars
The required query budget is attainable through legitimate API access in days to weeks
The resulting model runs locally with no per-query cost and no usage restrictions

The primary victims: specialized vertical models fine-tuned for specific tasks (legal document analysis, medical triage, code generation) where the fine-tuned capability represents significant proprietary investment. Generic instruction-following is too widely available to be meaningfully stolen; specialized capability is the target.

For detection, the relevant signal is aggregate API usage patterns rather than per-query analysis. An account that generates 100k (instruction, response) pairs across a narrow task distribution, with no natural variation in phrasing that a real user would have, looks different from legitimate API usage. Whether that detection is actively deployed at major providers is not publicly documented.

What extracted models are used for

Beyond IP theft, the extracted model serves as a white-box oracle for developing adversarial examples against the original. If the extracted model has high functional equivalence, adversarial inputs crafted against the extracted model transfer to the original with significant probability. This is the connection between model extraction and evasion attacks: extraction is often a precursor to targeted evasion, not an end in itself.

For red teams: if you’re testing a production ML system and lack white-box access, model extraction to obtain a functional approximation is a valid approach to building adversarial attack infrastructure. Query cost is usually within engagement scope for any system worth testing. The adversarial example and evasion techniques that become practical once you have a white-box surrogate are catalogued at adversarialml.dev ↗.

Practical query budget estimates

Target type	Architecture	Queries for functional extraction	Notes
Binary classifier (soft labels)	Any	5k–20k	High fidelity achievable
Multi-class classifier (soft labels)	CNN/ViT	100k–500k	Scales with class count
Multi-class classifier (hard labels)	Any	1M+	Hard-label attack required
Text classifier	Transformer	50k–200k	Paraphrase-based queries
LLM (task distillation)	Transformer	50k–500k	Task-specific; not weight recovery

These are rough figures. Actual query budgets depend heavily on model complexity, task difficulty, and acceptable fidelity threshold. For fine-grained adversarial purposes you need higher fidelity than for general task approximation. Standardized evaluation metrics for black-box attack success across model types are tracked at aisecbench.com ↗.

Model extraction is not a theoretical concern for any production model serving external API queries. If the model has sufficient value and the output isn’t restricted to hard labels, extraction is a matter of time and budget — and both are cheaper than most threat models assume. For defensive controls against extraction and evasion — output restriction, rate limiting, and query anomaly detection — see aidefense.dev ↗.

For more context, AI security blog ↗ covers related topics in depth.

Model Extraction via Black-Box Query Attacks

What extraction actually recovers

The taxonomy of extraction attacks

Countermeasures and how to defeat them

The LLM extraction threat model for 2026

What extracted models are used for

Practical query budget estimates

AI Attacks — in your inbox

Related

Evasion Attacks on Production Classifiers: Malware, Spam, and Fraud

Web-Scale Dataset Poisoning: Split-View and Frontrunning

Adversarial Examples Against Vision Models in 2025

Comments