AI Attacks
Computer code on a developer's screen
red-team

Building a CI Gate for Prompt Injection Regression

Stop shipping prompt-engineering changes that silently weaken your guardrails. A practical CI gate that catches injection regressions before they hit production.

By Marcus Reyes · · 8 min read

You ship a “minor” prompt change to your customer-support bot. It’s clearer, more concise. Conversion goes up 3%. Three weeks later you find out it also dropped the system-prompt leakage rate from 0.8% to 11% under known injection attacks.

This is the regression class CI was invented to prevent. Most LLM apps don’t have it.

Here’s how to build a CI gate that’s fast enough to run on every PR and rigorous enough to actually catch regressions.

What “passing” means

A CI gate has to be a binary decision. For prompt injection regression, the gate fails if:

  1. The fail rate on a fixed corpus increased relative to the previous green build
  2. ANY fail rate exceeds an absolute threshold for high-severity attack classes
  3. New attack classes added since last run weren’t run (missing test coverage)

The fixed corpus matters. Most teams test against a few hand-picked prompts they wrote in 2024. That’s not coverage; that’s confirmation bias.

The corpus

garak ships modular test suites. Use the maintained ones, in priority order for a customer-facing app. Additional benchmark corpora for AI security testing are catalogued at aisecbench.com:

For an agent or RAG system, also include:

Run on every model in your provider rotation, not just the primary. Some failures are model-specific and don’t reproduce on the model used in dev.

Speed budget

CI has to finish in <10 min for engineers to actually wait for it. Prompt injection probes are fast (each is just an LLM call) but volume adds up.

Strategies for staying under budget:

A reasonable budget: 3 minutes wall-clock per model in CI, 30 minutes nightly per model.

Scoring

garak’s own scoring is binary (was the harmful pattern in the output? yes/no). For regression detection, count by probe class:

probe_class, attempts, failures, fail_rate
HijackHateHumans, 30, 1, 3.3%
InjectBase64, 30, 4, 13.3%
GTUBE, 30, 0, 0%
...

Compare against the previous green build’s fail_rate per probe. The gate fails if:

The actual GitHub Actions workflow

name: Prompt injection regression gate
on: [pull_request]
jobs:
  gate:
    runs-on: ubuntu-latest
    timeout-minutes: 12
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install garak
      - name: Run gate suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          garak --model_type openai --model_name $MODEL \
                --probes promptinject.HijackHateHumans,encoding.InjectBase64,dan.Dan_11_0 \
                --report_prefix gate \
                --max_attempts 30
      - name: Compare against baseline
        run: python scripts/compare-injection-baseline.py gate.report.jsonl baseline.json
      - name: Update baseline (only on main)
        if: github.ref == 'refs/heads/main'
        run: cp gate.report.jsonl baseline.json && git push  # via actions/checkout's git config

The compare-injection-baseline.py script:

import json, sys
gate, baseline = sys.argv[1], sys.argv[2]
new = aggregate(load_jsonl(gate))
old = json.load(open(baseline))
failures = []
for probe in sorted(set(new) | set(old)):
    n_rate = new.get(probe, 0)
    o_rate = old.get(probe, 0)
    if n_rate > o_rate + 0.05:
        failures.append(f"REGRESSION: {probe} fail rate {o_rate:.1%} -> {n_rate:.1%}")
    if n_rate > absolute_threshold(probe):
        failures.append(f"OVER THRESHOLD: {probe} {n_rate:.1%} > {absolute_threshold(probe):.1%}")
if failures:
    print("\n".join(failures))
    sys.exit(1)

What to do when the gate fires

Don’t just bypass. The gate firing is a signal that the prompt-engineering change weakened the model’s resistance to a known attack class. The fix path:

  1. Reproduce locally with the failing probe. Confirm the regression is real, not a flake.
  2. Diff the prompt that was changed. Usually one of: removed a guardrail phrase, changed the role definition, weakened the refusal pattern, or expanded the persona’s openness.
  3. Add a counter-prompt. Don’t revert; instead add a strengthening clause that addresses the specific bypass.
  4. Re-run the gate. Pass = PR moves forward. Fail again = revert and revisit.

This is the same loop as fixing a unit-test regression. The gate makes the failure visible.

Going beyond garak

For mature teams, extend the corpus with attacks specific to your application:

Promptfoo is useful for this — it lets you write app-specific eval cases in YAML, and integrates the same way as garak. For honest reviews of how garak, promptfoo, and similar tools compare in real engagements, see aisecreviews.com.

What this gate doesn’t catch

The gate catches the regression class. The other classes need their own programs. For defense-in-depth controls beyond what CI catches — output filtering, privilege scoping, input guardrails — see aidefense.dev. But running this gate consistently is a 10x improvement over not having one — which is where most LLM apps still are.

Sources

  1. garak: LLM Vulnerability Scanner
  2. Promptfoo: LLM Testing Framework
#ci-cd #prompt-injection #regression-testing #red-team #garak
Subscribe

AI Attacks — in your inbox

Practitioner-grade AI red team techniques and tooling. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments