Prompt Injection Defenses That Actually Work
OWASP's #1 LLM vulnerability has no silver bullet. Here's the defense-in-depth that gets close.
Prompt injection is OWASP's #1 LLM vulnerability and there is no single fix. Anyone who tells you otherwise is selling something. What works is defense-in-depth — multiple shallow defenses that together raise the cost of attack.
The five layers, in order of effectiveness
- Privilege separation. Sensitive tools (delete data, send email, charge card) require human-in-the-loop approval. The LLM can propose the action; a human or a separate non-LLM service approves it. This single change defeats 80% of real-world injection attacks.
- Output filtering. Regex out credit card numbers, SSNs, internal hostnames, API keys before responses leave your system. Catches both leaks and exfiltration attempts through cleverly crafted responses.
- Sandboxed execution. Any code the LLM generates runs in an isolated environment (e2b, Modal, gVisor). Never on your prod hosts, never with your prod credentials.
- Input filtering. LlamaGuard, Azure Content Safety, or NeMo Guardrails on incoming text. Catches the obvious (~50% of attacks); leaks adversarial paraphrases.
- Audit logging + sampled review. Every tool call logged with prompt + response. Daily sample reviewed by a human. You'll catch the novel attacks the filters missed.
What doesn't work
- "Just tell the model to ignore injection attempts in the system prompt." Models comply with later instructions about as often as earlier ones.
- Delimiters and XML tags around user input. Helps a little, defeated trivially.
- Switching to a "more secure" model. All current models are vulnerable to skilled attackers.
The mental model
Treat every LLM call as if the input is hostile and the output is untrusted. Apply the same skepticism you'd apply to user-uploaded files in a web app. The LLM is a clever but compromisable subprocess — not a member of your trust boundary.
Bottom line
Layer five shallow defenses. Assume each will be bypassed individually. Measure attempts in your logs. You will never be perfectly safe — but you can be a much harder target than the next app.