AI security notes 6/27
AI generated code security risks stem from basic LLM limitations; we should frame prompt injection models like other detection and response tools
Sources of vibe coding risk
We know AI coding agents can write blatantly insecure code or cause slopsquatting risks. But a deeper and more enduring issue will be AI’s lack of access to tribal, implicit organizational cultural knowledge, often passed down by oral tradition, or in fragmented, poor documentation, required to understand what ‘secure code’ looks like in specific organizational contexts.
Suppose an engineering team at a hospital generates code for a patient management system with an LLM. Pre-AI, manual engineering teams with detailed, implicit knowledge of the hospital’s byzantine, legacy data tables would ensure code was HIPAA compliant. Now, the AI lacks understanding of the hospital’s counterintuitive, legacy data tables and is more liable to make mistakes.
Or, suppose a bank employs AI to automate changing the bank web app’s business logic. The bank’s engineering team maintains a hairball of workarounds and patches to their business logic that makes their code work correctly. Moving to AI generated code risks landing code that misunderstands the meaning of complicated, counterintuitively named sensitive data.
Another enduring source of risk: when user requests are underspecified, AI ‘fills in the blanks’ in ways that may be misaligned with the engineer’s intentions:
“Spin up an S3 bucket I can drop assets into from my script” … doesn’t specify any of the security attributes of this operation, hazarding the model implementing the code but in an insecure way for that specific business context.
“Write me a login form for my app” … doesn’t specify how credentials are to be stored and may make generically correct security decisions that are wrong in a specific programming context.
Of course, even when full context is available, AI coding systems also have poorer code security understanding than human engineering teams
Yes, AI can now win competitive coding competitions; but human teams of programmers can implement the Linux kernel, or the Unity game engine, or the software that drives a Waymo vehicle; let’s not kid ourselves, AI’s capabilities are still a pale shadow of what human teams can accomplish.
Given this, do we really imagine vibe coded code will have the security awareness that skilled teams of human programmers have, especially given AI’s lack of organizational knowledge or ability to learn over time?
As an aside, these are all reasons that AI won’t replace human engineers in the foreseeable future and will remain a ‘power tool’ requiring significant hand holding and human review.
Gaps in achieving superhuman performance at secure coding
Much reduces to the alignment problem in AI security, including problems in secure AI coding
Misalignment is inevitable when models don’t have the implicit context that human engineers have; this is a hard problem that won’t be solved for awhile
We should condition AI coding agents to carefully interact with human engineers as they code, asking for clarity versus assuming when it comes to critical, sensitive code blocks
We should also develop new programming paradigms in which human engineers specify programs precisely, and AI agents iterate until they’ve built a program that satisfies the specifications
Opportunities in vuln discovery and remediation
Lots of new and interesting work targets vulnerability discovery and remediation. Notably, most existing research targets finding and fixing generic bugs that don’t depend on specific organizational context. A couple papers I found interesting recently:
Generation‑time self‑patching; SGCode (NJIT + UIUC) inserts static analyzers into the generation loop so GPT‑4 corrects its own output before it ever reaches the developer, eliminating many OWASP‑top‑10 issues with negligible latency.
Retrieval‑augmented fixes; SOSecure (CMU) shows that fetching relevant Stack Overflow discussions after an LLM produces code raises GPT‑4’s fix‑rate on LLMSecEval from 56 % to 91 %; the community knowledge gives the model concrete examples of how others hardened similar snippets.
Berkeley’s new CyberGym benchmark shows how hard AI for code security in the real world will be; the best LLM+scaffolding combo reproduced 11.9 % of real‑world bugs scattered across 188 open‑source projects, struggling most with logic errors that hinge on project‑specific assumptions.
SecRepoBench (April 2025), a 318-task benchmark built from 27 real C/C++ repos; testing 19 leading LLMs showed they “struggle with generating correct and secure code” across 15 CWE categories.
Georgetown CSET issued the brief “Cybersecurity Risks of AI-Generated Code” (Nov 2024) containing a detailed study of general weaknesses in LLM code and warning of a feedback loop where insecure AI code contaminates future training data.
Prompt‑injection guardrails: instrumentation, not silver bullets
Prompt injection detection models can be easily bypassed by AI redteamers. The headlines tend to suggest this means the these models are “broken,” but that framing misses how these tools are meant to be used.
Prompt injection models are for AI detection and response
Treat them as about as brittle (but just as indispensable) as all the other security signals detection and response teams use, like sigma rules, yara rules, snort rules, malware detection ML models.
Retrain them regularly on your current set of known jailbreaks and prompt injection attacks on the one hand, and on new benign LLM traffic on the other.
Only apply them after you’ve closed any severe security risks in AI applications deterministically, through the application logic itself.
This, of course, is how to approach security detection and response anyways; as a layering of brittle detection signals that we iteratively improve as part of an operational loop that is emergently robust.
I find it interesting some folks are treating prompt injection model bypasses as ‘vulnerabilities’; to me this feels a bit like calling a bypass for a detection rule base that looks for known malware families a “vulnerability”.
If we’re relying on detection rules, and detection models, to resist red teaming (versus treating them as partial components of a larger strategy), we need to reconsider our overall strategy.
Related work (2024 ‑ 2025)
Google’s GenAI Security Team blogged “Mitigating Prompt Injection with a Layered Defense Strategy” detailing how they situate prompt injection defense within their larger strategy.
PI‑Bench is an interactive benchmark that mutates prompts in real time; baseline detectors drop from F1 0.82 on familiar attacks to 0.41 on unseen variants. The study quantifies how brittle today’s guardrails are when adversaries shift tactics—a direct validation of why SOC‑style iterative training is needed.
SoK: Evaluating Jailbreak Guardrails for LLMs surveyed 40 guardrail papers, proposed a six‑dimension taxonomy, and showed that mixing statistical and rule‑based defenses yields the best risk–utility trade‑off.
Another absolutely amazing piece of work, Josh! Very detailed and well thought out. Thanks!