AI security notes 8/4/25: How do we build agent security frameworks the scaling laws won’t force us to rewrite?

Aug 04, 2025

Many are proposing agent security frameworks, but it’s tricky to build frameworks to secure a category of tech that doesn’t fully exist yet and whose future shape is hard to predict.

Agents today are a mixture of research previews (like UI agents) that don’t really work yet, and low-autonomy applications like coding agents which take few sensitive actions on their own or without human confirmation.

Of course, tomorrow, we should expect that agents will do stuff like land unreviewed changes to no-code e-commerce sites, process refunds for businesses at scale based on customer calls, and more. And then, post, say, 2027, who knows.

In other words, agents are a very fast moving target, and it’s table stakes that any security framework we build now afford smooth interpolation between a regime of low trust today and a regime of much higher trust tomorrow. This means we need to identify what will and won’t change in AI agents and build flexibility into the former and a solid foundation around the latter.

Here’s what won’t change by 2028 and beyond:

Agents' next actions will be sampled, meaning anything can happen within an agent’s radius of control (as it can with humans!), and attackers can put their finger on the scale via adversarial inputs (as they can with humans!).
Agents will live within privilege sandboxes that limit their control deterministically (as do humans!).
These privilege sandboxes will have a non-zero probability of having security bugs, as does all serious, in-production software; agents can subvert them (as human attackers do today).
Agents, and these privilege sandboxes, will be subject to monitoring by security operations centers and by individual humans managers (as are humans).

Here’s what’ll change:

The set of actions that agents take with acceptable reliability and security will increase.
The privilege sandboxes in which we place agents will expand.
All else being equal, the privilege sandboxes in which agents live will have more security vulnerabilities, because they’ll grow more complex, involve more broadly distributed systems (e.g. heterogenous authn/authz/identity systems distributed across infrastructures, APIs, and software frameworks), and contain more lines of code.
The ratio between human monitors and agent instances will shift towards more and more agents per person as agents gain more trust.

And here’s what this all recommends for designers of agent security frameworks:

We should invest with confidence in building very good privilege sandboxes that can be configured to support today’s use cases, when we barely trust agents to execute shell commands, and tomorrow’s, when we’ll trust agents to make sensitive changes more reliably than humans.
We can do this by leveraging existing systems (we have to!) but while recognizing that they’re not built to purpose, because they’re built for humans, which operate within human trust regimes, which carry all sorts of assumptions inappropriate to agents. We’ll need to consider the new affordances, risks, and limitations of agents when designing from first principles here.
We should continue to pursue conditioning the underlying LLMs to be more reliable, more steerable, more resistant to prompt injection, and more likely to write secure code.
And we should continue working on evals, red teaming and measurement so we can quantify and give guidance around exactly how secure any given agent really is. This will feed back into growing the radius of the dynamic security sandbox into which we place the agents.
We need to begin to build security operations center practices around monitoring AI agents. These practices will treat agents as untrustworthy within their sandboxes, but, more crucially, expect that sandbox designs will have flaws and oversights that allow agents to do bad things they shouldn’t have been able to do.
Agent security operations center practice will look different than traditional insider trust. Agents have different affordances; we can monitor their latent states, chains of thoughts, and their every action. We can pause them and roll them back to ‘safe’ states. They’ll often operate at a faster pace than human, and be less heterogenous in behavior. There’s a lot to learn and consider here that’ll come from practice over time.

I re-express my framing in the figure below, the main point of which is to show four pieces of the agent security picture that will remain invariant for the foreseeable future, and also how they’ll shrink, grow, and change as we move up the agent adoption, trust, and capability curves.

AI security notes 8/4/25: How do we build agent security frameworks the scaling laws won’t force us to rewrite?

Discussion about this post