Agent Safety is a Box
Keep a lid on it. Before we start, let’s cover some terms so we’re thinking about the same thing. This is a post about AI agents, which I’ll define (riffing off Simon Willison 1 ) as: An AI agent runs models and tools in a loop to achieve a goal. Here, goals can include coding, customer service, proving theorems, cloud operations , or many other things. These agents can be interactive or one-shot; called by humans, other agents, or traditional computer systems; local or cloud; and short-lived or long-running. What they don’t tend to be is pure . They typically achieve their goals by side effects. Side effects including modifying the local filesystem, calling another agent, calling a cloud service, making a payment, or starting a 3D print. The topic of today’s post is those side-effects. Simply, what agents can do . We should also be concerned with what agents can say , and I’ll touch on that topic a bit as I go. But the focus is on do . Agents do things with tools. These could be MCP-style tools, powers , skills , or one of many other patterns for tool calling. But, crucially, the act of doing inference doesn’t do anything. Without the do , the think seems less important. The right way to control what agents do is to put them in a box. The box is a strong, deterministic, exact, layer of control outside the agent which limits which tools it can call, and what it can do with those tools. The most important one of those properties is outside the agent . Alignment and other AI safety topics are important. Steering , careful prompting, and context management help a lot. These techniques have a lot of value for liveness (success rate, cost, etc), but are insufficient for safety. They’re insufficient for safety for the same reason we’re building agents in the first place: because they’re flexible, adaptive, creative 2 problem solvers. Traditional old-school workflows are great. They’re cheap, predictable, deterministic, understandable, and well understood. But they aren’t flexible, adaptive, or creative. One change to a data representation or API, and they’re stuck. One unexpected exception case, and they can’t make progress. We’re interested in AI agents because they can make progress towards a broader range of goals without having a human think about all the edge cases before hand. Safety approaches which run inside the agent typically run against this hard trade-off: to get value out of an agent we want to give it as much flexibility as possible, but to reason about what it can do we need to constrain that flexibility. Doing that, with strong guarantees, by trying to constrain what an agent can think, is hard. The other advantage of the box, the deterministic layer around an agent, is that it allows us to make some crisp statements about what matters and doesn’t. For example, if the box deterministically implements the policy a refund can only be for the original purchase price or less , and only one refund can be issued per order , we can exactly reason about how much refunds can be without worrying about the prompt injection attack of the week. What is the Box? The implementation of the box depends a lot on the type of agent we’re talking about. In later posts I’ll look a bit at local agents (the kind I run on my laptop), but for today I’ll start with agents in the cloud. In this cloud environment, agents implemented in code run in a secure execution environment like AgentCore Runtime . Each agent session running inside this environment gets a secure, isolated, place to run its loop, execute generated code, store things in local memory, and so on. Then, we have to add a way to interact with the outside world. To allow the agent to do things. This is where gateways (like AgentCore Gateway ) come in. The gateway is the singular hole in the box. The place where tools are given to the agent, where those tools are controlled, and where policy is enforced. This scoping of tools differs from the usual concerns of authorization: typical authorization is concerned with what an actor can do with a tool, the gateway’s control is concerned with which tools are available. Agents can’t bypass the Gateway, because the Runtime stops them from sending packets anywhere else. Old-school network security controls. The Box’s Policy The simplest way this version of the box constrains what an agent can do is by constraining which tools it can access 3 . Then we need to control what the agent can do with these tools. This is where authorization comes in. In the simplest case, the agent is working on behalf of a human user, and inherits a subset of its authorizations. In a future post I’ll write about other cases, where agents have their own authorization and the ability to escalate privilege, but none of that invalidates the box concept. Regardless, most of today’s authorization implementations don’t have sufficient power and flexibility to express some of the constraints we’d like to express as we control what an agent can do. And they don’t tend to compose across tools. So we need a policy layer at the gateway. AgentCore Policy gives fine-grained, deterministic, control over the ways that an agent can call tools. Using the powerful Cedar policy language , AgentCore Policy is super flexible. But most people don’t want to learn Cedar, so we built on our research on converting human intent to policy to allow policies to also be expressed in natural language . Here’s what a policy looks like: By putting these policies at the edge of the box, in the gateway, we can make sure they are true no matter what the agent does. No errant prompt, context, or memory can bypass this policy. Anyway, this post has gotten very long, and there’s still some ground to cover. There’s more to say about multi-agent systems, memories, local agents, composition of polcies, and many other topics. But hopefully the core point is clear: by building a deterministic, strong, box around an agent we can get a level of safety and control that’s impossible to achieve without it. If this sounds interesting, and you’d like to spend an hour on it, here’s me talking about it at reInvent’25. Simon’s version is An LLM agent runs tools in a loop to achieve a goal , but I like to expand the definition to capture agents that may use smaller models and multiple models, and to highlight that inference is just one tool used by the larger system. I don’t love using the word creative in this sense, because it implies something is happening that really isn’t. But it’s not a terrible mental model. Which, of course, also requires that these tools are built in a way that they can’t be deputized to have their own unexpected side effects. In general, SaaS and cloud tools are built with an adversarial model which assumes that clients are badly-intentioned and so strictly scopes their access, so a lot of this work has already been done.