Stop Trusting Your Agent's Tools. Start Watching Them.
Hardening an agent toolchain assumes you can predict every threat. Mature teams are adding instrumentation instead: decoy MCP tripwires, recurring sandbox escape research, acquisition audits, and empirical blast-radius measurement.
A few weeks ago I argued that MCP is the new SSH: the connective tissue that links your agents to everything is also the thing that links everything to your agents. That piece was about hardening. Sanitise inputs. Validate outputs. Lock down the servers you choose to trust.
Hardening is necessary. It is also not the part most teams are getting wrong.
The part most teams are getting wrong is that they treat the agent toolchain as something you secure once and then trust, rather than something you run, watch, and measure continuously. Your database has query logs. Your API gateway has rate limits and alerts. Your servers have intrusion detection. Your agent's tool layer, in most setups, has none of that. It is a blind spot wearing a lanyard that says "trusted."
This is not a theoretical gap. Agent systems now browse hostile web pages, download files, call local tools, read repositories, write code, and sometimes carry credentials. The stochastic model sitting in the middle does not behave like a deterministic script. It can be steered by context, by tool descriptions, by memory, by retrieved documents, and by small changes in phrasing that no human reviewer would have treated as a configuration change.
So the question shifts. It is no longer only: have we hardened the toolchain? It is also: would we know if the toolchain started being used against us?
The hardening mindset has a ceiling
Hardening answers the question: how do I stop the bad things I can think of?
The problem is the things you cannot think of. A new MCP server gets added by a teammate. A model upgrade changes how aggressively the agent reaches for tools. A third-party tool quietly changes its behaviour after an update. A prompt-injected web page convinces your agent to call a tool it has access to but should never have used in that context. You did not harden against any of these, because on the day you hardened, they did not exist.
Hardening assumes a static threat model in a domain that is not static. Instrumentation assumes you will be surprised and makes sure that when you are, you find out.
Four controls that watch instead of wall
The following controls are not abstract governance theatre. They are practical patterns: a decoy MCP, an empirical blast-radius diff, an acquisition audit, and a recurring self-red-team process. Each one is cheap compared with the cost of discovering the problem through an incident.
1. Put a tripwire in your config
Lenny Zeltser published a beautifully simple idea: a decoy MCP server. You add a believable but fake MCP server entry to your agent's config and point it at a lightweight endpoint that speaks enough of the protocol to look real. It advertises tempting tools and does nothing except alert you the moment anything tries to initialise it, list its tools, or call them.
This needs spelling out because it is a new way of thinking. The decoy is not a comment in a config file. It is not a warning label. It is a real MCP server. It has a server name, a command or URL, a tool manifest, and plausible tool descriptions. The whole point is that it sits in the same discovery path as the legitimate servers.
If your daily agent has a normal set of tools, the decoy might look like an innocuous internal service: vault_search, billing_admin, customer_export, workspace_secrets, legacy_crm, incident_notes, or whatever would be locally plausible in your environment. Its tools should sound boring enough to be believed, and useful enough that a compromised process or prompt-injected agent might reach for them.
Do not design that decoy with the same agent that uses your daily tools. Use a separate coding agent, such as Codex, Claude Code, or a clean Hermes profile, and give it no memory of your normal workflows. Ask it a specific question: what would look harmless enough that an agent might call it without suspicion, but sensitive enough that any call is a security signal? That distance matters. Your normal agent has habits. It knows your naming conventions. It may unconsciously design around its own patterns. A separate agent gives you a more adversarial imagination.
The decoy server should then do three things only:
- Log initialisation, tool listing, and attempted tool calls.
- Capture enough context to identify the caller and surrounding session.
- Alert a human or supervisor immediately.
It should not return real secrets. It should not perform real actions. It should not contain a real backdoor. The logic is that legitimate usage is exactly zero. Any hit is therefore reconnaissance, a compromised process, a malicious tool, or an attacker poking at the developer workstation.
This reframes the MCP config from a thing you protect into a thing that protects you.
2. Measure your blast radius empirically, not rhetorically
Most teams describe their agent's sandbox in terms of intent: "it can only touch the project directory," "it has no network access," "the container cannot see host secrets." Almost nobody verifies that the sandbox actually behaves the way they believe.
A tool called sandbox-probe makes the verification concrete. You run it once unconfined on the host, once inside your agent's sandbox, and diff the two reports. The difference is your sandbox boundary, measured rather than assumed. It enumerates readable paths, reachable network destinations, and other environment properties, so you get an empirical map of your coding agent's actual blast radius.
The discipline matters more than the specific tool. Any time you can replace "we think it can only do X" with "we measured exactly what it can do," you have converted belief into a control. And the measurement becomes a regression test. Re-run it after every sandbox, model, MCP, or tooling change. If the diff expands, your confinement has loosened.
This should not be a one-line check. You want a baseline that answers concrete questions:
- Which host paths are readable?
- Which paths are writable?
- Which environment variables and secrets are visible?
- Which local services and ports can be reached?
- Which external domains can be reached?
- Can the agent spawn child processes?
- Can it persist files outside the intended workspace?
- Can it write to configuration that affects future sessions?
- Can it reach credentials, sockets, browser profiles, package caches, SSH keys, cloud metadata, or deployment tokens?
The blast radius is not a slogan. It is the set of things the agent can touch when something goes wrong.
3. Audit what the agent acquires, not just what it says
The most underrated paper in this cluster is PrivacyPeek, a benchmark for what its authors call acquisition-stage privacy leakage. Their argument is sharp: most privacy auditing inspects what an agent outputs, meaning what it discloses. But the risk begins earlier, the moment sensitive data enters the agent's context through a tool call it did not need to make.
Across ten agents and four model families, they found widespread over-acquisition: agents pulling in sensitive information they had no functional reason to touch. Worse, they found a correlation between stronger task performance and more over-acquisition. The more capable agents were also the more indiscriminate ones. Prompt-level defences barely moved the needle.
The instrumentation lesson is simple: log and review tool-call trajectories, not just final answers. The question is not only "did it say something it shouldn't have?" but "did it fetch something it shouldn't have?" That distinction is invisible unless you are watching the tool layer.
A practical trajectory audit looks for avoidable acquisition. Did the agent read all customer records to answer a question about one customer? Did it inspect .env while editing CSS? Did it call calendar, email, or CRM tools when the task only required a repository search? Did it ask the search tool for broad web results and then ingest untrusted pages when a vetted documentation URL would have been enough?
That is where many real failures will happen. Not in the final paragraph. In the quiet tool call three steps earlier.
4. Run a recurring sandbox-escape researcher, but do not let it become the thing you fear
The next maturity step is more uncomfortable. Task an agent with researching how agent sandboxes are being bypassed and how your own environment might fail.
This is legitimate internal red-teaming if it is scoped properly. It is also dangerous if treated casually. An agent researching sandbox escapes will read adversarial material. It may encounter prompt injections, malicious code, poisoned proof-of-concepts, and instructions that try to make the researcher itself misbehave. The security agent must therefore be more isolated than the system it is inspecting.
The clean version is a separate cloned environment. Clone the agent stack, strip credentials, seed it with realistic fake secrets, and run the researcher inside a hardened sandbox that can be destroyed after each cycle. Prefer microVMs, gVisor, Firecracker, Daytona, Modal, Singularity, or a comparable isolation layer where available. A normal container is a useful convenience boundary, but it should not be treated as a complete security boundary.
The research task should be explicit:
- Search reputable sources first: OWASP, MITRE, NIST, CVE databases, academic papers, responsible disclosure posts, vendor advisories, and benchmark work such as SandboxEscapeBench.
- Identify plausible weaknesses in the local agent setup.
- Stop short of implementing an actual backdoor or escape.
- Produce a non-destructive reproduction plan where possible.
- Prepare a full reversal script for any configuration change it proposes.
- Notify the user with the finding, the apparent route of exploitation, the evidence, the blast radius, and the proposed mitigation.
- Refuse to visit illegal forums, buy exploits, download weaponised malware, or operationalise a live attack.
That last line is not moral decoration. It is an engineering requirement. The goal is to know whether the house has a weak door, not to install a hidden tunnel beneath it.
The researcher should run on a schedule because the attack surface moves. New agent frameworks appear. New MCP servers ship. New CVEs land. New model behaviours alter tool-use patterns. A quarterly review is too slow for an agent that can change behaviour when a dependency, model, prompt, or tool description changes.
But scheduled does not mean uncontrolled. Give it timeouts, a kill switch, immutable logs, network controls, and a human approval gate for any action beyond static analysis or simulation. Use egress whitelisting where possible. If arbitrary web research is needed, route it through a proxy and log every domain. Scan downloaded files before processing. Destroy the environment after each run. Assume the security agent can itself be compromised and design the system so compromise has nowhere useful to go.
This is the point many teams will miss. A self-red-team process is only mature if it is less privileged than the thing it studies.
Why the platforms are already moving this way
If you want a signal that this is where the market is heading, look at what Microsoft shipped: Azure's SRE Agent is exposed through the Azure MCP Server, and it did not arrive as a bare set of tools. It arrived with confirmation requirements for destructive actions, secret redaction, sanitised errors, host pinning, and explicit environment references for secrets.
OWASP's agent and MCP guidance points in the same direction: least privilege, tool allowlisting, approval gates, logging, prompt-injection resistance, and careful treatment of tool metadata. NIST's AI Risk Management Framework is broader, but its practical implication for agents is similar: govern, map, measure, and manage the risk rather than pretending a static policy is enough.
That is a platform vendor and the security community treating the tool layer as a sensitive control plane. The decoy server, acquisition audit, blast-radius diff, and sandbox researcher are all expressions of the same principle: assume the tool layer will be probed, misused, or surprised, and instrument it accordingly.
If you cannot clone production
A cloned environment is ideal. Sometimes you do not have one. That does not make live testing acceptable by default.
If you must work near a live setup, reduce the problem until the test is non-destructive. Run static analysis. Review configs. Use read-only mounts. Use fake credentials. Put the researcher inside a nested sandbox. Simulate the exploit path rather than executing it. Prepare scripts that reverse proposed changes before any change is made. Use canaries and blue-green deployment patterns for configuration changes that must eventually reach production.
For agent systems, the important distinction is between proving a possibility and operationalising it. You often do not need to complete the exploit to learn what matters. If an agent can see a host socket it should not see, that is already a finding. If it can read a credential path it should not read, that is already a finding. If it can call a decoy tool during a routine task, that is already a finding.
Production is where you monitor. It is not where you discover the blast radius by swinging a hammer.
The shift, in one line
Hardening asks: is my wall high enough? Instrumentation asks: would I know if something got over it?
You need both. But if you have spent your security budget entirely on the wall and have no answer to the second question, you do not have a secure agent. You have a confident one. Those are not the same thing, and the gap between them is exactly where the next incident lives.
The cheapest place to start is the decoy server, because it turns your config from a liability into an early-warning system. Then measure your blast radius. Then audit what your agent acquires. Then create a recurring researcher that watches the frontier of sandbox escape work from inside a disposable environment.
None of this requires a new model, a new vendor, or a rewrite. It requires deciding that your agent's tools are infrastructure, and that infrastructure you cannot see is infrastructure you do not control.
How to measure the blast radius empirically
Do not end this exercise with a diagram of intended permissions. End it with evidence.
Start with a clean baseline. Run a probe outside the sandbox to establish what the host can see. Run the same probe inside each agent environment: daily coding agent, browser agent, research agent, security researcher, and any production worker. Store the results as versioned artefacts. Then diff them.
The report should classify access, not merely list it:
- Filesystem: readable, writable, executable, persistent after session, shared with host, shared with other agents.
- Secrets: environment variables, mounted secret files, credential helpers, SSH agents, browser profiles, cloud tokens, package-manager credentials.
- Network: outbound domains, local ports, metadata services, private network ranges, DNS behaviour, proxy bypass routes.
- Process control: ability to spawn shells, background processes, child agents, package installs, interpreters, Docker, Kubernetes, system services.
- Tool plane: MCP servers visible, tools listed, tools callable, approval gates triggered, tool descriptions exposed to the model.
- Memory and persistence: places the agent can write that affect future behaviour, including config, skills, prompts, cron jobs, caches, vector stores, and shared notes.
- Cross-boundary paths: any route from a disposable workspace into host state, production state, another profile, or another user's data.
For each category, record three numbers: intended access, observed access, and unexplained access. The third number is the one to watch. It is the difference between your architecture and reality.
Then make the measurement continuous. Run it before and after dependency updates. Run it when a new MCP server is added. Run it when the model changes. Run it when an agent gains a new tool. Run it when the container image changes. Treat any expansion in observed access as a security regression unless someone can explain and approve it.
Finally, test containment with harmless canaries. Put fake secrets in places the agent should never read. Put decoy MCP tools in the config. Put a canary domain behind a logging endpoint. Put a writable sentinel file outside the intended workspace and alert if it changes. The point is not to trap your own agent for theatre. The point is to generate empirical proof of what it can reach, what it tries to reach, and how quickly you find out.
A blast radius you have not measured is only a story. A blast radius you can diff is a control.
This is the kind of trust-layer work we build with clients at Agent Paul: not slideware about AI risk, but concrete, reversible controls that make an agent deployment observable enough to run in production with a straight face. If your agents are touching real tools and you cannot answer "would I know if something went wrong?" then that is the conversation to have.