Context or Command? How LLM Agents Can Be Tricked into Leaking Data

Originally published on my LinkedIn.

A few days ago, the team at General Analysis published an interesting blog post on how they managed to exploit the Supabase MCP to leak sensitive data from a Supabase app.

Exploit TLDR: General Analysis built a simple customer support app to handle user tickets via agents. The support agent functions as expected - but an attacker could submit a ticket that instructs the backend’s LLM, connected via Supabase MCP, to publish sensitive data from the database.

Why does this work? Because the developer agent (using MCP) has full access to the database and also permission to write replies to tickets. Crucially, by design, the agents in general cannot distinguish between context/data and executable instructions. So when a developer queries the system for a summary of tickets, the malicious instructions embedded in the ticket trigger a leak of sensitive data. (Full post linked in comments.)

Couple of thoughts:

Much of today’s LLM based AI is instruction-following intelligence. While impressive, I believe that’s not going to be enough. Specially in cases where judgement is required.

We need a “situated” intelligence. What I mean by situated is - the intelligence that has a worldview and sense of its situation in that world, such that it can have value judgements about different scenarios. We humans are “situated” in this world through our hopes, fears, ambitions and other emotions that allow us to judge whether something should be done (and not merely if and how it can be done).

A human agent/assistant is “situated” within an org - they have motivations (to do good work), and deterrents (reputational/legal costs). That worldview helps them judge what should be done - not just what can be done. We need Agents to also have similar situational awareness.

This brings me back to a point I made in an earlier post: judgement requires a theory or worldview. And LLMs lack that.

What can we do to avoid such leaks? At first glance, it seems bleak. We’ve essentially built a natural language eval() - and we’re discovering the same limitations we already knew about eval() in code.

There’s no safe way to blindly run arbitrary, user-supplied instructions. And if we can’t reliably classify a python/javascript code as safe, how do we expect to classify much more ambiguous natural language prompts?

So what can we do (until we get to the "situated" intelligence)? I don’t have a foolproof solution yet - but here’s what I believe should be a potential direction:

As we bring agents into production, we cannot treat them casually. We’ll need to rigorously map:

All read/write permissions (including transitive and indirect ones)
All locations where sensitive data may reside
And prove invariants on this (likely massive) graph to ensure sensitive data doesn’t leak beyond some well defined boundaries

(Full disclosure: I work at Symmetry Systems, and thinking about these data flow graphs is pretty much my day job.)