The “Pack Hunt” Jailbreak That Took an AI Model Offline — What It Actually Means for Anyone Running AI Agents

On June 10, 2026, one day after Anthropic publicly launched a new frontier model, a researcher operating under the name Pliny the Liberator posted on X

“A single query asking for something dangerous gets blocked. Ten innocuous-looking queries that quietly add up to the same dangerous thing did not, until a researcher proved it could be done at scale. That gap is not just an AI safety story. It is a preview of the exact failure mode that shows up in business automation when nobody is watching what the pieces add up to.”

What Actually Happened, in Plain Terms

On June 10, 2026, one day after Anthropic publicly launched a new frontier model, a researcher operating under the name Pliny the Liberator posted on X that he had bypassed the model’s safety classifiers using what he called a “pack hunt”: a coordinated, multi-agent attack that exploits the gap between what a single query triggers in terms of safety review, and what a decomposed, distributed sequence of queries can collectively produce.

The technique, broken down to its components: Pliny used Unicode, homoglyphs, and Cyrillic character substitution to slip past keyword classifiers scanning for specific terms. He used long-context reference tracking to maintain consistency across a multi-turn conversation without triggering filters built for individual messages. And critically, he used decomposition and recomposition: instead of asking for harmful output directly, he asked a series of individually innocuous scientific sub-questions, then reassembled the separate answers into something actionable that none of the individual questions would have produced on their own.

Each piece, on its own, looked harmless to the safety system. Assembled together, it was not. That is the core mechanism, and it is the part of this story with implications well beyond one company’s model.

Why a Government Agency Got Involved

The public Pliny post, combined with a separate, private claim from an unnamed company that it could replicate a similar jailbreak, led to a US Commerce Department export control order on June 12 that pulled the model offline. Anthropic reviewed the private claim and reportedly found only minor, previously known vulnerabilities, not a new universal bypass. The public post and the private claim were treated inconsistently: one was visible and verifiable, the other was not disclosed publicly.

That inconsistency matters for understanding the story correctly. A careful hype-versus-facts analysis of the incident makes an important distinction: what Pliny demonstrated was sophisticated, but it was not a universal, one-click bypass that defeats all restrictions on any question. The technique required the attacker to already know what specific information they wanted, to figure out how to decompose that request into benign-looking parts, and to correctly reassemble the outputs. That is a real and non-trivial capability. It is not a master key.

The Detail Almost Nobody Is Talking About: The Leaked System Prompt

Alongside the jailbreak outputs, Pliny published the model’s internal system prompt on GitHub, reportedly around 120,000 characters long. This is, by most accounts, the first time the complete system prompt of a publicly deployed frontier-class model has been made available by a third party.

The length of that prompt reveals something structurally important: a significant amount of the model’s safety architecture relies on natural language instructions written into the prompt, rather than restrictions baked directly into the model’s underlying weights. A system prompt can be read, studied, and worked around by anyone with access to it. A restriction embedded in the model’s weights is fundamentally harder to study and circumvent from the outside.

This is the most important technical lesson in the entire story, and it applies directly to anyone building AI automation, not just to frontier model safety teams: instructions you write into a prompt are visible and exploitable by anyone who can see or infer them. Behaviour you actually need to guarantee has to be enforced structurally, not just requested politely in natural language.

Why This Story Matters Far Beyond Frontier AI Labs

The Pack Hunt Pattern Exists in Business Automation Too

Strip away the national security framing and the underlying vulnerability pattern Pliny exposed is one I see, in a less dramatic form, inside business automation stacks constantly: individual steps that look safe in isolation can combine into outcomes nobody approved when assembled together.

An AI agent with access to your CRM that, on its own, can update a customer record. An AI agent with access to your billing system that, on its own, can apply a discount code. Neither action alone is dangerous. But an automated workflow that chains them together, update a record’s status, then trigger a billing action based on that status, then notify a third system, can produce an outcome no single step was ever individually reviewed for. The danger lives in the composition, not in any single component.

This is precisely why governance frameworks across the industry, from JPMorgan’s agent identity controls to Boomi’s Agent Control Tower, have converged on reviewing agent actions at the workflow level, not just the individual action level. A single-step audit catches what one agent did. It does not catch what three agents did together that none of them would have been permitted to do alone.

The Prompt-Versus-Architecture Lesson for Your Own Automations

Most small business automation relies heavily on prompt instructions to constrain AI behaviour: “only respond to questions about our product,” “never discuss pricing without manager approval,” “always escalate complaints to a human.” These instructions work most of the time. They are not structural guarantees.

The Fable 5 system prompt leak is a vivid demonstration of why prompt-based restrictions are inherently softer than architectural restrictions. If an instruction matters enough that a violation would cause real harm, financial loss, legal exposure, customer relationship damage, it should be enforced by code logic, permission scoping, or a required human approval step, not solely by asking the model nicely in the system prompt. Reserve prompt-based guidance for preferences and tone. Reserve hard architecture for anything where a failure actually costs you something.

Multi-Agent Systems Need Multi-Agent Security Review

As businesses move from single AI assistants toward coordinated multi-agent workflows, the kind being built into Make, n8n, and enterprise platforms like Boomi and Sustain, the pack hunt pattern becomes directly relevant. A multi-agent system where each agent operates within safe individual boundaries can still produce unsafe outcomes if the combination of their outputs is never reviewed as a whole.

The practical takeaway for anyone building agent orchestration: review the workflow’s end-to-end output, not just each agent’s individual contribution. A weekly check of what your fully assembled automation actually produced, not just whether each component behaved, is the equivalent of the safety review the Fable 5 incident shows was missing at scale.

The Inconsistency Question Every Business Should Note

One detail in this story deserves attention from a business risk perspective, not just a policy perspective: the government’s decision to pull one model offline while leaving other publicly available models with arguably similar exposure online applied an inconsistent standard, according to a cybersecurity executive who spoke to Fortune. The same general category of information Pliny extracted through his multi-step technique is, per that reporting, available through other publicly deployed AI models without requiring any bypass at all.

That inconsistency is a reminder that the current AI safety governance landscape is still being defined in real time, through individual incidents and reactive enforcement actions, not through stable, predictable rules. For businesses building automation on any AI provider’s model, this means staying aware that provider-level access, pricing, and availability can change quickly and not always for reasons that are fully predictable in advance. The portable architecture principle discussed in coverage of the agent platform war applies here directly: build with the ability to switch providers, because regulatory and safety actions can remove access faster than typical SaaS vendor changes ever would.

What to Actually Do With This Information

Audit Your Multi-Step Workflows for Composition Risk

Go through your current AI automations and ask, for each multi-step workflow: if I look only at the final combined output of every step together, does anything happen here that none of the individual steps would have been approved to do alone? This is a different question from auditing each step individually, and it is the question the Fable 5 incident shows is easy to overlook.

Move Hard Constraints From Prompts to Architecture

For any rule where a violation would cost you money, legal exposure, or a damaged customer relationship, do not rely solely on a prompt instruction. Enforce it with permission scoping (the agent literally cannot access the system it would need to violate the rule), required approval steps (a human must confirm before the action executes), or hard-coded logic outside the AI’s control. Save prompt-based instructions for tone, style, and preferences where a soft failure is acceptable.

Keep Your Stack Portable

This incident is a concrete example of how quickly access to a specific AI model can change for reasons unrelated to your business decisions. If your automation is deeply wired into one provider with no abstraction layer, an access change at the provider level becomes your emergency. Building model-agnostic workflows, the principle discussed repeatedly across this year’s major platform announcements, is not theoretical risk management. It is a response to something that has already happened once this year.

The Bottom Line

The Pliny the Liberator jailbreak is, at its core, a story about frontier AI safety and government export controls. But the mechanism underneath it, the gap between what individual components are approved to do and what their combination can produce, is a pattern that exists in any sufficiently complex automated system, including the ones running inside ordinary businesses every day.

You do not need to be building a frontier AI model to learn from this. You need to ask the same question Anthropic’s safety team is now being forced to answer at scale: what happens when the individually safe pieces of my system get put together? If you have not asked that question about your own automation stack, this is the week to start.

Frequently Asked Questions

What is a ‘pack hunt’ jailbreak in AI safety terms?

A pack hunt is a coordinated, multi-step jailbreak technique where an attacker decomposes a harmful request into multiple individually benign-looking sub-queries, gets the AI model to answer each one separately, and then reassembles the answers into actionable information the model would have refused to provide if asked directly. The term was coined by the researcher Pliny the Liberator in his June 2026 demonstration against Anthropic’s Fable 5 model.

Does this jailbreak mean AI models are fundamentally unsafe to use in business automation?

No. The technique demonstrated requires significant sophistication, prior knowledge of the target information, and deliberate decomposition strategy. It is not a universal bypass usable by an average user against any restriction. For business automation use cases, the more relevant lesson is architectural: rely on structural permission controls rather than prompt instructions alone for any rule where a violation carries real cost.

Why does it matter that the AI model’s system prompt was leaked?

A leaked system prompt reveals the natural-language rules the model was instructed to follow, which can be studied and worked around by anyone with access to it. This demonstrates that safety or business rules enforced purely through prompt instructions are inherently more fragile than rules enforced through code-level permissions or access controls, since prompts can potentially be inferred, leaked, or reverse-engineered.

How should I check my own AI automation workflows for similar composition risks?

Review each multi-step automated workflow by examining its complete end-to-end output, not just each individual step. Ask whether the combined result of all steps together produces an outcome that no single step would have been independently approved to produce. For any workflow where the answer is unclear or concerning, add a human review checkpoint at the point where the steps combine, not just at the start or end of the process.

About the Author: Hamza Baig is the founder of Hexona Systems, an AI automation agency serving clients across six continents, and creator of the AI Automation Institute, where over 40,000 entrepreneurs have learned to build and scale automation businesses. He has been featured in GHL Top 50, Yahoo Finance, and Brainz Magazine. Follow him at @hamza_automates.

About

Hamza Baig is the founder of Hexona Systems—an automation agency and softwareplatform that helps thousands of entrepreneurs and business owners implement AI-powered workflows at scale.