Jailbreaking
Jailbreaking is an attempt to bypass a model’s safety rules or restrictions. In large language models, it usually means using carefully crafted prompts, context or interaction patterns to make the model produce outputs that it is supposed to refuse or handle safely.
In large language models, safety rules are used to reduce harmful, illegal, deceptive, private or otherwise unsafe outputs. Jailbreaking tries to work around those rules. It is therefore both a security issue and a model safety issue.
A jailbreak does not normally change the model’s weights. It manipulates the model at inference time through input, conversation context, role framing, external content or tool interaction. The model remains the same, but the attacker tries to make it behave outside its intended boundaries.
Jailbreaking means trying to make an AI model ignore, bypass or weaken its safety restrictions. It is usually done through adversarial prompts or manipulated context.
What jailbreaking means
Jailbreaking is the practice of trying to bypass restrictions placed on an AI model. These restrictions may be designed to prevent harmful instructions, unsafe advice, privacy violations, copyrighted text reproduction, malicious code, policy evasion or other unwanted behaviour.
The term originally comes from computing, where jailbreaking meant removing restrictions from a device or operating system. In AI, the meaning is different. It usually refers to prompt-based attempts to make a model produce restricted output.
For example, a user may try to reframe a harmful request as fiction, testing, roleplay or research. The goal is not normal model use. The goal is to make the model cross a boundary it was designed to maintain.
Why jailbreaking matters
Jailbreaking matters because large language models are increasingly used in products, internal systems, search tools, customer support, coding assistants, document analysis, education, healthcare, finance and AI agents.
If a model can be manipulated into ignoring its safeguards, the result can be more serious than a bad answer. It may help users obtain dangerous information, reveal confidential data, misuse tools, generate deceptive content or automate harmful workflows.
The risk is higher when the model is connected to tools. A standalone chatbot may produce unsafe text. A tool-using AI agent may also send messages, inspect files, call APIs, run code or update business systems.
Jailbreaking is not only about making a chatbot say something forbidden. In agentic systems, bypassed safeguards can affect real actions, tools and data flows.
Jailbreaking vs prompt injection
Jailbreaking and prompt injection are closely related, but they are not identical.
Prompt injection is a broader term. It means manipulating the model’s behaviour through crafted input, often by mixing instructions and data in a way that changes what the model follows.
Jailbreaking is usually a specific type of prompt-based attack focused on bypassing safety rules or restrictions.
In simple terms:
- Prompt injection – tries to alter the model’s intended behaviour.
- Jailbreaking – tries to bypass safety restrictions.
A jailbreak can be considered a form of prompt injection when the attack works by injecting instructions that undermine the model’s safety behaviour.
Jailbreaking vs normal prompt engineering
Prompt engineering is legitimate instruction design. It helps the model produce useful, structured and accurate outputs.
Jailbreaking is different. It tries to defeat restrictions or make the model do something it should not do.
The difference is not only the prompt format. It is also the goal:
- Prompt engineering – improves task performance within allowed boundaries.
- Jailbreaking – attempts to bypass those boundaries.
A well-designed prompt can make an answer clearer. A jailbreak tries to make the model ignore or weaken the rules that should govern the answer.
How jailbreaking works at a high level
Jailbreaking works by exploiting the fact that language models follow patterns in text. The attacker tries to create a context where the model treats the restricted request as acceptable, transformed, indirect or lower risk than it really is.
This can involve:
- role framing – trying to place the model in a fictional or alternative role,
- context manipulation – surrounding a restricted request with misleading context,
- instruction conflict – creating tension between user instructions and safety rules,
- multi-step pressure – gradually moving the conversation toward restricted content,
- format transformation – asking for restricted content in a different form,
- indirect requests – hiding the unsafe goal behind a seemingly harmless task,
- external content – placing manipulative instructions inside documents, webpages or tool outputs.
These are categories, not instructions. The important point is that jailbreaking tries to confuse the model about which rules and context should control the answer.
Direct jailbreaking
Direct jailbreaking happens when the user directly enters adversarial instructions into the model interface.
The user may try to convince the model that restrictions do not apply, that the task is harmless, that the model should adopt a special role or that it should ignore its normal boundaries.
Direct jailbreaking is common in public chatbots because the attacker can interact with the model directly and try many prompt variations.
A robust system should recognise these attempts and keep the model within its intended safety boundaries.
Indirect jailbreaking
Indirect jailbreaking happens when the attack is hidden inside external content that the model reads.
For example, a model may be asked to summarise a webpage, email, PDF, spreadsheet or code file. That external content may contain instructions designed to weaken the model’s behaviour.
This is especially relevant for agentic AI. Agentic systems often read untrusted content and then decide what to do next. If external content can change the system’s safety behaviour, the risk becomes serious.
Indirect jailbreaking overlaps strongly with indirect prompt injection.
External content should be treated as data, not as instruction. A webpage, email or document should not be allowed to rewrite the AI system’s safety rules.
Jailbreaking and AI agents
Jailbreaking becomes more dangerous when the model is part of an AI agent.
A simple chatbot can answer incorrectly. An AI agent can also take steps. It may search files, call tools, draft emails, update records, run code, inspect private data or communicate with external systems.
If a jailbreak changes the agent’s behaviour, the system may perform actions that the user, developer or organisation did not intend.
This is why AI agents need stronger controls than ordinary prompt-response systems. Safety should not depend only on the model refusing bad requests. Tool permissions, approval gates, logs and validation should also be used.
Jailbreaking and tool use
Tool use increases the possible impact of jailbreaking.
If a model has no tools, a jailbreak may produce unsafe text. If a model has tools, a jailbreak may influence actions.
Possible risks include:
- calling the wrong tool,
- accessing unnecessary files,
- sending information to the wrong recipient,
- running unsafe code,
- creating misleading records,
- changing data without approval,
- leaking confidential information,
- escalating from a harmless request into a harmful workflow.
The more powerful the tools, the stricter the controls should be.
Jailbreaking and model alignment
Model alignment means making an AI system behave according to intended human values, safety rules and task requirements.
Jailbreaking tests the limits of that alignment. It asks whether the model follows safety behaviour only in normal cases, or whether it can still follow it under adversarial pressure.
This is difficult because safety is not only a list of blocked words. The model must understand context, intent, risk and allowed alternatives.
For example, cybersecurity education can be legitimate, but operational abuse is not. Medical information can be useful, but dangerous personalised instructions are risky. A model needs to distinguish between safe explanation and harmful enablement.
Jailbreaking and red teaming
Red teaming is the practice of deliberately testing a system against adversarial behaviour. In AI, red teams try to find ways a model or AI application can fail before attackers or users discover those weaknesses in production.
Jailbreak testing can be part of AI red teaming. The goal of responsible red teaming is not to spread jailbreaks. The goal is to find vulnerabilities, document them, improve safeguards and reduce real-world risk.
Good red teaming should be controlled, logged and authorised. It should use safe test environments and avoid exposing harmful details unnecessarily.
Why jailbreaks are hard to eliminate completely
Jailbreaks are hard to eliminate because language models work with flexible natural language. The same idea can be expressed in many ways. Attackers can reframe, translate, encode, split, obscure or gradually build toward a restricted request.
Traditional software security often separates code from data more strictly. LLM systems are different because instructions, user requests and external content are all represented as text or tokens inside the model context.
This does not mean defence is impossible. It means defence must be layered. A single system prompt or keyword filter is not enough.
Common jailbreak goals
Jailbreak attempts can have different goals.
Common goals include:
- bypassing safety refusals – trying to make the model answer restricted requests,
- extracting hidden instructions – trying to reveal system or developer prompts,
- weakening content rules – trying to make the model ignore policy boundaries,
- obtaining harmful procedural information – trying to get actionable unsafe details,
- misusing tools – trying to make an AI agent perform an unsafe action,
- exfiltrating data – trying to reveal private, confidential or unrelated information,
- creating deceptive content – trying to produce fraud, impersonation or manipulation at scale.
These goals show why jailbreak resistance matters for both safety and security.
Jailbreaking and system prompt leakage
System prompt leakage is one common jailbreak target. The attacker tries to make the model reveal hidden system instructions, tool descriptions or internal rules.
This can matter because hidden instructions may describe safety behaviour, business logic, moderation criteria, tool access, routing rules or implementation details.
However, keeping the system prompt secret is not enough. A secure AI application should assume that some instructions may eventually be inferred or exposed. The real protection should come from strong permissions, separation of data and instructions, validation and monitoring.
Jailbreaking and data leakage
Data leakage is another serious risk. A jailbreak may try to make the system reveal information that should remain private.
This may include:
- private user data,
- internal documents,
- conversation history,
- tool outputs,
- API responses,
- credentials or secrets,
- customer records,
- confidential business information.
A model should not receive data it does not need. Access control should happen before data reaches the model, not only after the model generates output.
Jailbreaking and RAG systems
RAG means retrieval-augmented generation. In a RAG system, the model retrieves documents or passages and uses them as context.
RAG can improve factual grounding, but it also creates new jailbreak and prompt injection risks. Retrieved documents may contain adversarial text, hidden instructions or misleading content.
The model should treat retrieved text as source material, not as authority over system behaviour.
For example, a retrieved document can be relevant to the user’s question and still contain unsafe instructions. Relevance does not mean trust.
Jailbreaking and multimodal models
Multimodal models can process images, screenshots, PDFs, audio, video or other input types.
This expands the jailbreak surface. Instructions may be hidden in images, small text, screenshots, slides, captions, metadata or document layouts.
A user may ask the model to analyse an image. The image may contain text that tries to manipulate the model. The model should recognise that the text is part of the image content, not an instruction to follow.
Multimodal systems therefore need the same separation principle: untrusted content remains data.
Jailbreaking and embeddings
Embeddings are numerical representations used for search, clustering and retrieval. They do not remove jailbreak risk.
A RAG system may use embeddings to retrieve relevant documents. If one of those documents contains manipulative instructions, the model may still be exposed to a jailbreak or prompt injection attempt.
Embedding search can find relevant content. It does not automatically determine whether that content is safe to follow.
This is why retrieval systems need source trust, content filtering, instruction separation and tool restrictions.
Jailbreaking and excessive agency
Excessive agency means giving an AI system more autonomy or permissions than it needs.
Jailbreaking is more dangerous when excessive agency exists. If the system can read broad data, call many tools and act without confirmation, a successful jailbreak has more room to cause damage.
A safer design gives the system only the minimum required permissions and requires human approval for sensitive actions.
For example:
- drafting an email is lower risk than sending it automatically,
- reading a specific document is lower risk than searching all private files,
- suggesting a database update is lower risk than executing it immediately,
- summarising logs is lower risk than changing production settings.
How to reduce jailbreaking risk
Jailbreaking cannot be solved by one sentence in a system prompt. It needs layered defence.
Useful measures include:
- clear safety policies – define what the model should refuse or handle carefully,
- model-level safeguards – train and evaluate models for safe behaviour,
- input classification – detect adversarial or unsafe requests,
- output filtering – check generated content before it is returned or used,
- tool restrictions – limit what the model can do,
- least privilege – give access only to data and tools needed for the task,
- human approval – require review for sensitive or irreversible actions,
- context separation – distinguish trusted instructions from untrusted data,
- red teaming – test the system against adversarial use,
- monitoring – review failures, blocked attempts and unexpected tool use.
Why keyword filters are not enough
Keyword filters can block some obvious unsafe requests, but jailbreaks are often designed to avoid simple filters.
A harmful request may be rephrased, split across multiple messages, hidden in a scenario, translated, encoded or framed indirectly. A keyword list cannot reliably understand intent.
This does not mean filters are useless. They can be one layer. But they should not be the only layer.
A stronger system combines filtering with model safeguards, tool permissions, policy enforcement, human review and logging.
Jailbreak defence should not depend on detecting one exact phrase. The same unsafe intent can be expressed in many different ways.
Least privilege as a defence
Least privilege means the AI system should have only the permissions needed for the current task.
If the model only needs to answer a public question, it should not access private documents. If it only needs to draft a message, it should not send it. If it only needs to inspect one record, it should not have access to the whole database.
Least privilege reduces the damage if a jailbreak succeeds.
This principle is especially important for enterprise AI systems, coding agents, customer support agents, CRM assistants and AI tools connected to internal systems.
Human-in-the-loop controls
Human-in-the-loop means that a person reviews or approves important actions.
This is important when the AI system can:
- send external communication,
- delete or overwrite data,
- change customer records,
- approve refunds,
- modify access rights,
- execute code,
- use sensitive information,
- make recommendations in regulated contexts.
The model can prepare work, but a human should approve high-impact actions. This helps reduce the practical impact of jailbreaking and other model failures.
Context separation
Context separation means clearly separating different types of information inside the AI system.
A system may contain:
- system instructions,
- developer instructions,
- user instructions,
- retrieved documents,
- tool outputs,
- conversation history,
- external webpages,
- uploaded files.
Not all of these should have the same authority. A retrieved webpage should not override system safety rules. A user-uploaded file should not become a new policy. A tool output should not instruct the agent to perform unrelated actions.
Strong agent design should make these boundaries explicit.
Output validation
Output validation means checking the model’s answer before it is shown to the user or passed to another system.
This is important because unsafe model output can become dangerous when used downstream. For example, generated code may be executed, generated SQL may query a database, generated email text may be sent externally, or generated instructions may be followed by a human.
Output validation may include:
- checking for unsafe content,
- checking for private data,
- checking whether tool calls match the user request,
- checking whether generated code is safe to run,
- checking whether the answer cites valid sources,
- requiring human review for sensitive outputs.
The output should not be trusted blindly only because it came from an AI model.
Jailbreak testing
Jailbreak testing should be part of AI evaluation. It helps teams find weaknesses before deployment.
A responsible test process should include:
- direct adversarial prompts,
- multi-turn manipulation attempts,
- attempts to reveal hidden instructions,
- attempts to access unrelated data,
- unsafe requests hidden inside normal tasks,
- indirect attacks inside documents or webpages,
- tool misuse scenarios,
- multimodal inputs containing embedded text,
- edge cases from real user behaviour.
The purpose of testing is not to publish working jailbreaks. The purpose is to improve model and system resilience.
Jailbreak monitoring after deployment
Even if a system performs well during testing, new jailbreak attempts may appear later. Users and attackers continuously experiment with new wording, formats and attack chains.
Monitoring can include:
- blocked request rates,
- unsafe output reports,
- tool call anomalies,
- unusual data access,
- repeated refusal patterns,
- escalations to human review,
- security incident logs,
- sampled conversation audits.
Monitoring matters because jailbreak resilience is not a one-time implementation task. It is an ongoing safety and security process.
Common mistakes when thinking about jailbreaking
Jailbreaking is often misunderstood.
Common mistakes include:
- treating it as only a chatbot problem – tool-using agents make it more serious,
- relying only on a system prompt – written rules are not enough,
- using only keyword filters – unsafe intent can be rephrased,
- giving the model too much access – broad permissions increase damage,
- ignoring indirect attacks – external content can manipulate the system,
- not testing multi-turn interactions – attacks may build gradually,
- not logging tool calls – failures become hard to investigate,
- assuming one fix is permanent – jailbreak methods evolve over time.
Jailbreaking in business AI systems
In business AI systems, jailbreaking can affect more than content moderation. It can affect data confidentiality, customer communication, CRM accuracy, compliance workflows, software development and internal knowledge systems.
For example, a customer support assistant may be manipulated into revealing policy details it should not disclose. A coding assistant may be pushed toward unsafe code. A document assistant may expose content from sources outside the user’s permission. A sales assistant may send or draft inappropriate messages.
This is why business deployment needs governance:
- clear ownership,
- defined use cases,
- limited permissions,
- testing before launch,
- human approval for sensitive actions,
- incident response process,
- regular review of model behaviour.
Jailbreaking and AI governance
AI governance means the rules, controls and responsibilities used to manage AI systems safely.
For jailbreaking risk, governance should answer:
- which outputs are disallowed,
- which actions require approval,
- which tools the model can use,
- which data the model can access,
- how jailbreak attempts are detected,
- how failures are logged,
- who reviews incidents,
- how the system is updated after new risks appear.
Without governance, jailbreak defence becomes reactive. Teams only respond after users discover failures.
When jailbreak testing is legitimate
Jailbreak testing can be legitimate when it is authorised, controlled and used to improve safety.
This includes:
- internal AI red teaming,
- model safety evaluation,
- security testing before deployment,
- bug bounty programmes with clear rules,
- academic research with responsible disclosure,
- enterprise risk assessment.
The same activity can be harmful if it is used to extract restricted content, attack deployed systems, bypass safeguards for misuse or publish actionable instructions that enable abuse.
Intent, authorisation and disclosure process matter.
How to explain jailbreaking simply
Jailbreaking can be compared to trying to trick a security guard into ignoring the building rules.
The guard may have instructions: do not let unauthorised people into restricted rooms. A jailbreak attempt tries to confuse, pressure or reframe the situation so the guard treats the restriction as optional.
In AI, the model is not a human guard. It is a probabilistic system processing language. That makes the problem different, but the intuition is similar: the attacker tries to make the system ignore boundaries.
Jailbreaking = trying to make an AI model bypass its safety boundaries. The main defence is layered safety: model safeguards, limited tools, access control, monitoring and human review.
Related terms
- Large language model (LLM) – a language-focused AI model that processes prompts, context and generated text.
- Prompt injection – an attack or failure mode where external content tries to manipulate the system’s instructions.
- Prompt engineering – the practice of designing prompts so that language models produce useful and controlled outputs.
- AI agent – an AI system that can pursue a goal, use tools and take actions.
- Agentic AI – a broader category of AI systems focused on goal-driven action and autonomy.
- AI red teaming – authorised adversarial testing used to find weaknesses in AI systems.
- System prompt leakage – attempts to reveal hidden system or developer instructions.
- Data leakage – exposing sensitive or private information through model output, tool use or system design.
- RAG – retrieval-augmented generation, where external documents are retrieved and used as context for model output.
- Tool calling – the ability of an AI system to call external tools, APIs or functions.
- Least privilege – the security principle that a system should have only the access required for its task.
- Human-in-the-loop – a design where a person reviews or approves important actions.
- AI governance – policies, processes and controls for safe, auditable and accountable AI use.
- Multimodal models – AI models that process several input types, such as text, images, audio, video or documents.
- Embedding – a numerical representation of content, often used for retrieval, search and similarity matching.
Sources and further reading
- LLM01:2025 Prompt Injection – genai.owasp.org – June 2026 – explains prompt injection and describes jailbreaking as a form of prompt injection where inputs cause the model to disregard safety protocols.
- OWASP Top 10 for Large Language Model Applications – owasp.org – June 2026 – lists prompt injection, insecure output handling, sensitive information disclosure and other major LLM application risks.
- Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile – nvlpubs.nist.gov – June 2026 – discusses cybersecurity and misuse risks in generative AI systems, including prompt injection and related threats.
- Cybersecurity Framework Profile for Artificial Intelligence – nvlpubs.nist.gov – June 2026 – includes AI-focused cyber threat intelligence considerations such as jailbreaks and prompt injection.
- AI Red Teaming Agent – learn.microsoft.com – June 2026 – describes AI red teaming concepts including user-injected prompt attacks and indirect jailbreaks.
- Google’s AI Red Team: the ethical hackers making AI safer – blog.google – June 2026 – explains how AI red teams test real products and features to identify safety, privacy and abuse issues.
- Adversarial Misuse of Generative AI – cloud.google.com – June 2026 – discusses adversarial misuse, prompt injection, safeguards and AI red teaming.
- Advancing Gemini’s security safeguards – deepmind.google – June 2026 – describes automated red teaming and ongoing security safeguards for Gemini.
- Prompt injection is not SQL injection – ncsc.gov.uk – June 2026 – explains why prompt injection and related LLM manipulation risks require architectural thinking rather than one simple fix.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study – arxiv.org – June 2026 – studies jailbreak prompt patterns and the challenge of evaluating model resistance to adversarial prompting.
Was this article helpful?
Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!
Reaction to comment: Cancel reply