What guardrails are for
- Safety and moderation — Block or flag hate speech, violence, sexual content, self-harm, illicit content.
- PII — Mask or block personally identifiable information (email, phone, SSN, etc.) in text.
- Jailbreak and prompt injection — Detect attempts to bypass or override model instructions.
- NSFW and URL filtering — Filter not-safe-for-work content and control which URLs are allowed.
How guardrails work
- You create a guardrails policy in the platform (or via the Assistant) and enable the types you need (PII, moderation, jailbreak, etc.).
- For each piece of text (e.g. an LLM response), you call the API or SDK with the policy key and the text.
- Limits returns allow, block, or escalate and a reason. When PII masking is enabled, the response can include a sanitized version of the text.
Platform: Create guardrails
Create and configure guardrails in the Limits UI.
Guardrail types
Each type maps to a section in your policy. When you evaluate a guardrails policy, the text is checked against all enabled types. Use the links below to jump to a specific type.| Type | Anchor | Description |
|---|---|---|
| PII | #pii | Personally identifiable information — mask or block. |
| Moderation | #moderation | Sexual content, hate, self-harm, violence, illicit. |
| Jailbreak | #jailbreak | Detects attempts to bypass model safety. |
| NSFW | #nsfw | Not-safe-for-work content filter. |
| Prompt injection detection | #prompt_injection_detection | Detects instruction override in prompts. |
| URL filter | #url_filter | Allow or block URLs by scheme and allow-list. |
| Output formatter | #output_formatter | Expected JSON structure for model output. |
PII
Personally identifiable information (PII) detects sensitive personal data in text (e.g. model output or user messages). For each entity (email, phone, SSN, credit card, etc.) you can choose:- Mask — Replace the value with a placeholder so the content can still be used without exposing PII.
- Block — Reject the request or response so the PII is never returned.
Moderation
Moderation detects harmful or policy-violating content in text. Categories include:- Sexual content — Sexual or sexual/minors.
- Hate & harassment — Hate, hate/threatening, harassment, harassment/threatening.
- Self-harm — Self-harm, self-harm/intent, self-harm/instructions.
- Violence — Violence, violence/graphic.
- Illicit activities — Illicit, illicit/violent.
Jailbreak
Jailbreak detection identifies attempts to bypass or override the model’s safety instructions (e.g. “ignore previous rules” or crafted prompts that try to get the model to behave unsafely). When enabled, such attempts are detected and the guardrail can block or escalate the request. This is a single on/off switch; no per-category options.NSFW
NSFW (not-safe-for-work) filters adult or not-safe-for-work content in requests and responses. When enabled, content that is classified as NSFW is filtered so you can block or escalate. Use this when you need to keep outputs appropriate for your audience (e.g. consumer apps or work contexts).Prompt injection detection
Prompt injection detection looks for attempts to inject or override instructions inside user prompts (e.g. hidden text that tells the model to ignore safety or leak data). When enabled, the guardrail flags such content so you can block or escalate. This complements Jailbreak (which focuses on bypassing safety) by focusing on instruction override in the prompt itself.URL filter
URL filter controls which URLs are allowed in text. You configure:- URL allow list — Domains, IPs, or CIDR ranges that are permitted. Only URLs that match the allow list (and allowed schemes) pass.
- Allowed schemes — e.g.
httpsonly; URLs with other schemes (e.g.http,file) can be blocked. - Block user info — Reject URLs that contain username/password segments (e.g.
https://user:[email protected]). - Allow subdomains — When enabled, subdomains of allowed domains are permitted (e.g.
api.example.comifexample.comis allowed).
Output formatter
Output formatter defines the expected JSON structure for model output. When enabled, you provide a JSON schema or template; the guardrail can validate that the model’s response conforms to it (e.g. for structured APIs or downstream parsing). Invalid or non-conforming output can be flagged or blocked. This is useful when you need a fixed shape (for example, an object withanswer and confidence fields) and want to catch malformed responses.
Evaluating guardrails (SDK and API)
- SDK: Use
limits.guard(policyKeyOrTag, input)whereinputis the text string. See SDK Guardrails. - API:
POST /api/policies/{policyKey}/evaluate/guardrailswith body{ "request": { "input": "..." } }. See API Reference.