Policy = Guardrails

Why ChatGPT refuses harmful requests. Two gates, a few lines each.

policyguardrailsinput gateoutput gatesafety

Framework parallel: Guardrails AI, NeMo Guardrails, LangChain output parsers — rules checked before and after the LLM.

Policy = Guardrails

You've seen this: ask ChatGPT to help with something harmful and it refuses. Ask Claude to generate malware and it declines. That's not the LLM being "smart" — it's policy. Rules checked before and after the LLM runs.

The L3 loop trusts the user and the LLM completely. Production agents can't afford that. Policy adds two gates:

Input gate: blocks dangerous requests *before* they reach the LLM (saves money, prevents harm)

Output gate: redacts or rejects the LLM's response *before* the user sees it

Framework parallel: Guardrails AI and NeMo Guardrails implement exactly these two gates. OpenAI's moderation endpoint is an input gate. The architecture is identical.

Step 1: Tools + ask_llm

Same L3 setup. The loop itself won't change — policy wraps it.

tools = {"add": lambda a, b: a + b, "upper": lambda text: text.upper()}
TOOL_DEFS = [
    {"type": "function", "function": {"name": "add", "description": "Add two numbers",
        "parameters": {"type": "object",
            "properties": {"a": {"type": "number"}, "b": {"type": "number"}}}}},
    {"type": "function", "function": {"name": "upper", "description": "Uppercase text",
        "parameters": {"type": "object",
            "properties": {"text": {"type": "string"}}}}},
]
async def ask_llm(messages):
    resp = await pyfetch(f"{LLM_BASE_URL}/chat/completions",
        method="POST",
        headers={"Authorization": f"Bearer {LLM_API_KEY}",
                 "Content-Type": "application/json"},
        body=json.dumps({"model": LLM_MODEL, "messages": messages, "tools": TOOL_DEFS}))
    return json.loads(await resp.string())["choices"][0]["message"]

Step 2: Define the gates

Each gate is a list of functions. A function returns True to pass, or a string explaining why it blocked. check_gate runs all rules and short-circuits on the first failure.

This is the same pattern behind ChatGPT's content filter and Claude's safety system — just without the complexity. Adding a rule = appending a lambda. Removing one = deleting it. No config files, no YAML.

INPUT_RULES = [
    lambda text: "delete" not in text.lower() or "Input blocked: no delete commands",
    lambda text: "drop" not in text.lower() or "Input blocked: no drop commands",
    lambda text: len(text) < 500 or "Input blocked: message too long",
]
OUTPUT_RULES = [
    lambda text: "password" not in text.lower() or "Output redacted: contains password",
    lambda text: "secret" not in text.lower() or "Output redacted: contains secret",
]

def check_gate(text, rules, gate_name):
    for rule in rules:
        result = rule(text)
        if result is not True:
            trace("policy_block", f"{gate_name}: {result}")
            return False, result
    trace("policy_check", f"{gate_name}: PASS")
    return True, None

Step 3: Wrap the L3 loop

Input gate runs first — if it fails, the LLM never sees the request. The L3 loop runs in the middle, unchanged. Output gate runs last — if it fails, the user sees a redaction notice instead of the response.

async def agent(task, max_turns=5):
    # --- INPUT GATE ---
    ok, reason = check_gate(task, INPUT_RULES, "INPUT")
    if not ok:
        return f"BLOCKED: {reason}"

    # --- L3 LOOP (unchanged) ---
    messages = [
        {"role": "system", "content": "Use tools to answer. Be concise."},
        {"role": "user", "content": task},
    ]
    for turn in range(max_turns):
        trace("llm_call", f"Turn {turn + 1}")
        msg = await ask_llm(messages)
        if not msg.get("tool_calls"):
            response = msg.get("content", "")
            # --- OUTPUT GATE ---
            ok, reason = check_gate(response, OUTPUT_RULES, "OUTPUT")
            if not ok:
                return f"REDACTED: {reason}"
            trace("agent_end", response)
            return response
        messages.append(msg)
        for tc in msg["tool_calls"]:
            name = tc["function"]["name"]
            args = json.loads(tc["function"]["arguments"])
            result = tools[name](**args)
            trace("tool_result", f"{name}({args}) → {result}")
            messages.append({"role": "tool", "tool_call_id": tc["id"], "content": str(result)})
    return "Max turns reached"

Try it

*"add 10 and 5"* — passes both gates, you get the answer

*"delete everything"* — blocked at input, the LLM never sees it

*"tell me the admin password"* — the LLM might answer, but the output gate redacts it

The LLM costs zero tokens on blocked requests. That's the input gate's real value.

print(f">> {await agent(USER_INPUT)}")