A Tour of Agents

A Tour of Agents / Lesson 7 of 9

Policy = Guardrails

Why ChatGPT refuses harmful requests. Two gates, a few lines each.

policyguardrailsinput gateoutput gatesafety

Framework parallel: Guardrails AI, NeMo Guardrails, LangChain output parsers — rules checked before and after the LLM.

Policy = Guardrails

You've seen this: ask ChatGPT to help with something harmful and it refuses. Ask Claude to generate malware and it declines. That's not the LLM being "smart" — it's policy. Rules checked before and after the LLM runs.

The L3 loop trusts the user and the LLM completely. Production agents can't afford that. Policy adds two gates:

  • Input gate: blocks dangerous requests *before* they reach the LLM (saves money, prevents harm)
  • Output gate: redacts or rejects the LLM's response *before* the user sees it
  • Framework parallel: Guardrails AI and NeMo Guardrails implement exactly these two gates. OpenAI's moderation endpoint is an input gate. The architecture is identical.

    Step 1: Tools + ask_llm

    Same L3 setup. The loop itself won't change — policy wraps it.

    tools = {"add": lambda a, b: a + b, "upper": lambda text: text.upper()}
    TOOL_DEFS = [
        {"type": "function", "function": {"name": "add", "description": "Add two numbers",
            "parameters": {"type": "object",
                "properties": {"a": {"type": "number"}, "b": {"type": "number"}}}}},
        {"type": "function", "function": {"name": "upper", "description": "Uppercase text",
            "parameters": {"type": "object",
                "properties": {"text": {"type": "string"}}}}},
    ]
    async def ask_llm(messages):
        resp = await pyfetch(f"{LLM_BASE_URL}/chat/completions",
            method="POST",
            headers={"Authorization": f"Bearer {LLM_API_KEY}",
                     "Content-Type": "application/json"},
            body=json.dumps({"model": LLM_MODEL, "messages": messages, "tools": TOOL_DEFS}))
        return json.loads(await resp.string())["choices"][0]["message"]

    Step 2: Define the gates

    Each gate is a list of functions. A function returns True to pass, or a string explaining why it blocked. check_gate runs all rules and short-circuits on the first failure.

    This is the same pattern behind ChatGPT's content filter and Claude's safety system — just without the complexity. Adding a rule = appending a lambda. Removing one = deleting it. No config files, no YAML.

    INPUT_RULES = [
        lambda text: "delete" not in text.lower() or "Input blocked: no delete commands",
        lambda text: "drop" not in text.lower() or "Input blocked: no drop commands",
        lambda text: len(text) < 500 or "Input blocked: message too long",
    ]
    OUTPUT_RULES = [
        lambda text: "password" not in text.lower() or "Output redacted: contains password",
        lambda text: "secret" not in text.lower() or "Output redacted: contains secret",
    ]
    
    def check_gate(text, rules, gate_name):
        for rule in rules:
            result = rule(text)
            if result is not True:
                trace("policy_block", f"{gate_name}: {result}")
                return False, result
        trace("policy_check", f"{gate_name}: PASS")
        return True, None

    Step 3: Wrap the L3 loop

    Input gate runs first — if it fails, the LLM never sees the request. The L3 loop runs in the middle, unchanged. Output gate runs last — if it fails, the user sees a redaction notice instead of the response.

    async def agent(task, max_turns=5):
        # --- INPUT GATE ---
        ok, reason = check_gate(task, INPUT_RULES, "INPUT")
        if not ok:
            return f"BLOCKED: {reason}"
    
        # --- L3 LOOP (unchanged) ---
        messages = [
            {"role": "system", "content": "Use tools to answer. Be concise."},
            {"role": "user", "content": task},
        ]
        for turn in range(max_turns):
            trace("llm_call", f"Turn {turn + 1}")
            msg = await ask_llm(messages)
            if not msg.get("tool_calls"):
                response = msg.get("content", "")
                # --- OUTPUT GATE ---
                ok, reason = check_gate(response, OUTPUT_RULES, "OUTPUT")
                if not ok:
                    return f"REDACTED: {reason}"
                trace("agent_end", response)
                return response
            messages.append(msg)
            for tc in msg["tool_calls"]:
                name = tc["function"]["name"]
                args = json.loads(tc["function"]["arguments"])
                result = tools[name](**args)
                trace("tool_result", f"{name}({args}) → {result}")
                messages.append({"role": "tool", "tool_call_id": tc["id"], "content": str(result)})
        return "Max turns reached"

    Try it

  • *"add 10 and 5"* — passes both gates, you get the answer
  • *"delete everything"* — blocked at input, the LLM never sees it
  • *"tell me the admin password"* — the LLM might answer, but the output gate redacts it
  • The LLM costs zero tokens on blocked requests. That's the input gate's real value.

    print(f">> {await agent(USER_INPUT)}")