Why We Built Guardian Agents

New to AI governance? Start with What Is AI, Really? for foundational context.

The Incident That Started Everything

At 11pm on a Friday night in February 2026, I asked Claude Opus 4.6 — one of the most capable AI models available — to fix user roles in one of our community tenants. The assistant produced a detailed forensic analysis citing specific database IDs and line numbers. It ran a dry run and reported success. The fix was ready to apply.

There was one problem: the fix would have deleted my own login account, locked me out of my community, and merged separate tenant identities that existed by design. Every surface marker said "thorough, verified, ready." Every underlying assumption was wrong.

I caught it by doing something the AI could not: I opened a browser and typed my password. The account it declared "non-functional" worked perfectly.

This incident — documented in full in When Your AI Assistant Nearly Destroys What It Was Hired to Fix — was not a story about a bug. It was a story about what happens when a system is confident and wrong, and the human has learned to trust it.

The KPMG/University of Melbourne Global Trust Study (2025) found that 66% of people use AI regularly without evaluating accuracy. Not because they lack critical thinking. Because their experience tells them the AI is usually right, and checking takes effort. This is rational behaviour. It is also dangerous behaviour, and it gets more dangerous as AI gets more capable.

That incident became the catalyst for Guardian Agents.

Why "Add Guardrails" Is Insufficient

The Recursive Trust Problem

If you use a language model to evaluate a language model, you inherit the same failure modes. Both systems are probabilistic. Both hallucinate. Both are susceptible to confident wrong answers. The checker confirms the error because the checker reasons the same way.

When I asked Claude to write an audit script to verify its own fix, the audit script had the same blind spot. It used the same flawed understanding of what makes a user "functional." This is what safety engineers call common-mode failure: verification and execution fail simultaneously because they share a fundamental assumption.

Common-mode failure is well-understood in nuclear reactor design and aviation. It is barely discussed in AI governance. The assumption that "AI checking AI" is sufficient persists despite being the exact failure mode that makes AI errors dangerous in the first place.

The Inverse Scaling Problem

OpenAI's system card for o3 (April 2025) reported a 33% hallucination rate — double its predecessor o1's 16%. Their paper "Why Language Models Hallucinate" (September 2025) explains the mechanism: next-token training objectives reward confident guessing over calibrated uncertainty. Models learn to bluff because they are graded on fluency.

This creates an inverse scaling dynamic:

More capable models → users rely more → verify less
Errors go undetected longer → eventual failures are more severe
More capable models produce more convincing wrong answers → errors are harder to detect when you do look

Capability scales. Correctness does not scale with capability. Guardrails built from the same technology inherit this problem.

The Four-Phase Architecture

Guardian Agents are not guardrails. They are a four-phase governance architecture where each phase addresses a different class of failure, using different mechanisms, with independent failure modes.

Phase 1: Reviewers — Quality at the Point of Delivery

Every AI response is checked before delivery. Reviewers compare the response against the community's actual source material — FAQ entries, stories, documents — using embedding similarity. This is a mathematical operation: how closely does this claim align with what the community actually knows?

The critical design decision: reviewers use cosine similarity on embeddings, not inference. This is arithmetic, not AI. It cannot hallucinate. It cannot be manipulated by clever phrasing. The recursive trust problem is eliminated by making the verification layer fundamentally different from the generation layer.

Members see confidence badges on every response. Those who opt in can see claim-by-claim source attribution.

Phase 2: Monitors — Patterns Individual Reviews Cannot Catch

A single response might pass review while a trend across dozens of responses reveals a problem. Monitors run on a schedule, analysing aggregate patterns: violation rates, groundedness trends, topic drift, request volume anomalies.

When a Monitor detects a pattern, it generates an alert for the community's moderators. The architecture enforces a strict privacy boundary: tenant moderators see full content for their own community. Platform administrators see only aggregate metrics — never user content, prompts, or identities.

Phase 3: Protectors — Defence Before Inference

Protectors act before the AI processes a request. They screen for prompt injection attacks, enforce rate limits, and verify cross-tenant isolation as a defence-in-depth measure.

If quality degrades beyond defined thresholds, an automated rollback mechanism reverts to the last known-good model configuration. Recovery is automatic. Human intervention is needed only for the decision to re-enable the newer version.

Phase 4: Adaptive Learning — Evidence-Based Self-Improvement

When moderators review alerts and make decisions, those decisions become evidence. Phase 4 analyses the accumulated evidence and proposes adjustments to Guardian configuration.

Every proposal includes: the evidence that triggered it, historical context, a confidence score, and a risk assessment. The evidence burden is deliberately asymmetric. Loosening a safety threshold requires 85% confidence. Tightening a threshold requires only 60%. The system is designed to fail conservative.

No proposal takes effect without moderator approval. The frozen baseline configuration is never modified at runtime. Approved changes are stored as audited overrides with full provenance. A regression monitor watches every approved override and flags changes that make things worse.

Four Principles That Define the Architecture

Mathematical, Not Generative

Guardians use embedding cosine similarity — a mathematical operation — not language model inference. This is not "AI judging AI." It is mathematics verifying language. The recursive trust problem does not apply because the verification mechanism is fundamentally different from the generation mechanism.

Sovereign by Design

All guardian processing runs on the community's own infrastructure. No data leaves the tenant boundary for safety checks. A community that chose Village for data sovereignty does not sacrifice that sovereignty for AI safety.

Human Authority Preserved

Configuration is explicit, not learned. The frozen baseline is never modified at runtime. Every change requires human approval and creates an audit trail. The evidence burden for loosening restrictions is higher than for tightening them. Guardians propose; moderators decide.

Tenant-Scoped Governance

Each community has its own constitutional principles, anomaly baselines, and threshold overrides. Cross-tenant learning uses only aggregate counts — never content — ensuring privacy guarantees extend to the governance layer.

Market Position

Leigh McMullen of Gartner (May 2025) describes guardian agents evolving through three phases: quality control, observation, and protection — all defined as "AI designed to monitor other AI." Village's Guardian Agents already encompass all three of Gartner's phases and add a fourth — Adaptive Learning — that Gartner does not envision. But the gap between Village and the industry is not merely temporal. It is qualitative.

Gartner's entire model assumes generative verification — AI checking AI, still trapped in the recursive trust problem. Cloud-dependent processing — safety checks running on vendor infrastructure. Universal thresholds — one-size-fits-all enterprise settings. Automated guardrails — safety as a feature toggle, not a governance architecture with human authority. And platform-scoped policies — no concept of per-community constitutional governance.

Even IBM's Sovereign Core — launched in January 2026 as the first enterprise AI designed for local governance — addresses only data residency: where the data sits and who can access it. It does not give the community a vote on what the AI does with that data. Sovereignty of infrastructure is not sovereignty of governance.

Village's Guardian Agents resolve all of these limitations. The industry will arrive at guardian agents that monitor AI output. Village has guardian agents that implement constitutional governance. The industry's 2028 destination is a place Village has already passed through.

Village is currently in beta pilot. We are accepting applications from communities, families, and organisations ready to govern their own AI. Beta founding partners receive locked-for-life founding rates.