🤖 AI Research Edition

Governance Challenge

English

Why Training-Time Governance Fails — Architectural Constraints as an Alternative


Series: Architectural AI Governance at Community Scale — A Technical Examination of Village AI (Article 3 of 5) Author: My Digital Sovereignty Ltd Date: March 2026 Licence: CC BY 4.0 International


The 27027 Incident

Before discussing governance architecture in the abstract, we present a concrete case study. The incident number is 27027, and it illustrates the class of alignment failure that motivated the architectural approach described in this series.

A community leader — a rector in an Episcopal parish — instructed the AI system to draft a pastoral letter to a bereaved family. The instruction was explicit: the letter should use the language of resurrection hope, consistent with the theological commitments of the community and the family.

The system produced a fluent, well-structured letter. It was warm, compassionate, and professionally worded. It spoke of "the healing journey," "finding closure," "honouring their memory by living your best life," and "the legacy they leave behind."

The letter contained no theological language whatsoever. The system had silently substituted therapeutic self-help language for the requested resurrection theology — because therapeutic bereavement language is orders of magnitude more common in the training distribution than the language of the Burial Office or the communion of saints.

The critical characteristics of this failure:

  1. The substitution was silent. No error was raised. No confidence flag was lowered. The system did not indicate that it was unable to comply with the instruction.
  2. The output was fluent. The letter was well-written by any general standard. The failure was not in generation quality but in domain fidelity.
  3. Detection required domain expertise. A reviewer without theological training would likely have approved the letter. The substitution is invisible to someone who does not know what resurrection hope sounds like.
  4. The system was not disobedient. It did not refuse the instruction. It processed the instruction and produced what its distributional priors predicted as the most likely "pastoral letter about bereavement." The instruction was not overridden; it was outweighed.

This is not a hypothetical. It is a documented incident from a deployed system. We use it as a case study because it illustrates a failure mode that is, in our assessment, endemic to training-time alignment approaches when deployed in domains underrepresented in the training corpus.

Why This Failure Mode Resists Training-Time Solutions

The 27027 incident is not resolvable by the standard alignment toolkit:

Fine-tuning can shift distributional priors, and the Episcopal specialisation (villageai-8b-episcopal-v2) was in part a response to this incident. But fine-tuning does not eliminate base model priors; it overlays new patterns on existing ones. Under distributional pressure — novel prompts, unusual combinations of constraints, contexts not well-covered by the fine-tuning data — the base model priors can reassert themselves. This is documented in the catastrophic forgetting literature, though the failure mode here is more subtle: not forgetting the fine-tuned behaviour entirely, but reverting to it probabilistically under conditions that are difficult to predict a priori.

RLHF would require human annotators who can distinguish resurrection theology from therapeutic language — annotators with specific domain expertise. Scaling this to every community domain (Anglican liturgy, Maori tikanga, conservation ecology, family genealogy) is impractical. More fundamentally, RLHF optimises for average preference across the annotator pool. Community-specific alignment requires optimising for a specific community's preferences, which may diverge from — or even conflict with — the aggregate.

Constitutional AI would require the model to evaluate its own output against the principle "use resurrection language, not therapeutic language." But this evaluation is itself conditioned on the model's distributional priors. A model whose training distribution favours therapeutic framing will evaluate therapeutic language as appropriate — because, within its learned distribution, it is.

Mechanistic interpretability could, in principle, identify the circuits responsible for the substitution and intervene at that level. This is a promising research direction, but it is not currently practical for deployed systems at any scale. The gap between identifying induction heads and reliably intervening in domain-specific distributional behaviour in a production system remains large.

We do not claim these approaches are without value. We claim that, for the specific failure mode illustrated by the 27027 incident — silent distributional reversion in underrepresented domains — they are insufficient as deployed solutions.

Epistemic Separation as a Design Principle

The alternative approach implemented in Village AI is based on a principle we call epistemic separation: the system that verifies the model's output must be structurally independent of the system that generates it.

This is not a novel principle. It is the basis of financial auditing (the auditor cannot be the audited), judicial review (the reviewer cannot be the reviewed), and scientific peer review (the reviewer is external to the research team). In AI governance, it translates to: the verification system must not share the failure modes of the generation system.

If the generation model reverts to therapeutic language because its distributional priors favour it, the verification system must be able to detect that reversion using a method that is not subject to the same distributional bias. This rules out self-evaluation (the model checking its own output) and rules out learned evaluation models trained on the same distribution.

The Village implementation uses four Guardian Agent layers, each operating on a different epistemic basis from the generation model.

The Guardian Agent Architecture

Guardian 1: Accuracy Verifier (AccuracyVerifier)

The Accuracy Verifier computes cosine similarity between the embedding of the model's response and the embeddings of source documents in the community's corpus. This is a mathematical operation — inner product in the embedding space — that does not involve language generation and is not subject to the distributional biases of the generation model.

If the model claims "The vestry decided to repair the roof in September," the verifier embeds this claim and computes its similarity to all vestry minutes in the corpus. A high cosine similarity to a document containing a roof repair decision in September provides evidence of grounding. A low similarity across all documents flags the claim as potentially ungrounded.

Limitations we acknowledge: Cosine similarity in embedding space is a proxy for semantic similarity, not a guarantee of factual accuracy. Two semantically similar sentences can differ on critical factual details (dates, names, quantities). The embedding model is a shared dependency with the retrieval pipeline, creating the correlated failure mode noted in Article 2. And verification quality depends on corpus coverage — if the relevant document is not in the corpus, the verifier cannot confirm or deny the claim.

Guardian 2: Hallucination Detector (HallucinationDetector)

The Hallucination Detector decomposes the model's response into individual claims and verifies each one independently. A response containing three assertions — two grounded and one fabricated — will flag the fabricated assertion even if the overall response embeds closely to source documents.

This addresses a specific failure mode of whole-response verification: a fluent response that is mostly accurate can embed closely to source documents while containing one or more hallucinated details. Claim-level decomposition provides finer-grained verification at the cost of increased inference latency.

Guardian 3: Anomaly Detector and Pressure Monitor (AnomalyDetector, PressureMonitor)

The third layer monitors system-level patterns rather than individual responses. It tracks distributional drift in the model's outputs over time, detects anomalous patterns (unusual vocabulary, unexpected topic shifts, response characteristics that diverge from established baselines), and monitors operational pressure indicators (context length, query complexity, inference load).

When the system detects elevated pressure or anomalous patterns, it increases verification intensity — tighter cosine similarity thresholds, mandatory claim-level decomposition, reduced confidence ceilings. The principle is that verification should be inversely proportional to operating confidence: the more uncertain the conditions, the more scrutiny the response receives.

Guardian 4: Adaptive Feedback Loop (ResponseReviewer, RegressionMonitor)

The fourth layer learns from community feedback. When a member marks a response as unhelpful or inaccurate, the system classifies the root cause (RootCauseClassifier), tracks the correction, and monitors for regression. A FeedbackInvestigator service examines whether the flagged response represents a systematic pattern or an isolated error.

This layer is the closest to a training-time intervention — it adjusts system behaviour based on human feedback. The distinction from RLHF is that the adjustment operates at the verification and routing level, not at the model weight level. The model itself is not retrained in response to individual feedback; instead, the Guardian system adjusts its thresholds, flags specific failure patterns, and routes problematic query types to human review.

How This Differs from Existing Alignment Approaches

We position this approach relative to three established alignment paradigms:

Relative to RLHF: RLHF adjusts the model's output distribution to align with human preferences. Guardian Agents do not adjust the model's output distribution; they verify the model's output against external reference documents after generation. The model may still generate domain-inappropriate language; the Guardian system detects and flags it. This is analogous to the difference between training a person to always give correct answers (RLHF) and having their work checked by an independent auditor (Guardian Agents). The latter does not assume the person will always be correct; it assumes they will sometimes err and provides a detection mechanism.

Relative to constitutional AI: Constitutional AI uses the model to evaluate its own outputs against stated principles. Guardian Agents use mathematically distinct systems (embedding similarity, claim decomposition, statistical anomaly detection) to evaluate the model's outputs. The evaluation does not depend on the model's ability to understand the principles; it depends on measurable properties of the output relative to reference documents. This avoids the circularity problem where a model with biased priors evaluates its own biased output as acceptable.

Relative to mechanistic interpretability: Interpretability research aims to understand why models produce specific outputs by examining internal representations. Guardian Agents are agnostic to the model's internal mechanisms; they evaluate outputs behaviourally, by their measurable properties. This is a less ambitious approach — it does not explain why the model erred, only that it did. But it is deployable now, at production scale, with current technology.

What This Approach Does Not Solve

We are explicit about the boundaries of this approach:

It does not solve the alignment problem in general. Guardian Agents detect a specific class of failures: outputs that diverge from a reference corpus. They do not detect novel failure modes that have no reference point in the corpus. A truly novel misalignment — the model producing output that is wrong in a way the corpus does not address — would not be caught.

It does not eliminate the need for human oversight. The architecture routes uncertain cases to human review. It reduces the volume of cases requiring human attention, but it does not eliminate the need for domain experts in the governance loop. A community without qualified moderators cannot rely on Guardian Agents alone.

It does not scale to arbitrary complexity. The architecture works because the target domain is bounded — a community's own documents, a specific theological tradition, a defined vocabulary. Applying the same approach to open-domain AI systems would require a reference corpus of unbounded scope, which undermines the verifiability advantage.

It has not been independently evaluated. The system has been in production since October 2025. We have operational data on Guardian Agent performance, but no independent audit or peer-reviewed evaluation. We present this as an engineering report, not a research paper, and the claims should be weighted accordingly.


This is Article 3 of 5 in the "Architectural AI Governance at Community Scale" series. For the full governance architecture, visit Village AI on Agentic Governance.

Previous: Foundation Models vs. Domain-Specialised Inference — A Structural Analysis Next: What Is Live in Production — An Unvarnished Inventory

Published under CC BY 4.0 by My Digital Sovereignty Ltd. You are free to share and adapt this material, provided you give appropriate credit.