🎓 Academia Edition

Big Tech vs Community

English

Platform AI vs. Community-Governed AI — A Structural Analysis


Series: Community-Scale AI Governance — A Research Perspective on the Village Platform (Article 2 of 5) Author: My Digital Sovereignty Ltd Date: March 2026 Licence: CC BY 4.0 International


The Corpus Problem

The governance properties of an AI system are substantially determined by its training corpus. This is not a secondary concern — it is a first-order architectural property.

Commercial LLMs from major platform providers are trained on web-scale corpora: billions of documents scraped from the open internet, supplemented by licensed datasets and proprietary collections. The resulting systems are broad in capability and correspondingly broad in their distributional assumptions.

The internet, as a training corpus, over-represents certain domains and perspectives:

It correspondingly under-represents:

This distributional imbalance is not correctable by scale. A larger web corpus amplifies the same biases. It is a structural property of the data source, not a sampling error.

Domain-Specific AI: The Alternative and Its Limitations

The Village platform takes a different architectural approach: a smaller model, trained on a layered corpus that prioritises domain-specific content over breadth.

The training architecture has three layers:

Platform layer. Common operational knowledge shared across all deployments — how the platform functions, what features are available, navigational assistance. This layer is analogous to a shared ontology across instances.

Community layer. Content specific to a particular deployment — the records, communications, and documents produced by the community that operates the instance. This layer is what differentiates one deployment from another and grounds the model's outputs in local context.

Consent layer. A structural constraint: no content enters the training corpus without explicit, auditable consent from the content creator. This is enforced architecturally, not by policy.

The resulting system is narrower than a commercial LLM. It cannot discuss topics outside its training domain with any competence. It will not produce general-purpose creative writing or engage in wide-ranging conversation. What it offers instead is outputs grounded in a specific community's records, verifiable against those records.

Limitations of this approach

Several limitations should be acknowledged:

Reduced generality. A domain-specific model cannot assist with tasks outside its training domain. Community members who need general-purpose AI assistance must use a separate system.

Corpus size constraints. Small communities produce limited content. A model trained on a few hundred documents has a correspondingly narrow knowledge base. The quality of outputs is directly constrained by the volume and quality of community content.

Retraining lag. The community layer requires periodic retraining to incorporate new content. Between retraining cycles, the model's knowledge is stale. The current retraining cadence (weekly during beta) may be insufficient for rapidly changing contexts.

Fine-tuning fragility. Domain-specific fine-tuning overlays new patterns on a base model's existing distribution. Under certain query conditions — particularly novel or complex questions — the base model's patterns may reassert themselves, a phenomenon known in the literature as catastrophic forgetting. The extent to which this affects governance-relevant outputs in practice is not yet well-characterised for this system.

Guardian Agents: External Verification Architecture

The Village platform does not rely solely on training to ensure output quality. It interposes a verification layer — termed "Guardian Agents" — between the model's outputs and the end user.

The Guardian Agent architecture comprises four independent verification mechanisms:

Semantic grounding verification. The model's output is compared against the community's document corpus using embedding-based similarity measures. Outputs that lack sufficient grounding in actual records are flagged or suppressed.

Claim-level decomposition. The output is decomposed into individual claims, each verified independently. This addresses the common failure mode where a response contains a mixture of grounded and ungrounded assertions.

Behavioural drift monitoring. A longitudinal monitoring layer tracks patterns in the model's outputs over time, detecting systematic shifts in tone, framing, or accuracy that may indicate distributional drift or degradation.

Adaptive feedback integration. Community members' feedback (both explicit ratings and moderator corrections) is incorporated into the verification thresholds. This creates a feedback loop where the verification system becomes more attuned to the community's expectations over time.

Counter-arguments and failure modes

The Guardian Agent architecture is a research contribution, not a solved problem. Several counter-arguments and failure modes warrant examination:

Semantic similarity is not truth. Embedding-based verification measures semantic proximity, not factual accuracy. A statement that is semantically close to a source document may still be factually wrong — paraphrases can invert meaning while preserving embedding similarity. The claim-level decomposition layer partially addresses this, but false positives and false negatives remain.

Verification coverage is incomplete. The guardians can verify claims against existing records. They cannot verify claims about topics not covered by the community's records. For novel questions, the system faces a choice between refusing to answer (conservative but unhelpful) and generating unverifiable outputs (helpful but unguarded). The current implementation flags low-confidence responses rather than suppressing them, which transfers the verification burden to the end user.

Feedback loops can introduce bias. The adaptive feedback mechanism assumes that community feedback is a reliable signal. In practice, feedback may be sparse, biased toward certain user demographics, or reflect preferences that conflict with accuracy. The system does not currently distinguish between feedback that corrects factual errors and feedback that reflects aesthetic or ideological preferences.

Computational overhead. Four-layer verification adds latency and computational cost. For time-sensitive queries, this overhead may degrade the user experience to the point where the system is not used — a governance failure by non-adoption rather than by technical error.

The Trade-Off: An Analytical Framework

The choice between commercial AI and community-governed AI is not a choice between a good option and a bad option. It is a choice between different trade-off profiles:

Dimension Commercial Platform AI Community-Governed AI
Breadth of capability High Low (domain-specific)
Distributional bias Reflects web-scale corpus Reflects community corpus
Verifiability Low (proprietary, opaque) Higher (open-source, auditable)
Data sovereignty Data flows to provider Data remains within community boundary
Verification architecture Provider-controlled Community-inspectable
Computational resources Substantial (cloud-scale) Constrained (local or small-cloud)
Generalisability High Low (by design)

Neither profile is categorically superior. The appropriate choice depends on the governance priorities of the deploying community — a point that itself constitutes a governance decision.

Replicability and Generalisability

A question of particular interest to the research community is whether the Village architecture is replicable and generalisable beyond its current deployment context.

The platform is designed for multi-tenant operation across diverse community types (the current implementation supports nine product types, from parishes to conservation groups to alumni associations). The vocabulary system adapts terminology to the community context, which suggests a degree of generalisability in the platform layer.

However, several factors limit confident claims about generalisability:

These are open research questions, not resolved design decisions.


This is Article 2 of 5 in the "Community-Scale AI Governance" series. For the full Guardian Agents architecture, visit Village AI on Agentic Governance.

Previous: What AI Is, What It Is Not, and What Remains Uncertain Next: Why Policy-Based AI Governance Is Insufficient — The Structural Alternative

Published under CC BY 4.0 by My Digital Sovereignty Ltd. You are free to share and adapt this material, provided you give appropriate credit.