Deploy and oversee — a green tick isn’t success
Agents at Work — CC BY 4.0
The whole promise of an agent is that it runs while you’re not watching — overnight, or while you’re with a customer. That’s also the whole danger, and the two are the same fact. This lesson is about the discipline of running an agent unattended without letting “unattended” quietly become “unaccountable.”
The trap of the green tick
An agent finishes its run and reports success. The job says “done.” Everything’s green. Here’s the thing to burn in: a green tick means the agent completed the steps it was told to. It does not mean it did the right thing.
A reconciliation-checker can run cleanly and flag the wrong invoices. A triage agent can sort a full inbox and quietly misfile the one urgent message. A screening agent can score every application without error and skew hard against one group — no error, no crash, green tick, real harm. The agent can only tell you it did what it did. It cannot tell you that what it did was correct, fair, or wise — that’s your judgment, and it doesn’t disappear because the run succeeded.
So the first rule of oversight: never read “completed” as “correct.” Completion is a claim about the process. Correctness is a claim about the world, and only a person checking against the world can make it.
Oversight you build in, not bolt on
You can’t stand over an agent that runs at 2am. So oversight has to be built into how it runs — three plain habits:
- The audit trail. The agent should leave a record of what it actually did — what it read, what it decided, what it changed or sent, and why. Not so it looks thorough; so that when something’s wrong, you can find out what and when without guessing. An agent whose work you can’t reconstruct after the fact is one you can’t answer for — and answering for it is the job (Anchor 3).
- Spot-checks on a schedule. The testing from 3.2 isn’t a launch gate you pass once; it’s a habit you keep. Sample the agent’s real output regularly, not just the day you built it — because the models underneath it change, and last month’s “fine” can drift.
- A stop you can reach. You need to be able to pull the agent — pause it, revoke its access — quickly, without a developer, when something looks off. If you can’t stop it fast, you’re not overseeing it; you’re hoping.
Start narrow, widen on evidence
This is Anchor 2, continuous improvement, as a deployment rule. Don’t hand a new agent the full job on day one and walk away. Run it on a slice, watch it, read its trail, check its output. Widen its role as it earns the trust — more volume, more autonomy, less checking — on evidence that it behaves, not on the fact that it hasn’t obviously broken yet. The businesses that get burned are the ones that mistook “it ran without complaint for a week” for “it’s safe to stop looking.”
The oversight move
Before an agent runs unattended on real work:
- Decide what “wrong” would look like — the specific bad outcome you’re watching for — and how you’d notice it from the trail, not from a green tick.
- Build the audit trail first, not after the first incident.
- Set the spot-check cadence and the stop procedure — and make sure someone other than you could use both.
Picture your agent running overnight. Something goes wrong at 3am — not a crash, a wrong call. When you sit down in the morning, how would you even know? If the honest answer is “I might not,” that’s the gap to close before you deploy, not after.
Next
You’ve deployed it and you’re watching it. Now the part people most want to skip and least can afford to: the law you’re actually operating under.
Shared freely, in good faith. If it's been of value, a koha toward development and running costs is warmly welcomed.
Leave a koha →