AI agents are in production. The infrastructure to govern them is not. We are building it in the open and publishing everything we learn.
A governance layer for autonomous AI agents. The problem, the architecture, a concrete walkthrough, honest limitations, and why now. Written for operators, investors, and policymakers.
Everything needed to replicate our Terminal-Bench results or test governance rules on new benchmarks. Harbor agent adapters for Claude and GPT, the 6 rules, cost estimates, and the full experimental roadmap.
on Terminal-Bench 2.0. 89 tasks, single attempt, no retry. Early results suggest governance rules improve coding agent performance vs the vanilla baseline at 58.0%.
Caveat: the governed run used Opus 4.7; vanilla baseline is Opus 4.6. Model confound not yet resolved. See the whitepaper for full methodology.
The framework, the research, and the benchmarks are all public. If you operate agents and governance keeps you up at night, this work is for you.