Research - Covenant Research

01 · Whitepaper

The Covenant Framework

Non-technical whitepaper · 10 pages · May 2026

A Governance Layer for Autonomous AI Agents

Describes the problem (agents act, nothing governs them), explains the five-function architecture (identify, authorize, enforce, sanction, record), walks through a concrete example, and is honest about limitations.

Written for operators, investors, and policymakers. No jargon, no code.

Read online → Download PDF

Paper sections

The Problem
Why Existing Approaches Fall Short
What Covenant Does
The Architecture
A Walkthrough
What We Have Seen, and What We Are Testing
What This Is Not
Why Now
The Path Forward
Conclusion

02 · Benchmarking Kit

Reproduce or challenge
the results.

Everything you need to run the same benchmarks, test different rules, or validate on new benchmark suites. No access to the private framework repo required.

Claude agents (Anthropic)

covenant_prophet.py — Governed agent with 6 rules (the variable under test)

governed_agent.py — Simplified governed agent (same rules, cleaner adapter)

adhoc_baseline.py — 6 ad-hoc rules from prompt engineering (rule quality control)

vanilla_claude.py — Raw Claude Code, zero governance (model baseline)

covenant_multiagent.py — Multi-agent: analyst + executor + retry

Codex / GPT agents (OpenAI)

codex_governed.py — Same 6 rules on Codex CLI / GPT (cross-model validation)

codex_governed.py (CodexAdaptive) — Adaptive escalation on Codex

vanilla_codex.py — Raw Codex CLI, zero governance (GPT baseline)

The 6 rules

What carried the 25 points

Genesis (read first), Plan First, Iterate Don’t Repeat, Verify Before Done, Time Is Limited, When Stuck (change approach). Testing showed rules 1, 3, and 4 drove most of the lift.

Read the rules →

Cost estimates

What it costs to run

10 tasks: $15–40, 1–3 hours. Full 89 tasks: $150–400, 12–20 hours. All 4 agents on 89 tasks: $600–1600, 3–5 days. Token consumption averages 50–100K per task.

Contributor opportunities

Experiment	What	Cost	Impact
P1	Run vanilla Opus 4.7 on all 89 tasks (resolves model version confound)	~$200	Resolves the paper’s biggest credibility threat
P3	Run ad-hoc baseline on all 89 tasks (currently only 10)	~$200	Powers up rule-quality comparison to statistical significance
P6	Run governance rules on GPT-5 via Codex CLI	~$50	Tests whether governance transfers across model families

View on GitHub → Full benchmarking guide Experimental roadmap

The Evidence

The Covenant Framework

A Governance Layer for Autonomous AI Agents

Reproduce or challengethe results.

What carried the 25 points

What it costs to run

Reproduce or challenge
the results.