ResearchPapers · Benchmarks · Open data

The Evidence

Everything we have published. The whitepaper that describes the framework, the research paper that tests it, and the benchmarking kit so anyone can reproduce or challenge the results.

01 · Whitepaper

The Covenant Framework

Non-technical whitepaper · 10 pages · May 2026

A Governance Layer for Autonomous AI Agents

Describes the problem (agents act, nothing governs them), explains the five-function architecture (identify, authorize, enforce, sanction, record), walks through a concrete example, and is honest about limitations.

Written for operators, investors, and policymakers. No jargon, no code.

Paper sections
  1. The Problem
  2. Why Existing Approaches Fall Short
  3. What Covenant Does
  4. The Architecture
  5. A Walkthrough
  6. What We Have Seen, and What We Are Testing
  7. What This Is Not
  8. Why Now
  9. The Path Forward
  10. Conclusion
02 · Benchmarking Kit

Reproduce or challenge
the results.

Everything you need to run the same benchmarks, test different rules, or validate on new benchmark suites. No access to the private framework repo required.

Claude agents (Anthropic)
covenant_prophet.py — Governed agent with 6 rules (the variable under test)
governed_agent.py — Simplified governed agent (same rules, cleaner adapter)
adhoc_baseline.py — 6 ad-hoc rules from prompt engineering (rule quality control)
vanilla_claude.py — Raw Claude Code, zero governance (model baseline)
covenant_multiagent.py — Multi-agent: analyst + executor + retry
Codex / GPT agents (OpenAI)
codex_governed.py — Same 6 rules on Codex CLI / GPT (cross-model validation)
codex_governed.py (CodexAdaptive) — Adaptive escalation on Codex
vanilla_codex.py — Raw Codex CLI, zero governance (GPT baseline)
The 6 rules

What carried the 25 points

Genesis (read first), Plan First, Iterate Don’t Repeat, Verify Before Done, Time Is Limited, When Stuck (change approach). Testing showed rules 1, 3, and 4 drove most of the lift.

Read the rules →
Cost estimates

What it costs to run

10 tasks: $15–40, 1–3 hours. Full 89 tasks: $150–400, 12–20 hours. All 4 agents on 89 tasks: $600–1600, 3–5 days. Token consumption averages 50–100K per task.

Contributor opportunities
ExperimentWhatCostImpact
P1Run vanilla Opus 4.7 on all 89 tasks (resolves model version confound)~$200Resolves the paper’s biggest credibility threat
P3Run ad-hoc baseline on all 89 tasks (currently only 10)~$200Powers up rule-quality comparison to statistical significance
P6Run governance rules on GPT-5 via Codex CLI~$50Tests whether governance transfers across model families