AEROS' guarantees aren't asserted.
They're measured.

Three benchmarks pin the production claims of the runtime. Every number is reproducible from the open-source repo. Every figure ships with the corresponding test harness.

Test baseline: 2,239 passing tests · 0 failing on main · HR-1 identity-hash regression on every PR.

Audit-first vs fail-open recovery

200 trials of a 4-level structured rollback harness, comparing two execution modes: audit-first (every state-mutating action gates on the audit log persisting first) vs fail-open (action proceeds; audit catches up).

The chart is filled with one tile per trial. Green = rollback succeeded; red = rollback failed. 60ms is the median latency the audit-first mode pays per call.

audit-first fail-open 100% 200 / 200 trials 25% 50 / 200 trials audit overhead, p50: 60ms
Takeaway

Audit overhead is small. Audit gap is total. The 60ms p50 cost is the price of going from a 25% rollback floor to a 100% one.

Source: AEROS rollback harness Trials: N=200

Byte-identical replay determinism

N=100 replays of the same scenario, with the same governance config and frozen identity manifest. We hash the full run output (per-step decisions, signed audit chain, terminal state) and bucket the hashes.

The histogram below is the actual output: 99 buckets stay empty, one bucket holds all 100 replays. No RNG, no clock-as-input, no LLM in the consolidation path.

100 50 0 100 modal hash 49 empty buckets 50 empty buckets distinct output hashes distinct output hashes: 1 / 100
Takeaway

Determinism isn't aspirational. It's the default. The consolidator does deterministic SQL aggregation; the persona engine signs every event; the cursor advances atomically with the facts upsert via outbox. Same input, same output, every time.

Source: AEROS replay benchmark Replays: N=100

Grounded planner cuts unproductive motion

V1 / V2 / V3 are three planner variants with progressively tighter grounding in the embodied agent's identity manifest and current capability set. We measure the rate of unproductive actions (actions that the watcher rolls back, repeats, or marks as no-ops) across a fixed task suite.

V1 is the ungrounded baseline. V3 reads the identity manifest, the live ECM registry, and the current persona — then plans against that envelope. The chart shows how the unproductive-action rate falls.

100% 75% 50% 25% 0 100% V1 ungrounded V2 capability-aware * illustrative 25.4% V3 grounded −74.6% unproductive actions, V3 vs V1: 3.9× fewer
Takeaway

Grounding the planner in the embodied identity cuts wasted motion 3–4×. The identity manifest isn't just an audit anchor; it's a planning input that pays for itself.

Source: AEROS planner benchmark Variants: V1 / V2 / V3 Validation: V1, V2, V3 PASS

Run the numbers yourself

Every benchmark above is in the open-source repo. Clone, install, run.

1. Clone & install

git clone https://github.com/s20sc/aeros-runtime
cd aeros-runtime
python -m venv .venv && source .venv/bin/activate
pip install -e .[dev]

2. Run the test baseline

pytest --tb=short -q \
  --ignore=tests/sim \
  --ignore=tests/runtime/test_franka_render_thread_safety.py
# expected: ~2,239 passed, 3 skipped

3. Run the benchmark suite

pytest tests/benchmarks/ -v
# governance / evolution / fleet / runtime
# perf gates run as part of CI on every PR

Runs against frozen identity v0.9.0. For v0.10.0 numbers, switch to the corresponding tag once cut.

Want to add your own benchmark?

The next public benchmark is EmbodiedGovBench v2 — a public suite for governance overhead, replay determinism, and recovery latency, built into v0.11.0. RFC public, contributors welcome.