AEROS' guarantees aren't asserted.
They're measured.
Three benchmarks pin the production claims of the runtime. Every number is reproducible from the open-source repo. Every figure ships with the corresponding test harness.
Audit-first vs fail-open recovery
200 trials of a 4-level structured rollback harness, comparing two execution modes: audit-first (every state-mutating action gates on the audit log persisting first) vs fail-open (action proceeds; audit catches up).
The chart is filled with one tile per trial. Green = rollback succeeded; red = rollback failed. 60ms is the median latency the audit-first mode pays per call.
Audit overhead is small. Audit gap is total. The 60ms p50 cost is the price of going from a 25% rollback floor to a 100% one.
Byte-identical replay determinism
N=100 replays of the same scenario, with the same governance config and frozen identity manifest. We hash the full run output (per-step decisions, signed audit chain, terminal state) and bucket the hashes.
The histogram below is the actual output: 99 buckets stay empty, one bucket holds
all 100 replays. No RNG, no clock-as-input, no LLM in the consolidation path.
Determinism isn't aspirational. It's the default. The consolidator does deterministic SQL aggregation; the persona engine signs every event; the cursor advances atomically with the facts upsert via outbox. Same input, same output, every time.
Grounded planner cuts unproductive motion
V1 / V2 / V3 are three planner variants with progressively tighter grounding in the embodied agent's identity manifest and current capability set. We measure the rate of unproductive actions (actions that the watcher rolls back, repeats, or marks as no-ops) across a fixed task suite.
V1 is the ungrounded baseline. V3 reads the identity manifest, the live ECM registry, and the current persona — then plans against that envelope. The chart shows how the unproductive-action rate falls.
Grounding the planner in the embodied identity cuts wasted motion 3–4×. The identity manifest isn't just an audit anchor; it's a planning input that pays for itself.
Run the numbers yourself
Every benchmark above is in the open-source repo. Clone, install, run.
1. Clone & install
git clone https://github.com/s20sc/aeros-runtime cd aeros-runtime python -m venv .venv && source .venv/bin/activate pip install -e .[dev]
2. Run the test baseline
pytest --tb=short -q \ --ignore=tests/sim \ --ignore=tests/runtime/test_franka_render_thread_safety.py # expected: ~2,239 passed, 3 skipped
3. Run the benchmark suite
pytest tests/benchmarks/ -v # governance / evolution / fleet / runtime # perf gates run as part of CI on every PR
Runs against frozen identity v0.9.0. For v0.10.0 numbers, switch to the corresponding tag once cut.
Want to add your own benchmark?
The next public benchmark is EmbodiedGovBench v2 — a public suite for governance overhead, replay determinism, and recovery latency, built into v0.11.0. RFC public, contributors welcome.