prod_orchestrator::install_fresh now branches on the new Config::use_quadlet_backends flag (default false): * off (today's production behavior) — unchanged: runtime.create_container + start_container, container parented under archipelago.service's cgroup, FM3 cascade SIGKILL on every archipelago restart. * on — install_via_quadlet renders the manifest as a Quadlet unit via QuadletUnit::from_manifest, writes it atomically into ~/.config/containers/systemd/, calls daemon-reload, and starts the generated <name>.service. Container ends up under user.slice — no more cgroup parented under archipelago, so archipelago restarts don't touch the container's lifetime. Default off so this commit is structurally safe to ship: nothing changes at runtime until an operator opts in. Flip the default once tests/lifecycle/run-20x.sh has gone green against the new path on .228 + .198 (the v1.7.52 release gate). Plumbing: * config.rs — `use_quadlet_backends: bool` w/ Default false * prod_orchestrator.rs — flag stored on the struct, threaded through new(), with set_use_quadlet_backends(bool) test setter * prod_orchestrator.rs — install_via_quadlet helper * dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest / parse_memory_mib / RestartPolicy::OnFailure now that the call path exists; if a future revert removes the wiring, the warnings come back. Tests: 624 passing, cargo check clean (0 warnings). Existing companion behavior unaffected — render_skips_backend_directives_when_default still passes byte-equal to before quadlet.rs grew the new fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.0 KiB
Container subsystem testing — scorecard and roadmap
The bar (verbatim from the v1.7.52 owner):
"best performant, minimal code, tested containers possible in the world. No bloated code, no problems installing a single one, no problems uninstalling, every one needs to be tested 20+ times in every state before we make another update, not a single container failure outside of hardware or internet failure is allowed."
This document is the live tracker for whether we're meeting that bar. Every PR that touches the container subsystem updates the scoreboard below. If you can't honestly tick the box, the change isn't ready.
Test layers
| Layer | What it asserts | Toolchain | Latency / iteration |
|---|---|---|---|
| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | cargo test --workspace --bins |
~5s |
| L1 — RPC API | The JSON-RPC API responds correctly per app (container-list, package.{install,start,stop,restart,uninstall}, bitcoin.getinfo, etc.) |
bats + lib/rpc.bash | ~30s per suite |
| L2 — UI surface | The URLs a user actually clicks (dashboard, /app/<id>/, direct-port iframes) return 200 with non-empty bodies |
bats + lib/ui-probes.bash | ~10s per suite |
| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario |
| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey |
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
Release gate: L0+L1+L2+L3 green × 20 iterations on .228 AND .198. L4+L5+L6 are quality gates we add as they mature; not blocking the v1.7.52 tag.
Coverage matrix — current state
Legend: ● fully covered, ◐ partial, ○ missing
Per-app × per-state matrix (L1 + L2)
| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives |
|---|---|---|---|---|---|---|---|---|---|---|
| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only |
| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate |
| lnd | ● | ● | ● (lncli) | ● (/app/lnd/) |
● | ● | ● | ● | ○ | ◐ regression-gate |
| electrumx | ● | ● | ● (TCP 50001) | ● (/app/electrumx/) |
● | ● | ● | ● | ○ | ◐ regression-gate |
| btcpay-server | ● | ● | ◐ frontend-port | ● (/app/btcpay/) |
● | ● | ● | ● | ○ | ○ |
| mempool | ● | ● | ● (/api/v1/backend-info) |
● (/app/mempool/) |
● | ● | ● | ● | ○ | ○ |
| fedimint | ● | ● | ◐ container-only | ● (/app/fedimint/) |
● | ● | ● | ● | ○ | ○ |
| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions |
| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (/app/lnd/) |
○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (/app/electrumx/) |
○ | ○ | ○ | n/a | ◐ via companions | ● |
Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before v1.7.52 tags.
Layer-by-layer status
| Layer | Tests | Suites | Status |
|---|---|---|---|
| L0 unit | 624 | n/a | ● green |
| L1 RPC | 70 | bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke | ● for the 6 core apps |
| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 |
| L3 lifecycle survival | 8 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships) |
| L4 browser journey | 0 | none | ○ not started |
| L5 chaos | 0 | none | ○ not started |
| L6 performance | 0 | none | ○ not started |
Run commands
# L0 unit:
cd core && cargo test --workspace --bins
# Single bats suite:
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots
# Full bats suite (read-only):
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh
# Full + destructive (for the verification fleet):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh
# 20× release-gate run (the actual v1.7.52 ship gate):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
tests/lifecycle/run-20x.sh
LoC budget
Goal: minimum-viable container subsystem.
| Module | LoC today | Target | Δ | Status |
|---|---|---|---|---|
core/container/src/dependency_resolver.rs |
— | — | -270 | ● deleted |
core/container/src/health_monitor.rs |
196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) |
core/container/src/podman_client.rs::create/start/stop |
~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) |
core/archipelago/src/container/dev_orchestrator.rs |
410 | 0 | -410 | ○ pending dev_mode strategy decision |
core/archipelago/src/container/data_manager.rs |
96 | 0 | -96 | ○ couples with dev_orchestrator |
core/container/src/bitcoin_simulator.rs |
219 | 0 | -219 | ○ couples with dev_orchestrator |
core/container/src/port_manager.rs |
175 | 0 | -175 | ○ couples with dev_orchestrator |
core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair |
~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start |
imperative install_fresh in prod_orchestrator |
~120 | 0 | -120 | ◐ Phase 3.2 wired behind use_quadlet_backends flag (default off); flip default after 20× green |
Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC (if Phase 3 ships fully + dev_mode resolved).
Net target for v1.7.52: container subsystem ≈ half of today's LoC.
Performance KPIs (TBD — measure first, then target)
We don't have a performance harness yet. Add as L6 lands:
| KPI | Today | Target | Notes |
|---|---|---|---|
cold install: bitcoin-knots manifest → running healthcheck |
unknown | < 30s once image is local | excludes the ~1GB image pull |
| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates |
| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many podman inspect calls |
| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 |
| daemon startup (boot → port 5678 listening) | unknown | < 5s | reconcile is async after this |
Release gates
v1.7.52 ships only when ALL of:
- ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
- ☐
tests/lifecycle/run-20x.shreturns 0 against .228 (full suite, ARCHY_ALLOW_DESTRUCTIVE=1) - ☐
tests/lifecycle/run-20x.shreturns 0 against .198 (same) - ☐ The L3
backend-survives-archipelago-restartsuite passes (= Phase 3 Quadlet shipped for backends) - ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since
1c0df95f) - ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
- ☐ Layman-readable changelog (per
feedback_changelog_layman.md) - ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per
feedback_ship_ritual.md)
How to update this document
When you land a change that materially moves any cell of the matrix or any LoC row, update this file in the same commit. Reviewers checking the PR can read the diff to TESTING.md as the answer to "what did this commit improve?". Without the update, the change is half-shipped.