diff --git a/tests/lifecycle/TESTING.md b/tests/lifecycle/TESTING.md new file mode 100644 index 00000000..0ef53835 --- /dev/null +++ b/tests/lifecycle/TESTING.md @@ -0,0 +1,135 @@ +# Container subsystem testing — scorecard and roadmap + +The bar (verbatim from the v1.7.52 owner): + +> "best performant, minimal code, tested containers possible in the world. +> No bloated code, no problems installing a single one, no problems +> uninstalling, every one needs to be tested 20+ times in every state +> before we make another update, not a single container failure outside +> of hardware or internet failure is allowed." + +This document is the live tracker for whether we're meeting that bar. +Every PR that touches the container subsystem updates the scoreboard +below. **If you can't honestly tick the box, the change isn't ready.** + +## Test layers + +| Layer | What it asserts | Toolchain | Latency / iteration | +|---|---|---|---| +| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | `cargo test --workspace --bins` | ~5s | +| L1 — RPC API | The JSON-RPC API responds correctly per app (`container-list`, `package.{install,start,stop,restart,uninstall}`, `bitcoin.getinfo`, etc.) | bats + lib/rpc.bash | ~30s per suite | +| L2 — UI surface | The URLs a user actually clicks (dashboard, `/app//`, direct-port iframes) return 200 with non-empty bodies | bats + lib/ui-probes.bash | ~10s per suite | +| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario | +| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey | +| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario | +| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark | + +Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are +quality gates we add as they mature; not blocking the v1.7.52 tag. + +## Coverage matrix — current state + +Legend: ● fully covered, ◐ partial, ○ missing + +### Per-app × per-state matrix (L1 + L2) + +| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives | +|---|---|---|---|---|---|---|---|---|---|---| +| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only | +| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate | +| lnd | ● | ● | ● (lncli) | ● (`/app/lnd/`) | ● | ● | ● | ● | ○ | ◐ regression-gate | +| electrumx | ● | ● | ● (TCP 50001) | ● (`/app/electrumx/`) | ● | ● | ● | ● | ○ | ◐ regression-gate | +| btcpay-server | ◐ via required-stack | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ○ | +| mempool | ◐ via required-stack | ○ | ◐ via required-stack | ● probe-only | ○ | ○ | ○ | ○ | ○ | ○ | +| fedimint | ◐ via required-stack | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ○ | +| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions | +| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● | +| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (`/app/lnd/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● | +| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (`/app/electrumx/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● | + +Done: 23 of 110 cells. Goal: 110/110 ● for the listed apps before +v1.7.52 tags. + +### Layer-by-layer status + +| Layer | Tests | Suites | Status | +|---|---:|---:|---| +| L0 unit | 615 | n/a | ● green | +| L1 RPC | 30 → growing | bitcoin-knots, lnd, electrumx, required-stack, package-update-smoke | ● for the 3 single-container core apps | +| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 | +| L3 lifecycle survival | 8 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships) | +| L4 browser journey | 0 | none | ○ not started | +| L5 chaos | 0 | none | ○ not started | +| L6 performance | 0 | none | ○ not started | + +## Run commands + +```bash +# L0 unit: +cd core && cargo test --workspace --bins + +# Single bats suite: +ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots + +# Full bats suite (read-only): +ARCHY_PASSWORD=password123 tests/lifecycle/run.sh + +# Full + destructive (for the verification fleet): +ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh + +# 20× release-gate run (the actual v1.7.52 ship gate): +ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \ + tests/lifecycle/run-20x.sh +``` + +## LoC budget + +Goal: minimum-viable container subsystem. + +| Module | LoC today | Target | Δ | Status | +|---|---:|---:|---:|---| +| `core/container/src/dependency_resolver.rs` | — | — | -270 | ● deleted | +| `core/container/src/health_monitor.rs` | 196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) | +| `core/container/src/podman_client.rs::create/start/stop` | ~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) | +| `core/archipelago/src/container/dev_orchestrator.rs` | 410 | 0 | -410 | ○ pending dev_mode strategy decision | +| `core/archipelago/src/container/data_manager.rs` | 96 | 0 | -96 | ○ couples with dev_orchestrator | +| `core/container/src/bitcoin_simulator.rs` | 219 | 0 | -219 | ○ couples with dev_orchestrator | +| `core/container/src/port_manager.rs` | 175 | 0 | -175 | ○ couples with dev_orchestrator | +| `core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair` | ~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start | +| imperative `install_fresh` in prod_orchestrator | ~120 | 0 | -120 | ○ pending Phase 3.2 Quadlet renderer | + +**Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC** (if Phase 3 ships fully + dev_mode resolved). + +Net target for v1.7.52: container subsystem ≈ **half** of today's LoC. + +## Performance KPIs (TBD — measure first, then target) + +We don't have a performance harness yet. Add as L6 lands: + +| KPI | Today | Target | Notes | +|---|---|---|---| +| cold install: bitcoin-knots manifest → `running` healthcheck | unknown | < 30s once image is local | excludes the ~1GB image pull | +| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates | +| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many `podman inspect` calls | +| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 | +| daemon startup (boot → port 5678 listening) | unknown | < 5s | reconcile is async after this | + +## Release gates + +v1.7.52 ships only when ALL of: + +1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install) +2. ☐ `tests/lifecycle/run-20x.sh` returns 0 against .228 (full suite, ARCHY_ALLOW_DESTRUCTIVE=1) +3. ☐ `tests/lifecycle/run-20x.sh` returns 0 against .198 (same) +4. ☐ The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends) +5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f) +6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged +7. ☐ Layman-readable changelog (per `feedback_changelog_layman.md`) +8. ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per `feedback_ship_ritual.md`) + +## How to update this document + +When you land a change that materially moves any cell of the matrix or +any LoC row, update this file in the same commit. Reviewers checking +the PR can read the diff to TESTING.md as the answer to "what did +this commit improve?". Without the update, the change is half-shipped.