* L0 unit count: 630 → 631 (translate_health_check_http_does_not_double_prefix_scheme) * Phase 3 row: add TimeoutStartSec=600 race fix (44f275ed) + drift-sync hook (0889367d) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
149 lines
8.8 KiB
Markdown
149 lines
8.8 KiB
Markdown
# Container subsystem testing — scorecard and roadmap
|
||
|
||
The bar (verbatim from the v1.7.52 owner):
|
||
|
||
> "best performant, minimal code, tested containers possible in the world.
|
||
> No bloated code, no problems installing a single one, no problems
|
||
> uninstalling, every one needs to be tested 20+ times in every state
|
||
> before we make another update, not a single container failure outside
|
||
> of hardware or internet failure is allowed."
|
||
|
||
This document is the live tracker for whether we're meeting that bar.
|
||
Every PR that touches the container subsystem updates the scoreboard
|
||
below. **If you can't honestly tick the box, the change isn't ready.**
|
||
|
||
## Test layers
|
||
|
||
| Layer | What it asserts | Toolchain | Latency / iteration |
|
||
|---|---|---|---|
|
||
| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | `cargo test --workspace --bins` | ~5s |
|
||
| L1 — RPC API | The JSON-RPC API responds correctly per app (`container-list`, `package.{install,start,stop,restart,uninstall}`, `bitcoin.getinfo`, etc.) | bats + lib/rpc.bash | ~30s per suite |
|
||
| L2 — UI surface | The URLs a user actually clicks (dashboard, `/app/<id>/`, direct-port iframes) return 200 with non-empty bodies | bats + lib/ui-probes.bash | ~10s per suite |
|
||
| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario |
|
||
| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey |
|
||
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
|
||
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
|
||
|
||
Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are
|
||
quality gates we add as they mature; not blocking the v1.7.52 tag.
|
||
|
||
## Coverage matrix — current state
|
||
|
||
Legend: ● fully covered, ◐ partial, ○ missing
|
||
|
||
### Per-app × per-state matrix (L1 + L2)
|
||
|
||
| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives |
|
||
|---|---|---|---|---|---|---|---|---|---|---|
|
||
| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only |
|
||
| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate |
|
||
| lnd | ● | ● | ● (lncli) | ● (`/app/lnd/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
|
||
| electrumx | ● | ● | ● (TCP 50001) | ● (`/app/electrumx/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
|
||
| btcpay-server | ● | ● | ◐ frontend-port | ● (`/app/btcpay/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| mempool | ● | ● | ● (`/api/v1/backend-info`) | ● (`/app/mempool/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| fedimint | ● | ● | ◐ container-only | ● (`/app/fedimint/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions |
|
||
| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (`/app/lnd/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (`/app/electrumx/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
|
||
Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before
|
||
v1.7.52 tags.
|
||
|
||
### Layer-by-layer status
|
||
|
||
| Layer | Tests | Suites | Status |
|
||
|---|---:|---:|---|
|
||
| L0 unit | 631 | n/a | ● green |
|
||
| L1 RPC | 70 | bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke | ● for the 6 core apps |
|
||
| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 |
|
||
| L3 lifecycle survival | 14 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive, use-quadlet-backends-install | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships); quadlet post-condition gate ✅ skip-clean today, hard gate when flag flipped |
|
||
| L4 browser journey | 0 | none | ○ not started |
|
||
| L5 chaos | 0 | none | ○ not started |
|
||
| L6 performance | 0 | none | ○ not started |
|
||
|
||
## Run commands
|
||
|
||
```bash
|
||
# L0 unit:
|
||
cd core && cargo test --workspace --bins
|
||
|
||
# Single bats suite:
|
||
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots
|
||
|
||
# Full bats suite (read-only):
|
||
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh
|
||
|
||
# Full + destructive (for the verification fleet):
|
||
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh
|
||
|
||
# 20× release-gate run (the actual v1.7.52 ship gate):
|
||
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
|
||
tests/lifecycle/run-20x.sh
|
||
```
|
||
|
||
To exercise the Phase 3.2 Quadlet-backend path on a target node without
|
||
editing config.json (which would require an archipelago restart and
|
||
trigger FM3 until 3.5 ships), set the env var on `archipelago.service`:
|
||
|
||
```bash
|
||
sudo systemctl edit archipelago # add: [Service]\nEnvironment=ARCHIPELAGO_USE_QUADLET_BACKENDS=1
|
||
sudo systemctl restart archipelago # one cgroup-cascade hit; survivable on a debug node
|
||
```
|
||
|
||
After the restart, `package.install` for any orchestrator-managed backend
|
||
will route through `install_via_quadlet`, and the
|
||
`use-quadlet-backends-install.bats` suite turns from skip → hard gate.
|
||
|
||
## LoC budget
|
||
|
||
Goal: minimum-viable container subsystem.
|
||
|
||
| Module | LoC today | Target | Δ | Status |
|
||
|---|---:|---:|---:|---|
|
||
| `core/container/src/dependency_resolver.rs` | — | — | -270 | ● deleted |
|
||
| `core/container/src/health_monitor.rs` | 196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) |
|
||
| `core/container/src/podman_client.rs::create/start/stop` | ~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) |
|
||
| `core/archipelago/src/container/dev_orchestrator.rs` | 410 | 0 | -410 | ○ pending dev_mode strategy decision |
|
||
| `core/archipelago/src/container/data_manager.rs` | 96 | 0 | -96 | ○ couples with dev_orchestrator |
|
||
| `core/container/src/bitcoin_simulator.rs` | 219 | 0 | -219 | ○ couples with dev_orchestrator |
|
||
| `core/container/src/port_manager.rs` | 175 | 0 | -175 | ○ couples with dev_orchestrator |
|
||
| `core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair` | ~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start |
|
||
| imperative `install_fresh` in prod_orchestrator | ~120 | 0 | -120 | ◐ Phase 3.2 wired behind `use_quadlet_backends` flag (default off); 3.3 in-place migration ✅; 3.4 health-gated startup (`Notify=healthy`) ✅ + `TimeoutStartSec=600` race fix ✅; 3.4a unit drift-sync each reconcile ✅; flip default after 20× green |
|
||
|
||
**Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC** (if Phase 3 ships fully + dev_mode resolved).
|
||
|
||
Net target for v1.7.52: container subsystem ≈ **half** of today's LoC.
|
||
|
||
## Performance KPIs (TBD — measure first, then target)
|
||
|
||
We don't have a performance harness yet. Add as L6 lands:
|
||
|
||
| KPI | Today | Target | Notes |
|
||
|---|---|---|---|
|
||
| cold install: bitcoin-knots manifest → `running` healthcheck | unknown | < 30s once image is local | excludes the ~1GB image pull |
|
||
| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates |
|
||
| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many `podman inspect` calls |
|
||
| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 |
|
||
| daemon startup (boot → port 5678 listening) | unknown | < 5s | reconcile is async after this |
|
||
|
||
## Release gates
|
||
|
||
v1.7.52 ships only when ALL of:
|
||
|
||
1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
|
||
2. ☐ `tests/lifecycle/run-20x.sh` returns 0 against .228 (full suite, ARCHY_ALLOW_DESTRUCTIVE=1)
|
||
3. ☐ `tests/lifecycle/run-20x.sh` returns 0 against .198 (same)
|
||
4. ☐ The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends)
|
||
5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f)
|
||
6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
|
||
7. ☐ Layman-readable changelog (per `feedback_changelog_layman.md`)
|
||
8. ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per `feedback_ship_ritual.md`)
|
||
|
||
## How to update this document
|
||
|
||
When you land a change that materially moves any cell of the matrix or
|
||
any LoC row, update this file in the same commit. Reviewers checking
|
||
the PR can read the diff to TESTING.md as the answer to "what did
|
||
this commit improve?". Without the update, the change is half-shipped.
|