archy/tests/lifecycle/TESTING.md
archipelago 0ed892a412 fix: wallet receive reliability, bitcoin install self-heal, ElectrumX app tile
Fixes three Bitcoin/wallet failures observed across the fleet on v1.7.90-alpha
(all nodes were already on the latest build — these were live bugs, not stale
builds), plus the missing ElectrumX tile, and adds automated coverage so each
can't regress silently.

Receive address (".116 receive fails", ".228 false 'wallet is locked'"):
- LND publishes its REST API on a host port that can drift from the manifest
  (a container created when the mapping was 8080 kept publishing 8080 after the
  manifest moved to 18080). The in-process client connects to the manifest port,
  gets connection-refused, and wallet init fails forever while the container
  looks "Up". Add published-port drift detection to the reconciler
  (container_ports_drifted / host_port_bindings_drifted) that recreates a
  drifted backend even for restart-sensitive apps — a drifted container is
  already broken, so leaving it "untouched" only perpetuates the failure.
- Receive errors now carry a stable [CODE] token (REST_UNREACHABLE, WALLET_LOCKED,
  WALLET_UNINITIALIZED, SYNCING) and always start with "Bitcoin address" so they
  survive the RPC error sanitizer instead of collapsing to the generic
  "Operation failed". The UI maps the code instead of guessing wallet state from
  substrings — so an unreachable REST endpoint is no longer mislabelled "locked".

Bitcoin install (".198 bitcoin gone / reinstall just stops"):
- bitcoin-knots requires the secret bitcoin-rpc-txrelay-rpcauth, which was only
  generated by the tx-relay flow. Nodes that never used tx-relay lacked it, so
  secret resolution hard-failed and the whole Bitcoin stack cascaded. Generate
  it idempotently before bitcoin starts (ensure_app_secrets, reusing
  ensure_txrelay_credentials), and name the missing secret in the error so a
  genuine gap is actionable instead of a bare "IO error".

ElectrumX app tile missing on every node with it installed:
- The catalog generator dropped electrumx because the manifest had no
  interfaces.main block, so the tile had no launch URL and was hidden. Declare
  the companion UI port (50002) in the manifest, regenerate the catalog, and let
  an app with a known launch URL stay launchable while its backend is still
  "starting" (ElectrumX indexes for 10m+).

Test harness:
- New lifecycle bats suites: bitcoin-receive, port-drift, secret-completeness
  (validated live; port-drift catches the real .116 drift).
- Rust unit tests for drift detection, the receive reason-code classifier, and
  the named-missing-secret error; vitest for the UI code mapping.
- create-release.sh now runs tests/release/run.sh and aborts the release on
  failure — previously it ran no tests at all.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:12:56 -04:00

168 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Container subsystem testing — scorecard and roadmap
The bar (verbatim from the v1.7.52 owner):
> "best performant, minimal code, tested containers possible in the world.
> No bloated code, no problems installing a single one, no problems
> uninstalling, every one needs to be tested 20+ times in every state
> before we make another update, not a single container failure outside
> of hardware or internet failure is allowed."
This document is the live tracker for whether we're meeting that bar.
Every PR that touches the container subsystem updates the scoreboard
below. **If you can't honestly tick the box, the change isn't ready.**
## Test layers
| Layer | What it asserts | Toolchain | Latency / iteration |
|---|---|---|---|
| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | `cargo test --workspace --bins` | ~5s |
| L1 — RPC API | The JSON-RPC API responds correctly per app (`container-list`, `package.{install,start,stop,restart,uninstall}`, `bitcoin.getinfo`, etc.) | bats + lib/rpc.bash | ~30s per suite |
| L2 — UI surface | The URLs a user actually clicks (dashboard, `/app/<id>/`, direct-port iframes) return 200 with non-empty bodies | bats + lib/ui-probes.bash | ~10s per suite |
| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario |
| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey |
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are
quality gates we add as they mature; not blocking the v1.7.52 tag.
## Coverage matrix — current state
Legend: ● fully covered, ◐ partial, ○ missing
### Per-app × per-state matrix (L1 + L2)
| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives |
|---|---|---|---|---|---|---|---|---|---|---|
| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only |
| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate |
| lnd | ● | ● | ● (lncli) | ● (`/app/lnd/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
| electrumx | ● | ● | ● (TCP 50001) | ● (`/app/electrumx/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
| btcpay-server | ● | ● | ◐ frontend-port | ● (`/app/btcpay/`) | ● | ● | ● | ● | ○ | ○ |
| mempool | ● | ● | ● (`/api/v1/backend-info`) | ● (`/app/mempool/`) | ● | ● | ● | ● | ○ | ○ |
| fedimint | ● | ● | ◐ container-only | ● (`/app/fedimint/`) | ● | ● | ● | ● | ○ | ○ |
| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions |
| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (`/app/lnd/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (`/app/electrumx/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before
v1.7.52 tags.
### Layer-by-layer status
| Layer | Tests | Suites | Status |
|---|---:|---:|---|
| L0 unit | 631 | n/a | ● green |
| L1 RPC | 70 | bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke | ● for the 6 core apps |
| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 |
| L3 lifecycle survival | 14 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive, use-quadlet-backends-install | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships); quadlet post-condition gate ✅ skip-clean today, hard gate when flag flipped |
| L1 wallet-receive / drift / secrets | 5 | bitcoin-receive, port-drift, secret-completeness | ● guards the v1.7.9x wallet fleet failures |
| L4 browser journey | 0 | none | ○ not started |
| L5 chaos | 0 | none | ○ not started |
| L6 performance | 0 | none | ○ not started |
### Wallet / Bitcoin fleet-failure regression suites (added after v1.7.90-alpha)
Three production failures shipped on v1.7.90-alpha despite the existing harness,
because nothing exercised the receive path, port-mapping drift, or secret
completeness on a live node. New suites close those gaps (all run on the archy
host, read-only, so they join `run.sh`/`run-20x.sh` automatically):
| Suite | Failure it guards | Asserts |
|---|---|---|
| `bitcoin-receive.bats` | .116 ("Operation failed" on receive) and .228 (false "wallet is locked") | LND REST reachable on the **manifest** host port; `lnd.newaddress` returns a `bc1…` address on a running node; receive errors are specific, never the generic catch-all |
| `port-drift.bats` | .116 (lnd REST stuck on host 8080 vs manifest 18080) | every installed backend's live `podman inspect` PortBindings match its manifest `ports:` (the external mirror of the orchestrator's `host_port_bindings_drifted`) |
| `secret-completeness.bats` | .198 (bitcoin-knots needs `bitcoin-rpc-txrelay-rpcauth`, never generated → stack cascade) | every `secret_file` referenced by an installed backend manifest exists in the secrets dir |
Backed by L0 unit tests (`cargo test … drift missing_secret lnd`) and a vitest
for the frontend reason-code mapping (`bitcoinReceive.test.ts`). The release
gate `scripts/create-release.sh` now runs `tests/release/run.sh` (which includes
these) and **aborts the release on failure** — previously it ran no tests at all.
## Run commands
```bash
# L0 unit:
cd core && cargo test --workspace --bins
# Single bats suite:
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots
# Full bats suite (read-only):
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh
# Full + destructive (for the verification fleet):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh
# 20× release-gate run (the actual v1.7.52 ship gate):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
tests/lifecycle/run-20x.sh
```
To exercise the Phase 3.2 Quadlet-backend path on a target node without
editing config.json (which would require an archipelago restart and
trigger FM3 until 3.5 ships), set the env var on `archipelago.service`:
```bash
sudo systemctl edit archipelago # add: [Service]\nEnvironment=ARCHIPELAGO_USE_QUADLET_BACKENDS=1
sudo systemctl restart archipelago # one cgroup-cascade hit; survivable on a debug node
```
After the restart, `package.install` for any orchestrator-managed backend
will route through `install_via_quadlet`, and the
`use-quadlet-backends-install.bats` suite turns from skip → hard gate.
## LoC budget
Goal: minimum-viable container subsystem.
| Module | LoC today | Target | Δ | Status |
|---|---:|---:|---:|---|
| `core/container/src/dependency_resolver.rs` | — | — | -270 | ● deleted |
| `core/container/src/health_monitor.rs` | 196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) |
| `core/container/src/podman_client.rs::create/start/stop` | ~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) |
| `core/archipelago/src/container/dev_orchestrator.rs` | 410 | 0 | -410 | ○ pending dev_mode strategy decision |
| `core/archipelago/src/container/data_manager.rs` | 96 | 0 | -96 | ○ couples with dev_orchestrator |
| `core/container/src/bitcoin_simulator.rs` | 219 | 0 | -219 | ○ couples with dev_orchestrator |
| `core/container/src/port_manager.rs` | 175 | 0 | -175 | ○ couples with dev_orchestrator |
| `core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair` | ~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start |
| imperative `install_fresh` in prod_orchestrator | ~120 | 0 | -120 | ◐ Phase 3.2 wired behind `use_quadlet_backends` flag (default off); 3.3 in-place migration ✅; 3.4 health-gated startup (`Notify=healthy`) ✅ + `TimeoutStartSec=600` race fix ✅; 3.4a unit drift-sync each reconcile ✅; flip default after 20× green |
**Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC** (if Phase 3 ships fully + dev_mode resolved).
Net target for v1.7.52: container subsystem ≈ **half** of today's LoC.
## Performance KPIs (TBD — measure first, then target)
We don't have a performance harness yet. Add as L6 lands:
| KPI | Today | Target | Notes |
|---|---|---|---|
| cold install: bitcoin-knots manifest → `running` healthcheck | unknown | < 30s once image is local | excludes the ~1GB image pull |
| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates |
| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many `podman inspect` calls |
| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 |
| daemon startup (boot port 5678 listening) | unknown | < 5s | reconcile is async after this |
## Release gates
v1.7.52 ships only when ALL of:
1. Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully after a cold install)
2. `tests/lifecycle/run-20x.sh` returns 0 against .228 (full suite, ARCHY_ALLOW_DESTRUCTIVE=1)
3. `tests/lifecycle/run-20x.sh` returns 0 against .198 (same)
4. The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends)
5. Cargo: 0 warnings, 0 unused, all tests green (sustained since 1c0df95f)
6. LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
7. Layman-readable changelog (per `feedback_changelog_layman.md`)
8. Tag pushed to origin + gitea-local + gitea-vps2 (per `feedback_ship_ritual.md`)
## How to update this document
When you land a change that materially moves any cell of the matrix or
any LoC row, update this file in the same commit. Reviewers checking
the PR can read the diff to TESTING.md as the answer to "what did
this commit improve?". Without the update, the change is half-shipped.