archy/tests/lifecycle/TESTING.md
archipelago 0406af522c test(lifecycle): add manifest-driven all-apps health matrix
The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others
(jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats
derives the app set from server.get-state package-data (no hardcoded list) and
asserts baseline health across EVERY installed app:
  - settles to a non-transitional state within a window (the #13/#14 stuck-ghost
    class, generalized fleet-wide — installing/removing that never settles)
  - not in error/failed
  - reports a recognized (non-garbage) state
  - every running UI app (manifest ui=="true") exposes a non-null lan-address
    (the immich/port-drift unreachable-UI failure, generalized to all UI apps)

Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as
nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 05:27:10 -04:00

20 KiB
Raw Blame History

Container subsystem testing — scorecard and roadmap

The bar (verbatim from the v1.7.52 owner):

"best performant, minimal code, tested containers possible in the world. No bloated code, no problems installing a single one, no problems uninstalling, every one needs to be tested 20+ times in every state before we make another update, not a single container failure outside of hardware or internet failure is allowed."

This document is the live tracker for whether we're meeting that bar. Every PR that touches the container subsystem updates the scoreboard below. If you can't honestly tick the box, the change isn't ready.


Production-quality pass — 2026-06-21 (current, v1.7.99-alpha)

The migration's aim, restated as five pillars (every app must satisfy all five):

  1. Quadlet-everywhere — every container is a declarative systemd Quadlet unit under user.slice, never inside archipelago.service's cgroup. Kills FM3 (restarting/updating archipelago SIGKILLs every container in its cgroup); systemd becomes the per-app supervisor.
  2. Level-triggered reconciler — a 30s idempotent reconcile loop drives desired→current from manifests + secrets. Self-healing, not edge-triggered.
  3. Lifecycle bulletproof — every app passes the full matrix (install / UI reachable / stop / start / restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall) 5× green on .228 — run ON the node (ARCHY_ITERATIONS=5). (Multinode / fleet → docs/multinode-testing-plan.md, separate.) before any release.
  4. Data-driven apps — install/uninstall needs only the app's manifest + catalog entry. No host OS changes (no apt, no /etc, no host units) and no archipelago binary code per app. Only core apps (bitcoin, lnd, electrumx, fedimint + gateway/clientd) may carry bespoke handling if truly unavoidable.
  5. Rootless + security-first (non-negotiable) — containers run in the unprivileged archipelago user namespace; never root, no --privileged, drop-all-caps + add-back only what a manifest declares. Secrets are 0600, owned by the service user. Security is king.

Per-app definition of done: all five pillars hold → lifecycle matrix 5× green on .228 (run ON the node) → catalog/registry updated (app-catalog/catalog.json

  • releases/app-catalog.json, rebuilt image pushed to the mirror) → tracker cell ticked. Only then move to the next app. (Fleet/multinode verification is a separate pass → docs/multinode-testing-plan.md.)

.228 testing constraint: do NOT touch bitcoin-knots, electrumx, or lnd on .228 — they are synced and healthy; destructive cycles there would cost hours of resync.

Session work log

Date App Change State
2026-06-21 fedimint-gateway / -clientd Generated-secrets system (Pillar 4+5). New generated_secrets: manifest field (hex16/hex32/bcrypt); materialised generically at the resolve_dynamic_env chokepoint — atomic 0600, rootless-owned, idempotent, and self-healing (recreates a wrongly root:root-owned secret via the service-owned dir, no chown/privilege). Removed per-app ensure_fmcd_password (30 LoC). Fixes gateway never starting (resolving secret_env → missing/unreadable fedimint-gateway-hash). ◐ code complete, cargo check + 3 unit tests green; not yet deployed/validated on .228
2026-06-21 fedimint-gateway Icon placeholder ○ investigating: marketplace catalog has title+icon (fedimint.png, shared); BUNDLED_APPS frontend list omits fedimint → installed view falls back to 📦

⏯ RESUME POINT (2026-06-21, mid-session)

Done (working tree, NOT git-committed):

  • Generated-secrets system — all files below written, cargo check clean, 3 unit tests green.
  • Manifests declare generated_secrets (fmcd-password hex16; fedimint-gateway-hash bcrypt).
  • Tracker refreshed with 5 pillars + this log.

In flight:

  • Local release build RUNNING (cd core && cargo build --release -p archipelago, log /tmp/archy-local-build.log, output core/target/release/archipelago). ⚠️ .228 has NO cargo and NO rsync — build LOCALLY on .116, ship binary + files via tar-over-ssh (tar -cf - … | ssh … 'tar -xf -').

Next steps (in order):

  1. Wait for local build → Finished. scp/tar core/target/release/archipelago → .228.
  2. Ship updated manifests to /opt/archipelago/apps/fedimint-{gateway,clientd}/ (canonical runtime dir; cwd-relative apps doesn't resolve — WorkingDirectory is empty).
  3. Binary swap is SAFE for protected backends: archipelago.service is KillMode=control-group BUT bitcoin-knots/electrumx/lnd conmons live under user.slice/.../libpod-*.scope, NOT the service cgroup. Only fedimint-clientd + immich conmons are in-cgroup (non-protected, reconciled back). systemctl stop archipelagocp binary → start.
  4. Validate: install fedimint-gateway → assert fedimint-gateway-hash (0600, archipelago-owned) + .pw generated → container starts healthy.
  5. Run tests/lifecycle/run-gate.sh for the gateway (do NOT touch knots/electrumx/lnd).
  6. Frontend fixes (separate from binary): see icon/rename below; rebuild neode-ui, ship dist + catalog.json + assets to /opt/archipelago/web-ui (chown 1000:1000).

Icon / naming (frontend, user-confirmed):

  • Gateway icon = reuse fedimint.png (user choice). Static catalogs already map all 3 → fedimint.png; deployed /catalog.json on .228 also correct; /api/app-catalog (decoupled, dict form) returns no fedimint → frontend falls through to /catalog.json. Placeholder is therefore a stale deployed bundle and/or the hardcoded fallback gap: getCuratedAppList() in neode-ui/src/views/discover/curatedApps.ts omits fedimint-gateway + fedimint-clientd entirely — add both (icon fedimint.png).
  • Base fedimint → display "Fedimint Guardian" (user ask). Edit name/title in: apps/fedimint/manifest.yml, app-catalog/catalog.json, neode-ui/public/catalog.json, web/dist/neode-ui/catalog.json, curatedApps.ts:101. (INSTALLED_ALIASES.fedimint = ['fedimint-gateway'] in curatedApps.ts.)

.228 access: sshpass -p archipelago ssh archipelago@192.168.1.228; UI/RPC pw password123 (https). Binary /usr/local/bin/archipelago (v1.7.99-alpha).

Generated-secrets — files touched

  • core/container/src/manifest.rsGeneratedSecret + SecretGenKind types, ContainerConfig.generated_secrets, validation (bare-filename, unique target files).
  • core/container/src/lib.rs — re-export the new types.
  • core/archipelago/src/container/secrets.rsnew generator module (atomic write, idempotent, self-heal) + 3 unit tests.
  • core/archipelago/src/container/mod.rs — register module.
  • core/archipelago/src/container/prod_orchestrator.rs — call ensure_generated_secrets in resolve_dynamic_env; drop fmcd special-case.
  • core/archipelago/src/wallet/fedimint_client.rs — delete orphaned ensure_fmcd_password (reader keeps FMCD_PASSWORD_SECRET).
  • apps/fedimint-clientd/manifest.yml, apps/fedimint-gateway/manifest.yml — declare generated_secrets.

Test layers

Layer What it asserts Toolchain Latency / iteration
L0 — Rust unit Pure-function behaviour (manifest parsing, secret resolution, structural invariants) cargo test --workspace --bins ~5s
L1 — RPC API The JSON-RPC API responds correctly per app (container-list, package.{install,start,stop,restart,uninstall}, bitcoin.getinfo, etc.) bats + lib/rpc.bash ~30s per suite
L2 — UI surface The URLs a user actually clicks (dashboard, /app/<id>/, direct-port iframes) return 200 with non-empty bodies bats + lib/ui-probes.bash ~10s per suite
L3 — Lifecycle survival Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) bats (gated) ~60s per scenario
L4 — Browser journey Real DOM-level user flow (login → install → wait → click → use) playwright (TBD) ~30-120s per journey
L5 — Chaos / failure-path Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) bats (chaos-gated) ~120s per scenario
L6 — Performance Cold install latency, reconcile-tick cost, podman call count per lifecycle event timed bats + Prometheus (TBD) ~60s per benchmark

Release gate: L0+L1+L2+L3 green × 20 iterations on .228 (run ON the node; 5× for now). Multinode/fleet → docs/multinode-testing-plan.md. L4+L5+L6 are quality gates we add as they mature; not blocking the v1.7.52 tag.

Coverage matrix — current state

Legend: ● fully covered, ◐ partial, ○ missing

Per-app × per-state matrix (L1 + L2)

App Container present Valid state RPC reachable UI URL 200 Stop Start Restart Reinstall Reboot survives Archipelago-restart survives
bitcoin-knots ● (port 8334) ◐ regression-gate only
bitcoin-core ◐ shares with knots ◐ regression-gate
lnd ● (lncli) ● (/app/lnd/) ◐ regression-gate
electrumx ● (TCP 50001) ● (/app/electrumx/) ◐ regression-gate
btcpay-server ◐ frontend-port ● (/app/btcpay/)
mempool ● (/api/v1/backend-info) ● (/app/mempool/)
fedimint ◐ container-only ● (/app/fedimint/)
filebrowser ● probe-only ◐ via companions
archy-bitcoin-ui ◐ via companions n/a ● (port 8334) n/a ◐ via companions
archy-lnd-ui ◐ via companions n/a ● (/app/lnd/) n/a ◐ via companions
archy-electrs-ui ◐ via companions n/a ● (/app/electrumx/) n/a ◐ via companions

Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before v1.7.52 tags.

Layer-by-layer status

Layer Tests Suites Status
L0 unit 631 n/a ● green
L1 RPC 70 bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke ● for the 6 core apps
L2 UI 9 ui-coverage ● for dashboard + 7 proxy paths + bitcoin-ui:8334
L3 lifecycle survival 14 companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive, use-quadlet-backends-install ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships); quadlet post-condition gate skip-clean today, hard gate when flag flipped
L1 wallet-receive / drift / secrets 5 bitcoin-receive, port-drift, secret-completeness ● guards the v1.7.9x wallet fleet failures
L4 browser journey 0 none ○ not started
L5 chaos 0 none ○ not started
L6 performance 0 none ○ not started

Wallet / Bitcoin fleet-failure regression suites (added after v1.7.90-alpha)

Three production failures shipped on v1.7.90-alpha despite the existing harness, because nothing exercised the receive path, port-mapping drift, or secret completeness on a live node. New suites close those gaps (all run on the archy host, read-only, so they join run.sh/run-gate.sh automatically):

Suite Failure it guards Asserts
bitcoin-receive.bats .116 ("Operation failed" on receive) and .228 (false "wallet is locked") LND REST reachable on the manifest host port; lnd.newaddress returns a bc1… address on a running node; receive errors are specific, never the generic catch-all
port-drift.bats .116 (lnd REST stuck on host 8080 vs manifest 18080) every installed backend's live podman inspect PortBindings match its manifest ports: (the external mirror of the orchestrator's host_port_bindings_drifted)
secret-completeness.bats .198 (bitcoin-knots needs bitcoin-rpc-txrelay-rpcauth, never generated → stack cascade) every secret_file referenced by an installed backend manifest exists in the secrets dir

Backed by L0 unit tests (cargo test … drift missing_secret lnd) and a vitest for the frontend reason-code mapping (bitcoinReceive.test.ts). The release gate scripts/create-release.sh now runs tests/release/run.sh (which includes these) and aborts the release on failure — previously it ran no tests at all.

Run commands

# L0 unit:
cd core && cargo test --workspace --bins

# Single bats suite:
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots

# Full bats suite (read-only):
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh

# Full + destructive (for the verification fleet):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh

# 5× release-gate run:
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 \
  tests/lifecycle/run-gate.sh

# CASCADE tier (uninstall → no-ghost → reinstall) — opt-in, NOT in the canonical
# gate. Installs/uninstalls a THROWAWAY app (default grafana; skips if already
# installed). Run on-node to also assert data-dir removal:
ARCHY_PASSWORD=password123 ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1 \
  tests/lifecycle/run.sh cascade-uninstall

CASCADE tier — uninstall/reinstall regression guard (Workstream F)

The 5× gate is DESTRUCTIVE-only (stop/start/restart/survive); it never exercised uninstall/reinstall, where the worst lifecycle bugs lived. cascade-uninstall.bats closes that gap and encodes the fixes for two field bugs:

Suite Failure it guards Asserts
cascade-uninstall.bats #13 uninstall ghost (immich/grafana stayed in My Apps after uninstall) and #14 reinstall stops (stalled on stale state/data) fresh install reaches running via a truthful (non-silent) progression; uninstall makes the entry disappear from server.get-state package-data (no ghost, no stuck uninstall stage) + removes the container + (on-node) the data dir; reinstall returns to running; node left as found

Throwaway-app + precondition-skip (won't touch an app that's already installed), so it's safe on a populated node. Override the app via ARCHY_CASCADE_APP / ARCHY_CASCADE_IMAGE / ARCHY_CASCADE_CONFIG / ARCHY_CASCADE_DATA_DIR. Gated on ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1. Verified 7/7 on .228 (2026-06-24).

All-apps lifecycle matrix (Workstream F)

The per-app suites cover ~8 core apps in depth; all-apps-matrix.bats covers every installed app in breadth, automatically — it derives the app set from server.get-state package-data (no hardcoded list) and grows coverage as nodes install more apps. Read-only, so it joins run.sh/run-gate.sh on every node.

Suite Guards (fleet-wide) Asserts (per installed app)
all-apps-matrix.bats apps STUCK transitional (the #13/#14 ghost generalized), error/failed apps, unreachable UI apps (port-drift generalized) settles to a non-transitional state within a window; not error/failed; recognized (non-garbage) state; every running UI app (manifest ui=="true") exposes a non-null lan-address

Tunables: ARCHY_MATRIX_SETTLE_SECS (45), ARCHY_MATRIX_UI_SECS (30), ARCHY_MATRIX_ALLOW_STOPPED (ids allowed non-running). Verified 5/5 on .228 (17 apps) and .116 (20 apps incl. grafana/nextcloud/photoprism/gitea), 2026-06-24.

To exercise the Phase 3.2 Quadlet-backend path on a target node without editing config.json (which would require an archipelago restart and trigger FM3 until 3.5 ships), set the env var on archipelago.service:

sudo systemctl edit archipelago     # add: [Service]\nEnvironment=ARCHIPELAGO_USE_QUADLET_BACKENDS=1
sudo systemctl restart archipelago  # one cgroup-cascade hit; survivable on a debug node

After the restart, package.install for any orchestrator-managed backend will route through install_via_quadlet, and the use-quadlet-backends-install.bats suite turns from skip → hard gate.

LoC budget

Goal: minimum-viable container subsystem.

Module LoC today Target Δ Status
core/container/src/dependency_resolver.rs -270 ● deleted
core/container/src/health_monitor.rs 196 0 -196 ◐ pending health migration into reconciler (Phase 3.5)
core/container/src/podman_client.rs::create/start/stop ~400 ~150 -250 ◐ pending Quadlet migration (Phase 3.5)
core/archipelago/src/container/dev_orchestrator.rs 410 0 -410 ○ pending dev_mode strategy decision
core/archipelago/src/container/data_manager.rs 96 0 -96 ○ couples with dev_orchestrator
core/container/src/bitcoin_simulator.rs 219 0 -219 ○ couples with dev_orchestrator
core/container/src/port_manager.rs 175 0 -175 ○ couples with dev_orchestrator
core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair ~150 0 -150 ◐ pending fold into orchestrator pre-start
imperative install_fresh in prod_orchestrator ~120 0 -120 ◐ Phase 3.2 wired behind use_quadlet_backends flag (default off); 3.3 in-place migration ; 3.4 health-gated startup (Notify=healthy) + TimeoutStartSec=600 race fix ; 3.4a unit drift-sync each reconcile ; flip default after 5× green

Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC (if Phase 3 ships fully + dev_mode resolved).

Net target for v1.7.52: container subsystem ≈ half of today's LoC.

Performance KPIs (TBD — measure first, then target)

We don't have a performance harness yet. Add as L6 lands:

KPI Today Target Notes
cold install: bitcoin-knots manifest → running healthcheck unknown < 30s once image is local excludes the ~1GB image pull
cold install: lnd unknown < 60s once image is local wallet unlock dominates
reconcile-tick wall time (no-op pass over all installed apps) unknown < 250ms the current orchestrator does many podman inspect calls
podman shell-outs per package.install (orchestrator path) 7-10 1-2 (Quadlet) post-Phase-3
daemon startup (boot → port 5678 listening) unknown < 5s reconcile is async after this

Release gates

v1.7.52 ships only when ALL of:

  1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
  2. ARCHY_ITERATIONS=5 tests/lifecycle/run-gate.sh returns 0 run ON .228 (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1) — 1× is GREEN (110/110), 5× in progress
  3. ☐ Multinode/fleet (.198 + others) — tracked separately in docs/multinode-testing-plan.md, NOT a v1.7.52 single-node gate item
  4. ☐ The L3 backend-survives-archipelago-restart suite passes (= Phase 3 Quadlet shipped for backends)
  5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f)
  6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
  7. ☐ Layman-readable changelog (per feedback_changelog_layman.md)
  8. ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per feedback_ship_ritual.md)

How to update this document

When you land a change that materially moves any cell of the matrix or any LoC row, update this file in the same commit. Reviewers checking the PR can read the diff to TESTING.md as the answer to "what did this commit improve?". Without the update, the change is half-shipped.