Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on .228 AND .198 for now, down from 20x. Restore to 20x before the final ship. Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
265 lines
17 KiB
Markdown
265 lines
17 KiB
Markdown
# Container subsystem testing — scorecard and roadmap
|
||
|
||
The bar (verbatim from the v1.7.52 owner):
|
||
|
||
> "best performant, minimal code, tested containers possible in the world.
|
||
> No bloated code, no problems installing a single one, no problems
|
||
> uninstalling, every one needs to be tested 20+ times in every state
|
||
> before we make another update, not a single container failure outside
|
||
> of hardware or internet failure is allowed."
|
||
|
||
This document is the live tracker for whether we're meeting that bar.
|
||
Every PR that touches the container subsystem updates the scoreboard
|
||
below. **If you can't honestly tick the box, the change isn't ready.**
|
||
|
||
---
|
||
|
||
## Production-quality pass — 2026-06-21 (current, v1.7.99-alpha)
|
||
|
||
The migration's aim, restated as **five pillars** (every app must satisfy all five):
|
||
|
||
1. **Quadlet-everywhere** — every container is a declarative systemd Quadlet
|
||
unit under `user.slice`, never inside `archipelago.service`'s cgroup. Kills
|
||
FM3 (restarting/updating archipelago SIGKILLs every container in its cgroup);
|
||
systemd becomes the per-app supervisor.
|
||
2. **Level-triggered reconciler** — a 30s idempotent reconcile loop drives
|
||
desired→current from manifests + secrets. Self-healing, not edge-triggered.
|
||
3. **Lifecycle bulletproof** — every app passes the full matrix
|
||
(install / UI reachable / stop / start / restart / reinstall / reboot-survive
|
||
/ archipelago-restart-survive / uninstall) **5× green on .228 AND .198 for now**
|
||
(`ARCHY_ITERATIONS=5`; temporarily reduced from 20×, restore before final ship)
|
||
before any release.
|
||
4. **Data-driven apps** — install/uninstall needs only the app's manifest +
|
||
catalog entry. **No host OS changes** (no apt, no /etc, no host units) and
|
||
**no archipelago binary code per app**. Only *core* apps (bitcoin, lnd,
|
||
electrumx, fedimint + gateway/clientd) may carry bespoke handling if truly
|
||
unavoidable.
|
||
5. **Rootless + security-first (non-negotiable)** — containers run in the
|
||
unprivileged `archipelago` user namespace; never root, no `--privileged`,
|
||
drop-all-caps + add-back only what a manifest declares. Secrets are `0600`,
|
||
owned by the service user. Security is king.
|
||
|
||
**Per-app definition of done:** all five pillars hold → lifecycle matrix 5×
|
||
(for now; was 20×) green on .228 then .198 → catalog/registry updated (`app-catalog/catalog.json`
|
||
+ `releases/app-catalog.json`, rebuilt image pushed to the mirror) → tracker
|
||
cell ticked. Only then move to the next app.
|
||
|
||
**.228 testing constraint:** do NOT touch `bitcoin-knots`, `electrumx`, or
|
||
`lnd` on .228 — they are synced and healthy; destructive cycles there would
|
||
cost hours of resync.
|
||
|
||
### Session work log
|
||
|
||
| Date | App | Change | State |
|
||
|---|---|---|---|
|
||
| 2026-06-21 | fedimint-gateway / -clientd | **Generated-secrets system** (Pillar 4+5). New `generated_secrets:` manifest field (`hex16`/`hex32`/`bcrypt`); materialised generically at the `resolve_dynamic_env` chokepoint — atomic `0600`, rootless-owned, idempotent, and **self-healing** (recreates a wrongly `root:root`-owned secret via the service-owned dir, no chown/privilege). Removed per-app `ensure_fmcd_password` (−30 LoC). Fixes gateway never starting (`resolving secret_env` → missing/unreadable `fedimint-gateway-hash`). | ◐ code complete, `cargo check` + 3 unit tests green; **not yet deployed/validated on .228** |
|
||
| 2026-06-21 | fedimint-gateway | Icon placeholder | ○ investigating: marketplace catalog has title+icon (fedimint.png, shared); `BUNDLED_APPS` frontend list omits fedimint → installed view falls back to 📦 |
|
||
|
||
### ⏯ RESUME POINT (2026-06-21, mid-session)
|
||
|
||
**Done (working tree, NOT git-committed):**
|
||
- Generated-secrets system — all files below written, `cargo check` clean, 3 unit tests green.
|
||
- Manifests declare `generated_secrets` (fmcd-password hex16; fedimint-gateway-hash bcrypt).
|
||
- Tracker refreshed with 5 pillars + this log.
|
||
|
||
**In flight:**
|
||
- Local release build RUNNING (`cd core && cargo build --release -p archipelago`,
|
||
log `/tmp/archy-local-build.log`, output `core/target/release/archipelago`).
|
||
⚠️ **.228 has NO cargo and NO rsync** — build LOCALLY on .116, ship binary + files
|
||
via **tar-over-ssh** (`tar -cf - … | ssh … 'tar -xf -'`).
|
||
|
||
**Next steps (in order):**
|
||
1. Wait for local build → `Finished`. scp/tar `core/target/release/archipelago` → .228.
|
||
2. Ship updated manifests to **`/opt/archipelago/apps/fedimint-{gateway,clientd}/`** (canonical runtime dir; cwd-relative `apps` doesn't resolve — WorkingDirectory is empty).
|
||
3. **Binary swap is SAFE for protected backends:** `archipelago.service` is
|
||
`KillMode=control-group` BUT bitcoin-knots/electrumx/lnd conmons live under
|
||
`user.slice/.../libpod-*.scope`, NOT the service cgroup. Only fedimint-clientd +
|
||
immich conmons are in-cgroup (non-protected, reconciled back). `systemctl stop
|
||
archipelago` → `cp` binary → `start`.
|
||
4. Validate: install fedimint-gateway → assert `fedimint-gateway-hash` (0600,
|
||
archipelago-owned) + `.pw` generated → container starts healthy.
|
||
5. Run `tests/lifecycle/run-20x.sh` for the gateway (do NOT touch knots/electrumx/lnd).
|
||
6. Frontend fixes (separate from binary): see icon/rename below; rebuild neode-ui,
|
||
ship `dist + catalog.json + assets` to `/opt/archipelago/web-ui` (chown 1000:1000).
|
||
|
||
**Icon / naming (frontend, user-confirmed):**
|
||
- Gateway icon = **reuse fedimint.png** (user choice). Static catalogs already map all 3
|
||
→ fedimint.png; deployed `/catalog.json` on .228 also correct; `/api/app-catalog`
|
||
(decoupled, dict form) returns no fedimint → frontend falls through to `/catalog.json`.
|
||
Placeholder is therefore a **stale deployed bundle** and/or the **hardcoded fallback gap**:
|
||
`getCuratedAppList()` in `neode-ui/src/views/discover/curatedApps.ts` omits
|
||
fedimint-gateway + fedimint-clientd entirely — add both (icon fedimint.png).
|
||
- Base **`fedimint` → display "Fedimint Guardian"** (user ask). Edit name/title in:
|
||
`apps/fedimint/manifest.yml`, `app-catalog/catalog.json`,
|
||
`neode-ui/public/catalog.json`, `web/dist/neode-ui/catalog.json`,
|
||
`curatedApps.ts:101`. (`INSTALLED_ALIASES.fedimint = ['fedimint-gateway']` in curatedApps.ts.)
|
||
|
||
**.228 access:** `sshpass -p archipelago ssh archipelago@192.168.1.228`; UI/RPC pw
|
||
`password123` (https). Binary `/usr/local/bin/archipelago` (v1.7.99-alpha).
|
||
|
||
### Generated-secrets — files touched
|
||
|
||
- `core/container/src/manifest.rs` — `GeneratedSecret` + `SecretGenKind` types, `ContainerConfig.generated_secrets`, validation (bare-filename, unique target files).
|
||
- `core/container/src/lib.rs` — re-export the new types.
|
||
- `core/archipelago/src/container/secrets.rs` — **new** generator module (atomic write, idempotent, self-heal) + 3 unit tests.
|
||
- `core/archipelago/src/container/mod.rs` — register module.
|
||
- `core/archipelago/src/container/prod_orchestrator.rs` — call `ensure_generated_secrets` in `resolve_dynamic_env`; drop fmcd special-case.
|
||
- `core/archipelago/src/wallet/fedimint_client.rs` — delete orphaned `ensure_fmcd_password` (reader keeps `FMCD_PASSWORD_SECRET`).
|
||
- `apps/fedimint-clientd/manifest.yml`, `apps/fedimint-gateway/manifest.yml` — declare `generated_secrets`.
|
||
|
||
---
|
||
|
||
## Test layers
|
||
|
||
| Layer | What it asserts | Toolchain | Latency / iteration |
|
||
|---|---|---|---|
|
||
| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | `cargo test --workspace --bins` | ~5s |
|
||
| L1 — RPC API | The JSON-RPC API responds correctly per app (`container-list`, `package.{install,start,stop,restart,uninstall}`, `bitcoin.getinfo`, etc.) | bats + lib/rpc.bash | ~30s per suite |
|
||
| L2 — UI surface | The URLs a user actually clicks (dashboard, `/app/<id>/`, direct-port iframes) return 200 with non-empty bodies | bats + lib/ui-probes.bash | ~10s per suite |
|
||
| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario |
|
||
| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey |
|
||
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
|
||
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
|
||
|
||
Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are
|
||
quality gates we add as they mature; not blocking the v1.7.52 tag.
|
||
|
||
## Coverage matrix — current state
|
||
|
||
Legend: ● fully covered, ◐ partial, ○ missing
|
||
|
||
### Per-app × per-state matrix (L1 + L2)
|
||
|
||
| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives |
|
||
|---|---|---|---|---|---|---|---|---|---|---|
|
||
| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only |
|
||
| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate |
|
||
| lnd | ● | ● | ● (lncli) | ● (`/app/lnd/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
|
||
| electrumx | ● | ● | ● (TCP 50001) | ● (`/app/electrumx/`) | ● | ● | ● | ● | ○ | ◐ regression-gate |
|
||
| btcpay-server | ● | ● | ◐ frontend-port | ● (`/app/btcpay/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| mempool | ● | ● | ● (`/api/v1/backend-info`) | ● (`/app/mempool/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| fedimint | ● | ● | ◐ container-only | ● (`/app/fedimint/`) | ● | ● | ● | ● | ○ | ○ |
|
||
| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions |
|
||
| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (`/app/lnd/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (`/app/electrumx/`) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
|
||
|
||
Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before
|
||
v1.7.52 tags.
|
||
|
||
### Layer-by-layer status
|
||
|
||
| Layer | Tests | Suites | Status |
|
||
|---|---:|---:|---|
|
||
| L0 unit | 631 | n/a | ● green |
|
||
| L1 RPC | 70 | bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke | ● for the 6 core apps |
|
||
| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 |
|
||
| L3 lifecycle survival | 14 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive, use-quadlet-backends-install | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships); quadlet post-condition gate ✅ skip-clean today, hard gate when flag flipped |
|
||
| L1 wallet-receive / drift / secrets | 5 | bitcoin-receive, port-drift, secret-completeness | ● guards the v1.7.9x wallet fleet failures |
|
||
| L4 browser journey | 0 | none | ○ not started |
|
||
| L5 chaos | 0 | none | ○ not started |
|
||
| L6 performance | 0 | none | ○ not started |
|
||
|
||
### Wallet / Bitcoin fleet-failure regression suites (added after v1.7.90-alpha)
|
||
|
||
Three production failures shipped on v1.7.90-alpha despite the existing harness,
|
||
because nothing exercised the receive path, port-mapping drift, or secret
|
||
completeness on a live node. New suites close those gaps (all run on the archy
|
||
host, read-only, so they join `run.sh`/`run-20x.sh` automatically):
|
||
|
||
| Suite | Failure it guards | Asserts |
|
||
|---|---|---|
|
||
| `bitcoin-receive.bats` | .116 ("Operation failed" on receive) and .228 (false "wallet is locked") | LND REST reachable on the **manifest** host port; `lnd.newaddress` returns a `bc1…` address on a running node; receive errors are specific, never the generic catch-all |
|
||
| `port-drift.bats` | .116 (lnd REST stuck on host 8080 vs manifest 18080) | every installed backend's live `podman inspect` PortBindings match its manifest `ports:` (the external mirror of the orchestrator's `host_port_bindings_drifted`) |
|
||
| `secret-completeness.bats` | .198 (bitcoin-knots needs `bitcoin-rpc-txrelay-rpcauth`, never generated → stack cascade) | every `secret_file` referenced by an installed backend manifest exists in the secrets dir |
|
||
|
||
Backed by L0 unit tests (`cargo test … drift missing_secret lnd`) and a vitest
|
||
for the frontend reason-code mapping (`bitcoinReceive.test.ts`). The release
|
||
gate `scripts/create-release.sh` now runs `tests/release/run.sh` (which includes
|
||
these) and **aborts the release on failure** — previously it ran no tests at all.
|
||
|
||
## Run commands
|
||
|
||
```bash
|
||
# L0 unit:
|
||
cd core && cargo test --workspace --bins
|
||
|
||
# Single bats suite:
|
||
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots
|
||
|
||
# Full bats suite (read-only):
|
||
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh
|
||
|
||
# Full + destructive (for the verification fleet):
|
||
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh
|
||
|
||
# 5× release-gate run (for now; was 20× — restore before final ship):
|
||
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 \
|
||
tests/lifecycle/run-20x.sh
|
||
```
|
||
|
||
To exercise the Phase 3.2 Quadlet-backend path on a target node without
|
||
editing config.json (which would require an archipelago restart and
|
||
trigger FM3 until 3.5 ships), set the env var on `archipelago.service`:
|
||
|
||
```bash
|
||
sudo systemctl edit archipelago # add: [Service]\nEnvironment=ARCHIPELAGO_USE_QUADLET_BACKENDS=1
|
||
sudo systemctl restart archipelago # one cgroup-cascade hit; survivable on a debug node
|
||
```
|
||
|
||
After the restart, `package.install` for any orchestrator-managed backend
|
||
will route through `install_via_quadlet`, and the
|
||
`use-quadlet-backends-install.bats` suite turns from skip → hard gate.
|
||
|
||
## LoC budget
|
||
|
||
Goal: minimum-viable container subsystem.
|
||
|
||
| Module | LoC today | Target | Δ | Status |
|
||
|---|---:|---:|---:|---|
|
||
| `core/container/src/dependency_resolver.rs` | — | — | -270 | ● deleted |
|
||
| `core/container/src/health_monitor.rs` | 196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) |
|
||
| `core/container/src/podman_client.rs::create/start/stop` | ~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) |
|
||
| `core/archipelago/src/container/dev_orchestrator.rs` | 410 | 0 | -410 | ○ pending dev_mode strategy decision |
|
||
| `core/archipelago/src/container/data_manager.rs` | 96 | 0 | -96 | ○ couples with dev_orchestrator |
|
||
| `core/container/src/bitcoin_simulator.rs` | 219 | 0 | -219 | ○ couples with dev_orchestrator |
|
||
| `core/container/src/port_manager.rs` | 175 | 0 | -175 | ○ couples with dev_orchestrator |
|
||
| `core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair` | ~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start |
|
||
| imperative `install_fresh` in prod_orchestrator | ~120 | 0 | -120 | ◐ Phase 3.2 wired behind `use_quadlet_backends` flag (default off); 3.3 in-place migration ✅; 3.4 health-gated startup (`Notify=healthy`) ✅ + `TimeoutStartSec=600` race fix ✅; 3.4a unit drift-sync each reconcile ✅; flip default after 20× green |
|
||
|
||
**Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC** (if Phase 3 ships fully + dev_mode resolved).
|
||
|
||
Net target for v1.7.52: container subsystem ≈ **half** of today's LoC.
|
||
|
||
## Performance KPIs (TBD — measure first, then target)
|
||
|
||
We don't have a performance harness yet. Add as L6 lands:
|
||
|
||
| KPI | Today | Target | Notes |
|
||
|---|---|---|---|
|
||
| cold install: bitcoin-knots manifest → `running` healthcheck | unknown | < 30s once image is local | excludes the ~1GB image pull |
|
||
| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates |
|
||
| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many `podman inspect` calls |
|
||
| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 |
|
||
| daemon startup (boot → port 5678 listening) | unknown | < 5s | reconcile is async after this |
|
||
|
||
## Release gates
|
||
|
||
v1.7.52 ships only when ALL of:
|
||
|
||
1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
|
||
2. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .228 (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1)
|
||
3. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .198 (same)
|
||
4. ☐ The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends)
|
||
5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f)
|
||
6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
|
||
7. ☐ Layman-readable changelog (per `feedback_changelog_layman.md`)
|
||
8. ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per `feedback_ship_ritual.md`)
|
||
|
||
## How to update this document
|
||
|
||
When you land a change that materially moves any cell of the matrix or
|
||
any LoC row, update this file in the same commit. Reviewers checking
|
||
the PR can read the diff to TESTING.md as the answer to "what did
|
||
this commit improve?". Without the update, the change is half-shipped.
|