Generated-secrets system: apps declare `generated_secrets` in their manifest (kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets` materialises them 0600/rootless in resolve_dynamic_env — idempotent and self-healing (recovers wrongly root-owned secrets with no privilege). Replaces per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests now declare fmcd-password / fedimint-gateway-hash. companion.rs: rebuild the auto-built :latest image when its build context changes (staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes. quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit 125) + regression tests. UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged as Services (headless backends), gateway icon fallback. Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start; grafana/strfry orphan crash-loop units removed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17 KiB
Container subsystem testing — scorecard and roadmap
The bar (verbatim from the v1.7.52 owner):
"best performant, minimal code, tested containers possible in the world. No bloated code, no problems installing a single one, no problems uninstalling, every one needs to be tested 20+ times in every state before we make another update, not a single container failure outside of hardware or internet failure is allowed."
This document is the live tracker for whether we're meeting that bar. Every PR that touches the container subsystem updates the scoreboard below. If you can't honestly tick the box, the change isn't ready.
Production-quality pass — 2026-06-21 (current, v1.7.99-alpha)
The migration's aim, restated as five pillars (every app must satisfy all five):
- Quadlet-everywhere — every container is a declarative systemd Quadlet
unit under
user.slice, never insidearchipelago.service's cgroup. Kills FM3 (restarting/updating archipelago SIGKILLs every container in its cgroup); systemd becomes the per-app supervisor. - Level-triggered reconciler — a 30s idempotent reconcile loop drives desired→current from manifests + secrets. Self-healing, not edge-triggered.
- Lifecycle bulletproof — every app passes the full matrix (install / UI reachable / stop / start / restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall) 20× green on .228 AND .198 before any release.
- Data-driven apps — install/uninstall needs only the app's manifest + catalog entry. No host OS changes (no apt, no /etc, no host units) and no archipelago binary code per app. Only core apps (bitcoin, lnd, electrumx, fedimint + gateway/clientd) may carry bespoke handling if truly unavoidable.
- Rootless + security-first (non-negotiable) — containers run in the
unprivileged
archipelagouser namespace; never root, no--privileged, drop-all-caps + add-back only what a manifest declares. Secrets are0600, owned by the service user. Security is king.
Per-app definition of done: all five pillars hold → lifecycle matrix 20×
green on .228 then .198 → catalog/registry updated (app-catalog/catalog.json
releases/app-catalog.json, rebuilt image pushed to the mirror) → tracker cell ticked. Only then move to the next app.
.228 testing constraint: do NOT touch bitcoin-knots, electrumx, or
lnd on .228 — they are synced and healthy; destructive cycles there would
cost hours of resync.
Session work log
| Date | App | Change | State |
|---|---|---|---|
| 2026-06-21 | fedimint-gateway / -clientd | Generated-secrets system (Pillar 4+5). New generated_secrets: manifest field (hex16/hex32/bcrypt); materialised generically at the resolve_dynamic_env chokepoint — atomic 0600, rootless-owned, idempotent, and self-healing (recreates a wrongly root:root-owned secret via the service-owned dir, no chown/privilege). Removed per-app ensure_fmcd_password (−30 LoC). Fixes gateway never starting (resolving secret_env → missing/unreadable fedimint-gateway-hash). |
◐ code complete, cargo check + 3 unit tests green; not yet deployed/validated on .228 |
| 2026-06-21 | fedimint-gateway | Icon placeholder | ○ investigating: marketplace catalog has title+icon (fedimint.png, shared); BUNDLED_APPS frontend list omits fedimint → installed view falls back to 📦 |
⏯ RESUME POINT (2026-06-21, mid-session)
Done (working tree, NOT git-committed):
- Generated-secrets system — all files below written,
cargo checkclean, 3 unit tests green. - Manifests declare
generated_secrets(fmcd-password hex16; fedimint-gateway-hash bcrypt). - Tracker refreshed with 5 pillars + this log.
In flight:
- Local release build RUNNING (
cd core && cargo build --release -p archipelago, log/tmp/archy-local-build.log, outputcore/target/release/archipelago). ⚠️ .228 has NO cargo and NO rsync — build LOCALLY on .116, ship binary + files via tar-over-ssh (tar -cf - … | ssh … 'tar -xf -').
Next steps (in order):
- Wait for local build →
Finished. scp/tarcore/target/release/archipelago→ .228. - Ship updated manifests to
/opt/archipelago/apps/fedimint-{gateway,clientd}/(canonical runtime dir; cwd-relativeappsdoesn't resolve — WorkingDirectory is empty). - Binary swap is SAFE for protected backends:
archipelago.serviceisKillMode=control-groupBUT bitcoin-knots/electrumx/lnd conmons live underuser.slice/.../libpod-*.scope, NOT the service cgroup. Only fedimint-clientd + immich conmons are in-cgroup (non-protected, reconciled back).systemctl stop archipelago→cpbinary →start. - Validate: install fedimint-gateway → assert
fedimint-gateway-hash(0600, archipelago-owned) +.pwgenerated → container starts healthy. - Run
tests/lifecycle/run-20x.shfor the gateway (do NOT touch knots/electrumx/lnd). - Frontend fixes (separate from binary): see icon/rename below; rebuild neode-ui,
ship
dist + catalog.json + assetsto/opt/archipelago/web-ui(chown 1000:1000).
Icon / naming (frontend, user-confirmed):
- Gateway icon = reuse fedimint.png (user choice). Static catalogs already map all 3
→ fedimint.png; deployed
/catalog.jsonon .228 also correct;/api/app-catalog(decoupled, dict form) returns no fedimint → frontend falls through to/catalog.json. Placeholder is therefore a stale deployed bundle and/or the hardcoded fallback gap:getCuratedAppList()inneode-ui/src/views/discover/curatedApps.tsomits fedimint-gateway + fedimint-clientd entirely — add both (icon fedimint.png). - Base
fedimint→ display "Fedimint Guardian" (user ask). Edit name/title in:apps/fedimint/manifest.yml,app-catalog/catalog.json,neode-ui/public/catalog.json,web/dist/neode-ui/catalog.json,curatedApps.ts:101. (INSTALLED_ALIASES.fedimint = ['fedimint-gateway']in curatedApps.ts.)
.228 access: sshpass -p archipelago ssh archipelago@192.168.1.228; UI/RPC pw
password123 (https). Binary /usr/local/bin/archipelago (v1.7.99-alpha).
Generated-secrets — files touched
core/container/src/manifest.rs—GeneratedSecret+SecretGenKindtypes,ContainerConfig.generated_secrets, validation (bare-filename, unique target files).core/container/src/lib.rs— re-export the new types.core/archipelago/src/container/secrets.rs— new generator module (atomic write, idempotent, self-heal) + 3 unit tests.core/archipelago/src/container/mod.rs— register module.core/archipelago/src/container/prod_orchestrator.rs— callensure_generated_secretsinresolve_dynamic_env; drop fmcd special-case.core/archipelago/src/wallet/fedimint_client.rs— delete orphanedensure_fmcd_password(reader keepsFMCD_PASSWORD_SECRET).apps/fedimint-clientd/manifest.yml,apps/fedimint-gateway/manifest.yml— declaregenerated_secrets.
Test layers
| Layer | What it asserts | Toolchain | Latency / iteration |
|---|---|---|---|
| L0 — Rust unit | Pure-function behaviour (manifest parsing, secret resolution, structural invariants) | cargo test --workspace --bins |
~5s |
| L1 — RPC API | The JSON-RPC API responds correctly per app (container-list, package.{install,start,stop,restart,uninstall}, bitcoin.getinfo, etc.) |
bats + lib/rpc.bash | ~30s per suite |
| L2 — UI surface | The URLs a user actually clicks (dashboard, /app/<id>/, direct-port iframes) return 200 with non-empty bodies |
bats + lib/ui-probes.bash | ~10s per suite |
| L3 — Lifecycle survival | Containers survive operational events (archipelago restart, host reboot, kill -9 mid-install, OOM) | bats (gated) | ~60s per scenario |
| L4 — Browser journey | Real DOM-level user flow (login → install → wait → click → use) | playwright (TBD) | ~30-120s per journey |
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
Release gate: L0+L1+L2+L3 green × 20 iterations on .228 AND .198. L4+L5+L6 are quality gates we add as they mature; not blocking the v1.7.52 tag.
Coverage matrix — current state
Legend: ● fully covered, ◐ partial, ○ missing
Per-app × per-state matrix (L1 + L2)
| App | Container present | Valid state | RPC reachable | UI URL 200 | Stop | Start | Restart | Reinstall | Reboot survives | Archipelago-restart survives |
|---|---|---|---|---|---|---|---|---|---|---|
| bitcoin-knots | ● | ● | ● | ● (port 8334) | ● | ● | ● | ● | ○ | ◐ regression-gate only |
| bitcoin-core | ◐ shares with knots | ◐ | ○ | ◐ | ○ | ○ | ○ | ○ | ○ | ◐ regression-gate |
| lnd | ● | ● | ● (lncli) | ● (/app/lnd/) |
● | ● | ● | ● | ○ | ◐ regression-gate |
| electrumx | ● | ● | ● (TCP 50001) | ● (/app/electrumx/) |
● | ● | ● | ● | ○ | ◐ regression-gate |
| btcpay-server | ● | ● | ◐ frontend-port | ● (/app/btcpay/) |
● | ● | ● | ● | ○ | ○ |
| mempool | ● | ● | ● (/api/v1/backend-info) |
● (/app/mempool/) |
● | ● | ● | ● | ○ | ○ |
| fedimint | ● | ● | ◐ container-only | ● (/app/fedimint/) |
● | ● | ● | ● | ○ | ○ |
| filebrowser | ○ | ○ | ○ | ● probe-only | ○ | ○ | ○ | ○ | ○ | ◐ via companions |
| archy-bitcoin-ui | ◐ via companions | ◐ | n/a | ● (port 8334) | ○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-lnd-ui | ◐ via companions | ◐ | n/a | ● (/app/lnd/) |
○ | ○ | ○ | n/a | ◐ via companions | ● |
| archy-electrs-ui | ◐ via companions | ◐ | n/a | ● (/app/electrumx/) |
○ | ○ | ○ | n/a | ◐ via companions | ● |
Done: 50 of 110 cells. Goal: 110/110 ● for the listed apps before v1.7.52 tags.
Layer-by-layer status
| Layer | Tests | Suites | Status |
|---|---|---|---|
| L0 unit | 631 | n/a | ● green |
| L1 RPC | 70 | bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint, required-stack, package-update-smoke | ● for the 6 core apps |
| L2 UI | 9 | ui-coverage | ● for dashboard + 7 proxy paths + bitcoin-ui:8334 |
| L3 lifecycle survival | 14 | companion-survives-archipelago-restart, backend-survives-archipelago-restart, required-stack-destructive, use-quadlet-backends-install | ◐ companions ● ; backends ◐ regression-gate (will fail until Phase 3 Quadlet ships); quadlet post-condition gate ✅ skip-clean today, hard gate when flag flipped |
| L1 wallet-receive / drift / secrets | 5 | bitcoin-receive, port-drift, secret-completeness | ● guards the v1.7.9x wallet fleet failures |
| L4 browser journey | 0 | none | ○ not started |
| L5 chaos | 0 | none | ○ not started |
| L6 performance | 0 | none | ○ not started |
Wallet / Bitcoin fleet-failure regression suites (added after v1.7.90-alpha)
Three production failures shipped on v1.7.90-alpha despite the existing harness,
because nothing exercised the receive path, port-mapping drift, or secret
completeness on a live node. New suites close those gaps (all run on the archy
host, read-only, so they join run.sh/run-20x.sh automatically):
| Suite | Failure it guards | Asserts |
|---|---|---|
bitcoin-receive.bats |
.116 ("Operation failed" on receive) and .228 (false "wallet is locked") | LND REST reachable on the manifest host port; lnd.newaddress returns a bc1… address on a running node; receive errors are specific, never the generic catch-all |
port-drift.bats |
.116 (lnd REST stuck on host 8080 vs manifest 18080) | every installed backend's live podman inspect PortBindings match its manifest ports: (the external mirror of the orchestrator's host_port_bindings_drifted) |
secret-completeness.bats |
.198 (bitcoin-knots needs bitcoin-rpc-txrelay-rpcauth, never generated → stack cascade) |
every secret_file referenced by an installed backend manifest exists in the secrets dir |
Backed by L0 unit tests (cargo test … drift missing_secret lnd) and a vitest
for the frontend reason-code mapping (bitcoinReceive.test.ts). The release
gate scripts/create-release.sh now runs tests/release/run.sh (which includes
these) and aborts the release on failure — previously it ran no tests at all.
Run commands
# L0 unit:
cd core && cargo test --workspace --bins
# Single bats suite:
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh bitcoin-knots
# Full bats suite (read-only):
ARCHY_PASSWORD=password123 tests/lifecycle/run.sh
# Full + destructive (for the verification fleet):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/run.sh
# 20× release-gate run (the actual v1.7.52 ship gate):
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
tests/lifecycle/run-20x.sh
To exercise the Phase 3.2 Quadlet-backend path on a target node without
editing config.json (which would require an archipelago restart and
trigger FM3 until 3.5 ships), set the env var on archipelago.service:
sudo systemctl edit archipelago # add: [Service]\nEnvironment=ARCHIPELAGO_USE_QUADLET_BACKENDS=1
sudo systemctl restart archipelago # one cgroup-cascade hit; survivable on a debug node
After the restart, package.install for any orchestrator-managed backend
will route through install_via_quadlet, and the
use-quadlet-backends-install.bats suite turns from skip → hard gate.
LoC budget
Goal: minimum-viable container subsystem.
| Module | LoC today | Target | Δ | Status |
|---|---|---|---|---|
core/container/src/dependency_resolver.rs |
— | — | -270 | ● deleted |
core/container/src/health_monitor.rs |
196 | 0 | -196 | ◐ pending health migration into reconciler (Phase 3.5) |
core/container/src/podman_client.rs::create/start/stop |
~400 | ~150 | -250 | ◐ pending Quadlet migration (Phase 3.5) |
core/archipelago/src/container/dev_orchestrator.rs |
410 | 0 | -410 | ○ pending dev_mode strategy decision |
core/archipelago/src/container/data_manager.rs |
96 | 0 | -96 | ○ couples with dev_orchestrator |
core/container/src/bitcoin_simulator.rs |
219 | 0 | -219 | ○ couples with dev_orchestrator |
core/container/src/port_manager.rs |
175 | 0 | -175 | ○ couples with dev_orchestrator |
core/archipelago/src/api/rpc/package/install.rs::install_bitcoincoin_rpc_repair |
~150 | 0 | -150 | ◐ pending fold into orchestrator pre-start |
imperative install_fresh in prod_orchestrator |
~120 | 0 | -120 | ◐ Phase 3.2 wired behind use_quadlet_backends flag (default off); 3.3 in-place migration ✅; 3.4 health-gated startup (Notify=healthy) ✅ + TimeoutStartSec=600 race fix ✅; 3.4a unit drift-sync each reconcile ✅; flip default after 20× green |
Today: -270 LoC committed. Outstanding deletes possible: ~1,616 LoC (if Phase 3 ships fully + dev_mode resolved).
Net target for v1.7.52: container subsystem ≈ half of today's LoC.
Performance KPIs (TBD — measure first, then target)
We don't have a performance harness yet. Add as L6 lands:
| KPI | Today | Target | Notes |
|---|---|---|---|
cold install: bitcoin-knots manifest → running healthcheck |
unknown | < 30s once image is local | excludes the ~1GB image pull |
| cold install: lnd | unknown | < 60s once image is local | wallet unlock dominates |
| reconcile-tick wall time (no-op pass over all installed apps) | unknown | < 250ms | the current orchestrator does many podman inspect calls |
| podman shell-outs per package.install (orchestrator path) | 7-10 | 1-2 (Quadlet) | post-Phase-3 |
| daemon startup (boot → port 5678 listening) | unknown | < 5s | reconcile is async after this |
Release gates
v1.7.52 ships only when ALL of:
- ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
- ☐
tests/lifecycle/run-20x.shreturns 0 against .228 (full suite, ARCHY_ALLOW_DESTRUCTIVE=1) - ☐
tests/lifecycle/run-20x.shreturns 0 against .198 (same) - ☐ The L3
backend-survives-archipelago-restartsuite passes (= Phase 3 Quadlet shipped for backends) - ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since
1c0df95f) - ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
- ☐ Layman-readable changelog (per
feedback_changelog_layman.md) - ☐ Tag pushed to origin + gitea-local + gitea-vps2 (per
feedback_ship_ritual.md)
How to update this document
When you land a change that materially moves any cell of the matrix or any LoC row, update this file in the same commit. Reviewers checking the PR can read the diff to TESTING.md as the answer to "what did this commit improve?". Without the update, the change is half-shipped.