A six-test bats suite that validates what install_via_quadlet (Phase 3.2)
is supposed to leave behind:
* `.container` unit on disk in $XDG_CONFIG_HOME/containers/systemd/
with [Container] / [Service] / [Install] sections, Image= present,
and Restart=on-failure (the backend invariant — companions use Always)
* Phase 3.4 cross-check: any unit with HealthCmd= must also emit
Notify=healthy, otherwise systemctl start won't gate on health
* `systemctl --user is-active` returns 0 for the .service
* podman shows the container running
* the container's cgroup is under user.slice/, NOT under
archipelago.service — the kernel-level proof that FM3 cgroup
cascade SIGKILL is structurally fixed for this container
Auto-skips on every test when no backend Quadlet units exist (today's
default state, use_quadlet_backends=false) — so the suite is a no-op
on current fleet boxes and turns into a hard regression gate the
moment anyone flips the flag and reinstalls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings L1 (RPC API) + L3 (lifecycle survival) parity coverage to the
three multi-app stacks that were previously only touched by
required-stack.bats. Combined with bitcoin-knots / lnd / electrumx
already shipping, the six core apps now have dedicated bats files.
Each suite is shaped like the existing single-container suites
(bitcoin-knots / lnd / electrumx) and gates every assertion on the
backing container actually being present, so a node without the stack
installed gets clean skip messages instead of false fails.
* btcpay.bats — 9 tests, including stack-wide presence and a
"supporting containers don't cascade-restart" guard
* fedimint.bats — 8 tests, single container
* mempool.bats — 9 tests, mixed legacy + orchestrator-managed stack;
reuses the :8999 mempool-api probe from required-stack for parity
Total bats now: 88 (was 53 → +35).
TESTING.md matrix advances 23 → 50 of 110 cells.
UI URL coverage for these three apps already lives in
ui-coverage.bats, so this PR doesn't duplicate proxy-path probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the coverage gap where existing bats suites would report green
on a node whose dashboard tiles 502 because the proxy upstream is dead.
First pass against .198 caught real prod issues immediately:
/app/lnd/ → 502 (lnd container exited)
/app/mempool/ → 502 (mempool container exited)
/app/fedimint/ → 502 (fedimint container exited)
while existing tests reported only "container is up: false" with no
404/502 distinction.
* lib/ui-probes.bash — sourced helper. probe_https_200,
probe_app_url (skip-if-container-down else assert-200),
probe_dashboard_shell (asserts the Vue SPA HTML, not nginx default —
catches the layout regression from feedback_release_tarball_layout.md),
probe_dashboard_catalog (asserts /catalog.json non-empty).
* bats/ui-coverage.bats — 9 @test cases covering the dashboard +
bitcoin-ui :8334 + the seven HTTPS_PROXY_PATHS most users hit
(lnd, electrumx, mempool, fedimint, btcpay, filebrowser).
URL list mirrors HTTPS_PROXY_PATHS in
neode-ui/src/views/appSession/appSessionConfig.ts. Divergence between
the two is the exact bug class we're guarding against.
Loops clean under run-20x.sh. Container-state oracle is via local
podman inspect, so the suite must run on the archy host (same as
companion-survives-archipelago-restart.bats).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sister suite to companion-survives-archipelago-restart.bats. That one
tests the same property for UI companions, which already ship via
Quadlet (commit 6e716f68) and so already pass.
This new suite tests the property for backend containers (bitcoin-knots
/ bitcoin-core / lnd / electrumx). Until v1.7.52 Phase 3 ships these
under Quadlet too, the suite is EXPECTED TO FAIL on fleet boxes — it's
the executable definition of "FM3 fixed".
Observed live on .198 on 2026-05-01: `sudo systemctl stop archipelago`
killed every container in archipelago.service's cgroup. The dedicated
"backends survive archipelago restart" test catches exactly that, and
also verifies the SAME container instance survives (compares pre/post
.Id), so an orchestrator that recreates a fresh container after the
SIGKILL doesn't read as pass.
Three @test cases:
* destructive gate (skip-marker for the suite)
* baseline: at least one backend installed + running
* backends survive: same .Id pre + post archipelago restart
Don't gate releases on this passing until Phase 3 lands; before then
treat it as a "expected to fail / shows progress" indicator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape as bitcoin-knots.bats and lnd.bats so the 20× release-gate
exercises electrumx through the same state matrix it uses for the other
two core apps. electrumx previously had a single TCP-port check inside
required-stack.bats; this adds destructive + cascade-destructive tiers.
10 @test cases:
* read-only: presence, valid state, TCP port (50001) reachable, no
orphan containers beyond {electrumx, archy-electrs-ui}
* destructive: stop, start, restart, TCP port recovers within 120s of
cold restart (longer than bitcoind because electrumx replays its
index against bitcoind on start)
* cascade: uninstall, reinstall (240s timeout for index rebuild)
With this suite, the three single-container core apps (bitcoin-knots,
lnd, electrumx) now have parity coverage. Multi-container stacks
(btcpay, mempool, fedimint) come next.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors bitcoin-knots.bats so the 20× release-gate run exercises lnd
through the same state matrix. lnd previously had only a single
read-only check inside required-stack.bats; this adds the destructive
and cascade-destructive tiers that match what we already test for
bitcoin-knots.
10 @test cases:
* read-only: presence, valid state, lncli getinfo, no orphan containers
* destructive (ARCHY_ALLOW_DESTRUCTIVE=1): stop, start, restart,
RPC recovers within 90s of cold restart (longer than bitcoind
because the wallet has to unlock first)
* cascade (ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1): uninstall, reinstall
Reuses the same lncli invocation as required-stack.bats so divergence
shows up clearly if either test breaks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn
blocks from install.rs. If archipelago crashed mid-spawn or the
container's cgroup was reaped, companions vanished from podman ps -a
and only a manual rm/run could bring them back (the .228 incident).
Now each companion is rendered as a Quadlet .container unit under
~/.config/containers/systemd/, daemon-reloaded, and started via
systemctl --user. systemd owns supervision from that point on:
- archipelago can crash, restart, or be uninstalled without touching
any companion.
- Quadlet's Restart=always + RestartSec=10 handles container exits.
- A 30s reconcile tick in boot_reconciler enumerates expected
companion units and re-installs any whose unit file or service
vanished — defense-in-depth against external tampering.
New module layout:
- container/quadlet.rs: pure unit renderer + atomic write_if_changed
+ systemctl helpers (daemon_reload_user / enable_now / disable_remove
/ is_active). 6 unit tests, no I/O in the renderer.
- container/companion.rs: per-app companion specs, install/remove/
reconcile, image presence (build local first, fall back to insecure
registry only via image_uses_insecure_registry whitelist). 2 tests.
install.rs handle_package_install now ends with a single call to
companion::install_for(package_id), replacing 287 lines of spawn-and-
hope shellouts plus a ~120-line nginx auth-injector helper that worked
around per-node RPC password baking. The helper is gone too — the
pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/
bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only.
runtime.rs handle_package_uninstall now disables companions before
the container rm loop. Otherwise systemd's Restart=always would
respawn each companion within ~10s of removal.
Tests: 53 container tests pass, including 6 quadlet renderer tests
(host network, bridge network, capability set, atomic write idempotence)
and 2 companion specs (per-app companion lookup, build_unit shape).
boot_reconciler tests gain a #[cfg(test)] without_companion_stage()
flag so the paused-clock fixtures don't race the real systemctl I/O.
A bats regression test (companion-survives-archipelago-restart.bats,
gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode
cannot recur: every installed companion has a unit file, services
stay active across systemctl --user restart archipelago, and a
deleted unit file is recreated within one reconcile tick.
Net delta: +941 / -363, but the +941 is mostly tests (~440 lines)
and the new declarative layer; the imperative tokio::spawn block and
its nginx-auth helper are gone, removing two failure classes
(orphan companions on archipelago crash, and post-start exec races
under tightly-confined cgroups) that previously needed manual SSH
recovery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>