archipelago 4ec6ca98c1 chore: release v1.7.45-alpha
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.

Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
  replaces fragile post-start exec that failed under restricted-cap rootless
  podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
  emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
  packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
  missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
  S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
  shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
  restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
  lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition

Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
  tester, every app × every transition. Run before each release.

Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:31:45 -04:00

4.7 KiB
Raw Permalink Blame History

Resilience Harness

Black-box state-machine tester for archipelago app containers.

Drives the live RPC against a real archipelago + podman runtime on a target host. For each app in app-catalog/catalog.json, runs every state transition a user could trigger and asserts the system stays in the expected state.

Why this exists

We shipped v1.7.43-alpha on .228 with three independent bugs that no unit test caught:

  1. indeedhub-api crashlooped 8500+ times because stacks.rs was missing 5 env vars (QUEUE_HOST/QUEUE_PORT/DATABASE_PORT/S3_PRIVATE_BUCKET_NAME/ AES_MASTER_SECRET) — the install "succeeded" (containers running) but the API never became healthy.
  2. bitcoin-ui shipped with a stale baked-in Authorization: Basic … header from the registry image, so every /bitcoin-rpc/ call returned 401.
  3. The container-absence scanner evicted apps from the UI 14 seconds into install (before image pull finished).

All three were exactly the kind of bug a "did the user-visible flow actually work end to end?" test would catch — and the kind a single-file unit test will never catch. This harness is the gate.

Running

Against the .228 test node:

scripts/resilience/resilience.sh archipelago@192.168.1.228

Or non-interactive (CI):

RESILIENCE_SSH_PASS=… RESILIENCE_UI_PASS=… \
  scripts/resilience/resilience.sh archipelago@192.168.1.228

Filters:

# Smoke test (3 apps, no reboot, ~15min)
scripts/resilience/resilience.sh archipelago@192.168.1.228 smoke

# Single app
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots

# Subset
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots,lnd

Without a filter, the harness sweeps every app in the catalog (~24 apps × 7 per-app transitions + 2 batch transitions) and runs the batch transitions (archipelago.service restart, host reboot) at the end. Full sweep is ~3-4 hours and reboots the target host as part of the run — only point it at a dedicated test node.

What it tests

Per-app transitions:

# Transition Pass criteria
1 install All containers reach running within 10 min
2 ui_probe HTTP 2xx/3xx via https://<host>/app/<id>/
3 auth_probe (bitcoin-rpc only) returns 200 not 401
4 stop All containers reach exited state
5 start All containers reach running state
6 restart All containers running after restart
7 uninstall All containers absent, no residue

Batch transitions (full sweep only):

# Transition Pass criteria
8 archipelago.service restart Container set unchanged across
9 host reboot Container set unchanged across

Coverage by design — discovery rather than encoded metadata. The harness snapshots podman ps -a before install, again after install stabilizes, and the difference IS this app's container set. Works equally well for single-container apps and 7-container stacks (indeedhub) without per-app configuration.

Output

JSON-lines results at scripts/resilience/reports/<run_ts>/results.jsonl:

{"ts":"…","app":"bitcoin-knots","transition":"install","status":"PASS","detail":"bitcoin-knots,archy-bitcoin-ui"}
{"ts":"…","app":"bitcoin-knots","transition":"auth_probe","status":"PASS","detail":"bitcoin-rpc HTTP 200"}

Exit code: 0 if every cell green, 1 if any red, 2 if setup failed before tests began. Use as a release gate — refuse to tag if any cell red.

Auth flow

The harness uses the same auth.login RPC that the UI uses, then carries session=… and csrf_token=… cookies plus the X-CSRF-Token header on every subsequent call. Re-logs in after archipelago.service restart and host reboot.

Caveats / known gaps

  • App proxy probe (/app/<id>/) only validates the proxy responds — for apps with deeper protocol behavior (lnd, fedimint, mempool) this only catches "container alive, proxy reachable", not "the protocol is healthy".
  • Multi-container stack assertions: the harness checks every new container is running, so it would catch the indeedhub-api restart loop while postgres/redis/minio looked fine.
  • Host reboot test is destructive and slow — runs once at end of full sweep.
  • package.start/stop/restart RPC methods may not exist for all apps; failures are recorded and the harness continues.