Resilience Harness
Black-box state-machine tester for archipelago app containers.
Drives the live RPC against a real archipelago + podman runtime on a target
host. For each app in app-catalog/catalog.json, runs every state transition
a user could trigger and asserts the system stays in the expected state.
Why this exists
We shipped v1.7.43-alpha on .228 with three independent bugs that no unit test caught:
indeedhub-apicrashlooped 8500+ times becausestacks.rswas missing 5 env vars (QUEUE_HOST/QUEUE_PORT/DATABASE_PORT/S3_PRIVATE_BUCKET_NAME/AES_MASTER_SECRET) — the install "succeeded" (containers running) but the API never became healthy.bitcoin-uishipped with a stale baked-inAuthorization: Basic …header from the registry image, so every/bitcoin-rpc/call returned 401.- The container-absence scanner evicted apps from the UI 14 seconds into install (before image pull finished).
All three were exactly the kind of bug a "did the user-visible flow actually work end to end?" test would catch — and the kind a single-file unit test will never catch. This harness is the gate.
Running
Against the .228 test node:
scripts/resilience/resilience.sh archipelago@192.168.1.228
Or non-interactive (CI):
RESILIENCE_SSH_PASS=… RESILIENCE_UI_PASS=… \
scripts/resilience/resilience.sh archipelago@192.168.1.228
Filters:
# Smoke test (3 apps, no reboot, ~15min)
scripts/resilience/resilience.sh archipelago@192.168.1.228 smoke
# Single app
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots
# Subset
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots,lnd
Without a filter, the harness sweeps every app in the catalog (~24 apps × 7 per-app transitions + 2 batch transitions) and runs the batch transitions (archipelago.service restart, host reboot) at the end. Full sweep is ~3-4 hours and reboots the target host as part of the run — only point it at a dedicated test node.
What it tests
Per-app transitions:
| # | Transition | Pass criteria |
|---|---|---|
| 1 | install | All containers reach running within 10 min |
| 2 | ui_probe | HTTP 2xx/3xx via https://<host>/app/<id>/ |
| 3 | auth_probe | (bitcoin-rpc only) returns 200 not 401 |
| 4 | stop | All containers reach exited state |
| 5 | start | All containers reach running state |
| 6 | restart | All containers running after restart |
| 7 | uninstall | All containers absent, no residue |
Batch transitions (full sweep only):
| # | Transition | Pass criteria |
|---|---|---|
| 8 | archipelago.service restart | Container set unchanged across |
| 9 | host reboot | Container set unchanged across |
Coverage by design — discovery rather than encoded metadata. The harness
snapshots podman ps -a before install, again after install stabilizes,
and the difference IS this app's container set. Works equally well for
single-container apps and 7-container stacks (indeedhub) without per-app
configuration.
Output
JSON-lines results at scripts/resilience/reports/<run_ts>/results.jsonl:
{"ts":"…","app":"bitcoin-knots","transition":"install","status":"PASS","detail":"bitcoin-knots,archy-bitcoin-ui"}
{"ts":"…","app":"bitcoin-knots","transition":"auth_probe","status":"PASS","detail":"bitcoin-rpc HTTP 200"}
Exit code: 0 if every cell green, 1 if any red, 2 if setup failed
before tests began. Use as a release gate — refuse to tag if any cell red.
Auth flow
The harness uses the same auth.login RPC that the UI uses, then carries
session=… and csrf_token=… cookies plus the X-CSRF-Token header on
every subsequent call. Re-logs in after archipelago.service restart and
host reboot.
Caveats / known gaps
- App proxy probe (
/app/<id>/) only validates the proxy responds — for apps with deeper protocol behavior (lnd, fedimint, mempool) this only catches "container alive, proxy reachable", not "the protocol is healthy". - Multi-container stack assertions: the harness checks every new
container is
running, so it would catch the indeedhub-api restart loop while postgres/redis/minio looked fine. - Host reboot test is destructive and slow — runs once at end of full sweep.
package.start/stop/restartRPC methods may not exist for all apps; failures are recorded and the harness continues.