Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.
Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
replaces fragile post-start exec that failed under restricted-cap rootless
podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition
Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
tester, every app × every transition. Run before each release.
Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
110 lines
4.7 KiB
Markdown
110 lines
4.7 KiB
Markdown
# Resilience Harness
|
||
|
||
Black-box state-machine tester for archipelago app containers.
|
||
|
||
Drives the live RPC against a real archipelago + podman runtime on a target
|
||
host. For each app in `app-catalog/catalog.json`, runs every state transition
|
||
a user could trigger and asserts the system stays in the expected state.
|
||
|
||
## Why this exists
|
||
|
||
We shipped v1.7.43-alpha on .228 with three independent bugs that no unit test
|
||
caught:
|
||
|
||
1. `indeedhub-api` crashlooped 8500+ times because `stacks.rs` was missing 5
|
||
env vars (`QUEUE_HOST`/`QUEUE_PORT`/`DATABASE_PORT`/`S3_PRIVATE_BUCKET_NAME`/
|
||
`AES_MASTER_SECRET`) — the install "succeeded" (containers running) but the
|
||
API never became healthy.
|
||
2. `bitcoin-ui` shipped with a stale baked-in `Authorization: Basic …` header
|
||
from the registry image, so every `/bitcoin-rpc/` call returned 401.
|
||
3. The container-absence scanner evicted apps from the UI 14 seconds into
|
||
install (before image pull finished).
|
||
|
||
All three were exactly the kind of bug a "did the user-visible flow actually
|
||
work end to end?" test would catch — and the kind a single-file unit test
|
||
will never catch. This harness is the gate.
|
||
|
||
## Running
|
||
|
||
Against the .228 test node:
|
||
|
||
scripts/resilience/resilience.sh archipelago@192.168.1.228
|
||
|
||
Or non-interactive (CI):
|
||
|
||
RESILIENCE_SSH_PASS=… RESILIENCE_UI_PASS=… \
|
||
scripts/resilience/resilience.sh archipelago@192.168.1.228
|
||
|
||
Filters:
|
||
|
||
# Smoke test (3 apps, no reboot, ~15min)
|
||
scripts/resilience/resilience.sh archipelago@192.168.1.228 smoke
|
||
|
||
# Single app
|
||
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots
|
||
|
||
# Subset
|
||
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots,lnd
|
||
|
||
Without a filter, the harness sweeps **every** app in the catalog
|
||
(~24 apps × 7 per-app transitions + 2 batch transitions) and runs the
|
||
batch transitions (archipelago.service restart, host reboot) at the end.
|
||
Full sweep is ~3-4 hours and **reboots the target host** as part of the
|
||
run — only point it at a dedicated test node.
|
||
|
||
## What it tests
|
||
|
||
Per-app transitions:
|
||
|
||
| # | Transition | Pass criteria |
|
||
|---|----------------------|------------------------------------------------|
|
||
| 1 | install | All containers reach `running` within 10 min |
|
||
| 2 | ui_probe | HTTP 2xx/3xx via `https://<host>/app/<id>/` |
|
||
| 3 | auth_probe | (bitcoin-rpc only) returns 200 not 401 |
|
||
| 4 | stop | All containers reach `exited` state |
|
||
| 5 | start | All containers reach `running` state |
|
||
| 6 | restart | All containers `running` after restart |
|
||
| 7 | uninstall | All containers absent, no residue |
|
||
|
||
Batch transitions (full sweep only):
|
||
|
||
| # | Transition | Pass criteria |
|
||
|---|-------------------------------|-------------------------------------|
|
||
| 8 | archipelago.service restart | Container set unchanged across |
|
||
| 9 | host reboot | Container set unchanged across |
|
||
|
||
Coverage by design — discovery rather than encoded metadata. The harness
|
||
snapshots `podman ps -a` before install, again after install stabilizes,
|
||
and the difference IS this app's container set. Works equally well for
|
||
single-container apps and 7-container stacks (indeedhub) without per-app
|
||
configuration.
|
||
|
||
## Output
|
||
|
||
JSON-lines results at `scripts/resilience/reports/<run_ts>/results.jsonl`:
|
||
|
||
{"ts":"…","app":"bitcoin-knots","transition":"install","status":"PASS","detail":"bitcoin-knots,archy-bitcoin-ui"}
|
||
{"ts":"…","app":"bitcoin-knots","transition":"auth_probe","status":"PASS","detail":"bitcoin-rpc HTTP 200"}
|
||
|
||
Exit code: `0` if every cell green, `1` if any red, `2` if setup failed
|
||
before tests began. Use as a release gate — refuse to tag if any cell red.
|
||
|
||
## Auth flow
|
||
|
||
The harness uses the same `auth.login` RPC that the UI uses, then carries
|
||
`session=…` and `csrf_token=…` cookies plus the `X-CSRF-Token` header on
|
||
every subsequent call. Re-logs in after archipelago.service restart and
|
||
host reboot.
|
||
|
||
## Caveats / known gaps
|
||
|
||
- App proxy probe (`/app/<id>/`) only validates the proxy responds — for
|
||
apps with deeper protocol behavior (lnd, fedimint, mempool) this only
|
||
catches "container alive, proxy reachable", not "the protocol is healthy".
|
||
- Multi-container stack assertions: the harness checks **every** new
|
||
container is `running`, so it would catch the indeedhub-api restart loop
|
||
while postgres/redis/minio looked fine.
|
||
- Host reboot test is destructive and slow — runs once at end of full sweep.
|
||
- `package.start`/`stop`/`restart` RPC methods may not exist for all apps;
|
||
failures are recorded and the harness continues.
|