110 lines
4.7 KiB
Markdown
Raw Permalink Normal View History

chore: release v1.7.45-alpha Resilience-validated release. Three full sweeps of the new resilience harness against .228 confirm no shipstoppers. Big user-visible: - Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount, replaces fragile post-start exec that failed under restricted-cap rootless podman ("crun: write cgroup.procs: Permission denied") - Multi-container stack installs (indeedhub, immich, btcpay, mempool) now emit phase events at every boundary so the progress bar advances - Apps no longer vanish from the dashboard mid-install (absent-scanner skips packages in transitional states) - Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT, S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code - Tailscale install fixed: --entrypoint string was being passed as a single shell-line arg; switched to custom_args array - Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud restored on docker.io) - Bitcoin Core update path uses correct image (was looking for nonexistent lfg2025/bitcoin:28.4) - ISO installs now allocate swap on the encrypted data partition Infra: - New resilience harness (scripts/resilience/) — black-box state-machine tester, every app × every transition. Run before each release. Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic (homeassistant trusted_hosts), 8 harness/timing false-positives, and 3 non-shipstopper tracked items. Down from 23 in baseline sweep #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:31:45 -04:00
# Resilience Harness
Black-box state-machine tester for archipelago app containers.
Drives the live RPC against a real archipelago + podman runtime on a target
host. For each app in `app-catalog/catalog.json`, runs every state transition
a user could trigger and asserts the system stays in the expected state.
## Why this exists
We shipped v1.7.43-alpha on .228 with three independent bugs that no unit test
caught:
1. `indeedhub-api` crashlooped 8500+ times because `stacks.rs` was missing 5
env vars (`QUEUE_HOST`/`QUEUE_PORT`/`DATABASE_PORT`/`S3_PRIVATE_BUCKET_NAME`/
`AES_MASTER_SECRET`) — the install "succeeded" (containers running) but the
API never became healthy.
2. `bitcoin-ui` shipped with a stale baked-in `Authorization: Basic …` header
from the registry image, so every `/bitcoin-rpc/` call returned 401.
3. The container-absence scanner evicted apps from the UI 14 seconds into
install (before image pull finished).
All three were exactly the kind of bug a "did the user-visible flow actually
work end to end?" test would catch — and the kind a single-file unit test
will never catch. This harness is the gate.
## Running
Against the .228 test node:
scripts/resilience/resilience.sh archipelago@192.168.1.228
Or non-interactive (CI):
RESILIENCE_SSH_PASS=… RESILIENCE_UI_PASS=… \
scripts/resilience/resilience.sh archipelago@192.168.1.228
Filters:
# Smoke test (3 apps, no reboot, ~15min)
scripts/resilience/resilience.sh archipelago@192.168.1.228 smoke
# Single app
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots
# Subset
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots,lnd
Without a filter, the harness sweeps **every** app in the catalog
(~24 apps × 7 per-app transitions + 2 batch transitions) and runs the
batch transitions (archipelago.service restart, host reboot) at the end.
Full sweep is ~3-4 hours and **reboots the target host** as part of the
run — only point it at a dedicated test node.
## What it tests
Per-app transitions:
| # | Transition | Pass criteria |
|---|----------------------|------------------------------------------------|
| 1 | install | All containers reach `running` within 10 min |
| 2 | ui_probe | HTTP 2xx/3xx via `https://<host>/app/<id>/` |
| 3 | auth_probe | (bitcoin-rpc only) returns 200 not 401 |
| 4 | stop | All containers reach `exited` state |
| 5 | start | All containers reach `running` state |
| 6 | restart | All containers `running` after restart |
| 7 | uninstall | All containers absent, no residue |
Batch transitions (full sweep only):
| # | Transition | Pass criteria |
|---|-------------------------------|-------------------------------------|
| 8 | archipelago.service restart | Container set unchanged across |
| 9 | host reboot | Container set unchanged across |
Coverage by design — discovery rather than encoded metadata. The harness
snapshots `podman ps -a` before install, again after install stabilizes,
and the difference IS this app's container set. Works equally well for
single-container apps and 7-container stacks (indeedhub) without per-app
configuration.
## Output
JSON-lines results at `scripts/resilience/reports/<run_ts>/results.jsonl`:
{"ts":"…","app":"bitcoin-knots","transition":"install","status":"PASS","detail":"bitcoin-knots,archy-bitcoin-ui"}
{"ts":"…","app":"bitcoin-knots","transition":"auth_probe","status":"PASS","detail":"bitcoin-rpc HTTP 200"}
Exit code: `0` if every cell green, `1` if any red, `2` if setup failed
before tests began. Use as a release gate — refuse to tag if any cell red.
## Auth flow
The harness uses the same `auth.login` RPC that the UI uses, then carries
`session=…` and `csrf_token=…` cookies plus the `X-CSRF-Token` header on
every subsequent call. Re-logs in after archipelago.service restart and
host reboot.
## Caveats / known gaps
- App proxy probe (`/app/<id>/`) only validates the proxy responds — for
apps with deeper protocol behavior (lnd, fedimint, mempool) this only
catches "container alive, proxy reachable", not "the protocol is healthy".
- Multi-container stack assertions: the harness checks **every** new
container is `running`, so it would catch the indeedhub-api restart loop
while postgres/redis/minio looked fine.
- Host reboot test is destructive and slow — runs once at end of full sweep.
- `package.start`/`stop`/`restart` RPC methods may not exist for all apps;
failures are recorded and the harness continues.