archipelago 4ec6ca98c1 chore: release v1.7.45-alpha
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.

Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
  replaces fragile post-start exec that failed under restricted-cap rootless
  podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
  emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
  packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
  missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
  S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
  shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
  restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
  lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition

Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
  tester, every app × every transition. Run before each release.

Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:31:45 -04:00

110 lines
4.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Resilience Harness
Black-box state-machine tester for archipelago app containers.
Drives the live RPC against a real archipelago + podman runtime on a target
host. For each app in `app-catalog/catalog.json`, runs every state transition
a user could trigger and asserts the system stays in the expected state.
## Why this exists
We shipped v1.7.43-alpha on .228 with three independent bugs that no unit test
caught:
1. `indeedhub-api` crashlooped 8500+ times because `stacks.rs` was missing 5
env vars (`QUEUE_HOST`/`QUEUE_PORT`/`DATABASE_PORT`/`S3_PRIVATE_BUCKET_NAME`/
`AES_MASTER_SECRET`) — the install "succeeded" (containers running) but the
API never became healthy.
2. `bitcoin-ui` shipped with a stale baked-in `Authorization: Basic …` header
from the registry image, so every `/bitcoin-rpc/` call returned 401.
3. The container-absence scanner evicted apps from the UI 14 seconds into
install (before image pull finished).
All three were exactly the kind of bug a "did the user-visible flow actually
work end to end?" test would catch — and the kind a single-file unit test
will never catch. This harness is the gate.
## Running
Against the .228 test node:
scripts/resilience/resilience.sh archipelago@192.168.1.228
Or non-interactive (CI):
RESILIENCE_SSH_PASS=… RESILIENCE_UI_PASS=… \
scripts/resilience/resilience.sh archipelago@192.168.1.228
Filters:
# Smoke test (3 apps, no reboot, ~15min)
scripts/resilience/resilience.sh archipelago@192.168.1.228 smoke
# Single app
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots
# Subset
scripts/resilience/resilience.sh archipelago@192.168.1.228 bitcoin-knots,lnd
Without a filter, the harness sweeps **every** app in the catalog
(~24 apps × 7 per-app transitions + 2 batch transitions) and runs the
batch transitions (archipelago.service restart, host reboot) at the end.
Full sweep is ~3-4 hours and **reboots the target host** as part of the
run — only point it at a dedicated test node.
## What it tests
Per-app transitions:
| # | Transition | Pass criteria |
|---|----------------------|------------------------------------------------|
| 1 | install | All containers reach `running` within 10 min |
| 2 | ui_probe | HTTP 2xx/3xx via `https://<host>/app/<id>/` |
| 3 | auth_probe | (bitcoin-rpc only) returns 200 not 401 |
| 4 | stop | All containers reach `exited` state |
| 5 | start | All containers reach `running` state |
| 6 | restart | All containers `running` after restart |
| 7 | uninstall | All containers absent, no residue |
Batch transitions (full sweep only):
| # | Transition | Pass criteria |
|---|-------------------------------|-------------------------------------|
| 8 | archipelago.service restart | Container set unchanged across |
| 9 | host reboot | Container set unchanged across |
Coverage by design — discovery rather than encoded metadata. The harness
snapshots `podman ps -a` before install, again after install stabilizes,
and the difference IS this app's container set. Works equally well for
single-container apps and 7-container stacks (indeedhub) without per-app
configuration.
## Output
JSON-lines results at `scripts/resilience/reports/<run_ts>/results.jsonl`:
{"ts":"…","app":"bitcoin-knots","transition":"install","status":"PASS","detail":"bitcoin-knots,archy-bitcoin-ui"}
{"ts":"…","app":"bitcoin-knots","transition":"auth_probe","status":"PASS","detail":"bitcoin-rpc HTTP 200"}
Exit code: `0` if every cell green, `1` if any red, `2` if setup failed
before tests began. Use as a release gate — refuse to tag if any cell red.
## Auth flow
The harness uses the same `auth.login` RPC that the UI uses, then carries
`session=…` and `csrf_token=…` cookies plus the `X-CSRF-Token` header on
every subsequent call. Re-logs in after archipelago.service restart and
host reboot.
## Caveats / known gaps
- App proxy probe (`/app/<id>/`) only validates the proxy responds — for
apps with deeper protocol behavior (lnd, fedimint, mempool) this only
catches "container alive, proxy reachable", not "the protocol is healthy".
- Multi-container stack assertions: the harness checks **every** new
container is `running`, so it would catch the indeedhub-api restart loop
while postgres/redis/minio looked fine.
- Host reboot test is destructive and slow — runs once at end of full sweep.
- `package.start`/`stop`/`restart` RPC methods may not exist for all apps;
failures are recorded and the harness continues.