5 Commits

Author SHA1 Message Date
archipelago
41e7f500f8 test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn
The 5x destructive gate on heavy nodes false-failed on transient windows
during stack recovery, not real regressions:

- immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis
  ->server (DB migrations on boot) stack can take >30s to republish :2283 after
  a churn-induced recreate; destructive-tier immich tests already allow 180-240s.
- mempool.bats: orphan-container check now polls to steady state (<=30s) instead
  of a single-shot count, which caught a recreated member briefly visible
  alongside its replacement mid-reconcile.
- run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when
  installed, so the next iteration's read-only probe doesn't race a still-
  recovering stack. Settle returns the instant every probe is green.

A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only
absorb the transient recreate window under sustained churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 09:18:34 -04:00
archipelago
2afd18c6de test(gate): poll immich lan_address to absorb mid-recreate churn
5× run #4 flaked iter4 on "immich exposes its web UI lan-address
(port 2283)": container-list returned lan_address=null because
immich_server was momentarily mid-recreate when the read-only tier
queried it (passed the other 4 iterations; immich_server does publish
0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots
state probe — poll <=30s for the exposed port instead of one read. A
genuinely unexposed immich never publishes 2283, so real port drift
is still caught.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 03:20:18 -04:00
archipelago
53b8e47f1d test(gate): fix two false-failing lifecycle tests (not product bugs)
- immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3-
  container stack (postgres->redis->server w/ DB migrations), so it needs at least
  as long as the start test (180s) — the old 120s was inconsistent and false-failed
  on loaded nodes. immich does return to running.
- fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the
  legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex
  omitted it -> total>known false orphan on every node running fedimint-clientd.
  Add fedimint-clientd to known.

Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node
(.116), not the RPC target — surfaced while driving the .228 gate green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 14:11:35 -04:00
archipelago
b0b54a96fa test(lifecycle): immich suite — package-level checks, wait-based destructive tier
container-list reports stack apps package-level (.name="immich"), so the suite
checks the "immich" package (presence, valid state, :2283 lan-address) rather than
individual container names. Destructive tier fires async stop/start/restart and
asserts on the end state via wait_for_container_status.

KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs
ops back-to-back with no settling while immich's async stack ops take 30s+, and
stopped reports as "exited" not "stopped". The immich migration itself is verified
working (manual stop/start/restart succeed; all 3 containers healthy). Hardening
the harness for stack apps (inter-op settling + stopped|exited acceptance) is a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 09:52:33 -04:00
archipelago
b1f175b927 test(lifecycle): add immich stack lifecycle suite
RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack
(immich + immich-postgres + immich-redis): presence + valid state of all 3 members,
a guard that no legacy underscore containers exist (catches botched migration /
legacy-installer fallback), destructive stop/start/restart of the server with
postgres+redis staying up, and cascade uninstall/reinstall (preserve_data).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 09:01:19 -04:00