docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228

Stop failure was 3 real product bugs (grace / reconcile-resurrection /
container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) +
deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was
probe-induced churn (stable when left alone). Validating breadth next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-22 09:49:45 -04:00
parent 6e49ce6f88
commit b090235b04

View File

@ -247,30 +247,32 @@ regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
(running→exited→removed) — no regression; the deployed binary's stop path works.
**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
(why is its container unhealthy / why does host port 8173 not become reachable).
`health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
reachable — fights any stop of a port-unreachable app.
4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited``absent`, never `stopped`; the gate waits
for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
(server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
**The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx
lifecycle suite is GREEN (10/10, 66s) on .228:**
1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`.
Orchestrator now uses manifest `stop_grace_secs``stop_grace_secs_for()` table; deadline =
grace + 15s; applied to quadlet stop + API + CLI.
2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's
`dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool),
the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted
the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")`
when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through);
install/start clear the marker first so user actions are unaffected.
3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was
Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the
state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces
`stopped` for `user_stopped` apps before the launch-port refresh.
**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2#6
are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
stop interaction, and the gate's terminal-state acceptance).
**Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn**
left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing
were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped
key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout"
(electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s.
**Status: validating breadth.** electrumx suite GREEN on .228 (the previously-failing repro). Full
single-iteration gate (all suites, DESTRUCTIVE) running on .228 to confirm the other apps; then .198,
then the 5× canonical gate. `.228` is still contamination-flavored (plain podman) but the fixes are
runtime-agnostic and electrumx passed there regardless. Re-quadletizing .228 + the 5× runs remain.
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
@ -297,22 +299,15 @@ bug is purely "container never stops", not "state not reported".
`Invalid Docker image format`.
### NEXT STEPS (in order)
1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
(`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
(Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
per-app stop-wait ≥ the app's grace.
6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228:
- stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list
user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228.
2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/
fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens).
3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node).
4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
re-survey the status doc's quadlet % from `.container`-file presence.