docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228

Stop failure was 3 real product bugs (grace / reconcile-resurrection / container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) + deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was probe-induced churn (stable when left alone). Validating breadth next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:49:45 -04:00 · 2026-06-22 09:49:45 -04:00 · b090235b04
commit b090235b04
parent 6e49ce6f88
1 changed files with 33 additions and 38 deletions
--- a/docs/PRODUCTION-MASTER-PLAN.md
+++ b/docs/PRODUCTION-MASTER-PLAN.md
@ -247,30 +247,32 @@ regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
 regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
 (running→exited→removed) — no regression; the deployed binary's stop path works.

-**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
-1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
-2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
-   unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
-   stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
-   (why is its container unhealthy / why does host port 8173 not become reachable).
-   `health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
-3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
-   startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
-   reachable — fights any stop of a port-unreachable app.
-4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits
-   for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
-   (server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
-   key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
-5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
-   only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
-   electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
-   Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
-6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
+**The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx
+lifecycle suite is GREEN (10/10, 66s) on .228:**
+1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`.
+   Orchestrator now uses manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline =
+   grace + 15s; applied to quadlet stop + API + CLI.
+2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's
+   `dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool),
+   the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted
+   the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")`
+   when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through);
+   install/start clear the marker first so user actions are unaffected.
+3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was
+   Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the
+   state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces
+   `stopped` for `user_stopped` apps before the launch-port refresh.

-**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6
-are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
-regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
-stop interaction, and the gate's terminal-state acceptance).
+**Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn** —
+left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing
+were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped
+key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout"
+(electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s.
+
+**Status: validating breadth.** electrumx suite GREEN on .228 (the previously-failing repro). Full
+single-iteration gate (all suites, DESTRUCTIVE) running on .228 to confirm the other apps; then .198,
+then the 5× canonical gate. `.228` is still contamination-flavored (plain podman) but the fixes are
+runtime-agnostic and electrumx passed there regardless. Re-quadletizing .228 + the 5× runs remain.

 **Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
 runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
@ -297,22 +299,15 @@ bug is purely "container never stops", not "state not reported".
  → `Invalid Docker image format`.

 ### NEXT STEPS (in order)
-1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
-   release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
-2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
-   6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
-   real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
-3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
-   (`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
-   restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
-4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
-   (Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
-   conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
-5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
-   per-app stop-wait ≥ the app's grace.
-6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
+1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228:
+   - stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list
+     user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228.
+2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/
+   fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens).
+3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node).
+4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
   units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
-7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
+5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
   mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
 8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
   re-survey the status doc's quadlet % from `.container`-file presence.