docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues)

Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 08:07:43 -04:00 · 2026-06-22 08:07:43 -04:00 · 29cd167894
commit 29cd167894
parent 2dad64b2ee
1 changed files with 53 additions and 15 deletions
--- a/docs/PRODUCTION-MASTER-PLAN.md
+++ b/docs/PRODUCTION-MASTER-PLAN.md
@ -239,6 +239,39 @@ equals the grace:
 `.container` files are gone from my cascade-gate contamination — reinstall its apps so units
 regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).

+### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug
+
+**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs` →
+`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed
+(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet
+regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
+(running→exited→removed) — no regression; the deployed binary's stop path works.
+
+**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
+1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
+2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
+   unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
+   stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
+   (why is its container unhealthy / why does host port 8173 not become reachable).
+   `health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
+3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
+   startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
+   reachable — fights any stop of a port-unreachable app.
+4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits
+   for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
+   (server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
+   key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
+5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
+   only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
+   electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
+   Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
+6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
+
+**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6
+are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
+regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
+stop interaction, and the gate's terminal-state acceptance).
+
 **Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
 runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
 indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
@ -264,24 +297,29 @@ bug is purely "container never stops", not "state not reported".
  → `Invalid Docker image format`.

 ### NEXT STEPS (in order)
-1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug
-   reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above).
-2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app
-   `stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make
-   the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven
-   `stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`.
-3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`;
-   stop archipelago, cp binary, start — containers survive).
-4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall
-   its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`).
-5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
+1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
+   release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
+2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
+   6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
+   real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
+3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
+   (`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
+   restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
+4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
+   (Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
+   conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
+5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
+   per-app stop-wait ≥ the app's grace.
+6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
+   units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
+7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
   mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
-6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare
-   podman; (b) re-survey the status doc's quadlet % from `.container`-file presence.
-7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
+8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
+   re-survey the status doc's quadlet % from `.container`-file presence.
+9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
   config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
   install_netbird_stack in stacks.rs).
-8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
+10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.

 ### KNOWN ISSUES / WATCH-OUTS
 - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates