diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index b94a010c..45cc1b28 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -247,30 +247,32 @@ regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only). regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198 (running→exited→removed) — no regression; the deployed binary's stop path works. -**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:** -1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout). -2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting - unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly - stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation - (why is its container unhealthy / why does host port 8173 not become reachable). - `health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct. -3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after - startup; restarting container app_id=fedimint") restarts containers whose launch port isn't - reachable — fights any stop of a port-unreachable app. -4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits - for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion - (server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name - key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed. -5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container - only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY - electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window. - Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both. -6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize. +**The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx +lifecycle suite is GREEN (10/10, 66s) on .228:** +1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`. + Orchestrator now uses manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline = + grace + 15s; applied to quadlet stop + API + CLI. +2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's + `dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool), + the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted + the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")` + when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through); + install/start clear the marker first so user actions are unaffected. +3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was + Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the + state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces + `stopped` for `user_stopped` apps before the launch-port refresh. -**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6 -are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not -regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs- -stop interaction, and the gate's terminal-state acceptance). +**Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn** — +left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing +were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped +key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout" +(electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s. + +**Status: validating breadth.** electrumx suite GREEN on .228 (the previously-failing repro). Full +single-iteration gate (all suites, DESTRUCTIVE) running on .228 to confirm the other apps; then .198, +then the 5× canonical gate. `.228` is still contamination-flavored (plain podman) but the fixes are +runtime-agnostic and electrumx passed there regardless. Re-quadletizing .228 + the 5× runs remain. **Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/ @@ -297,22 +299,15 @@ bug is purely "container never stops", not "state not reported". → `Invalid Docker image format`. ### NEXT STEPS (in order) -1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested, - release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198). -2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart - 6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the - real top blocker now. Same lens for any other unhealthy app surfaced by the gate. -3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog - (`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT - restart a container the user just stopped. Check it consults `disabled`/`user_stopped`. -4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped` - (Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the - conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent. -5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's - per-app stop-wait ≥ the app's grace. -6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so +1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228: + - stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list + user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228. +2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/ + fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens). +3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node). +4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`). -7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill +5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill mid-iteration) on .198 then .228. Green = Step-2-of-plan done. 8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman; re-survey the status doc's quadlet % from `.container`-file presence.