From 29cd1678949458514f09d87614856ef6a1fc4115 Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 08:07:43 -0400 Subject: [PATCH] docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues) Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 68 ++++++++++++++++++++++++++-------- 1 file changed, 53 insertions(+), 15 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index bc2c5f59..b94a010c 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -239,6 +239,39 @@ equals the grace: `.container` files are gone from my cascade-gate contamination — reinstall its apps so units regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only). +### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug + +**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs` → +`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed +(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet +regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198 +(running→exited→removed) — no regression; the deployed binary's stop path works. + +**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:** +1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout). +2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting + unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly + stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation + (why is its container unhealthy / why does host port 8173 not become reachable). + `health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct. +3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after + startup; restarting container app_id=fedimint") restarts containers whose launch port isn't + reachable — fights any stop of a port-unreachable app. +4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits + for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion + (server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name + key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed. +5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container + only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY + electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window. + Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both. +6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize. + +**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6 +are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not +regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs- +stop interaction, and the gate's terminal-state acceptance). + **Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/ indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain; @@ -264,24 +297,29 @@ bug is purely "container never stops", not "state not reported". → `Invalid Docker image format`. ### NEXT STEPS (in order) -1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug - reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above). -2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app - `stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make - the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven - `stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`. -3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`; - stop archipelago, cp binary, start — containers survive). -4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall - its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`). -5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill +1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested, + release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198). +2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart + 6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the + real top blocker now. Same lens for any other unhealthy app surfaced by the gate. +3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog + (`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT + restart a container the user just stopped. Check it consults `disabled`/`user_stopped`. +4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped` + (Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the + conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent. +5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's + per-app stop-wait ≥ the app's grace. +6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so + units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`). +7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill mid-iteration) on .198 then .228. Green = Step-2-of-plan done. -6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare - podman; (b) re-survey the status doc's quadlet % from `.container`-file presence. -7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, +8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman; + re-survey the status doc's quadlet % from `.container`-file presence. +9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is install_netbird_stack in stacks.rs). -8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. +10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates