docs(gate): ROOT-CAUSE the stop blocker — orchestrator ignores per-app stop grace

Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30 timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd 330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI -t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 06:17:23 -04:00 · 2026-06-22 06:17:23 -04:00 · 470e3c649a
commit 470e3c649a
parent a111d79a05
1 changed files with 71 additions and 68 deletions
--- a/docs/PRODUCTION-MASTER-PLAN.md
+++ b/docs/PRODUCTION-MASTER-PLAN.md
@ -190,64 +190,67 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no
 nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
 guard is KEPT on purpose (beneficial; not a blocker).

-### ⚠️ GATE FINDING 2026-06-22 — `package.stop` non-propagation (mostly self-inflicted on .228; verify on .198)
+### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED)

-Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
-churn). Step 2 (the 5× gate) surfaced a `package.stop` failure — **but the headline cause turned
-out to be MY cascade-gate contaminating .228**, not a fundamental product gap. Severity downgraded
-from ⛔ to ⚠️ after the .198 ground-truth check (below). Still has a real robustness sub-bug.
+Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a
+real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a
+genuine product bug, not node contamination. Root cause is fully pinned (below).

-**Symptom.** On (post-contamination) .228, `package.stop electrumx` returns `{"status":"stopping"}`
-but the container **never stops** — `container-list` shows `running` 66s+. The gate's
-`wait_for_container_status electrumx stopped 60` times out. Same hit bitcoin-knots/btcpay/fedimint/
-immich. **Contrast:** `filebrowser` stops correctly (`running → stopped` ~6s).
+**Symptom.** `package.stop <app>` returns `{"status":"stopping"}` but the container **never stops**
+(`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times
+out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps).
+`filebrowser` passes because it exits on SIGTERM in <30s.

-**.198 GROUND TRUTH (decisive, checked 2026-06-22):** .198 (untouched today) **HAS quadlet
-`.container` files for backend apps** — `bitcoin-knots.container`, `btcpay-server.container`,
-`fedimint.container`, `filebrowser.container`, `indeedhub.container`, `gitea.container`,
-`grafana.container`, `botfights.container`, `archy-{btcpay-db,nbxplorer}.container`, the
-`fedimint*`/`indeedhub*` members, etc. **⇒ Quadlet IS the intended backend runtime.** .228 instead
-has NONE of these (only the 4 UI companions + home-assistant; `bitcoin-core.container` is
-`.disabled-20260506`). **So .228's plain-podman state is contamination:** my cascade-destructive
-gate UNINSTALLED its apps (removing the `.container` files) and my `package.start` restore brought
-them back as plain `podman run --restart=unless-stopped` **without regenerating the quadlet units**.
-`podman inspect electrumx` on .228 → `PODMAN_SYSTEMD_UNIT` EMPTY; `systemctl --user stop
-electrumx.service` → `Unit not loaded (rc=5)`. (NB: electrumx specifically shows no `PODMAN_SYSTEMD_UNIT`
-on .198 too — confirm whether electrumx has its own `.container` on .198; the listing was truncated.)
+**ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):**
+```
+WARN  quadlet: systemctl --user stop fedimint.service timed out after 45s
+ERROR runtime: package.stop fedimint failed: stop_container fedimint:
+      podman stop -t 30 fedimint timed out after 30s: deadline has elapsed
+```
+The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline
+equals the grace:
+- `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace
+  (**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30).
+  The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t <stop_timeout_secs>`).
+- The **orchestrator** path does NOT: `prod_orchestrator::stop()` → `ContainerRuntime::stop_container`
+  (`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`**
+  (podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s
+  but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails →
+  state reverts to `running`.
+- **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so
+  the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill
+  would land a moment later. The wrapper deadline must exceed the `-t` grace.

-**Two real sub-bugs remain (independent of the contamination):**
-1. **`package.start`/restore recreates a container as plain podman when its quadlet unit is missing**
-   instead of regenerating the `.container` unit — leaving it un-stoppable via systemctl. Should
-   reconcile the quadlet unit, not fall back to bare podman silently.
-2. **`prod_orchestrator::stop()` podman-fallback doesn't fire for electrumx-class apps.** Stop path
-   (prod_orchestrator.rs:2890): `loaded(app_id)?` → `quadlet::stop_service` (fail-soft) →
-   `runtime.stop_container` (podman). `compute_container_name(electrumx)` → bare `"electrumx"`
-   (correct target). filebrowser reaches the fallback and stops; electrumx does NOT ⇒ suspect
-   `loaded("electrumx")` erroring before the fallback AND the error not classed as
-   `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never reaches `do_package_stop`).
-   Confirm by promoting the best-effort `install_log("STOP …")`/`STOP FAIL` to `tracing::error!`
-   (it was empty in .228's install log) and reading `loaded()` + `is_unknown_app_id_error`.
+**FIX (two parts, design choice flagged):**
+1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate
+   `stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the
+   `ContainerRuntime::stop_container` signature to take a `grace: Duration` and have
+   `prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)**
+   add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in
+   `stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare
+   their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).**
+2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL
+   completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and
+   the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`).
+   Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`.

-**Correction to the status doc:** the "Quadlet-everywhere ~96%" survey may have mis-read the signal
-*on contaminated nodes*; .198 genuinely is quadlet, so re-survey from `.container` file presence +
-`PODMAN_SYSTEMD_UNIT`, not from "container running".
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` →
-  `quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
-  (podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
-  filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")`
-  **erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being
-  classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the
-  plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line**
-  (the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure
-  reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting
-  `loaded()` + `is_unknown_app_id_error`.
- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334
-  keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
-  So the bug is purely "container never stops", not "state not reported".
+**Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
+→ sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend
+`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
+regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).

-**Quadlet-vs-podman question: RESOLVED.** Quadlet is intended (.198 has the `.container` files;
-see ground-truth block above). No need to redesign — the work is (a) restore .228's quadlet units,
-(b) fix the two robustness sub-bugs, (c) re-run the canonical gate on a clean node.
+**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
+runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
+indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
+`bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and
+my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without
+regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing
+quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%"
+from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running".
+
+The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334
+keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the
+bug is purely "container never stops", not "state not reported".

 ### MY-SESSION ERRATA (own it on resume)
 - I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
@ -261,24 +264,24 @@ see ground-truth block above). No need to redesign — the work is (a) restore .
  → `Invalid Docker image format`.

 ### NEXT STEPS (in order)
-1. ✅ **DONE — .198 ground truth:** quadlet is intended (.198 has the backend `.container` files).
-2. **Run the CANONICAL gate on .198 FIRST** — it is the clean, properly-quadletized node (I did NOT
-   touch it today). `ARCHY_HOST=192.168.1.198 ARCHY_SCHEME=https ARCHY_PASSWORD='ThisIsWeb54321@'
-   ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`. NO cascade; never kill
-   mid-iteration. This tells us whether the stop bug reproduces on a quadlet-correct node (→ real
-   product bug) or was purely .228 contamination (→ just re-quadletize .228).
-3. **Restore .228's quadlet units** — properly reinstall its backend apps so `.container` files
-   regenerate (match .198). The cleanest route is the gate's own install path or a forced reconcile;
-   verify `.container` files reappear + `PODMAN_SYSTEMD_UNIT` is set, then re-run the gate on .228.
-4. **Fix the two robustness sub-bugs** (only if they reproduce on quadlet-correct nodes / as
-   hardening): (a) `package.start` must regenerate a missing quadlet unit, not fall back to bare
-   podman; (b) `prod_orchestrator::stop()` podman-fallback must fire when there's no quadlet unit
-   (`loaded()` failure / non-`unknown_app_id` error must not abort the stop). Add a mock-orchestrator
-   test reproducing electrumx-style "no quadlet unit" stop.
-5. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
+1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug
+   reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above).
+2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app
+   `stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make
+   the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven
+   `stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`.
+3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`;
+   stop archipelago, cp binary, start — containers survive).
+4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall
+   its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`).
+5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
+   mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
+6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare
+   podman; (b) re-survey the status doc's quadlet % from `.container`-file presence.
+7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
   config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
   install_netbird_stack in stacks.rs).
-6. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
+8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.

 ### KNOWN ISSUES / WATCH-OUTS
 - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates