From 470e3c649a58fde355585f9bb18a02e6d89718f9 Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 06:17:23 -0400 Subject: [PATCH] =?UTF-8?q?docs(gate):=20ROOT-CAUSE=20the=20stop=20blocker?= =?UTF-8?q?=20=E2=80=94=20orchestrator=20ignores=20per-app=20stop=20grace?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30 timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd 330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI -t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 139 +++++++++++++++++---------------- 1 file changed, 71 insertions(+), 68 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 5690fcd6..bc2c5f59 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -190,64 +190,67 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering guard is KEPT on purpose (beneficial; not a blocker). -### ⚠️ GATE FINDING 2026-06-22 — `package.stop` non-propagation (mostly self-inflicted on .228; verify on .198) +### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED) -Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no -churn). Step 2 (the 5× gate) surfaced a `package.stop` failure — **but the headline cause turned -out to be MY cascade-gate contaminating .228**, not a fundamental product gap. Severity downgraded -from ⛔ to ⚠️ after the .198 ground-truth check (below). Still has a real robustness sub-bug. +Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a +real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a +genuine product bug, not node contamination. Root cause is fully pinned (below). -**Symptom.** On (post-contamination) .228, `package.stop electrumx` returns `{"status":"stopping"}` -but the container **never stops** — `container-list` shows `running` 66s+. The gate's -`wait_for_container_status electrumx stopped 60` times out. Same hit bitcoin-knots/btcpay/fedimint/ -immich. **Contrast:** `filebrowser` stops correctly (`running → stopped` ~6s). +**Symptom.** `package.stop ` returns `{"status":"stopping"}` but the container **never stops** +(`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times +out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps). +`filebrowser` passes because it exits on SIGTERM in <30s. -**.198 GROUND TRUTH (decisive, checked 2026-06-22):** .198 (untouched today) **HAS quadlet -`.container` files for backend apps** — `bitcoin-knots.container`, `btcpay-server.container`, -`fedimint.container`, `filebrowser.container`, `indeedhub.container`, `gitea.container`, -`grafana.container`, `botfights.container`, `archy-{btcpay-db,nbxplorer}.container`, the -`fedimint*`/`indeedhub*` members, etc. **⇒ Quadlet IS the intended backend runtime.** .228 instead -has NONE of these (only the 4 UI companions + home-assistant; `bitcoin-core.container` is -`.disabled-20260506`). **So .228's plain-podman state is contamination:** my cascade-destructive -gate UNINSTALLED its apps (removing the `.container` files) and my `package.start` restore brought -them back as plain `podman run --restart=unless-stopped` **without regenerating the quadlet units**. -`podman inspect electrumx` on .228 → `PODMAN_SYSTEMD_UNIT` EMPTY; `systemctl --user stop -electrumx.service` → `Unit not loaded (rc=5)`. (NB: electrumx specifically shows no `PODMAN_SYSTEMD_UNIT` -on .198 too — confirm whether electrumx has its own `.container` on .198; the listing was truncated.) +**ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):** +``` +WARN quadlet: systemctl --user stop fedimint.service timed out after 45s +ERROR runtime: package.stop fedimint failed: stop_container fedimint: + podman stop -t 30 fedimint timed out after 30s: deadline has elapsed +``` +The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline +equals the grace: +- `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace + (**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30). + The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t `). +- The **orchestrator** path does NOT: `prod_orchestrator::stop()` → `ContainerRuntime::stop_container` + (`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`** + (podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s + but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails → + state reverts to `running`. +- **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so + the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill + would land a moment later. The wrapper deadline must exceed the `-t` grace. -**Two real sub-bugs remain (independent of the contamination):** -1. **`package.start`/restore recreates a container as plain podman when its quadlet unit is missing** - instead of regenerating the `.container` unit — leaving it un-stoppable via systemctl. Should - reconcile the quadlet unit, not fall back to bare podman silently. -2. **`prod_orchestrator::stop()` podman-fallback doesn't fire for electrumx-class apps.** Stop path - (prod_orchestrator.rs:2890): `loaded(app_id)?` → `quadlet::stop_service` (fail-soft) → - `runtime.stop_container` (podman). `compute_container_name(electrumx)` → bare `"electrumx"` - (correct target). filebrowser reaches the fallback and stops; electrumx does NOT ⇒ suspect - `loaded("electrumx")` erroring before the fallback AND the error not classed as - `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never reaches `do_package_stop`). - Confirm by promoting the best-effort `install_log("STOP …")`/`STOP FAIL` to `tracing::error!` - (it was empty in .228's install log) and reading `loaded()` + `is_unknown_app_id_error`. +**FIX (two parts, design choice flagged):** +1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate + `stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the + `ContainerRuntime::stop_container` signature to take a `grace: Duration` and have + `prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)** + add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in + `stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare + their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).** +2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL + completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and + the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`). + Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`. -**Correction to the status doc:** the "Quadlet-everywhere ~96%" survey may have mis-read the signal -*on contaminated nodes*; .198 genuinely is quadlet, so re-survey from `.container` file presence + -`PODMAN_SYSTEMD_UNIT`, not from "container running". -- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` → - `quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container` - (podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target). - filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")` - **erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being - classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the - plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line** - (the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure - reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting - `loaded()` + `is_unknown_app_id_error`. -- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334 - keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser). - So the bug is purely "container never stops", not "state not reported". +**Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago` +→ sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend +`.container` files are gone from my cascade-gate contamination — reinstall its apps so units +regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only). -**Quadlet-vs-podman question: RESOLVED.** Quadlet is intended (.198 has the `.container` files; -see ground-truth block above). No need to redesign — the work is (a) restore .228's quadlet units, -(b) fix the two robustness sub-bugs, (c) re-run the canonical gate on a clean node. +**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend +runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/ +indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain; +`bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and +my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without +regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing +quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%" +from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running". + +The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334 +keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the +bug is purely "container never stops", not "state not reported". ### MY-SESSION ERRATA (own it on resume) - I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that @@ -261,24 +264,24 @@ see ground-truth block above). No need to redesign — the work is (a) restore . → `Invalid Docker image format`. ### NEXT STEPS (in order) -1. ✅ **DONE — .198 ground truth:** quadlet is intended (.198 has the backend `.container` files). -2. **Run the CANONICAL gate on .198 FIRST** — it is the clean, properly-quadletized node (I did NOT - touch it today). `ARCHY_HOST=192.168.1.198 ARCHY_SCHEME=https ARCHY_PASSWORD='ThisIsWeb54321@' - ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`. NO cascade; never kill - mid-iteration. This tells us whether the stop bug reproduces on a quadlet-correct node (→ real - product bug) or was purely .228 contamination (→ just re-quadletize .228). -3. **Restore .228's quadlet units** — properly reinstall its backend apps so `.container` files - regenerate (match .198). The cleanest route is the gate's own install path or a forced reconcile; - verify `.container` files reappear + `PODMAN_SYSTEMD_UNIT` is set, then re-run the gate on .228. -4. **Fix the two robustness sub-bugs** (only if they reproduce on quadlet-correct nodes / as - hardening): (a) `package.start` must regenerate a missing quadlet unit, not fall back to bare - podman; (b) `prod_orchestrator::stop()` podman-fallback must fire when there's no quadlet unit - (`loaded()` failure / non-`unknown_app_id` error must not abort the stop). Add a mock-orchestrator - test reproducing electrumx-style "no quadlet unit" stop. -5. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, +1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug + reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above). +2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app + `stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make + the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven + `stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`. +3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`; + stop archipelago, cp binary, start — containers survive). +4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall + its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`). +5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill + mid-iteration) on .198 then .228. Green = Step-2-of-plan done. +6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare + podman; (b) re-survey the status doc's quadlet % from `.container`-file presence. +7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is install_netbird_stack in stacks.rs). -6. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. +8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates