docs(gate): ROOT-CAUSE the stop blocker — orchestrator ignores per-app stop grace

Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30
timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide
bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd
330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the
orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI
-t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as
podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks
table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-22 06:17:23 -04:00
parent a111d79a05
commit 470e3c649a

View File

@ -190,64 +190,67 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
guard is KEPT on purpose (beneficial; not a blocker).
### ⚠️ GATE FINDING 2026-06-22 — `package.stop` non-propagation (mostly self-inflicted on .228; verify on .198)
### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED)
Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
churn). Step 2 (the 5× gate) surfaced a `package.stop` failure — **but the headline cause turned
out to be MY cascade-gate contaminating .228**, not a fundamental product gap. Severity downgraded
from ⛔ to ⚠️ after the .198 ground-truth check (below). Still has a real robustness sub-bug.
Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a
real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a
genuine product bug, not node contamination. Root cause is fully pinned (below).
**Symptom.** On (post-contamination) .228, `package.stop electrumx` returns `{"status":"stopping"}`
but the container **never stops**`container-list` shows `running` 66s+. The gate's
`wait_for_container_status electrumx stopped 60` times out. Same hit bitcoin-knots/btcpay/fedimint/
immich. **Contrast:** `filebrowser` stops correctly (`running → stopped` ~6s).
**Symptom.** `package.stop <app>` returns `{"status":"stopping"}` but the container **never stops**
(`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times
out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps).
`filebrowser` passes because it exits on SIGTERM in <30s.
**.198 GROUND TRUTH (decisive, checked 2026-06-22):** .198 (untouched today) **HAS quadlet
`.container` files for backend apps** — `bitcoin-knots.container`, `btcpay-server.container`,
`fedimint.container`, `filebrowser.container`, `indeedhub.container`, `gitea.container`,
`grafana.container`, `botfights.container`, `archy-{btcpay-db,nbxplorer}.container`, the
`fedimint*`/`indeedhub*` members, etc. **⇒ Quadlet IS the intended backend runtime.** .228 instead
has NONE of these (only the 4 UI companions + home-assistant; `bitcoin-core.container` is
`.disabled-20260506`). **So .228's plain-podman state is contamination:** my cascade-destructive
gate UNINSTALLED its apps (removing the `.container` files) and my `package.start` restore brought
them back as plain `podman run --restart=unless-stopped` **without regenerating the quadlet units**.
`podman inspect electrumx` on .228 → `PODMAN_SYSTEMD_UNIT` EMPTY; `systemctl --user stop
electrumx.service` → `Unit not loaded (rc=5)`. (NB: electrumx specifically shows no `PODMAN_SYSTEMD_UNIT`
on .198 too — confirm whether electrumx has its own `.container` on .198; the listing was truncated.)
**ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):**
```
WARN quadlet: systemctl --user stop fedimint.service timed out after 45s
ERROR runtime: package.stop fedimint failed: stop_container fedimint:
podman stop -t 30 fedimint timed out after 30s: deadline has elapsed
```
The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline
equals the grace:
- `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace
(**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30).
The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t <stop_timeout_secs>`).
- The **orchestrator** path does NOT: `prod_orchestrator::stop()``ContainerRuntime::stop_container`
(`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`**
(podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s
but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails →
state reverts to `running`.
- **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so
the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill
would land a moment later. The wrapper deadline must exceed the `-t` grace.
**Two real sub-bugs remain (independent of the contamination):**
1. **`package.start`/restore recreates a container as plain podman when its quadlet unit is missing**
instead of regenerating the `.container` unit — leaving it un-stoppable via systemctl. Should
reconcile the quadlet unit, not fall back to bare podman silently.
2. **`prod_orchestrator::stop()` podman-fallback doesn't fire for electrumx-class apps.** Stop path
(prod_orchestrator.rs:2890): `loaded(app_id)?``quadlet::stop_service` (fail-soft) →
`runtime.stop_container` (podman). `compute_container_name(electrumx)` → bare `"electrumx"`
(correct target). filebrowser reaches the fallback and stops; electrumx does NOT ⇒ suspect
`loaded("electrumx")` erroring before the fallback AND the error not classed as
`is_unknown_app_id_error` (so `do_orchestrator_package_stop` never reaches `do_package_stop`).
Confirm by promoting the best-effort `install_log("STOP …")`/`STOP FAIL` to `tracing::error!`
(it was empty in .228's install log) and reading `loaded()` + `is_unknown_app_id_error`.
**FIX (two parts, design choice flagged):**
1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate
`stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the
`ContainerRuntime::stop_container` signature to take a `grace: Duration` and have
`prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)**
add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in
`stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare
their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).**
2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL
completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and
the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`).
Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`.
**Correction to the status doc:** the "Quadlet-everywhere ~96%" survey may have mis-read the signal
*on contaminated nodes*; .198 genuinely is quadlet, so re-survey from `.container` file presence +
`PODMAN_SYSTEMD_UNIT`, not from "container running".
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?`
`quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
(podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")`
**erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being
classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the
plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line**
(the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure
reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting
`loaded()` + `is_unknown_app_id_error`.
- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
So the bug is purely "container never stops", not "state not reported".
**Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
→ sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend
`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
**Quadlet-vs-podman question: RESOLVED.** Quadlet is intended (.198 has the `.container` files;
see ground-truth block above). No need to redesign — the work is (a) restore .228's quadlet units,
(b) fix the two robustness sub-bugs, (c) re-run the canonical gate on a clean node.
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
`bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and
my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without
regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing
quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%"
from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running".
The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the
bug is purely "container never stops", not "state not reported".
### MY-SESSION ERRATA (own it on resume)
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
@ -261,24 +264,24 @@ see ground-truth block above). No need to redesign — the work is (a) restore .
`Invalid Docker image format`.
### NEXT STEPS (in order)
1. ✅ **DONE — .198 ground truth:** quadlet is intended (.198 has the backend `.container` files).
2. **Run the CANONICAL gate on .198 FIRST** — it is the clean, properly-quadletized node (I did NOT
touch it today). `ARCHY_HOST=192.168.1.198 ARCHY_SCHEME=https ARCHY_PASSWORD='ThisIsWeb54321@'
ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`. NO cascade; never kill
mid-iteration. This tells us whether the stop bug reproduces on a quadlet-correct node (→ real
product bug) or was purely .228 contamination (→ just re-quadletize .228).
3. **Restore .228's quadlet units** — properly reinstall its backend apps so `.container` files
regenerate (match .198). The cleanest route is the gate's own install path or a forced reconcile;
verify `.container` files reappear + `PODMAN_SYSTEMD_UNIT` is set, then re-run the gate on .228.
4. **Fix the two robustness sub-bugs** (only if they reproduce on quadlet-correct nodes / as
hardening): (a) `package.start` must regenerate a missing quadlet unit, not fall back to bare
podman; (b) `prod_orchestrator::stop()` podman-fallback must fire when there's no quadlet unit
(`loaded()` failure / non-`unknown_app_id` error must not abort the stop). Add a mock-orchestrator
test reproducing electrumx-style "no quadlet unit" stop.
5. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug
reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above).
2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app
`stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make
the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven
`stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`.
3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`;
stop archipelago, cp binary, start — containers survive).
4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall
its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`).
5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare
podman; (b) re-survey the status doc's quadlet % from `.container`-file presence.
7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
install_netbird_stack in stacks.rs).
6. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
### KNOWN ISSUES / WATCH-OUTS
- **.198 is a weak/loaded node** (load avg ~35). The generic reconcile recreates