docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues)
Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
2dad64b2ee
commit
29cd167894
@ -239,6 +239,39 @@ equals the grace:
|
||||
`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
|
||||
regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
|
||||
|
||||
### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug
|
||||
|
||||
**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs` →
|
||||
`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed
|
||||
(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet
|
||||
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
|
||||
(running→exited→removed) — no regression; the deployed binary's stop path works.
|
||||
|
||||
**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
|
||||
1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
|
||||
2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
|
||||
unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
|
||||
stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
|
||||
(why is its container unhealthy / why does host port 8173 not become reachable).
|
||||
`health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
|
||||
3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
|
||||
startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
|
||||
reachable — fights any stop of a port-unreachable app.
|
||||
4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits
|
||||
for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
|
||||
(server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
|
||||
key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
|
||||
5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
|
||||
only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
|
||||
electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
|
||||
Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
|
||||
6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
|
||||
|
||||
**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6
|
||||
are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
|
||||
regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
|
||||
stop interaction, and the gate's terminal-state acceptance).
|
||||
|
||||
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
|
||||
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
|
||||
indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
|
||||
@ -264,24 +297,29 @@ bug is purely "container never stops", not "state not reported".
|
||||
→ `Invalid Docker image format`.
|
||||
|
||||
### NEXT STEPS (in order)
|
||||
1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug
|
||||
reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above).
|
||||
2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app
|
||||
`stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make
|
||||
the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven
|
||||
`stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`.
|
||||
3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`;
|
||||
stop archipelago, cp binary, start — containers survive).
|
||||
4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall
|
||||
its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`).
|
||||
5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
|
||||
1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
|
||||
release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
|
||||
2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
|
||||
6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
|
||||
real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
|
||||
3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
|
||||
(`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
|
||||
restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
|
||||
4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
|
||||
(Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
|
||||
conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
|
||||
5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
|
||||
per-app stop-wait ≥ the app's grace.
|
||||
6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
|
||||
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
|
||||
7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
|
||||
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
|
||||
6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare
|
||||
podman; (b) re-survey the status doc's quadlet % from `.container`-file presence.
|
||||
7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
||||
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
|
||||
re-survey the status doc's quadlet % from `.container`-file presence.
|
||||
9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
||||
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
|
||||
install_netbird_stack in stacks.rs).
|
||||
8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||
10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||
|
||||
### KNOWN ISSUES / WATCH-OUTS
|
||||
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user