docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues)

Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation
showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on
both nodes can't be stopped; (3) host-listener repair watchdog restarts
port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end
'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s
gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced
NEXT STEPS (fedimint health is the new top blocker).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-22 08:07:43 -04:00
parent 2dad64b2ee
commit 29cd167894

View File

@ -239,6 +239,39 @@ equals the grace:
`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug
**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs`
`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed
(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
(running→exited→removed) — no regression; the deployed binary's stop path works.
**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
(why is its container unhealthy / why does host port 8173 not become reachable).
`health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
reachable — fights any stop of a port-unreachable app.
4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited``absent`, never `stopped`; the gate waits
for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
(server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2#6
are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
stop interaction, and the gate's terminal-state acceptance).
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
@ -264,24 +297,29 @@ bug is purely "container never stops", not "state not reported".
`Invalid Docker image format`.
### NEXT STEPS (in order)
1. ✅ **DONE** — .198 ground truth (quadlet is intended) + **root cause pinned** (stop-grace bug
reproduced live on clean .198; it's a REAL fleet-wide bug, see blocker block above).
2. **Fix the stop-grace bug** (the gate exit criterion now hinges on this): thread the per-app
`stop_timeout_secs` grace into `ContainerRuntime::stop_container` (API `?t=` + CLI `-t`) and make
the wrapper deadline = grace + buffer. **Owner decision: table-based (A/B) vs manifest-driven
`stop_grace_secs` (C).** Add a mock test: a SIGTERM-ignoring container must still end `stopped`.
3. **Build + sideload** to .198 and .228 (`CARGO_INCREMENTAL=0 cargo build --release -p archipelago`;
stop archipelago, cp binary, start — containers survive).
4. **Re-quadletize .228** (its backend `.container` files were wiped by my cascade-gate; reinstall
its apps so units regenerate, matching .198; verify `.container` files + `PODMAN_SYSTEMD_UNIT`).
5. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
(`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
(Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
per-app stop-wait ≥ the app's grace.
6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
6. Hardening: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare
podman; (b) re-survey the status doc's quadlet % from `.container`-file presence.
7. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
re-survey the status doc's quadlet % from `.container`-file presence.
9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
install_netbird_stack in stacks.rs).
8. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
### KNOWN ISSUES / WATCH-OUTS
- **.198 is a weak/loaded node** (load avg ~35). The generic reconcile recreates