docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228
Stop failure was 3 real product bugs (grace / reconcile-resurrection / container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) + deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was probe-induced churn (stable when left alone). Validating breadth next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
6e49ce6f88
commit
b090235b04
@ -247,30 +247,32 @@ regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
|
||||
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
|
||||
(running→exited→removed) — no regression; the deployed binary's stop path works.
|
||||
|
||||
**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
|
||||
1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
|
||||
2. ⛔ **`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
|
||||
unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
|
||||
stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
|
||||
(why is its container unhealthy / why does host port 8173 not become reachable).
|
||||
`health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
|
||||
3. ⛔ **Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
|
||||
startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
|
||||
reachable — fights any stop of a port-unreachable app.
|
||||
4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited`→`absent`, never `stopped`; the gate waits
|
||||
for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
|
||||
(server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
|
||||
key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
|
||||
5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
|
||||
only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
|
||||
electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
|
||||
Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
|
||||
6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
|
||||
**The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx
|
||||
lifecycle suite is GREEN (10/10, 66s) on .228:**
|
||||
1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`.
|
||||
Orchestrator now uses manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline =
|
||||
grace + 15s; applied to quadlet stop + API + CLI.
|
||||
2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's
|
||||
`dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool),
|
||||
the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted
|
||||
the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")`
|
||||
when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through);
|
||||
install/start clear the marker first so user actions are unaffected.
|
||||
3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was
|
||||
Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the
|
||||
state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces
|
||||
`stopped` for `user_stopped` apps before the launch-port refresh.
|
||||
|
||||
**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2–#6
|
||||
are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
|
||||
regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
|
||||
stop interaction, and the gate's terminal-state acceptance).
|
||||
**Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn** —
|
||||
left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing
|
||||
were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped
|
||||
key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout"
|
||||
(electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s.
|
||||
|
||||
**Status: validating breadth.** electrumx suite GREEN on .228 (the previously-failing repro). Full
|
||||
single-iteration gate (all suites, DESTRUCTIVE) running on .228 to confirm the other apps; then .198,
|
||||
then the 5× canonical gate. `.228` is still contamination-flavored (plain podman) but the fixes are
|
||||
runtime-agnostic and electrumx passed there regardless. Re-quadletizing .228 + the 5× runs remain.
|
||||
|
||||
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
|
||||
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
|
||||
@ -297,22 +299,15 @@ bug is purely "container never stops", not "state not reported".
|
||||
→ `Invalid Docker image format`.
|
||||
|
||||
### NEXT STEPS (in order)
|
||||
1. ✅ **DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
|
||||
release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
|
||||
2. ⛔ **fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
|
||||
6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
|
||||
real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
|
||||
3. ⛔ **Host-listener repair vs user-stop** — the launch-port watchdog
|
||||
(`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
|
||||
restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
|
||||
4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
|
||||
(Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
|
||||
conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
|
||||
5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
|
||||
per-app stop-wait ≥ the app's grace.
|
||||
6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
|
||||
1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228:
|
||||
- stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list
|
||||
user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228.
|
||||
2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/
|
||||
fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens).
|
||||
3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node).
|
||||
4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
|
||||
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
|
||||
7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
|
||||
5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
|
||||
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
|
||||
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
|
||||
re-survey the status doc's quadlet % from `.container`-file presence.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user