docs(gate): downgrade stop-blocker ⛔→⚠️ — .198 has quadlet units, .228 state was my contamination
.198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet is the intended runtime. .228's plain-podman state traced to my cascade-gate uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs remain (start should regen quadlet; stop podman-fallback gap). Next: canonical gate on CLEAN .198 first to tell real-bug from contamination. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
47026fae30
commit
a111d79a05
@ -190,25 +190,47 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no
|
|||||||
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
|
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
|
||||||
guard is KEPT on purpose (beneficial; not a blocker).
|
guard is KEPT on purpose (beneficial; not a blocker).
|
||||||
|
|
||||||
### ⛔ GATE BLOCKER discovered 2026-06-22 — `package.stop` does not stop several apps (FIX FIRST)
|
### ⚠️ GATE FINDING 2026-06-22 — `package.stop` non-propagation (mostly self-inflicted on .228; verify on .198)
|
||||||
|
|
||||||
Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
|
Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
|
||||||
churn). Step 2 (the 5× gate) surfaced a **real, reproducible blocker** — this is now the top task.
|
churn). Step 2 (the 5× gate) surfaced a `package.stop` failure — **but the headline cause turned
|
||||||
|
out to be MY cascade-gate contaminating .228**, not a fundamental product gap. Severity downgraded
|
||||||
|
from ⛔ to ⚠️ after the .198 ground-truth check (below). Still has a real robustness sub-bug.
|
||||||
|
|
||||||
**Symptom.** On a CLEAN, healthy .228, `package.stop electrumx` returns `{"status":"stopping"}`
|
**Symptom.** On (post-contamination) .228, `package.stop electrumx` returns `{"status":"stopping"}`
|
||||||
but the container **never stops** — `container-list` shows `running` for 66s+, the scanner keeps
|
but the container **never stops** — `container-list` shows `running` 66s+. The gate's
|
||||||
logging `Detected container: ElectrumX (running)`. The gate's `wait_for_container_status electrumx
|
`wait_for_container_status electrumx stopped 60` times out. Same hit bitcoin-knots/btcpay/fedimint/
|
||||||
stopped 60` therefore times out. Same failure hit bitcoin-knots / btcpay-server / fedimint / immich
|
immich. **Contrast:** `filebrowser` stops correctly (`running → stopped` ~6s).
|
||||||
in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopped` in ~6s).
|
|
||||||
|
|
||||||
**Root-cause chain (evidence-backed, not fully pinned):**
|
**.198 GROUND TRUTH (decisive, checked 2026-06-22):** .198 (untouched today) **HAS quadlet
|
||||||
- These app containers on .228 run as **plain `podman run --restart=unless-stopped` — NOT quadlet
|
`.container` files for backend apps** — `bitcoin-knots.container`, `btcpay-server.container`,
|
||||||
units.** `podman inspect electrumx` → `PODMAN_SYSTEMD_UNIT` is EMPTY; `systemctl --user stop
|
`fedimint.container`, `filebrowser.container`, `indeedhub.container`, `gitea.container`,
|
||||||
electrumx.service` → `Unit electrumx.service not loaded (rc=5)`. The only `.container` quadlet
|
`grafana.container`, `botfights.container`, `archy-{btcpay-db,nbxplorer}.container`, the
|
||||||
files on disk are the 4 UI companions + `home-assistant`; **`bitcoin-core.container` is renamed
|
`fedimint*`/`indeedhub*` members, etc. **⇒ Quadlet IS the intended backend runtime.** .228 instead
|
||||||
`.disabled-20260506`**. ⇒ **The "Quadlet-everywhere ~96% migrated" claim in
|
has NONE of these (only the 4 UI companions + home-assistant; `bitcoin-core.container` is
|
||||||
`app-registry-status-2026-06-21.md` is WRONG for backend apps** (the survey read a misleading
|
`.disabled-20260506`). **So .228's plain-podman state is contamination:** my cascade-destructive
|
||||||
signal). This is itself a finding the gate was meant to catch.
|
gate UNINSTALLED its apps (removing the `.container` files) and my `package.start` restore brought
|
||||||
|
them back as plain `podman run --restart=unless-stopped` **without regenerating the quadlet units**.
|
||||||
|
`podman inspect electrumx` on .228 → `PODMAN_SYSTEMD_UNIT` EMPTY; `systemctl --user stop
|
||||||
|
electrumx.service` → `Unit not loaded (rc=5)`. (NB: electrumx specifically shows no `PODMAN_SYSTEMD_UNIT`
|
||||||
|
on .198 too — confirm whether electrumx has its own `.container` on .198; the listing was truncated.)
|
||||||
|
|
||||||
|
**Two real sub-bugs remain (independent of the contamination):**
|
||||||
|
1. **`package.start`/restore recreates a container as plain podman when its quadlet unit is missing**
|
||||||
|
instead of regenerating the `.container` unit — leaving it un-stoppable via systemctl. Should
|
||||||
|
reconcile the quadlet unit, not fall back to bare podman silently.
|
||||||
|
2. **`prod_orchestrator::stop()` podman-fallback doesn't fire for electrumx-class apps.** Stop path
|
||||||
|
(prod_orchestrator.rs:2890): `loaded(app_id)?` → `quadlet::stop_service` (fail-soft) →
|
||||||
|
`runtime.stop_container` (podman). `compute_container_name(electrumx)` → bare `"electrumx"`
|
||||||
|
(correct target). filebrowser reaches the fallback and stops; electrumx does NOT ⇒ suspect
|
||||||
|
`loaded("electrumx")` erroring before the fallback AND the error not classed as
|
||||||
|
`is_unknown_app_id_error` (so `do_orchestrator_package_stop` never reaches `do_package_stop`).
|
||||||
|
Confirm by promoting the best-effort `install_log("STOP …")`/`STOP FAIL` to `tracing::error!`
|
||||||
|
(it was empty in .228's install log) and reading `loaded()` + `is_unknown_app_id_error`.
|
||||||
|
|
||||||
|
**Correction to the status doc:** the "Quadlet-everywhere ~96%" survey may have mis-read the signal
|
||||||
|
*on contaminated nodes*; .198 genuinely is quadlet, so re-survey from `.container` file presence +
|
||||||
|
`PODMAN_SYSTEMD_UNIT`, not from "container running".
|
||||||
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` →
|
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` →
|
||||||
`quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
|
`quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
|
||||||
(podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
|
(podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
|
||||||
@ -223,15 +245,9 @@ in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopp
|
|||||||
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
|
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
|
||||||
So the bug is purely "container never stops", not "state not reported".
|
So the bug is purely "container never stops", not "state not reported".
|
||||||
|
|
||||||
**Decisions needed before coding the fix:**
|
**Quadlet-vs-podman question: RESOLVED.** Quadlet is intended (.198 has the `.container` files;
|
||||||
1. **Quadlet vs plain-podman — which is the intended runtime?** `bitcoin-core.container.disabled`
|
see ground-truth block above). No need to redesign — the work is (a) restore .228's quadlet units,
|
||||||
(May 6) suggests the fleet *reverted* backend apps from quadlet to plain podman. If plain-podman
|
(b) fix the two robustness sub-bugs, (c) re-run the canonical gate on a clean node.
|
||||||
is intended, `package.stop` MUST reliably reach `runtime.stop_container` for every app (fix the
|
|
||||||
`loaded()`/fallback gap). If quadlet is intended, these apps aren't migrated and the
|
|
||||||
status-doc/Pillar-1 claims must be corrected + the migration actually done. **Check .198
|
|
||||||
(untouched today) for ground truth: is its `electrumx` quadlet-managed or plain podman?**
|
|
||||||
2. Whether to also widen the gate's stop-wait timeout for heavy apps (secondary; the primary bug is
|
|
||||||
that the container never stops at all, so timeout tuning won't fix electrumx).
|
|
||||||
|
|
||||||
### MY-SESSION ERRATA (own it on resume)
|
### MY-SESSION ERRATA (own it on resume)
|
||||||
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
|
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
|
||||||
@ -245,20 +261,24 @@ in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopp
|
|||||||
→ `Invalid Docker image format`.
|
→ `Invalid Docker image format`.
|
||||||
|
|
||||||
### NEXT STEPS (in order)
|
### NEXT STEPS (in order)
|
||||||
1. **Get .198 ground truth on quadlet** (one SSH at a time — .198 sshd wedges under concurrent
|
1. ✅ **DONE — .198 ground truth:** quadlet is intended (.198 has the backend `.container` files).
|
||||||
sessions). `podman inspect electrumx --format '{{index .Config.Labels "PODMAN_SYSTEMD_UNIT"}}'`
|
2. **Run the CANONICAL gate on .198 FIRST** — it is the clean, properly-quadletized node (I did NOT
|
||||||
and `ls ~/.config/containers/systemd/`. Decides the quadlet-vs-podman question above.
|
touch it today). `ARCHY_HOST=192.168.1.198 ARCHY_SCHEME=https ARCHY_PASSWORD='ThisIsWeb54321@'
|
||||||
2. **Fix the `package.stop` propagation bug** so every app's stop reaches `runtime.stop_container`
|
ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`. NO cascade; never kill
|
||||||
(the podman fallback) when no quadlet unit exists — i.e. `loaded()` failure / non-unknown-app
|
mid-iteration. This tells us whether the stop bug reproduces on a quadlet-correct node (→ real
|
||||||
error must not abort the stop. Add a unit/integration test (mock orchestrator) reproducing
|
product bug) or was purely .228 contamination (→ just re-quadletize .228).
|
||||||
electrumx-style "no quadlet unit" stop.
|
3. **Restore .228's quadlet units** — properly reinstall its backend apps so `.container` files
|
||||||
3. **Re-run the CANONICAL gate** `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1
|
regenerate (match .198). The cleanest route is the gate's own install path or a forced reconcile;
|
||||||
tests/lifecycle/run-20x.sh` on .228, then .198. (Do NOT set CASCADE unless deliberately testing
|
verify `.container` files reappear + `PODMAN_SYSTEMD_UNIT` is set, then re-run the gate on .228.
|
||||||
uninstall/reinstall, and never kill it mid-iteration on a real node.)
|
4. **Fix the two robustness sub-bugs** (only if they reproduce on quadlet-correct nodes / as
|
||||||
4. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
hardening): (a) `package.start` must regenerate a missing quadlet unit, not fall back to bare
|
||||||
|
podman; (b) `prod_orchestrator::stop()` podman-fallback must fire when there's no quadlet unit
|
||||||
|
(`loaded()` failure / non-`unknown_app_id` error must not abort the stop). Add a mock-orchestrator
|
||||||
|
test reproducing electrumx-style "no quadlet unit" stop.
|
||||||
|
5. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
||||||
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
|
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
|
||||||
install_netbird_stack in stacks.rs).
|
install_netbird_stack in stacks.rs).
|
||||||
5. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
6. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||||
|
|
||||||
### KNOWN ISSUES / WATCH-OUTS
|
### KNOWN ISSUES / WATCH-OUTS
|
||||||
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user