From a111d79a05a9093e7eb6ec391675f48dcfcae9c7 Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 06:00:42 -0400 Subject: [PATCH] =?UTF-8?q?docs(gate):=20downgrade=20stop-blocker=20?= =?UTF-8?q?=E2=9B=94=E2=86=92=E2=9A=A0=EF=B8=8F=20=E2=80=94=20.198=20has?= =?UTF-8?q?=20quadlet=20units,=20.228=20state=20was=20my=20contamination?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit .198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet is the intended runtime. .228's plain-podman state traced to my cascade-gate uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs remain (start should regen quadlet; stop podman-fallback gap). Next: canonical gate on CLEAN .198 first to tell real-bug from contamination. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 92 +++++++++++++++++++++------------- 1 file changed, 56 insertions(+), 36 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index e1462fa7..5690fcd6 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -190,25 +190,47 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering guard is KEPT on purpose (beneficial; not a blocker). -### ⛔ GATE BLOCKER discovered 2026-06-22 — `package.stop` does not stop several apps (FIX FIRST) +### ⚠️ GATE FINDING 2026-06-22 — `package.stop` non-propagation (mostly self-inflicted on .228; verify on .198) Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no -churn). Step 2 (the 5× gate) surfaced a **real, reproducible blocker** — this is now the top task. +churn). Step 2 (the 5× gate) surfaced a `package.stop` failure — **but the headline cause turned +out to be MY cascade-gate contaminating .228**, not a fundamental product gap. Severity downgraded +from ⛔ to ⚠️ after the .198 ground-truth check (below). Still has a real robustness sub-bug. -**Symptom.** On a CLEAN, healthy .228, `package.stop electrumx` returns `{"status":"stopping"}` -but the container **never stops** — `container-list` shows `running` for 66s+, the scanner keeps -logging `Detected container: ElectrumX (running)`. The gate's `wait_for_container_status electrumx -stopped 60` therefore times out. Same failure hit bitcoin-knots / btcpay-server / fedimint / immich -in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopped` in ~6s). +**Symptom.** On (post-contamination) .228, `package.stop electrumx` returns `{"status":"stopping"}` +but the container **never stops** — `container-list` shows `running` 66s+. The gate's +`wait_for_container_status electrumx stopped 60` times out. Same hit bitcoin-knots/btcpay/fedimint/ +immich. **Contrast:** `filebrowser` stops correctly (`running → stopped` ~6s). -**Root-cause chain (evidence-backed, not fully pinned):** -- These app containers on .228 run as **plain `podman run --restart=unless-stopped` — NOT quadlet - units.** `podman inspect electrumx` → `PODMAN_SYSTEMD_UNIT` is EMPTY; `systemctl --user stop - electrumx.service` → `Unit electrumx.service not loaded (rc=5)`. The only `.container` quadlet - files on disk are the 4 UI companions + `home-assistant`; **`bitcoin-core.container` is renamed - `.disabled-20260506`**. ⇒ **The "Quadlet-everywhere ~96% migrated" claim in - `app-registry-status-2026-06-21.md` is WRONG for backend apps** (the survey read a misleading - signal). This is itself a finding the gate was meant to catch. +**.198 GROUND TRUTH (decisive, checked 2026-06-22):** .198 (untouched today) **HAS quadlet +`.container` files for backend apps** — `bitcoin-knots.container`, `btcpay-server.container`, +`fedimint.container`, `filebrowser.container`, `indeedhub.container`, `gitea.container`, +`grafana.container`, `botfights.container`, `archy-{btcpay-db,nbxplorer}.container`, the +`fedimint*`/`indeedhub*` members, etc. **⇒ Quadlet IS the intended backend runtime.** .228 instead +has NONE of these (only the 4 UI companions + home-assistant; `bitcoin-core.container` is +`.disabled-20260506`). **So .228's plain-podman state is contamination:** my cascade-destructive +gate UNINSTALLED its apps (removing the `.container` files) and my `package.start` restore brought +them back as plain `podman run --restart=unless-stopped` **without regenerating the quadlet units**. +`podman inspect electrumx` on .228 → `PODMAN_SYSTEMD_UNIT` EMPTY; `systemctl --user stop +electrumx.service` → `Unit not loaded (rc=5)`. (NB: electrumx specifically shows no `PODMAN_SYSTEMD_UNIT` +on .198 too — confirm whether electrumx has its own `.container` on .198; the listing was truncated.) + +**Two real sub-bugs remain (independent of the contamination):** +1. **`package.start`/restore recreates a container as plain podman when its quadlet unit is missing** + instead of regenerating the `.container` unit — leaving it un-stoppable via systemctl. Should + reconcile the quadlet unit, not fall back to bare podman silently. +2. **`prod_orchestrator::stop()` podman-fallback doesn't fire for electrumx-class apps.** Stop path + (prod_orchestrator.rs:2890): `loaded(app_id)?` → `quadlet::stop_service` (fail-soft) → + `runtime.stop_container` (podman). `compute_container_name(electrumx)` → bare `"electrumx"` + (correct target). filebrowser reaches the fallback and stops; electrumx does NOT ⇒ suspect + `loaded("electrumx")` erroring before the fallback AND the error not classed as + `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never reaches `do_package_stop`). + Confirm by promoting the best-effort `install_log("STOP …")`/`STOP FAIL` to `tracing::error!` + (it was empty in .228's install log) and reading `loaded()` + `is_unknown_app_id_error`. + +**Correction to the status doc:** the "Quadlet-everywhere ~96%" survey may have mis-read the signal +*on contaminated nodes*; .198 genuinely is quadlet, so re-survey from `.container` file presence + +`PODMAN_SYSTEMD_UNIT`, not from "container running". - `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` → `quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container` (podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target). @@ -223,15 +245,9 @@ in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopp keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser). So the bug is purely "container never stops", not "state not reported". -**Decisions needed before coding the fix:** -1. **Quadlet vs plain-podman — which is the intended runtime?** `bitcoin-core.container.disabled` - (May 6) suggests the fleet *reverted* backend apps from quadlet to plain podman. If plain-podman - is intended, `package.stop` MUST reliably reach `runtime.stop_container` for every app (fix the - `loaded()`/fallback gap). If quadlet is intended, these apps aren't migrated and the - status-doc/Pillar-1 claims must be corrected + the migration actually done. **Check .198 - (untouched today) for ground truth: is its `electrumx` quadlet-managed or plain podman?** -2. Whether to also widen the gate's stop-wait timeout for heavy apps (secondary; the primary bug is - that the container never stops at all, so timeout tuning won't fix electrumx). +**Quadlet-vs-podman question: RESOLVED.** Quadlet is intended (.198 has the `.container` files; +see ground-truth block above). No need to redesign — the work is (a) restore .228's quadlet units, +(b) fix the two robustness sub-bugs, (c) re-run the canonical gate on a clean node. ### MY-SESSION ERRATA (own it on resume) - I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that @@ -245,20 +261,24 @@ in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopp → `Invalid Docker image format`. ### NEXT STEPS (in order) -1. **Get .198 ground truth on quadlet** (one SSH at a time — .198 sshd wedges under concurrent - sessions). `podman inspect electrumx --format '{{index .Config.Labels "PODMAN_SYSTEMD_UNIT"}}'` - and `ls ~/.config/containers/systemd/`. Decides the quadlet-vs-podman question above. -2. **Fix the `package.stop` propagation bug** so every app's stop reaches `runtime.stop_container` - (the podman fallback) when no quadlet unit exists — i.e. `loaded()` failure / non-unknown-app - error must not abort the stop. Add a unit/integration test (mock orchestrator) reproducing - electrumx-style "no quadlet unit" stop. -3. **Re-run the CANONICAL gate** `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1 - tests/lifecycle/run-20x.sh` on .228, then .198. (Do NOT set CASCADE unless deliberately testing - uninstall/reinstall, and never kill it mid-iteration on a real node.) -4. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, +1. ✅ **DONE — .198 ground truth:** quadlet is intended (.198 has the backend `.container` files). +2. **Run the CANONICAL gate on .198 FIRST** — it is the clean, properly-quadletized node (I did NOT + touch it today). `ARCHY_HOST=192.168.1.198 ARCHY_SCHEME=https ARCHY_PASSWORD='ThisIsWeb54321@' + ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`. NO cascade; never kill + mid-iteration. This tells us whether the stop bug reproduces on a quadlet-correct node (→ real + product bug) or was purely .228 contamination (→ just re-quadletize .228). +3. **Restore .228's quadlet units** — properly reinstall its backend apps so `.container` files + regenerate (match .198). The cleanest route is the gate's own install path or a forced reconcile; + verify `.container` files reappear + `PODMAN_SYSTEMD_UNIT` is set, then re-run the gate on .228. +4. **Fix the two robustness sub-bugs** (only if they reproduce on quadlet-correct nodes / as + hardening): (a) `package.start` must regenerate a missing quadlet unit, not fall back to bare + podman; (b) `prod_orchestrator::stop()` podman-fallback must fire when there's no quadlet unit + (`loaded()` failure / non-`unknown_app_id` error must not abort the stop). Add a mock-orchestrator + test reproducing electrumx-style "no quadlet unit" stop. +5. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is install_netbird_stack in stacks.rs). -5. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. +6. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates