From 47026fae3091063266576ed93bf4f5f22e258f37 Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 05:47:11 -0400 Subject: [PATCH] docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228) 5x gate run surfaced a real blocker: package.stop does not stop electrumx/ bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait times out). Root cause chain: these backend apps run as plain podman --restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI companions + home-assistant have .container files; bitcoin-core.container is .disabled). orchestrator.stop() podman-fallback fires for filebrowser but not electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state reporting itself is correct (filebrowser proof, user_stopped guard). Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE); restored .228 after my cascade-gate left apps stranded. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 78 +++++++++++++++++++++++++++++----- 1 file changed, 68 insertions(+), 10 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 4dbe136d..e1462fa7 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -190,17 +190,75 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering guard is KEPT on purpose (beneficial; not a blocker). +### ⛔ GATE BLOCKER discovered 2026-06-22 — `package.stop` does not stop several apps (FIX FIRST) + +Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no +churn). Step 2 (the 5× gate) surfaced a **real, reproducible blocker** — this is now the top task. + +**Symptom.** On a CLEAN, healthy .228, `package.stop electrumx` returns `{"status":"stopping"}` +but the container **never stops** — `container-list` shows `running` for 66s+, the scanner keeps +logging `Detected container: ElectrumX (running)`. The gate's `wait_for_container_status electrumx +stopped 60` therefore times out. Same failure hit bitcoin-knots / btcpay-server / fedimint / immich +in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopped` in ~6s). + +**Root-cause chain (evidence-backed, not fully pinned):** +- These app containers on .228 run as **plain `podman run --restart=unless-stopped` — NOT quadlet + units.** `podman inspect electrumx` → `PODMAN_SYSTEMD_UNIT` is EMPTY; `systemctl --user stop + electrumx.service` → `Unit electrumx.service not loaded (rc=5)`. The only `.container` quadlet + files on disk are the 4 UI companions + `home-assistant`; **`bitcoin-core.container` is renamed + `.disabled-20260506`**. ⇒ **The "Quadlet-everywhere ~96% migrated" claim in + `app-registry-status-2026-06-21.md` is WRONG for backend apps** (the survey read a misleading + signal). This is itself a finding the gate was meant to catch. +- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` → + `quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container` + (podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target). + filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")` + **erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being + classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the + plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line** + (the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure + reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting + `loaded()` + `is_unknown_app_id_error`. +- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334 + keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser). + So the bug is purely "container never stops", not "state not reported". + +**Decisions needed before coding the fix:** +1. **Quadlet vs plain-podman — which is the intended runtime?** `bitcoin-core.container.disabled` + (May 6) suggests the fleet *reverted* backend apps from quadlet to plain podman. If plain-podman + is intended, `package.stop` MUST reliably reach `runtime.stop_container` for every app (fix the + `loaded()`/fallback gap). If quadlet is intended, these apps aren't migrated and the + status-doc/Pillar-1 claims must be corrected + the migration actually done. **Check .198 + (untouched today) for ground truth: is its `electrumx` quadlet-managed or plain podman?** +2. Whether to also widen the gate's stop-wait timeout for heavy apps (secondary; the primary bug is + that the container never stops at all, so timeout tuning won't fix electrumx). + +### MY-SESSION ERRATA (own it on resume) +- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that + is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-20x.sh + "Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I + killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or + stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image + `146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale + `user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`. +- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name + → `Invalid Docker image format`. + ### NEXT STEPS (in order) -1. **Sync .228 to the tcp-health manifest.** .228 still runs the OLD http-health frontend - manifest on disk (stable there at low load, but inconsistent). Deploy `apps/indeedhub/manifest.yml` - → /opt/archipelago/apps/indeedhub/manifest.yml on .228, restart archipelago, reinstall - the frontend (it caches manifests at startup). Verify no churn. -2. **Run the 5× lifecycle gate** (`ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`) on .228 - then .198 (ARCHY_ALLOW_DESTRUCTIVE=1). Fix until green. This is the production exit criterion. -3. **netbird migration (#20 phase 4)** — same pattern, but assess its setup steps first - (TLS cert gen, config files, resolver IP — may need host-file-write hooks the current - exec/copy_from_host hooks don't cover; legacy is install_netbird_stack in stacks.rs). -4. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. +1. **Get .198 ground truth on quadlet** (one SSH at a time — .198 sshd wedges under concurrent + sessions). `podman inspect electrumx --format '{{index .Config.Labels "PODMAN_SYSTEMD_UNIT"}}'` + and `ls ~/.config/containers/systemd/`. Decides the quadlet-vs-podman question above. +2. **Fix the `package.stop` propagation bug** so every app's stop reaches `runtime.stop_container` + (the podman fallback) when no quadlet unit exists — i.e. `loaded()` failure / non-unknown-app + error must not abort the stop. Add a unit/integration test (mock orchestrator) reproducing + electrumx-style "no quadlet unit" stop. +3. **Re-run the CANONICAL gate** `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1 + tests/lifecycle/run-20x.sh` on .228, then .198. (Do NOT set CASCADE unless deliberately testing + uninstall/reinstall, and never kill it mid-iteration on a real node.) +4. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, + config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is + install_netbird_stack in stacks.rs). +5. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates