docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228)

5x gate run surfaced a real blocker: package.stop does not stop electrumx/
bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait
times out). Root cause chain: these backend apps run as plain podman
--restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI
companions + home-assistant have .container files; bitcoin-core.container is
.disabled). orchestrator.stop() podman-fallback fires for filebrowser but not
electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state
reporting itself is correct (filebrowser proof, user_stopped guard).

Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE);
restored .228 after my cascade-gate left apps stranded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-22 05:47:11 -04:00
parent d6fa262d69
commit 47026fae30

View File

@ -190,17 +190,75 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
guard is KEPT on purpose (beneficial; not a blocker).
### ⛔ GATE BLOCKER discovered 2026-06-22 — `package.stop` does not stop several apps (FIX FIRST)
Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
churn). Step 2 (the 5× gate) surfaced a **real, reproducible blocker** — this is now the top task.
**Symptom.** On a CLEAN, healthy .228, `package.stop electrumx` returns `{"status":"stopping"}`
but the container **never stops**`container-list` shows `running` for 66s+, the scanner keeps
logging `Detected container: ElectrumX (running)`. The gate's `wait_for_container_status electrumx
stopped 60` therefore times out. Same failure hit bitcoin-knots / btcpay-server / fedimint / immich
in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopped` in ~6s).
**Root-cause chain (evidence-backed, not fully pinned):**
- These app containers on .228 run as **plain `podman run --restart=unless-stopped` — NOT quadlet
units.** `podman inspect electrumx``PODMAN_SYSTEMD_UNIT` is EMPTY; `systemctl --user stop
electrumx.service` → `Unit electrumx.service not loaded (rc=5)`. The only `.container` quadlet
files on disk are the 4 UI companions + `home-assistant`; **`bitcoin-core.container` is renamed
`.disabled-20260506`**. ⇒ **The "Quadlet-everywhere ~96% migrated" claim in
`app-registry-status-2026-06-21.md` is WRONG for backend apps** (the survey read a misleading
signal). This is itself a finding the gate was meant to catch.
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?`
`quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
(podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")`
**erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being
classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the
plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line**
(the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure
reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting
`loaded()` + `is_unknown_app_id_error`.
- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
So the bug is purely "container never stops", not "state not reported".
**Decisions needed before coding the fix:**
1. **Quadlet vs plain-podman — which is the intended runtime?** `bitcoin-core.container.disabled`
(May 6) suggests the fleet *reverted* backend apps from quadlet to plain podman. If plain-podman
is intended, `package.stop` MUST reliably reach `runtime.stop_container` for every app (fix the
`loaded()`/fallback gap). If quadlet is intended, these apps aren't migrated and the
status-doc/Pillar-1 claims must be corrected + the migration actually done. **Check .198
(untouched today) for ground truth: is its `electrumx` quadlet-managed or plain podman?**
2. Whether to also widen the gate's stop-wait timeout for heavy apps (secondary; the primary bug is
that the container never stops at all, so timeout tuning won't fix electrumx).
### MY-SESSION ERRATA (own it on resume)
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-20x.sh
"Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I
killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or
stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image
`146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale
`user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`.
- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name
`Invalid Docker image format`.
### NEXT STEPS (in order)
1. **Sync .228 to the tcp-health manifest.** .228 still runs the OLD http-health frontend
manifest on disk (stable there at low load, but inconsistent). Deploy `apps/indeedhub/manifest.yml`
→ /opt/archipelago/apps/indeedhub/manifest.yml on .228, restart archipelago, reinstall
the frontend (it caches manifests at startup). Verify no churn.
2. **Run the 5× lifecycle gate** (`ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`) on .228
then .198 (ARCHY_ALLOW_DESTRUCTIVE=1). Fix until green. This is the production exit criterion.
3. **netbird migration (#20 phase 4)** — same pattern, but assess its setup steps first
(TLS cert gen, config files, resolver IP — may need host-file-write hooks the current
exec/copy_from_host hooks don't cover; legacy is install_netbird_stack in stacks.rs).
4. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
1. **Get .198 ground truth on quadlet** (one SSH at a time — .198 sshd wedges under concurrent
sessions). `podman inspect electrumx --format '{{index .Config.Labels "PODMAN_SYSTEMD_UNIT"}}'`
and `ls ~/.config/containers/systemd/`. Decides the quadlet-vs-podman question above.
2. **Fix the `package.stop` propagation bug** so every app's stop reaches `runtime.stop_container`
(the podman fallback) when no quadlet unit exists — i.e. `loaded()` failure / non-unknown-app
error must not abort the stop. Add a unit/integration test (mock orchestrator) reproducing
electrumx-style "no quadlet unit" stop.
3. **Re-run the CANONICAL gate** `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1
tests/lifecycle/run-20x.sh` on .228, then .198. (Do NOT set CASCADE unless deliberately testing
uninstall/reinstall, and never kill it mid-iteration on a real node.)
4. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
install_netbird_stack in stacks.rs).
5. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
### KNOWN ISSUES / WATCH-OUTS
- **.198 is a weak/loaded node** (load avg ~35). The generic reconcile recreates