docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228)
5x gate run surfaced a real blocker: package.stop does not stop electrumx/ bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait times out). Root cause chain: these backend apps run as plain podman --restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI companions + home-assistant have .container files; bitcoin-core.container is .disabled). orchestrator.stop() podman-fallback fires for filebrowser but not electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state reporting itself is correct (filebrowser proof, user_stopped guard). Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE); restored .228 after my cascade-gate left apps stranded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
d6fa262d69
commit
47026fae30
@ -190,17 +190,75 @@ frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + no
|
||||
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
|
||||
guard is KEPT on purpose (beneficial; not a blocker).
|
||||
|
||||
### ⛔ GATE BLOCKER discovered 2026-06-22 — `package.stop` does not stop several apps (FIX FIRST)
|
||||
|
||||
Step 1 (sync .228 tcp-health manifest) is **DONE + verified** (frontend adopted, UI 200, no
|
||||
churn). Step 2 (the 5× gate) surfaced a **real, reproducible blocker** — this is now the top task.
|
||||
|
||||
**Symptom.** On a CLEAN, healthy .228, `package.stop electrumx` returns `{"status":"stopping"}`
|
||||
but the container **never stops** — `container-list` shows `running` for 66s+, the scanner keeps
|
||||
logging `Detected container: ElectrumX (running)`. The gate's `wait_for_container_status electrumx
|
||||
stopped 60` therefore times out. Same failure hit bitcoin-knots / btcpay-server / fedimint / immich
|
||||
in the gate run. **Contrast:** `filebrowser` stops correctly (`running → stopped` in ~6s).
|
||||
|
||||
**Root-cause chain (evidence-backed, not fully pinned):**
|
||||
- These app containers on .228 run as **plain `podman run --restart=unless-stopped` — NOT quadlet
|
||||
units.** `podman inspect electrumx` → `PODMAN_SYSTEMD_UNIT` is EMPTY; `systemctl --user stop
|
||||
electrumx.service` → `Unit electrumx.service not loaded (rc=5)`. The only `.container` quadlet
|
||||
files on disk are the 4 UI companions + `home-assistant`; **`bitcoin-core.container` is renamed
|
||||
`.disabled-20260506`**. ⇒ **The "Quadlet-everywhere ~96% migrated" claim in
|
||||
`app-registry-status-2026-06-21.md` is WRONG for backend apps** (the survey read a misleading
|
||||
signal). This is itself a finding the gate was meant to catch.
|
||||
- `prod_orchestrator::stop()` (prod_orchestrator.rs:2890) does: `self.loaded(app_id)?` →
|
||||
`quadlet::stop_service("{name}.service")` (fails-soft for non-quadlet) → `runtime.stop_container`
|
||||
(podman fallback). For electrumx, `compute_container_name` → bare `"electrumx"` (correct target).
|
||||
filebrowser hits the podman fallback and stops; electrumx does NOT ⇒ suspect `loaded("electrumx")`
|
||||
**erroring before the fallback** (manifest not loaded in orchestrator) AND the error not being
|
||||
classed as `is_unknown_app_id_error` (so `do_orchestrator_package_stop` never falls back to the
|
||||
plain-podman `do_package_stop`). **NEXT: confirm by capturing the `STOP:`/`STOP FAIL:` line**
|
||||
(the best-effort install-log was empty on .228 — promote it to a `tracing::error!` so the failure
|
||||
reason is visible in journalctl, or reproduce locally with the mock orchestrator) and inspecting
|
||||
`loaded()` + `is_unknown_app_id_error`.
|
||||
- The **stop→stopped STATE reporting is correct** when the container actually stops: server.rs:1334
|
||||
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard (proven on filebrowser).
|
||||
So the bug is purely "container never stops", not "state not reported".
|
||||
|
||||
**Decisions needed before coding the fix:**
|
||||
1. **Quadlet vs plain-podman — which is the intended runtime?** `bitcoin-core.container.disabled`
|
||||
(May 6) suggests the fleet *reverted* backend apps from quadlet to plain podman. If plain-podman
|
||||
is intended, `package.stop` MUST reliably reach `runtime.stop_container` for every app (fix the
|
||||
`loaded()`/fallback gap). If quadlet is intended, these apps aren't migrated and the
|
||||
status-doc/Pillar-1 claims must be corrected + the migration actually done. **Check .198
|
||||
(untouched today) for ground truth: is its `electrumx` quadlet-managed or plain podman?**
|
||||
2. Whether to also widen the gate's stop-wait timeout for heavy apps (secondary; the primary bug is
|
||||
that the container never stops at all, so timeout tuning won't fix electrumx).
|
||||
|
||||
### MY-SESSION ERRATA (own it on resume)
|
||||
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
|
||||
is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-20x.sh
|
||||
"Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I
|
||||
killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or
|
||||
stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image
|
||||
`146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale
|
||||
`user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`.
|
||||
- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name
|
||||
→ `Invalid Docker image format`.
|
||||
|
||||
### NEXT STEPS (in order)
|
||||
1. **Sync .228 to the tcp-health manifest.** .228 still runs the OLD http-health frontend
|
||||
manifest on disk (stable there at low load, but inconsistent). Deploy `apps/indeedhub/manifest.yml`
|
||||
→ /opt/archipelago/apps/indeedhub/manifest.yml on .228, restart archipelago, reinstall
|
||||
the frontend (it caches manifests at startup). Verify no churn.
|
||||
2. **Run the 5× lifecycle gate** (`ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`) on .228
|
||||
then .198 (ARCHY_ALLOW_DESTRUCTIVE=1). Fix until green. This is the production exit criterion.
|
||||
3. **netbird migration (#20 phase 4)** — same pattern, but assess its setup steps first
|
||||
(TLS cert gen, config files, resolver IP — may need host-file-write hooks the current
|
||||
exec/copy_from_host hooks don't cover; legacy is install_netbird_stack in stacks.rs).
|
||||
4. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||
1. **Get .198 ground truth on quadlet** (one SSH at a time — .198 sshd wedges under concurrent
|
||||
sessions). `podman inspect electrumx --format '{{index .Config.Labels "PODMAN_SYSTEMD_UNIT"}}'`
|
||||
and `ls ~/.config/containers/systemd/`. Decides the quadlet-vs-podman question above.
|
||||
2. **Fix the `package.stop` propagation bug** so every app's stop reaches `runtime.stop_container`
|
||||
(the podman fallback) when no quadlet unit exists — i.e. `loaded()` failure / non-unknown-app
|
||||
error must not abort the stop. Add a unit/integration test (mock orchestrator) reproducing
|
||||
electrumx-style "no quadlet unit" stop.
|
||||
3. **Re-run the CANONICAL gate** `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1
|
||||
tests/lifecycle/run-20x.sh` on .228, then .198. (Do NOT set CASCADE unless deliberately testing
|
||||
uninstall/reinstall, and never kill it mid-iteration on a real node.)
|
||||
4. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
||||
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
|
||||
install_netbird_stack in stacks.rs).
|
||||
5. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||
|
||||
### KNOWN ISSUES / WATCH-OUTS
|
||||
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user