diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index bbc561ec..03b0f00d 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -169,22 +169,58 @@ phases 2–6 (`dual-ecash-design.md`). ### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device) **Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN -(110/110)**; a **hardened 5× run is IN PROGRESS on `.228`** (the single-node exit criterion). The -gate is now single-node (.228); multinode is split out (`docs/multinode-testing-plan.md`). +(110/110)**; a **fresh 5× run is IN PROGRESS on `.228`** (the single-node exit criterion) after a +real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out +(`docs/multinode-testing-plan.md`). The gate is canonically **5×** now — `run-gate.sh` (the `20x` +naming/script was removed 2026-06-22, commit `57a013bc`). + +**2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake):** +The 1st 5× attempt failed iteration 1 on `#74 mempool api backend remains queryable`. Root cause was +NOT timing — the frontend nginx pinned mempool-api's IP at startup (no `resolver`); after the gate +restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in +`mempool-frontend:v3.0.1` (resolver+variable proxy_pass; see `[[project_mempool_nginx_stale_ip_fix]]` +/ `docker/mempool-frontend/`), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience- +verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (`mempool.bats` +#74: 180s→300s + real `fail` helper). Commits `0f05f73a` (fix) `57a013bc` (gate rename). **THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:** ``` sshpass -p archipelago ssh archipelago@192.168.1.228 \ - 'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x2.log; \ - echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x2.log | sort -u' + 'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \ + echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u' ``` -- Log: `/tmp/gate-5x2.log` on .228 · launched `nohup` (pid was 4042141) · `ARCHY_ITERATIONS=5 - ARCHY_ALLOW_DESTRUCTIVE=1`, run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` - (ARCHY_HOST=127.0.0.1). `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228 for this. +- Log: `/tmp/gate-5x3.log` on .228 · launched `nohup` · `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, + run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` via `./run-gate.sh` (ARCHY_HOST=127.0.0.1). + `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228. - **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.** -- If it flakes again: it'll be readiness-under-churn (lnd/mempool); the hardening (commit `98f4fa44`: - inter-iteration `settle_stack()` + 180–240s readiness windows) targets exactly that. Re-copy the - repo `tests/lifecycle` to /tmp/lifecycle-run and re-launch. +- If it flakes again: readiness-under-churn (lnd/mempool); hardening in `98f4fa44` (inter-iteration + `settle_stack()` + readiness windows). Re-copy repo `tests/lifecycle` to /tmp/lifecycle-run, relaunch. + +**▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real +orchestrator bug (NOT flakes) + FIXED:** the overnight run finished `passed: 2 / failed: 3` on +`gate-5x3.log`, three *distinct one-off* fails, none repeating: +- iter1 `#5 container-list valid state for bitcoin-knots` — pre-launch churn (as predicted); didn't + repeat. **Hardened anyway:** the probe was a single-shot read; now polls ≤30s for a settled valid + state so a momentary `restarting`/transient can't flake a 20-min iteration (`bitcoin-knots.bats`). +- iter2 `#74 mempool api queryable` + iter5 `#73 mempool stack running` — **SAME root cause.** + `package.restart mempool` resolves its container list via `ordered_containers_for_start`, which was + **injecting phantom stack-member names** (`mysql-mempool`, `archy-mempool-api`, `archy-mempool-web` + — variant names from the union `startup_order` list that aren't live on this node). The phantom + `mysql-mempool` is 2nd in the start order; `do_orchestrator_package_start` hits its unknown-app-id + fallback → `do_package_start` inspect fails "no such object" → the `?` **aborts the whole start + sequence**, so `mempool-api` (pos 5) + `mempool` frontend (pos 8) never start. They then sat down + ~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s) + and #74 (api not queryable in 300s) both flake. Journal proof on .228: `package.restart mempool + failed: Start failed: mysql-mempool: ... no such object`, 23:27:32. + **Fix:** `ordered_containers_for_start` now orders only the *actually-present* containers and never + injects phantom order entries (new pure helper `order_present_containers` + 3 unit tests, + `dependencies.rs`). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality + mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets. +- **Deploy + relaunch:** built release binary on .116, swapped `/usr/local/bin/archipelago` on .228 + (containers live under `user@1000.service`, NOT the `archipelago.service` cgroup, so a service + restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart + keeps the stack up, then relaunched a clean 5× → see `gate-5x4.log` (check cmd above, swap the + filename). Expectation: all three fixed → 5/5 green → demote the banner. **Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):** - `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL).