docs: master-plan §8b — 5× triage, mempool restart bug fixed

Record the overnight 5× outcome (2/5) and the triage: all three
fails were distinct one-offs. iter1 #5 bitcoin-knots = pre-launch
churn (hardened anyway); iter2 #74 + iter5 #73 = one real
orchestrator bug (phantom stack-member injection in
ordered_containers_for_start), now fixed + live-verified on .228.
Update the resume check command to gate-5x4.log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-23 02:23:07 -04:00
parent 92d7f52dd6
commit 6511754545

View File

@ -169,22 +169,58 @@ phases 26 (`dual-ecash-design.md`).
### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device) ### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device)
**Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN **Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN
(110/110)**; a **hardened 5× run is IN PROGRESS on `.228`** (the single-node exit criterion). The (110/110)**; a **fresh 5× run is IN PROGRESS on `.228`** (the single-node exit criterion) after a
gate is now single-node (.228); multinode is split out (`docs/multinode-testing-plan.md`). real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out
(`docs/multinode-testing-plan.md`). The gate is canonically **5×** now — `run-gate.sh` (the `20x`
naming/script was removed 2026-06-22, commit `57a013bc`).
**2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake):**
The 1st 5× attempt failed iteration 1 on `#74 mempool api backend remains queryable`. Root cause was
NOT timing — the frontend nginx pinned mempool-api's IP at startup (no `resolver`); after the gate
restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in
`mempool-frontend:v3.0.1` (resolver+variable proxy_pass; see `[[project_mempool_nginx_stale_ip_fix]]`
/ `docker/mempool-frontend/`), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience-
verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (`mempool.bats`
#74: 180s→300s + real `fail` helper). Commits `0f05f73a` (fix) `57a013bc` (gate rename).
**THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:** **THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:**
``` ```
sshpass -p archipelago ssh archipelago@192.168.1.228 \ sshpass -p archipelago ssh archipelago@192.168.1.228 \
'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x2.log; \ 'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \
echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x2.log | sort -u' echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u'
``` ```
- Log: `/tmp/gate-5x2.log` on .228 · launched `nohup` (pid was 4042141) · `ARCHY_ITERATIONS=5 - Log: `/tmp/gate-5x3.log` on .228 · launched `nohup` · `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`,
ARCHY_ALLOW_DESTRUCTIVE=1`, run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` via `./run-gate.sh` (ARCHY_HOST=127.0.0.1).
(ARCHY_HOST=127.0.0.1). `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228 for this. `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228.
- **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.** - **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.**
- If it flakes again: it'll be readiness-under-churn (lnd/mempool); the hardening (commit `98f4fa44`: - If it flakes again: readiness-under-churn (lnd/mempool); hardening in `98f4fa44` (inter-iteration
inter-iteration `settle_stack()` + 180240s readiness windows) targets exactly that. Re-copy the `settle_stack()` + readiness windows). Re-copy repo `tests/lifecycle` to /tmp/lifecycle-run, relaunch.
repo `tests/lifecycle` to /tmp/lifecycle-run and re-launch.
**▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real
orchestrator bug (NOT flakes) + FIXED:** the overnight run finished `passed: 2 / failed: 3` on
`gate-5x3.log`, three *distinct one-off* fails, none repeating:
- iter1 `#5 container-list valid state for bitcoin-knots` — pre-launch churn (as predicted); didn't
repeat. **Hardened anyway:** the probe was a single-shot read; now polls ≤30s for a settled valid
state so a momentary `restarting`/transient can't flake a 20-min iteration (`bitcoin-knots.bats`).
- iter2 `#74 mempool api queryable` + iter5 `#73 mempool stack running` — **SAME root cause.**
`package.restart mempool` resolves its container list via `ordered_containers_for_start`, which was
**injecting phantom stack-member names** (`mysql-mempool`, `archy-mempool-api`, `archy-mempool-web`
— variant names from the union `startup_order` list that aren't live on this node). The phantom
`mysql-mempool` is 2nd in the start order; `do_orchestrator_package_start` hits its unknown-app-id
fallback → `do_package_start` inspect fails "no such object" → the `?` **aborts the whole start
sequence**, so `mempool-api` (pos 5) + `mempool` frontend (pos 8) never start. They then sat down
~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s)
and #74 (api not queryable in 300s) both flake. Journal proof on .228: `package.restart mempool
failed: Start failed: mysql-mempool: ... no such object`, 23:27:32.
**Fix:** `ordered_containers_for_start` now orders only the *actually-present* containers and never
injects phantom order entries (new pure helper `order_present_containers` + 3 unit tests,
`dependencies.rs`). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality
mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets.
- **Deploy + relaunch:** built release binary on .116, swapped `/usr/local/bin/archipelago` on .228
(containers live under `user@1000.service`, NOT the `archipelago.service` cgroup, so a service
restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart
keeps the stack up, then relaunched a clean 5× → see `gate-5x4.log` (check cmd above, swap the
filename). Expectation: all three fixed → 5/5 green → demote the banner.
**Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):** **Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):**
- `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL). - `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL).