docs: master-plan §8b — 5× triage, mempool restart bug fixed
Record the overnight 5× outcome (2/5) and the triage: all three fails were distinct one-offs. iter1 #5 bitcoin-knots = pre-launch churn (hardened anyway); iter2 #74 + iter5 #73 = one real orchestrator bug (phantom stack-member injection in ordered_containers_for_start), now fixed + live-verified on .228. Update the resume check command to gate-5x4.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
92d7f52dd6
commit
6511754545
@ -169,22 +169,58 @@ phases 2–6 (`dual-ecash-design.md`).
|
|||||||
### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device)
|
### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device)
|
||||||
|
|
||||||
**Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN
|
**Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN
|
||||||
(110/110)**; a **hardened 5× run is IN PROGRESS on `.228`** (the single-node exit criterion). The
|
(110/110)**; a **fresh 5× run is IN PROGRESS on `.228`** (the single-node exit criterion) after a
|
||||||
gate is now single-node (.228); multinode is split out (`docs/multinode-testing-plan.md`).
|
real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out
|
||||||
|
(`docs/multinode-testing-plan.md`). The gate is canonically **5×** now — `run-gate.sh` (the `20x`
|
||||||
|
naming/script was removed 2026-06-22, commit `57a013bc`).
|
||||||
|
|
||||||
|
**2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake):**
|
||||||
|
The 1st 5× attempt failed iteration 1 on `#74 mempool api backend remains queryable`. Root cause was
|
||||||
|
NOT timing — the frontend nginx pinned mempool-api's IP at startup (no `resolver`); after the gate
|
||||||
|
restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in
|
||||||
|
`mempool-frontend:v3.0.1` (resolver+variable proxy_pass; see `[[project_mempool_nginx_stale_ip_fix]]`
|
||||||
|
/ `docker/mempool-frontend/`), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience-
|
||||||
|
verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (`mempool.bats`
|
||||||
|
#74: 180s→300s + real `fail` helper). Commits `0f05f73a` (fix) `57a013bc` (gate rename).
|
||||||
|
|
||||||
**THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:**
|
**THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:**
|
||||||
```
|
```
|
||||||
sshpass -p archipelago ssh archipelago@192.168.1.228 \
|
sshpass -p archipelago ssh archipelago@192.168.1.228 \
|
||||||
'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x2.log; \
|
'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \
|
||||||
echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x2.log | sort -u'
|
echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u'
|
||||||
```
|
```
|
||||||
- Log: `/tmp/gate-5x2.log` on .228 · launched `nohup` (pid was 4042141) · `ARCHY_ITERATIONS=5
|
- Log: `/tmp/gate-5x3.log` on .228 · launched `nohup` · `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`,
|
||||||
ARCHY_ALLOW_DESTRUCTIVE=1`, run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle`
|
run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` via `./run-gate.sh` (ARCHY_HOST=127.0.0.1).
|
||||||
(ARCHY_HOST=127.0.0.1). `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228 for this.
|
`bats` 1.11.1 + static `jq` 1.7.1 are installed on .228.
|
||||||
- **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.**
|
- **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.**
|
||||||
- If it flakes again: it'll be readiness-under-churn (lnd/mempool); the hardening (commit `98f4fa44`:
|
- If it flakes again: readiness-under-churn (lnd/mempool); hardening in `98f4fa44` (inter-iteration
|
||||||
inter-iteration `settle_stack()` + 180–240s readiness windows) targets exactly that. Re-copy the
|
`settle_stack()` + readiness windows). Re-copy repo `tests/lifecycle` to /tmp/lifecycle-run, relaunch.
|
||||||
repo `tests/lifecycle` to /tmp/lifecycle-run and re-launch.
|
|
||||||
|
**▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real
|
||||||
|
orchestrator bug (NOT flakes) + FIXED:** the overnight run finished `passed: 2 / failed: 3` on
|
||||||
|
`gate-5x3.log`, three *distinct one-off* fails, none repeating:
|
||||||
|
- iter1 `#5 container-list valid state for bitcoin-knots` — pre-launch churn (as predicted); didn't
|
||||||
|
repeat. **Hardened anyway:** the probe was a single-shot read; now polls ≤30s for a settled valid
|
||||||
|
state so a momentary `restarting`/transient can't flake a 20-min iteration (`bitcoin-knots.bats`).
|
||||||
|
- iter2 `#74 mempool api queryable` + iter5 `#73 mempool stack running` — **SAME root cause.**
|
||||||
|
`package.restart mempool` resolves its container list via `ordered_containers_for_start`, which was
|
||||||
|
**injecting phantom stack-member names** (`mysql-mempool`, `archy-mempool-api`, `archy-mempool-web`
|
||||||
|
— variant names from the union `startup_order` list that aren't live on this node). The phantom
|
||||||
|
`mysql-mempool` is 2nd in the start order; `do_orchestrator_package_start` hits its unknown-app-id
|
||||||
|
fallback → `do_package_start` inspect fails "no such object" → the `?` **aborts the whole start
|
||||||
|
sequence**, so `mempool-api` (pos 5) + `mempool` frontend (pos 8) never start. They then sat down
|
||||||
|
~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s)
|
||||||
|
and #74 (api not queryable in 300s) both flake. Journal proof on .228: `package.restart mempool
|
||||||
|
failed: Start failed: mysql-mempool: ... no such object`, 23:27:32.
|
||||||
|
**Fix:** `ordered_containers_for_start` now orders only the *actually-present* containers and never
|
||||||
|
injects phantom order entries (new pure helper `order_present_containers` + 3 unit tests,
|
||||||
|
`dependencies.rs`). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality
|
||||||
|
mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets.
|
||||||
|
- **Deploy + relaunch:** built release binary on .116, swapped `/usr/local/bin/archipelago` on .228
|
||||||
|
(containers live under `user@1000.service`, NOT the `archipelago.service` cgroup, so a service
|
||||||
|
restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart
|
||||||
|
keeps the stack up, then relaunched a clean 5× → see `gate-5x4.log` (check cmd above, swap the
|
||||||
|
filename). Expectation: all three fixed → 5/5 green → demote the banner.
|
||||||
|
|
||||||
**Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):**
|
**Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):**
|
||||||
- `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL).
|
- `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL).
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user