docs: multinode gate finished + boot-reconciler self-heal bug found+fixed
.5's 5x gate done: 5/5 iterations, all technically FAIL per run-gate.sh's tally but only from .5's permanent pruned-bitcoin ceiling (accepted going in); down to 2 failures/iteration by the end. Found + fixed a real hang (lnd cached a dead bitcoin-knots IP after a restart) live mid-run. Separately found a real boot-reconciler bug via indeedhub going stuck on .116: any genuinely-installed-but-fully-absent app was left stuck forever unless it was one of 8 hardcoded "baseline" apps. Fix tracked, code change in the shared working tree pending test confirmation. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This commit is contained in:
parent
27e6747c2a
commit
2c1d2a2572
@ -82,11 +82,46 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
|
|||||||
still pruned — that ceiling is shared by every non-.228 node), but **fully
|
still pruned — that ceiling is shared by every non-.228 node), but **fully
|
||||||
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
|
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
|
||||||
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
|
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
|
||||||
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`) — running now**, log at
|
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`).
|
||||||
`/tmp/gate.log` on .5, background poller watching for the `RESULTS` banner.
|
- ✅ **Gate finished 2026-07-01: 5/5 iterations, technically all "FAIL" per
|
||||||
- Once .5's gate reports: bring the rest of the fleet to precondition, then the
|
run-gate.sh's tally — but only because .5's pruned-bitcoin limitation
|
||||||
cross-node federation/mesh/transport suites. This is the literal
|
(expected, permanent, accepted going in) fails one test every single
|
||||||
"next exit criterion" called out in `CLAUDE.md`.
|
iteration.** By iteration 4-5 that was down to exactly 2 failures per run:
|
||||||
|
the expected pruned-bitcoin one, plus a reproducible `lnd` proxy timeout
|
||||||
|
(`https://host/app/lnd/`, distinct from the DNS bug below — happened
|
||||||
|
consistently on both of the last 2 iterations, worth its own investigation,
|
||||||
|
not yet root-caused). Iterations 1-3 also hit test-suite bugs since fixed
|
||||||
|
live mid-run (see Tier 0/below) and one ~2h hang (also below) — none of
|
||||||
|
those are real product bugs.
|
||||||
|
- 🐛 **Real hang found + fixed live**: `lnd` cached a dead IP for
|
||||||
|
`bitcoin-knots` after an earlier restart gave it a new container IP —
|
||||||
|
every RPC needing chain data blocked forever (client-side `timeout`
|
||||||
|
wrappers don't reliably kill `podman exec`'s in-container process).
|
||||||
|
Blocked iteration 4 for ~2 hours before diagnosed + fixed (`podman
|
||||||
|
restart lnd`, forces fresh DNS resolution). **Product-level gap, not
|
||||||
|
fixed at the code level**: dependent services should reconnect/re-resolve
|
||||||
|
after a backend container is recreated, not cache indefinitely. Logged as
|
||||||
|
a follow-up, not yet implemented.
|
||||||
|
- Next: bring the rest of the fleet to precondition, then the cross-node
|
||||||
|
federation/mesh/transport suites. This is the literal "next exit
|
||||||
|
criterion" called out in `CLAUDE.md`.
|
||||||
|
- [x] ~~**Real bug found + fixed 2026-07-01**: boot reconciler left any
|
||||||
|
genuinely-installed-but-fully-absent app stuck forever unless it was one
|
||||||
|
of 8 hardcoded "required baseline" apps~~ — surfaced by indeedhub's
|
||||||
|
backend containers (minio/postgres/relay) never recovering on .116 after
|
||||||
|
going absent. Root cause: `ensure_running_with_mode()`
|
||||||
|
(`prod_orchestrator.rs`) only called `install_fresh()` for
|
||||||
|
`is_required_baseline_app()` apps in the fully-absent case; every other
|
||||||
|
installed app was left as `Left("absent")` with no path back short of an
|
||||||
|
explicit reinstall. Fixed: self-heal now applies to any app that
|
||||||
|
reaches this point (i.e. already confirmed NOT user-stopped / NOT
|
||||||
|
user-uninstalled earlier in the same function — those markers are
|
||||||
|
properly set/cleared on uninstall/reinstall, so this can't resurrect a
|
||||||
|
deliberately-removed app). Deleted the now-dead
|
||||||
|
`is_required_baseline_app()`, updated/renamed the test that had locked
|
||||||
|
in the old behavior. Compiles clean; test suite run in progress.
|
||||||
|
indeedhub itself not yet manually recovered on .116 — the code fix
|
||||||
|
will self-heal it on the next reconcile tick once deployed there.
|
||||||
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
|
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
|
||||||
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
|
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
|
||||||
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
|
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user