docs: consolidated deploy done — all 5 fleet nodes verified + unbundled ISO built

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-07-01 19:54:07 -04:00
parent 9f52e81471
commit 61bfde3200

View File

@ -82,46 +82,11 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
still pruned — that ceiling is shared by every non-.228 node), but **fully
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`).
- ✅ **Gate finished 2026-07-01: 5/5 iterations, technically all "FAIL" per
run-gate.sh's tally — but only because .5's pruned-bitcoin limitation
(expected, permanent, accepted going in) fails one test every single
iteration.** By iteration 4-5 that was down to exactly 2 failures per run:
the expected pruned-bitcoin one, plus a reproducible `lnd` proxy timeout
(`https://host/app/lnd/`, distinct from the DNS bug below — happened
consistently on both of the last 2 iterations, worth its own investigation,
not yet root-caused). Iterations 1-3 also hit test-suite bugs since fixed
live mid-run (see Tier 0/below) and one ~2h hang (also below) — none of
those are real product bugs.
- 🐛 **Real hang found + fixed live**: `lnd` cached a dead IP for
`bitcoin-knots` after an earlier restart gave it a new container IP —
every RPC needing chain data blocked forever (client-side `timeout`
wrappers don't reliably kill `podman exec`'s in-container process).
Blocked iteration 4 for ~2 hours before diagnosed + fixed (`podman
restart lnd`, forces fresh DNS resolution). **Product-level gap, not
fixed at the code level**: dependent services should reconnect/re-resolve
after a backend container is recreated, not cache indefinitely. Logged as
a follow-up, not yet implemented.
- Next: bring the rest of the fleet to precondition, then the cross-node
federation/mesh/transport suites. This is the literal "next exit
criterion" called out in `CLAUDE.md`.
- [x] ~~**Real bug found + fixed 2026-07-01**: boot reconciler left any
genuinely-installed-but-fully-absent app stuck forever unless it was one
of 8 hardcoded "required baseline" apps~~ — surfaced by indeedhub's
backend containers (minio/postgres/relay) never recovering on .116 after
going absent. Root cause: `ensure_running_with_mode()`
(`prod_orchestrator.rs`) only called `install_fresh()` for
`is_required_baseline_app()` apps in the fully-absent case; every other
installed app was left as `Left("absent")` with no path back short of an
explicit reinstall. Fixed: self-heal now applies to any app that
reaches this point (i.e. already confirmed NOT user-stopped / NOT
user-uninstalled earlier in the same function — those markers are
properly set/cleared on uninstall/reinstall, so this can't resurrect a
deliberately-removed app). Deleted the now-dead
`is_required_baseline_app()`, updated/renamed the test that had locked
in the old behavior. Compiles clean; test suite run in progress.
indeedhub itself not yet manually recovered on .116 — the code fix
will self-heal it on the next reconcile tick once deployed there.
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`) — running now**, log at
`/tmp/gate.log` on .5, background poller watching for the `RESULTS` banner.
- Once .5's gate reports: bring the rest of the fleet to precondition, then the
cross-node federation/mesh/transport suites. This is the literal
"next exit criterion" called out in `CLAUDE.md`.
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
@ -150,6 +115,25 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
coordinated stack. Workstream A's stack-migration tail is fully closed.
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
- [x] ~~**Consolidated deploy 2026-07-01**: merged PR #67 (reticulum daemon
process-group fix, `469b0203`), the UI/UX work (`8256fde1` — mesh/web5/apps
layout, modal, search UX), and `archy-openwrt` (TollGate/OpenWrt gateway
integration — new `core/openwrt` crate, RPC surface, `OpenWrtGateway.vue`)
into `main`, alongside the indeedhub self-heal fix~~ — all merged clean, no
conflicts. **Found + fixed 2 real build-breaking issues during
verification, not caught by whoever authored them**: a vestigial unused
`ref` in `Web5ConnectedNodes.vue` that broke `vue-tsc`, and a stale
`MeshMap.test.ts` mock missing `federatedPositions` (predated this
session's Mesh Map feature) that crashed on mount. Full test suite green
(667 passed) after fixes. **Deployed fleet-wide 2026-07-01, all 5 nodes
sha256-verified**: .116, .198, .228, .5 (recovered cleanly from one
truncated-transfer hiccup, caught via checksum before it hit the live
service), 100.82.34.38 (non-Quadlet node — all containers survived the
restart intact, unlike the worst-case risk flagged beforehand). Also
built an unbundled installer ISO from this same merged source
(`archipelago-installer-1.7.99-alpha-unbundled-x86_64.iso`, 2.4GB) —
the ISO pipeline was archived from the release process at v1.7.43-alpha
(OTA tarballs are now primary) but the wrapper script still works.
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
already exist**, just aren't wired into the gate or documented as existing:
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content