docs: multinode-pass parallel work — 3 items closed, 1 real regression found

While the .5 gate ran: confirmed no legacy multi-container stacks remain
(workstream A tail fully closed), reframed the "30 apps zero coverage"
claim as stale (all apps get generic baseline coverage via
all-apps-lifecycle/matrix, real gap is 34 apps lacking app-specific
assertions), and discovered tests/multinode/smoke.sh already exists and
ran it live against .116<->.228: federation pairing/FIPS/content-browse
all confirmed working, but found + root-caused a real tombstone bug
(federation.remove-node silently swallows tombstone-write failures,
letting removed peers get re-added by background sync). Not fixed yet —
federation/trust code, needs a careful fix, not a blind one.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-07-01 15:23:52 -04:00
parent 2f1a577109
commit 81444ab4a8

View File

@ -89,15 +89,59 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
"next exit criterion" called out in `CLAUDE.md`.
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
2026-07-01). Flip needs: re-test on a healthy idle legacy node, then flip the
default, then multinode gate re-run.
- [ ] **Per-app test coverage for the ~30 apps with zero automated coverage**
framework exists (bats + reusable helpers), just needs per-app suites written.
- [ ] **Convert remaining multi-container legacy stacks to the manifest-owned model**
(workstream A tail) — netbird's legacy installer is already deleted (`89d397bb`);
immich (see Tier 1) and any other multi-container stacks are what's left.
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt
left an uncommitted flip sitting around and that caused confusion; it's a 2-line
change, faster to just do it fresh once confirmed).
- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ —
**reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` +
`all-apps-lifecycle.bats` already give EVERY installed app generic baseline
coverage (no stuck state, no error state, stop/start/restart survives, UI
reachable). The real gap is narrower: **34 apps lack app-specific assertions**
(health endpoints, API queryability, data integrity) beyond that baseline —
aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd,
fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5
sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird
(+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router,
searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is
real safety net; treat as a backlog "nice to harden further," not a gate item.
- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~
**investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container
stacks (btcpay, mempool, immich, netbird, indeedhub) are on the
`install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from
the codebase; portainer/home-assistant/grafana are single-container
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
are 3 separate single-container apps with manifest dependency edges, not a
coordinated stack. Workstream A's stack-migration tail is fully closed.
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
already exist**, just aren't wired into the gate or documented as existing:
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content
browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh`
(8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`.
**Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1
failed, 1 skipped.** Confirms federation pairing (both directions), FIPS
anchor connectivity (both nodes), and peer-content-browse-over-mesh (the
v1.7.95 fix) all genuinely work node-to-node right now.
- ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash`
has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole
suite with zero feedback (this is what looked like a hang before it eventually
completed on its own). Cheap fix, not yet applied.
- 🐛 **Real regression found and root-caused**: removing a federation node
(`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer
list after removal in the live test. Root cause: `remove_node()`
(`core/archipelago/src/federation/storage.rs:187`) does
`let _ = tombstone_did(data_dir, did).await` — **silently swallows the
tombstone write's errors.** If that write fails (disk I/O, permission,
transient issue), the peer is removed from `nodes.json` but never actually
tombstoned, so the next background sync/notify-join re-adds it — the
tombstone check at `handlers.rs:592-599` passes because the DID was never
recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh
regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust
code, deliberately not touching it blind; needs a careful fix (surface the
tombstone-write failure instead of swallowing it, and/or retry) plus
re-verification with `smoke.sh` before considering it closed.
## Tier 3 — Blocked on a decision or resource only you can supply