docs: multinode-pass parallel work — 3 items closed, 1 real regression found
While the .5 gate ran: confirmed no legacy multi-container stacks remain (workstream A tail fully closed), reframed the "30 apps zero coverage" claim as stale (all apps get generic baseline coverage via all-apps-lifecycle/matrix, real gap is 34 apps lacking app-specific assertions), and discovered tests/multinode/smoke.sh already exists and ran it live against .116<->.228: federation pairing/FIPS/content-browse all confirmed working, but found + root-caused a real tombstone bug (federation.remove-node silently swallows tombstone-write failures, letting removed peers get re-added by background sync). Not fixed yet — federation/trust code, needs a careful fix, not a blind one. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This commit is contained in:
parent
2f1a577109
commit
81444ab4a8
@ -89,15 +89,59 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
|
||||
"next exit criterion" called out in `CLAUDE.md`.
|
||||
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
|
||||
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
|
||||
2026-07-01). Flip needs: re-test on a healthy idle legacy node, then flip the
|
||||
default, then multinode gate re-run.
|
||||
- [ ] **Per-app test coverage for the ~30 apps with zero automated coverage** —
|
||||
framework exists (bats + reusable helpers), just needs per-app suites written.
|
||||
- [ ] **Convert remaining multi-container legacy stacks to the manifest-owned model**
|
||||
(workstream A tail) — netbird's legacy installer is already deleted (`89d397bb`);
|
||||
immich (see Tier 1) and any other multi-container stacks are what's left.
|
||||
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
|
||||
reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt
|
||||
left an uncommitted flip sitting around and that caused confusion; it's a 2-line
|
||||
change, faster to just do it fresh once confirmed).
|
||||
- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ —
|
||||
**reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` +
|
||||
`all-apps-lifecycle.bats` already give EVERY installed app generic baseline
|
||||
coverage (no stuck state, no error state, stop/start/restart survives, UI
|
||||
reachable). The real gap is narrower: **34 apps lack app-specific assertions**
|
||||
(health endpoints, API queryability, data integrity) beyond that baseline —
|
||||
aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd,
|
||||
fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5
|
||||
sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird
|
||||
(+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router,
|
||||
searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is
|
||||
real safety net; treat as a backlog "nice to harden further," not a gate item.
|
||||
- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~ —
|
||||
**investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container
|
||||
stacks (btcpay, mempool, immich, netbird, indeedhub) are on the
|
||||
`install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from
|
||||
the codebase; portainer/home-assistant/grafana are single-container
|
||||
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
|
||||
are 3 separate single-container apps with manifest dependency edges, not a
|
||||
coordinated stack. Workstream A's stack-migration tail is fully closed.
|
||||
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
|
||||
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
|
||||
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
|
||||
already exist**, just aren't wired into the gate or documented as existing:
|
||||
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content
|
||||
browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh`
|
||||
(8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`.
|
||||
**Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1
|
||||
failed, 1 skipped.** Confirms federation pairing (both directions), FIPS
|
||||
anchor connectivity (both nodes), and peer-content-browse-over-mesh (the
|
||||
v1.7.95 fix) all genuinely work node-to-node right now.
|
||||
- ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash`
|
||||
has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole
|
||||
suite with zero feedback (this is what looked like a hang before it eventually
|
||||
completed on its own). Cheap fix, not yet applied.
|
||||
- 🐛 **Real regression found and root-caused**: removing a federation node
|
||||
(`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer
|
||||
list after removal in the live test. Root cause: `remove_node()`
|
||||
(`core/archipelago/src/federation/storage.rs:187`) does
|
||||
`let _ = tombstone_did(data_dir, did).await` — **silently swallows the
|
||||
tombstone write's errors.** If that write fails (disk I/O, permission,
|
||||
transient issue), the peer is removed from `nodes.json` but never actually
|
||||
tombstoned, so the next background sync/notify-join re-adds it — the
|
||||
tombstone check at `handlers.rs:592-599` passes because the DID was never
|
||||
recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh
|
||||
regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust
|
||||
code, deliberately not touching it blind; needs a careful fix (surface the
|
||||
tombstone-write failure instead of swallowing it, and/or retry) plus
|
||||
re-verification with `smoke.sh` before considering it closed.
|
||||
|
||||
## Tier 3 — Blocked on a decision or resource only you can supply
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user