archy/docs/multinode-testing-plan.md
archipelago 57a013bc66 test(gate): make 5× the canonical gate, drop 20x naming
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:12:41 -04:00

3.9 KiB
Raw Blame History

Multinode / Fleet Testing Plan (separate from the single-node gate)

Scope split (2026-06-22): the production test gate (docs/PRODUCTION-MASTER-PLAN.md §5, tests/lifecycle/TESTING.md) is now a single-node criterion on .228. Verifying the same lifecycle matrix across the rest of the fleet (.198 and the other testers) lives HERE and is run after the .228 single-node gate is green. This is intentionally NOT a blocker on the .228 gate.

Why split it out

The lifecycle gate must be run ON the node under test — its bitcoin/companion/orphan/endpoint checks use local podman/systemctl/bitcoin-cli/curl, not RPC to a remote host. Running it from one host against another silently tests the runner. So "multinode" isn't "point the harness at N hosts" — it's "run the on-node gate on each host," plus the genuinely cross-node concerns (federation, mesh, transport, sync) that a single node can't exercise.

How to run the gate on another node

Bats + jq usually aren't installed on ISO nodes. Bootstrap (one-time per node):

# from a host that has them (e.g. .116):
dpkg -L bats | grep -E '^/usr/(bin|lib|libexec)' | tar czf /tmp/bats.tgz -P -T - $(which jq)
tar czf /tmp/tests.tgz -C <repo> tests/lifecycle
scp /tmp/bats.tgz /tmp/tests.tgz <node>:/tmp/
# on the node:
sudo tar xzf /tmp/bats.tgz -P -C /          # bats (jq here is dynamically linked — may need libs)
sudo curl -fsSL -o /usr/local/bin/jq \
  https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 && sudo chmod +x /usr/local/bin/jq
mkdir -p /tmp/lifecycle-run && tar xzf /tmp/tests.tgz -C /tmp/lifecycle-run
cd /tmp/lifecycle-run/tests/lifecycle
ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=https ARCHY_PASSWORD=<node pw> \
  ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 nohup ./run-gate.sh > /tmp/gate.log 2>&1 &

Per-node preconditions (learned on .228)

  • Bitcoin must be fully synced + archival (initialblockdownload:false, pruned:false). test 83 reads the real getblockchaininfo, not the UI's headers-height. A node mid-IBD will cascade-fail electrumx/lnd/btcpay/mempool even though the apps run.
  • Backends should be proper installs (in manifest_ids), not adopted plain-podman left over from ad-hoc package.start/cascade churn — otherwise companion self-heal and quadlet checks skew.
  • No stale per-app nginx proxy targets. e.g. /app/lnd/ must point at the lnd-ui port (18083), not a stale 8081. Repo code is correct; old node configs may be stale — re-check + regenerate.
  • No orphan quadlet units (e.g. a home-assistant.container whose ContainerName ≠ the real homeassistant container) — these wedge systemctl --user "activating" and fail the quadlet checks.

Node roster (carry-over)

Node Role Notes
.228 single-node gate (primary) 14-app resilience node; bitcoin synced archival; gate GREEN.
.198 fleet verify was weak/loaded (load ~35) + bitcoin mid-IBD at split time → must finish syncing first; sshd wedges under concurrent SSH (use ONE session; gate uses HTTPS RPC so fine).
.5 / .120 x250 testers (Tailscale) flaky cellular; SSH via tailscale nc ProxyCommand.
.116 dev/validation local repo; its own bitcoin may be mid-IBD — do NOT treat as a gate target unless synced.

Cross-node concerns (only a multinode setup can test)

  • Federation sync (Tor/FIPS transports), DID/contact federation, peer file fetch.
  • Mesh (Meshtastic/MeshCore) + mesh-AI gating.
  • Dual-ecash federation validation + networking-sats routing.
  • DHT / iroh swarm distribution (origin-always-wins) once that dep lands.

Sequence

  1. Get the .228 single-node gate green 5× (master plan §5/§6) — DONE/in progress.
  2. THEN: bring each fleet node to the preconditions above; run the on-node gate 5× per node.
  3. THEN: the cross-node suites (federation/mesh/transport), tracked here.

This plan does not gate the v1.7.x single-node criterion; it is the next layer.