archy/docs/multinode-testing-plan.md

70 lines
3.9 KiB
Markdown
Raw Permalink Normal View History

# Multinode / Fleet Testing Plan (separate from the single-node gate)
> **Scope split (2026-06-22):** the production test gate (`docs/PRODUCTION-MASTER-PLAN.md` §5,
> `tests/lifecycle/TESTING.md`) is now a **single-node criterion on .228**. Verifying the same
> lifecycle matrix across the rest of the fleet (.198 and the other testers) lives HERE and is run
> **after** the .228 single-node gate is green. This is intentionally NOT a blocker on the .228 gate.
## Why split it out
The lifecycle gate must be **run ON the node under test** — its bitcoin/companion/orphan/endpoint
checks use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, not RPC to a remote host. Running it from
one host against another silently tests the *runner*. So "multinode" isn't "point the harness at N
hosts" — it's "run the on-node gate on each host," plus the genuinely cross-node concerns (federation,
mesh, transport, sync) that a single node can't exercise.
## How to run the gate on another node
Bats + jq usually aren't installed on ISO nodes. Bootstrap (one-time per node):
```
# from a host that has them (e.g. .116):
dpkg -L bats | grep -E '^/usr/(bin|lib|libexec)' | tar czf /tmp/bats.tgz -P -T - $(which jq)
tar czf /tmp/tests.tgz -C <repo> tests/lifecycle
scp /tmp/bats.tgz /tmp/tests.tgz <node>:/tmp/
# on the node:
sudo tar xzf /tmp/bats.tgz -P -C / # bats (jq here is dynamically linked — may need libs)
sudo curl -fsSL -o /usr/local/bin/jq \
https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 && sudo chmod +x /usr/local/bin/jq
mkdir -p /tmp/lifecycle-run && tar xzf /tmp/tests.tgz -C /tmp/lifecycle-run
cd /tmp/lifecycle-run/tests/lifecycle
ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=https ARCHY_PASSWORD=<node pw> \
ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 nohup ./run-gate.sh > /tmp/gate.log 2>&1 &
```
## Per-node preconditions (learned on .228)
- **Bitcoin must be fully synced + archival** (`initialblockdownload:false`, `pruned:false`).
test 83 reads the *real* `getblockchaininfo`, not the UI's headers-height. A node mid-IBD will
cascade-fail electrumx/lnd/btcpay/mempool even though the apps run.
- **Backends should be proper installs** (in `manifest_ids`), not adopted plain-podman left over
from ad-hoc `package.start`/cascade churn — otherwise companion self-heal and quadlet checks skew.
- **No stale per-app nginx proxy targets.** e.g. `/app/lnd/` must point at the lnd-ui port (18083),
not a stale `8081`. Repo code is correct; old node configs may be stale — re-check + regenerate.
- **No orphan quadlet units** (e.g. a `home-assistant.container` whose ContainerName ≠ the real
`homeassistant` container) — these wedge `systemctl --user` "activating" and fail the quadlet checks.
## Node roster (carry-over)
| Node | Role | Notes |
|------|------|-------|
| .228 | **single-node gate** (primary) | 14-app resilience node; bitcoin synced archival; gate GREEN. |
| .198 | fleet verify | was weak/loaded (load ~35) + **bitcoin mid-IBD** at split time → must finish syncing first; sshd wedges under concurrent SSH (use ONE session; gate uses HTTPS RPC so fine). |
| .5 / .120 | x250 testers (Tailscale) | flaky cellular; SSH via `tailscale nc` ProxyCommand. |
| .116 | dev/validation | local repo; its own bitcoin may be mid-IBD — do NOT treat as a gate target unless synced. |
## Cross-node concerns (only a multinode setup can test)
- Federation sync (Tor/FIPS transports), DID/contact federation, peer file fetch.
- Mesh (Meshtastic/MeshCore) + mesh-AI gating.
- Dual-ecash federation validation + networking-sats routing.
- DHT / iroh swarm distribution (origin-always-wins) once that dep lands.
## Sequence
1. Get the **.228 single-node gate green 5×** (master plan §5/§6) — DONE/in progress.
2. THEN: bring each fleet node to the preconditions above; run the on-node gate 5× per node.
3. THEN: the cross-node suites (federation/mesh/transport), tracked here.
This plan does not gate the v1.7.x single-node criterion; it is the next layer.