Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
70 lines
3.9 KiB
Markdown
70 lines
3.9 KiB
Markdown
# Multinode / Fleet Testing Plan (separate from the single-node gate)
|
||
|
||
> **Scope split (2026-06-22):** the production test gate (`docs/PRODUCTION-MASTER-PLAN.md` §5,
|
||
> `tests/lifecycle/TESTING.md`) is now a **single-node criterion on .228**. Verifying the same
|
||
> lifecycle matrix across the rest of the fleet (.198 and the other testers) lives HERE and is run
|
||
> **after** the .228 single-node gate is green. This is intentionally NOT a blocker on the .228 gate.
|
||
|
||
## Why split it out
|
||
|
||
The lifecycle gate must be **run ON the node under test** — its bitcoin/companion/orphan/endpoint
|
||
checks use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, not RPC to a remote host. Running it from
|
||
one host against another silently tests the *runner*. So "multinode" isn't "point the harness at N
|
||
hosts" — it's "run the on-node gate on each host," plus the genuinely cross-node concerns (federation,
|
||
mesh, transport, sync) that a single node can't exercise.
|
||
|
||
## How to run the gate on another node
|
||
|
||
Bats + jq usually aren't installed on ISO nodes. Bootstrap (one-time per node):
|
||
|
||
```
|
||
# from a host that has them (e.g. .116):
|
||
dpkg -L bats | grep -E '^/usr/(bin|lib|libexec)' | tar czf /tmp/bats.tgz -P -T - $(which jq)
|
||
tar czf /tmp/tests.tgz -C <repo> tests/lifecycle
|
||
scp /tmp/bats.tgz /tmp/tests.tgz <node>:/tmp/
|
||
# on the node:
|
||
sudo tar xzf /tmp/bats.tgz -P -C / # bats (jq here is dynamically linked — may need libs)
|
||
sudo curl -fsSL -o /usr/local/bin/jq \
|
||
https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 && sudo chmod +x /usr/local/bin/jq
|
||
mkdir -p /tmp/lifecycle-run && tar xzf /tmp/tests.tgz -C /tmp/lifecycle-run
|
||
cd /tmp/lifecycle-run/tests/lifecycle
|
||
ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=https ARCHY_PASSWORD=<node pw> \
|
||
ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 nohup ./run-20x.sh > /tmp/gate.log 2>&1 &
|
||
```
|
||
|
||
## Per-node preconditions (learned on .228)
|
||
|
||
- **Bitcoin must be fully synced + archival** (`initialblockdownload:false`, `pruned:false`).
|
||
test 83 reads the *real* `getblockchaininfo`, not the UI's headers-height. A node mid-IBD will
|
||
cascade-fail electrumx/lnd/btcpay/mempool even though the apps run.
|
||
- **Backends should be proper installs** (in `manifest_ids`), not adopted plain-podman left over
|
||
from ad-hoc `package.start`/cascade churn — otherwise companion self-heal and quadlet checks skew.
|
||
- **No stale per-app nginx proxy targets.** e.g. `/app/lnd/` must point at the lnd-ui port (18083),
|
||
not a stale `8081`. Repo code is correct; old node configs may be stale — re-check + regenerate.
|
||
- **No orphan quadlet units** (e.g. a `home-assistant.container` whose ContainerName ≠ the real
|
||
`homeassistant` container) — these wedge `systemctl --user` "activating" and fail the quadlet checks.
|
||
|
||
## Node roster (carry-over)
|
||
|
||
| Node | Role | Notes |
|
||
|------|------|-------|
|
||
| .228 | **single-node gate** (primary) | 14-app resilience node; bitcoin synced archival; gate GREEN. |
|
||
| .198 | fleet verify | was weak/loaded (load ~3–5) + **bitcoin mid-IBD** at split time → must finish syncing first; sshd wedges under concurrent SSH (use ONE session; gate uses HTTPS RPC so fine). |
|
||
| .5 / .120 | x250 testers (Tailscale) | flaky cellular; SSH via `tailscale nc` ProxyCommand. |
|
||
| .116 | dev/validation | local repo; its own bitcoin may be mid-IBD — do NOT treat as a gate target unless synced. |
|
||
|
||
## Cross-node concerns (only a multinode setup can test)
|
||
|
||
- Federation sync (Tor/FIPS transports), DID/contact federation, peer file fetch.
|
||
- Mesh (Meshtastic/MeshCore) + mesh-AI gating.
|
||
- Dual-ecash federation validation + networking-sats routing.
|
||
- DHT / iroh swarm distribution (origin-always-wins) once that dep lands.
|
||
|
||
## Sequence
|
||
|
||
1. Get the **.228 single-node gate green 5×** (master plan §5/§6) — DONE/in progress.
|
||
2. THEN: bring each fleet node to the preconditions above; run the on-node gate 5× per node.
|
||
3. THEN: the cross-node suites (federation/mesh/transport), tracked here.
|
||
|
||
This plan does not gate the v1.7.x single-node criterion; it is the next layer.
|