From 27299ea6872d1b9378536d5a59815ba829483edb Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 16:47:34 -0400 Subject: [PATCH] docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 10 +++-- docs/PRODUCTION-MASTER-PLAN.md | 57 +++++++++++++++------------- docs/multinode-testing-plan.md | 69 ++++++++++++++++++++++++++++++++++ tests/lifecycle/TESTING.md | 19 ++++++---- 4 files changed, 118 insertions(+), 37 deletions(-) create mode 100644 docs/multinode-testing-plan.md diff --git a/CLAUDE.md b/CLAUDE.md index 3566fb2a..b0e57be2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,7 +27,8 @@ Detailed sub-plans (all linked from the master): `container::secrets`, 0600/rootless) — never hardcoded, per-app, or logged. - **Migrations never destroy data** — preserve `/var/lib/archipelago/`, secrets, credentials, ports, and adoption container names; keep a rollback path. -- **Verify on a real node (.228, then .198) before any tag.** +- **Verify on the real node .228 before any tag.** (Fleet-wide multinode + verification is a separate plan: `docs/multinode-testing-plan.md`.) ## Build / verify @@ -43,5 +44,8 @@ Detailed sub-plans (all linked from the master): `tests/lifecycle/run-20x.sh` green across install / UI / stop / start / restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall — **5× on -.228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× — -restore to 20× before the final ship). Until green, the master plan is the priority. +.228** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× — restore to 20× before +the final ship). **Run the gate ON the node** (it uses local podman/systemctl/bitcoin +probes), not via RPC from another host. Until green, the master plan is the priority. +**Multinode testing (.198 + the rest of the fleet) is a SEPARATE plan** — +`docs/multinode-testing-plan.md` — not part of this single-node gate criterion. diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 54362091..94a71a79 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -66,7 +66,7 @@ real nodes. Until then, this plan is the priority. | B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet | | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending | | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) | -| E | **Production test gate** — 5× lifecycle on .228 + .198 (for now; was 20×), per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** | +| E | **Production test gate** — 5× lifecycle on **.228** (for now; was 20×), per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **.228 GREEN (110/110); 5× in progress** | **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md` (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption @@ -78,10 +78,13 @@ modes FM1–FM6 + the desired-state-first reconciler that fixes them). An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green across the full matrix — install / UI-reachable / stop / start / restart / reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall — -**5× on .228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from -20× — restore to 20× before the final ship). All 8 gate checkboxes in `tests/lifecycle/TESTING.md` -are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, -L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage. +**5× on .228** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× — restore to +20× before the final ship). **The gate runs ON the node** (it uses local +podman/systemctl/bitcoin probes; running it via RPC from another host silently +tests the runner). **Multinode / fleet verification (.198 + others) is a SEPARATE +plan — `docs/multinode-testing-plan.md` — NOT part of this single-node criterion.** +Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + +proxies; L3 survival ◐; ~30 apps have zero automated coverage. ## 6. Immediate sequence (live workstream) @@ -97,14 +100,17 @@ L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated cov data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)* 4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide) for the podman-`--restart` path. *(f160e0c4)* -5. ◻ **Verify on .198** (immich migration validated on .228 only so far). -6. ◻ **E** — run the 5× gate (`ARCHY_ITERATIONS=5`, was 20×); fix until green. -7. ◻ Demote this banner. +5. ◧ **E** — 5× gate on **.228** (`ARCHY_ITERATIONS=5`, was 20×). .228 is GREEN + 1× (110/110); the 5× run is in progress. This is now the SINGLE-NODE criterion. +6. ◻ Demote this banner once the 5× is green. + +**Multinode / fleet verification (.198 and the rest) is split into its own plan:** +`docs/multinode-testing-plan.md`. Do it AFTER the .228 single-node gate is green. **Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the published catalog (then sign) to actually distribute manifests via the registry; Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not -just podman-`--restart`); immich on .198. +just podman-`--restart`). ## 7. Release blockers & operational gotchas (durable) @@ -344,23 +350,22 @@ bug is purely "container never stops", not "state not reported". - Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name → `Invalid Docker image format`. -### NEXT STEPS (in order) -1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228: - - stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list - user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228. -2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/ - fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens). -3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node). -4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so - units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`). -5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill - mid-iteration) on .198 then .228. Green = Step-2-of-plan done. -8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman; - re-survey the status doc's quadlet % from `.container`-file presence. -9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen, - config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is - install_netbird_stack in stacks.rs). -10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. +### NEXT STEPS (in order) — SINGLE-NODE (.228) criterion +1. ✅ **DONE** — 4 stop/reconcile bugs fixed + deployed (`2dad64b2` grace, `760a32bc` + reconcile-resurrection guard, `6e49ce6f` container-list user-stopped, `452f05d8` companion + cadence). Plus test-harness fixes (lnd/immich/fedimint/NPM readiness + config). +2. ✅ **DONE** — gate run **ON .228** (synced bitcoin): **110/110 GREEN** (1×). Key lesson: + **run the gate on the node**, not via RPC from .116 (local podman/systemctl/bitcoin probes). +3. ◧ **5× run on .228 in progress** (`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, on the node). + 5 consecutive clean iterations = the single-node gate criterion → demote the banner. +4. **netbird migration (#20 phase 4)** — the one real migration left; assess setup steps first (TLS + cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; + legacy is install_netbird_stack in stacks.rs). Then btcpay/mempool stack polish. +5. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman. + +**Multinode / fleet (.198 + the rest) → `docs/multinode-testing-plan.md` (separate, after .228 green).** +Carry-over notes for that plan: .198 bitcoin was mid-IBD; the lnd `/app/lnd/` nginx proxy had a +stale `8081` target on .228 (repo code is correct at 18083 — re-check on other nodes). ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates diff --git a/docs/multinode-testing-plan.md b/docs/multinode-testing-plan.md new file mode 100644 index 00000000..6bb92abb --- /dev/null +++ b/docs/multinode-testing-plan.md @@ -0,0 +1,69 @@ +# Multinode / Fleet Testing Plan (separate from the single-node gate) + +> **Scope split (2026-06-22):** the production test gate (`docs/PRODUCTION-MASTER-PLAN.md` §5, +> `tests/lifecycle/TESTING.md`) is now a **single-node criterion on .228**. Verifying the same +> lifecycle matrix across the rest of the fleet (.198 and the other testers) lives HERE and is run +> **after** the .228 single-node gate is green. This is intentionally NOT a blocker on the .228 gate. + +## Why split it out + +The lifecycle gate must be **run ON the node under test** — its bitcoin/companion/orphan/endpoint +checks use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, not RPC to a remote host. Running it from +one host against another silently tests the *runner*. So "multinode" isn't "point the harness at N +hosts" — it's "run the on-node gate on each host," plus the genuinely cross-node concerns (federation, +mesh, transport, sync) that a single node can't exercise. + +## How to run the gate on another node + +Bats + jq usually aren't installed on ISO nodes. Bootstrap (one-time per node): + +``` +# from a host that has them (e.g. .116): +dpkg -L bats | grep -E '^/usr/(bin|lib|libexec)' | tar czf /tmp/bats.tgz -P -T - $(which jq) +tar czf /tmp/tests.tgz -C tests/lifecycle +scp /tmp/bats.tgz /tmp/tests.tgz :/tmp/ +# on the node: +sudo tar xzf /tmp/bats.tgz -P -C / # bats (jq here is dynamically linked — may need libs) +sudo curl -fsSL -o /usr/local/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 && sudo chmod +x /usr/local/bin/jq +mkdir -p /tmp/lifecycle-run && tar xzf /tmp/tests.tgz -C /tmp/lifecycle-run +cd /tmp/lifecycle-run/tests/lifecycle +ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=https ARCHY_PASSWORD= \ + ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 nohup ./run-20x.sh > /tmp/gate.log 2>&1 & +``` + +## Per-node preconditions (learned on .228) + +- **Bitcoin must be fully synced + archival** (`initialblockdownload:false`, `pruned:false`). + test 83 reads the *real* `getblockchaininfo`, not the UI's headers-height. A node mid-IBD will + cascade-fail electrumx/lnd/btcpay/mempool even though the apps run. +- **Backends should be proper installs** (in `manifest_ids`), not adopted plain-podman left over + from ad-hoc `package.start`/cascade churn — otherwise companion self-heal and quadlet checks skew. +- **No stale per-app nginx proxy targets.** e.g. `/app/lnd/` must point at the lnd-ui port (18083), + not a stale `8081`. Repo code is correct; old node configs may be stale — re-check + regenerate. +- **No orphan quadlet units** (e.g. a `home-assistant.container` whose ContainerName ≠ the real + `homeassistant` container) — these wedge `systemctl --user` "activating" and fail the quadlet checks. + +## Node roster (carry-over) + +| Node | Role | Notes | +|------|------|-------| +| .228 | **single-node gate** (primary) | 14-app resilience node; bitcoin synced archival; gate GREEN. | +| .198 | fleet verify | was weak/loaded (load ~3–5) + **bitcoin mid-IBD** at split time → must finish syncing first; sshd wedges under concurrent SSH (use ONE session; gate uses HTTPS RPC so fine). | +| .5 / .120 | x250 testers (Tailscale) | flaky cellular; SSH via `tailscale nc` ProxyCommand. | +| .116 | dev/validation | local repo; its own bitcoin may be mid-IBD — do NOT treat as a gate target unless synced. | + +## Cross-node concerns (only a multinode setup can test) + +- Federation sync (Tor/FIPS transports), DID/contact federation, peer file fetch. +- Mesh (Meshtastic/MeshCore) + mesh-AI gating. +- Dual-ecash federation validation + networking-sats routing. +- DHT / iroh swarm distribution (origin-always-wins) once that dep lands. + +## Sequence + +1. Get the **.228 single-node gate green 5×** (master plan §5/§6) — DONE/in progress. +2. THEN: bring each fleet node to the preconditions above; run the on-node gate 5× per node. +3. THEN: the cross-node suites (federation/mesh/transport), tracked here. + +This plan does not gate the v1.7.x single-node criterion; it is the next layer. diff --git a/tests/lifecycle/TESTING.md b/tests/lifecycle/TESTING.md index 47e68c5a..e10d768b 100644 --- a/tests/lifecycle/TESTING.md +++ b/tests/lifecycle/TESTING.md @@ -26,8 +26,9 @@ The migration's aim, restated as **five pillars** (every app must satisfy all fi desired→current from manifests + secrets. Self-healing, not edge-triggered. 3. **Lifecycle bulletproof** — every app passes the full matrix (install / UI reachable / stop / start / restart / reinstall / reboot-survive - / archipelago-restart-survive / uninstall) **5× green on .228 AND .198 for now** - (`ARCHY_ITERATIONS=5`; temporarily reduced from 20×, restore before final ship) + / archipelago-restart-survive / uninstall) **5× green on .228** — run ON the node + (`ARCHY_ITERATIONS=5`; temporarily reduced from 20×, restore before final ship). + (Multinode / fleet → `docs/multinode-testing-plan.md`, separate.) before any release. 4. **Data-driven apps** — install/uninstall needs only the app's manifest + catalog entry. **No host OS changes** (no apt, no /etc, no host units) and @@ -40,9 +41,10 @@ The migration's aim, restated as **five pillars** (every app must satisfy all fi owned by the service user. Security is king. **Per-app definition of done:** all five pillars hold → lifecycle matrix 5× -(for now; was 20×) green on .228 then .198 → catalog/registry updated (`app-catalog/catalog.json` +(for now; was 20×) green on .228 (run ON the node) → catalog/registry updated (`app-catalog/catalog.json` + `releases/app-catalog.json`, rebuilt image pushed to the mirror) → tracker -cell ticked. Only then move to the next app. +cell ticked. Only then move to the next app. (Fleet/multinode verification is a +separate pass → `docs/multinode-testing-plan.md`.) **.228 testing constraint:** do NOT touch `bitcoin-knots`, `electrumx`, or `lnd` on .228 — they are synced and healthy; destructive cycles there would @@ -121,8 +123,9 @@ cost hours of resync. | L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario | | L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark | -Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are -quality gates we add as they mature; not blocking the v1.7.52 tag. +Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 (run ON the node; 5× for +now). Multinode/fleet → `docs/multinode-testing-plan.md`. L4+L5+L6 are quality gates +we add as they mature; not blocking the v1.7.52 tag. ## Coverage matrix — current state @@ -248,8 +251,8 @@ We don't have a performance harness yet. Add as L6 lands: v1.7.52 ships only when ALL of: 1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install) -2. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .228 (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1) -3. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .198 (same) +2. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 **run ON .228** (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1) — 1× is GREEN (110/110), 5× in progress +3. ☐ Multinode/fleet (.198 + others) — tracked separately in `docs/multinode-testing-plan.md`, NOT a v1.7.52 single-node gate item 4. ☐ The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends) 5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f) 6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged