docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode
Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
892ff083c4
commit
27299ea687
10
CLAUDE.md
10
CLAUDE.md
@ -27,7 +27,8 @@ Detailed sub-plans (all linked from the master):
|
||||
`container::secrets`, 0600/rootless) — never hardcoded, per-app, or logged.
|
||||
- **Migrations never destroy data** — preserve `/var/lib/archipelago/<app>`,
|
||||
secrets, credentials, ports, and adoption container names; keep a rollback path.
|
||||
- **Verify on a real node (.228, then .198) before any tag.**
|
||||
- **Verify on the real node .228 before any tag.** (Fleet-wide multinode
|
||||
verification is a separate plan: `docs/multinode-testing-plan.md`.)
|
||||
|
||||
## Build / verify
|
||||
|
||||
@ -43,5 +44,8 @@ Detailed sub-plans (all linked from the master):
|
||||
|
||||
`tests/lifecycle/run-20x.sh` green across install / UI / stop / start / restart /
|
||||
reinstall / reboot-survive / archipelago-restart-survive / uninstall — **5× on
|
||||
.228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× —
|
||||
restore to 20× before the final ship). Until green, the master plan is the priority.
|
||||
.228** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× — restore to 20× before
|
||||
the final ship). **Run the gate ON the node** (it uses local podman/systemctl/bitcoin
|
||||
probes), not via RPC from another host. Until green, the master plan is the priority.
|
||||
**Multinode testing (.198 + the rest of the fleet) is a SEPARATE plan** —
|
||||
`docs/multinode-testing-plan.md` — not part of this single-node gate criterion.
|
||||
|
||||
@ -66,7 +66,7 @@ real nodes. Until then, this plan is the priority.
|
||||
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
|
||||
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
|
||||
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) |
|
||||
| E | **Production test gate** — 5× lifecycle on .228 + .198 (for now; was 20×), per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |
|
||||
| E | **Production test gate** — 5× lifecycle on **.228** (for now; was 20×), per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **.228 GREEN (110/110); 5× in progress** |
|
||||
|
||||
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
|
||||
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
|
||||
@ -78,10 +78,13 @@ modes FM1–FM6 + the desired-state-first reconciler that fixes them).
|
||||
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
|
||||
across the full matrix — install / UI-reachable / stop / start / restart /
|
||||
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
|
||||
**5× on .228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from
|
||||
20× — restore to 20× before the final ship). All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
|
||||
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
|
||||
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
|
||||
**5× on .228** (`ARCHY_ITERATIONS=5`; temporarily reduced from 20× — restore to
|
||||
20× before the final ship). **The gate runs ON the node** (it uses local
|
||||
podman/systemctl/bitcoin probes; running it via RPC from another host silently
|
||||
tests the runner). **Multinode / fleet verification (.198 + others) is a SEPARATE
|
||||
plan — `docs/multinode-testing-plan.md` — NOT part of this single-node criterion.**
|
||||
Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard +
|
||||
proxies; L3 survival ◐; ~30 apps have zero automated coverage.
|
||||
|
||||
## 6. Immediate sequence (live workstream)
|
||||
|
||||
@ -97,14 +100,17 @@ L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated cov
|
||||
data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)*
|
||||
4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide)
|
||||
for the podman-`--restart` path. *(f160e0c4)*
|
||||
5. ◻ **Verify on .198** (immich migration validated on .228 only so far).
|
||||
6. ◻ **E** — run the 5× gate (`ARCHY_ITERATIONS=5`, was 20×); fix until green.
|
||||
7. ◻ Demote this banner.
|
||||
5. ◧ **E** — 5× gate on **.228** (`ARCHY_ITERATIONS=5`, was 20×). .228 is GREEN
|
||||
1× (110/110); the 5× run is in progress. This is now the SINGLE-NODE criterion.
|
||||
6. ◻ Demote this banner once the 5× is green.
|
||||
|
||||
**Multinode / fleet verification (.198 and the rest) is split into its own plan:**
|
||||
`docs/multinode-testing-plan.md`. Do it AFTER the .228 single-node gate is green.
|
||||
|
||||
**Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the
|
||||
published catalog (then sign) to actually distribute manifests via the registry;
|
||||
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
|
||||
just podman-`--restart`); immich on .198.
|
||||
just podman-`--restart`).
|
||||
|
||||
## 7. Release blockers & operational gotchas (durable)
|
||||
|
||||
@ -344,23 +350,22 @@ bug is purely "container never stops", not "state not reported".
|
||||
- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name
|
||||
→ `Invalid Docker image format`.
|
||||
|
||||
### NEXT STEPS (in order)
|
||||
1. ✅ **DONE** — the 3 stop bugs are fixed + deployed; electrumx lifecycle suite GREEN on .228:
|
||||
- stop-grace (`2dad64b2`), reconcile-resurrection guard (`760a32bc`), container-list
|
||||
user-stopped state (`6e49ce6f`). All compile + unit/quadlet tests green; deployed to .228.
|
||||
2. **Full single-iteration gate on .228** (running) — confirm bitcoin/btcpay/lnd/mempool/immich/
|
||||
fedimint stop tests now pass too. Fix any stragglers (same 3-bug lens).
|
||||
3. **Deploy the 3-fix binary to .198** and run the gate there (the clean quadlet node).
|
||||
4. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
|
||||
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
|
||||
5. **Run the 5× canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
|
||||
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
|
||||
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
|
||||
re-survey the status doc's quadlet % from `.container`-file presence.
|
||||
9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
|
||||
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
|
||||
install_netbird_stack in stacks.rs).
|
||||
10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||
### NEXT STEPS (in order) — SINGLE-NODE (.228) criterion
|
||||
1. ✅ **DONE** — 4 stop/reconcile bugs fixed + deployed (`2dad64b2` grace, `760a32bc`
|
||||
reconcile-resurrection guard, `6e49ce6f` container-list user-stopped, `452f05d8` companion
|
||||
cadence). Plus test-harness fixes (lnd/immich/fedimint/NPM readiness + config).
|
||||
2. ✅ **DONE** — gate run **ON .228** (synced bitcoin): **110/110 GREEN** (1×). Key lesson:
|
||||
**run the gate on the node**, not via RPC from .116 (local podman/systemctl/bitcoin probes).
|
||||
3. ◧ **5× run on .228 in progress** (`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, on the node).
|
||||
5 consecutive clean iterations = the single-node gate criterion → demote the banner.
|
||||
4. **netbird migration (#20 phase 4)** — the one real migration left; assess setup steps first (TLS
|
||||
cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host;
|
||||
legacy is install_netbird_stack in stacks.rs). Then btcpay/mempool stack polish.
|
||||
5. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman.
|
||||
|
||||
**Multinode / fleet (.198 + the rest) → `docs/multinode-testing-plan.md` (separate, after .228 green).**
|
||||
Carry-over notes for that plan: .198 bitcoin was mid-IBD; the lnd `/app/lnd/` nginx proxy had a
|
||||
stale `8081` target on .228 (repo code is correct at 18083 — re-check on other nodes).
|
||||
|
||||
### KNOWN ISSUES / WATCH-OUTS
|
||||
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
||||
|
||||
69
docs/multinode-testing-plan.md
Normal file
69
docs/multinode-testing-plan.md
Normal file
@ -0,0 +1,69 @@
|
||||
# Multinode / Fleet Testing Plan (separate from the single-node gate)
|
||||
|
||||
> **Scope split (2026-06-22):** the production test gate (`docs/PRODUCTION-MASTER-PLAN.md` §5,
|
||||
> `tests/lifecycle/TESTING.md`) is now a **single-node criterion on .228**. Verifying the same
|
||||
> lifecycle matrix across the rest of the fleet (.198 and the other testers) lives HERE and is run
|
||||
> **after** the .228 single-node gate is green. This is intentionally NOT a blocker on the .228 gate.
|
||||
|
||||
## Why split it out
|
||||
|
||||
The lifecycle gate must be **run ON the node under test** — its bitcoin/companion/orphan/endpoint
|
||||
checks use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, not RPC to a remote host. Running it from
|
||||
one host against another silently tests the *runner*. So "multinode" isn't "point the harness at N
|
||||
hosts" — it's "run the on-node gate on each host," plus the genuinely cross-node concerns (federation,
|
||||
mesh, transport, sync) that a single node can't exercise.
|
||||
|
||||
## How to run the gate on another node
|
||||
|
||||
Bats + jq usually aren't installed on ISO nodes. Bootstrap (one-time per node):
|
||||
|
||||
```
|
||||
# from a host that has them (e.g. .116):
|
||||
dpkg -L bats | grep -E '^/usr/(bin|lib|libexec)' | tar czf /tmp/bats.tgz -P -T - $(which jq)
|
||||
tar czf /tmp/tests.tgz -C <repo> tests/lifecycle
|
||||
scp /tmp/bats.tgz /tmp/tests.tgz <node>:/tmp/
|
||||
# on the node:
|
||||
sudo tar xzf /tmp/bats.tgz -P -C / # bats (jq here is dynamically linked — may need libs)
|
||||
sudo curl -fsSL -o /usr/local/bin/jq \
|
||||
https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 && sudo chmod +x /usr/local/bin/jq
|
||||
mkdir -p /tmp/lifecycle-run && tar xzf /tmp/tests.tgz -C /tmp/lifecycle-run
|
||||
cd /tmp/lifecycle-run/tests/lifecycle
|
||||
ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=https ARCHY_PASSWORD=<node pw> \
|
||||
ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5 nohup ./run-20x.sh > /tmp/gate.log 2>&1 &
|
||||
```
|
||||
|
||||
## Per-node preconditions (learned on .228)
|
||||
|
||||
- **Bitcoin must be fully synced + archival** (`initialblockdownload:false`, `pruned:false`).
|
||||
test 83 reads the *real* `getblockchaininfo`, not the UI's headers-height. A node mid-IBD will
|
||||
cascade-fail electrumx/lnd/btcpay/mempool even though the apps run.
|
||||
- **Backends should be proper installs** (in `manifest_ids`), not adopted plain-podman left over
|
||||
from ad-hoc `package.start`/cascade churn — otherwise companion self-heal and quadlet checks skew.
|
||||
- **No stale per-app nginx proxy targets.** e.g. `/app/lnd/` must point at the lnd-ui port (18083),
|
||||
not a stale `8081`. Repo code is correct; old node configs may be stale — re-check + regenerate.
|
||||
- **No orphan quadlet units** (e.g. a `home-assistant.container` whose ContainerName ≠ the real
|
||||
`homeassistant` container) — these wedge `systemctl --user` "activating" and fail the quadlet checks.
|
||||
|
||||
## Node roster (carry-over)
|
||||
|
||||
| Node | Role | Notes |
|
||||
|------|------|-------|
|
||||
| .228 | **single-node gate** (primary) | 14-app resilience node; bitcoin synced archival; gate GREEN. |
|
||||
| .198 | fleet verify | was weak/loaded (load ~3–5) + **bitcoin mid-IBD** at split time → must finish syncing first; sshd wedges under concurrent SSH (use ONE session; gate uses HTTPS RPC so fine). |
|
||||
| .5 / .120 | x250 testers (Tailscale) | flaky cellular; SSH via `tailscale nc` ProxyCommand. |
|
||||
| .116 | dev/validation | local repo; its own bitcoin may be mid-IBD — do NOT treat as a gate target unless synced. |
|
||||
|
||||
## Cross-node concerns (only a multinode setup can test)
|
||||
|
||||
- Federation sync (Tor/FIPS transports), DID/contact federation, peer file fetch.
|
||||
- Mesh (Meshtastic/MeshCore) + mesh-AI gating.
|
||||
- Dual-ecash federation validation + networking-sats routing.
|
||||
- DHT / iroh swarm distribution (origin-always-wins) once that dep lands.
|
||||
|
||||
## Sequence
|
||||
|
||||
1. Get the **.228 single-node gate green 5×** (master plan §5/§6) — DONE/in progress.
|
||||
2. THEN: bring each fleet node to the preconditions above; run the on-node gate 5× per node.
|
||||
3. THEN: the cross-node suites (federation/mesh/transport), tracked here.
|
||||
|
||||
This plan does not gate the v1.7.x single-node criterion; it is the next layer.
|
||||
@ -26,8 +26,9 @@ The migration's aim, restated as **five pillars** (every app must satisfy all fi
|
||||
desired→current from manifests + secrets. Self-healing, not edge-triggered.
|
||||
3. **Lifecycle bulletproof** — every app passes the full matrix
|
||||
(install / UI reachable / stop / start / restart / reinstall / reboot-survive
|
||||
/ archipelago-restart-survive / uninstall) **5× green on .228 AND .198 for now**
|
||||
(`ARCHY_ITERATIONS=5`; temporarily reduced from 20×, restore before final ship)
|
||||
/ archipelago-restart-survive / uninstall) **5× green on .228** — run ON the node
|
||||
(`ARCHY_ITERATIONS=5`; temporarily reduced from 20×, restore before final ship).
|
||||
(Multinode / fleet → `docs/multinode-testing-plan.md`, separate.)
|
||||
before any release.
|
||||
4. **Data-driven apps** — install/uninstall needs only the app's manifest +
|
||||
catalog entry. **No host OS changes** (no apt, no /etc, no host units) and
|
||||
@ -40,9 +41,10 @@ The migration's aim, restated as **five pillars** (every app must satisfy all fi
|
||||
owned by the service user. Security is king.
|
||||
|
||||
**Per-app definition of done:** all five pillars hold → lifecycle matrix 5×
|
||||
(for now; was 20×) green on .228 then .198 → catalog/registry updated (`app-catalog/catalog.json`
|
||||
(for now; was 20×) green on .228 (run ON the node) → catalog/registry updated (`app-catalog/catalog.json`
|
||||
+ `releases/app-catalog.json`, rebuilt image pushed to the mirror) → tracker
|
||||
cell ticked. Only then move to the next app.
|
||||
cell ticked. Only then move to the next app. (Fleet/multinode verification is a
|
||||
separate pass → `docs/multinode-testing-plan.md`.)
|
||||
|
||||
**.228 testing constraint:** do NOT touch `bitcoin-knots`, `electrumx`, or
|
||||
`lnd` on .228 — they are synced and healthy; destructive cycles there would
|
||||
@ -121,8 +123,9 @@ cost hours of resync.
|
||||
| L5 — Chaos / failure-path | Failure modes recover gracefully (corrupt config, deleted bolt DB, network partition) | bats (chaos-gated) | ~120s per scenario |
|
||||
| L6 — Performance | Cold install latency, reconcile-tick cost, podman call count per lifecycle event | timed bats + Prometheus (TBD) | ~60s per benchmark |
|
||||
|
||||
Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 AND .198. L4+L5+L6 are
|
||||
quality gates we add as they mature; not blocking the v1.7.52 tag.
|
||||
Release gate: **L0+L1+L2+L3 green × 20 iterations** on .228 (run ON the node; 5× for
|
||||
now). Multinode/fleet → `docs/multinode-testing-plan.md`. L4+L5+L6 are quality gates
|
||||
we add as they mature; not blocking the v1.7.52 tag.
|
||||
|
||||
## Coverage matrix — current state
|
||||
|
||||
@ -248,8 +251,8 @@ We don't have a performance harness yet. Add as L6 lands:
|
||||
v1.7.52 ships only when ALL of:
|
||||
|
||||
1. ☐ Bitcoin-stops fix verified live on a fresh node (tests/lifecycle/bats/bitcoin-knots.bats fully ● after a cold install)
|
||||
2. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .228 (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1)
|
||||
3. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 against .198 (same)
|
||||
2. ☐ `ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh` returns 0 **run ON .228** (5× for now; full suite, ARCHY_ALLOW_DESTRUCTIVE=1) — 1× is GREEN (110/110), 5× in progress
|
||||
3. ☐ Multinode/fleet (.198 + others) — tracked separately in `docs/multinode-testing-plan.md`, NOT a v1.7.52 single-node gate item
|
||||
4. ☐ The L3 `backend-survives-archipelago-restart` suite passes (= Phase 3 Quadlet shipped for backends)
|
||||
5. ☐ Cargo: 0 warnings, 0 unused, all tests green (sustained ✓ since 1c0df95f)
|
||||
6. ☐ LoC: at least one of {Phase 3 Quadlet, dev_mode resolution} merged
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user