Compare commits

...

2 Commits

Author SHA1 Message Date
archipelago
ccb594fb85 test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note
It called bats-assert's `fail` (not loaded in this file) → "fail:
command not found"/127, masking the real reason. Emit+return instead,
bump the cold-restart RPC window 60s→120s (block-index reload), and
note a node mid-IBD legitimately can't serve getinfo (environmental
precondition, not a product regression).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 06:28:20 -04:00
archipelago
deff380191 docs(master-plan): workstream F (lifecycle perfection) + §10 state-mgmt backlog
The 2026-06-23 5×-green gate is DESTRUCTIVE-tier / ~8 core apps only —
it skips uninstall/reinstall (cascade) and has no progress-UI or
all-apps coverage. Manual multinode testing found real bugs it never
ran (immich+grafana uninstall hangs at full-red bar + ghost in My Apps;
grafana reinstall stops; fedimint guardian "waiting for bitcoin sync").
Adds §4 row F, §6b post-deploy order (netbird→Phase-3→F), §6c scope +
observed bugs + definition-of-done, a §5 warning, and §10 backlog to
investigate TanStack-Query/push-based state management for neode-ui.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 06:28:19 -04:00
2 changed files with 91 additions and 4 deletions

View File

@ -69,7 +69,8 @@ real nodes. Until then, this plan is the priority.
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — single-node criterion met |
| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps |
| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
@ -88,6 +89,14 @@ plan — `docs/multinode-testing-plan.md` — NOT part of this single-node crite
Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard +
proxies; L3 survival ◐; ~30 apps have zero automated coverage.
> ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the
> **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips
> uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`,
> never set by the gate) and tests no install/uninstall **progress UI**. Real
> uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing
> right after — see **§6c (workstream F)** for the gap and the expanded-gate plan.
> The true "every app, fully" criterion is F's definition-of-done, not this run.
## 6. Immediate sequence (live workstream)
1. ✅ **B-phase 1**`manifest` field on `AppCatalogEntry`; `load_manifests`
@ -117,6 +126,55 @@ published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
just podman-`--restart`).
## 6b. Post-deploy task order (agreed 2026-06-23)
After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228
+ Tailscale testers), do these IN ORDER:
1. **netbird #20 ph4** — the last real manifest migration (workstream A).
2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units.
3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall +
progress-UI + all-apps gate expansion below.
## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar)
**Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the
"every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate
(`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over
**~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint,
filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind
`ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage**
for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism,
uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall,
reinstall, install-progress UI, and most apps were never under test.
**Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:**
- **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a
**solid full-red with no real progression**, and the app **does not actually uninstall**
it still appears in **My Apps** afterward (ghost entry / state not cleared).
- **grafana reinstall just stops** partway (no completion, no clear error).
- **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until
Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct
wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).
**Workstream F scope — the gate must grow to (in priority order):**
1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps /
`container-list` / package state (no ghost), data preserved per policy, then reinstall →
verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs.
2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress
(not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal
success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)
3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start /
restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and
the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are
covered automatically.
4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a
legitimate, surfaced wait (with a path to ready) and never a permanent stuck state.
**Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on
.228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container
environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with
honest progress, no ghosts, no data loss, reboot-survivable.
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
@ -556,3 +614,24 @@ This master plan is the hub. Authoritative standalone docs (linked above), kept:
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.
## 10. Backlog — investigate frontend state management (2026-06-23)
**Investigate adopting a real client-state/data-fetching layer for `neode-ui`** instead of
the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX
bugs like the stuck "full-red" install/uninstall progress bar and ghost **My Apps** entries
(see §6c) are partly a *state-sync* problem — the UI's view of package state drifts from the
backend and isn't reliably invalidated/refetched. A principled query/cache layer (request
dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale
handling) would make these classes of bug structurally hard.
**Research → recommend → (maybe) adopt:**
- Evaluate **TanStack Query** (Vue Query) as the leading candidate, plus alternatives
(Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or
an SSE/WebSocket push model for package-state events instead of polling).
- Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA
behaviour, how cleanly it models long-running mutations (install/uninstall with progress),
and whether a push channel for package-state changes is the better root-cause fix.
- Deliverable: a short design note + a recommendation, then a scoped migration of the
package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof
case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).

View File

@ -137,15 +137,23 @@ ssh_podman_ps() {
@test "bitcoin.getinfo succeeds after restart" {
[[ "${ARCHY_ALLOW_DESTRUCTIVE:-0}" == "1" ]] || skip "ARCHY_ALLOW_DESTRUCTIVE not set"
# Give bitcoind up to 60s to accept RPC after cold restart
local deadline=$(( $(date +%s) + 60 ))
# Give bitcoind up to 120s to accept RPC after a cold restart — reloading the
# block index + chainstate can take a while even on a synced node.
local deadline=$(( $(date +%s) + 120 ))
while (( $(date +%s) < deadline )); do
if rpc_call bitcoin.getinfo | jq -e '.error == null' >/dev/null 2>&1; then
return 0
fi
sleep 3
done
fail "bitcoin.getinfo never recovered after restart"
# NB: bats-assert's `fail` is not loaded in this file (only ../lib/rpc.bash),
# so emit + return non-zero directly rather than calling an undefined helper
# (which fails with "fail: command not found" / status 127 and hides the real
# reason). A node mid-IBD legitimately can't serve getinfo here — that's an
# environmental precondition (see required-stack "synced archival"), not a
# product regression.
echo "bitcoin.getinfo never recovered after restart within 120s" >&2
return 1
}
# ────────────────────────────────────────────────────────────────────