test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note

It called bats-assert's `fail` (not loaded in this file) → "fail: command not found"/127, masking the real reason. Emit+return instead, bump the cold-restart RPC window 60s→120s (block-index reload), and note a node mid-IBD legitimately can't serve getinfo (environmental precondition, not a product regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs(master-plan): workstream F (lifecycle perfection) + §10 state-mgmt backlog
2026-06-23 06:28:20 -04:00 · 2026-06-23 06:28:19 -04:00
2 changed files with 91 additions and 4 deletions
--- a/docs/PRODUCTION-MASTER-PLAN.md
+++ b/docs/PRODUCTION-MASTER-PLAN.md
@ -69,7 +69,8 @@ real nodes. Until then, this plan is the priority.
 | B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
 | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
 | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) |
-| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — single-node criterion met |
+| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps |
+| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 |

 **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
 (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
@ -88,6 +89,14 @@ plan — `docs/multinode-testing-plan.md` — NOT part of this single-node crite
 Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard +
 proxies; L3 survival ◐; ~30 apps have zero automated coverage.

+> ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the
+> **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips
+> uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`,
+> never set by the gate) and tests no install/uninstall **progress UI**. Real
+> uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing
+> right after — see **§6c (workstream F)** for the gap and the expanded-gate plan.
+> The true "every app, fully" criterion is F's definition-of-done, not this run.
+
 ## 6. Immediate sequence (live workstream)

 1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry`; `load_manifests`
@ -117,6 +126,55 @@ published catalog (then sign) to actually distribute manifests via the registry;
 Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
 just podman-`--restart`).

+## 6b. Post-deploy task order (agreed 2026-06-23)
+
+After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228
+ Tailscale testers), do these IN ORDER:
+1. **netbird #20 ph4** — the last real manifest migration (workstream A).
+2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units.
+3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall +
+   progress-UI + all-apps gate expansion below.
+
+## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar)
+
+**Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the
+"every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate
+(`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over
+**~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint,
+filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind
+`ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage**
+for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism,
+uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall,
+reinstall, install-progress UI, and most apps were never under test.
+
+**Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:**
+- **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a
+  **solid full-red with no real progression**, and the app **does not actually uninstall** —
+  it still appears in **My Apps** afterward (ghost entry / state not cleared).
+- **grafana reinstall just stops** partway (no completion, no clear error).
+- **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until
+  Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct
+  wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).
+
+**Workstream F scope — the gate must grow to (in priority order):**
+1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps /
+   `container-list` / package state (no ghost), data preserved per policy, then reinstall →
+   verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs.
+2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress
+   (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal
+   success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)
+3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start /
+   restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and
+   the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are
+   covered automatically.
+4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a
+   legitimate, surfaced wait (with a path to ready) and never a permanent stuck state.
+
+**Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on
+.228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container
+environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with
+honest progress, no ghosts, no data loss, reboot-survivable.
+
 ## 7. Release blockers & operational gotchas (durable)

 Carried forward from prior handoffs (deduped against persistent memory):
@ -556,3 +614,24 @@ This master plan is the hub. Authoritative standalone docs (linked above), kept:

 All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
 and removed (recoverable via git) on 2026-06-21.
+
+## 10. Backlog — investigate frontend state management (2026-06-23)
+
+**Investigate adopting a real client-state/data-fetching layer for `neode-ui`** instead of
+the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX
+bugs like the stuck "full-red" install/uninstall progress bar and ghost **My Apps** entries
+(see §6c) are partly a *state-sync* problem — the UI's view of package state drifts from the
+backend and isn't reliably invalidated/refetched. A principled query/cache layer (request
+dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale
+handling) would make these classes of bug structurally hard.
+
+**Research → recommend → (maybe) adopt:**
+- Evaluate **TanStack Query** (Vue Query) as the leading candidate, plus alternatives
+  (Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or
+  an SSE/WebSocket push model for package-state events instead of polling).
+- Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA
+  behaviour, how cleanly it models long-running mutations (install/uninstall with progress),
+  and whether a push channel for package-state changes is the better root-cause fix.
+- Deliverable: a short design note + a recommendation, then a scoped migration of the
+  package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof
+  case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).
--- a/tests/lifecycle/bats/bitcoin-knots.bats
+++ b/tests/lifecycle/bats/bitcoin-knots.bats
@ -137,15 +137,23 @@ ssh_podman_ps() {
@test "bitcoin.getinfo succeeds after restart" {
  [[ "${ARCHY_ALLOW_DESTRUCTIVE:-0}" == "1" ]] || skip "ARCHY_ALLOW_DESTRUCTIVE not set"

-  # Give bitcoind up to 60s to accept RPC after cold restart
-  local deadline=$(( $(date +%s) + 60 ))
+  # Give bitcoind up to 120s to accept RPC after a cold restart — reloading the
+  # block index + chainstate can take a while even on a synced node.
+  local deadline=$(( $(date +%s) + 120 ))
  while (( $(date +%s) < deadline )); do
    if rpc_call bitcoin.getinfo | jq -e '.error == null' >/dev/null 2>&1; then
      return 0
    fi
    sleep 3
  done
-  fail "bitcoin.getinfo never recovered after restart"
+  # NB: bats-assert's `fail` is not loaded in this file (only ../lib/rpc.bash),
+  # so emit + return non-zero directly rather than calling an undefined helper
+  # (which fails with "fail: command not found" / status 127 and hides the real
+  # reason). A node mid-IBD legitimately can't serve getinfo here — that's an
+  # environmental precondition (see required-stack "synced archival"), not a
+  # product regression.
+  echo "bitcoin.getinfo never recovered after restart within 120s" >&2
+  return 1
 }

 # ────────────────────────────────────────────────────────────────────