diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index da30f4b4..21bfa92a 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -69,7 +69,8 @@ real nodes. Until then, this plan is the priority. | B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet | | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending | | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) | -| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — single-node criterion met | +| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps | +| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 | **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md` (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption @@ -88,6 +89,14 @@ plan — `docs/multinode-testing-plan.md` — NOT part of this single-node crite Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage. +> ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the +> **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips +> uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, +> never set by the gate) and tests no install/uninstall **progress UI**. Real +> uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing +> right after — see **§6c (workstream F)** for the gap and the expanded-gate plan. +> The true "every app, fully" criterion is F's definition-of-done, not this run. + ## 6. Immediate sequence (live workstream) 1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry`; `load_manifests` @@ -117,6 +126,55 @@ published catalog (then sign) to actually distribute manifests via the registry; Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not just podman-`--restart`). +## 6b. Post-deploy task order (agreed 2026-06-23) + +After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228 ++ Tailscale testers), do these IN ORDER: +1. **netbird #20 ph4** — the last real manifest migration (workstream A). +2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units. +3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall + + progress-UI + all-apps gate expansion below. + +## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar) + +**Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the +"every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate +(`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over +**~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint, +filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind +`ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage** +for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism, +uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall, +reinstall, install-progress UI, and most apps were never under test. + +**Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:** +- **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a + **solid full-red with no real progression**, and the app **does not actually uninstall** — + it still appears in **My Apps** afterward (ghost entry / state not cleared). +- **grafana reinstall just stops** partway (no completion, no clear error). +- **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until + Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct + wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync). + +**Workstream F scope — the gate must grow to (in priority order):** +1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps / + `container-list` / package state (no ghost), data preserved per policy, then reinstall → + verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs. +2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress + (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal + success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.) +3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start / + restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and + the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are + covered automatically. +4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a + legitimate, surfaced wait (with a path to ready) and never a permanent stuck state. + +**Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on +.228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container +environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with +honest progress, no ghosts, no data loss, reboot-survivable. + ## 7. Release blockers & operational gotchas (durable) Carried forward from prior handoffs (deduped against persistent memory): @@ -556,3 +614,24 @@ This master plan is the hub. Authoritative standalone docs (linked above), kept: All dated handoffs/resumes/transcripts/superseded trackers were consolidated here and removed (recoverable via git) on 2026-06-21. + +## 10. Backlog — investigate frontend state management (2026-06-23) + +**Investigate adopting a real client-state/data-fetching layer for `neode-ui`** instead of +the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX +bugs like the stuck "full-red" install/uninstall progress bar and ghost **My Apps** entries +(see §6c) are partly a *state-sync* problem — the UI's view of package state drifts from the +backend and isn't reliably invalidated/refetched. A principled query/cache layer (request +dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale +handling) would make these classes of bug structurally hard. + +**Research → recommend → (maybe) adopt:** +- Evaluate **TanStack Query** (Vue Query) as the leading candidate, plus alternatives + (Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or + an SSE/WebSocket push model for package-state events instead of polling). +- Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA + behaviour, how cleanly it models long-running mutations (install/uninstall with progress), + and whether a push channel for package-state changes is the better root-cause fix. +- Deliverable: a short design note + a recommendation, then a scoped migration of the + package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof + case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).