# PRODUCTION MASTER PLAN — Archipelago App Platform & Registry > **✅ SINGLE-NODE PRODUCTION GATE IS GREEN (2026-06-23): `run-gate.sh` 5/5 on .228, 0 failures.** > This remains the authoritative plan for the broader north star (manifest-driven > platform, registry-distributed manifests, external marketplace), but it is no > longer a hard priority banner blocking all other work. Remaining workstreams are > in §6 / §8b. Next exit-criteria: multinode (`docs/multinode-testing-plan.md`) + > workstreams B/C/D. > > Last updated: 2026-06-23 · **.228 gate 5×-GREEN (110/110 ×5, 0 not-ok)** — exit criterion met (see §8b). --- ## 1. The North Star Make Archipelago a **world-class, developer-ready app platform** where: 1. **Every app is manifest-driven** — install/run/update/uninstall needs only the app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app Rust installers, no `sudo mkdir/chown`, no host provisioning. 2. **Manifests are distributed via the (signed) registry**, not baked into the binary OTA as disk files. Bumping/adding an app = a signed catalog change. 3. **Third-party developers can build and ship apps via an external registry** — a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation), not a gatekept central store. `archy app validate/render/install/test` tooling. 4. The platform stays **rootless, secure-by-default, elegant, robust, and 100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate). **Definition of done:** the production test gate (§5) is green for the app set on real nodes. Until then, this plan is the priority. ## 2. Invariants (never violate) - **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged containers unless explicitly approved. (ADR-001, ADR-009.) - **No app-specific business logic in the Rust backend.** The orchestrator owns the lifecycle state machine; apps are declarative. Legacy `install_immich_stack` (hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted. - **Secrets are manifest-declared** (`generated_secrets`, materialised by `container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded, per-app, or logged. Replaces the deleted `ensure_fmcd_password`. - **Migrations never destroy data.** Preserve `/var/lib/archipelago/`, generated secrets, displayed credentials, public ports, and adoption container names. Always provide a rollback path. Stop/recreate only when necessary. - **Verify on the real node .228 before any tag.** (Fleet/multinode verification is a separate pass → `docs/multinode-testing-plan.md`.) ## 3. Current state (2026-06-21) - **~40 apps are manifest-based and Quadlet-migrated** (survive `archipelago.service` restart + reboot). Exhaustive per-app table: `docs/app-registry-status-2026-06-21.md`. - **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data. The migration proof case. - **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`). The signed catalog (`app-catalog.json`) currently distributes **only image overrides** — not full manifests. Gap closed by workstream B. - **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`, `-fedimint-ui`) build from `docker/` contexts via `companion.rs`, not the manifest registry — a later phase folds them in. - **No app has passed the formal production gate.** That is the blocker. ## 4. Workstreams (each links its authoritative detail doc) | # | Workstream | Detail doc | Status | |---|-----------|-----------|--------| | A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain | | B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet | | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending | | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) | | E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps | | F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 | **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md` (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure modes FM1–FM6 + the desired-state-first reconciler that fixes them). ## 5. Production test gate (exit criterion) An app is **production-ready** only when `tests/lifecycle/run-gate.sh` is green across the full matrix — install / UI-reachable / stop / start / restart / reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall — **5× on .228** (`ARCHY_ITERATIONS=5`). **The gate runs ON the node** (it uses local podman/systemctl/bitcoin probes; running it via RPC from another host silently tests the runner). **Multinode / fleet verification (.198 + others) is a SEPARATE plan — `docs/multinode-testing-plan.md` — NOT part of this single-node criterion.** Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage. > ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the > **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips > uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, > never set by the gate) and tests no install/uninstall **progress UI**. Real > uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing > right after — see **§6c (workstream F)** for the gap and the expanded-gate plan. > The true "every app, fully" criterion is F's definition-of-done, not this run. ## 6. Immediate sequence (live workstream) 1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry`; `load_manifests` catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped in phase 1); unit tests. *(commit 220666d3)* 2. ✅ **B-phase 2** — `EMBED_MANIFESTS` publisher generator + round-trip guard. *(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)* 3. ✅ **C immich proof** — immich is a manifest-driven stack (immich + immich-postgres + immich-redis) installed via `install_stack_via_orchestrator`; legacy installer is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening, data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)* 4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide) for the podman-`--restart` path. *(f160e0c4)* 5. ✅ **E** — 5× gate on **.228** (`ARCHY_ITERATIONS=5`) is **GREEN: 5/5, 0 not-ok** (2026-06-23). Two real orchestrator bugs were found + fixed en route (package.stop per-app grace; package.restart phantom stack-member injection → `order_present_containers`, commit 92d7f52d) plus two single-shot-read probes hardened (bitcoin-knots state, immich lan_address). The single-node criterion is met. 6. ✅ Banner demoted (this doc, 2026-06-23). Next: multinode pass + workstreams B/C/D. **Multinode / fleet verification (.198 and the rest) is split into its own plan:** `docs/multinode-testing-plan.md`. Do it AFTER the .228 single-node gate is green. **Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the published catalog (then sign) to actually distribute manifests via the registry; Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not just podman-`--restart`). ## 6b. Post-deploy task order (agreed 2026-06-23) After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228 + Tailscale testers), do these IN ORDER: 1. **netbird #20 ph4** — the last real manifest migration (workstream A). 2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units. 3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall + progress-UI + all-apps gate expansion below. ## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar) **Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the "every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate (`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over **~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint, filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage** for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism, uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall, reinstall, install-progress UI, and most apps were never under test. **Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:** - **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a **solid full-red with no real progression**, and the app **does not actually uninstall** — it still appears in **My Apps** afterward (ghost entry / state not cleared). - **grafana reinstall just stops** partway (no completion, no clear error). - **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync). **Workstream F scope — the gate must grow to (in priority order):** 1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps / `container-list` / package state (no ghost), data preserved per policy, then reinstall → verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs. 2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.) 3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start / restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are covered automatically. 4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a legitimate, surfaced wait (with a path to ready) and never a permanent stuck state. **Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on .228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with honest progress, no ghosts, no data loss, reboot-survivable. ## 7. Release blockers & operational gotchas (durable) Carried forward from prior handoffs (deduped against persistent memory): - **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at startup must not surface a false "no apps installed" UI. **My Apps must preserve last-known apps during scanner backoff**, never show empty during a transient. - **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service` restart; legacy in-cgroup containers get SIGKILLed and reconciled back. - **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false` before launching fedimintd (proxy/wait companion on :8175 during IBD). - **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool. - **Adoption** — match existing containers by name and adopt without recreate; record a migration version in app state; preserve Nostr signer bridges (IndeeHub needs `/nostr-provider.js` served, not just port reachability). - **Image presence** — use bounded targeted `podman image inspect`, not `podman image exists` (avoids store-walk stalls). - **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never reach nodes. `:local` is a manual override, never auto-rebuilt. ## 8. Roadmap **Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) → Beta Live (public). Hardening priorities feeding the gate: - **P0** Container app reliability — bulletproof install/health/restart/uninstall across all apps, dependency chains, multi-container stacks. - **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor hidden services, LND Connect). - **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/` (AES-256-XTS, Argon2id, key from setup password + hardware salt). - **P1** Meshtastic plug-and-play parity with MeshCore. - **P1 ✅ CODE-COMPLETE** (branch `companion-mobile-ux`, 2026-06-23; needs on-device + mobile-web verification before merge to `main`) — Mobile app-launch UX — drop the "this app opens in a tab" interstitial. Two surfaces (both: no interstitial screen, launch the app directly): - **Companion app (Android):** open **every** app in the **in-app WebView** (not just non-iframeable ones) — *and* carry the current mobile-iframe footer controls into the WebView (back/forward/reload/close — good, useful UX). - **Mobile web browser (PWA):** open tab-apps directly in a **new browser tab**. Touch points: `neode-ui/src/stores/appLauncher.ts`, `AppLauncherOverlay.vue`, the Android in-app WebView bridge, and the mesh-mobile iframe footer controls. (Reference prior work: `b5a9deb8` in-app webview for non-iframeable apps, `d1fbcd9b` "open in browser" via native bridge.) - **✅ Done (branch `companion-mobile-ux`):** mobile launches now use the store-driven panel (no route push) so the background tab no longer changes and closing returns you where you launched; tab-only apps open directly (in-app WebView on companion via `openInApp`, new browser tab on PWA) with **no interstitial**; the Android `InAppBrowser` (`WebViewScreen.kt`) gained a bottom footer bar (back/forward/reload/open-in-browser/close) + a centered loading screen (favicon + progress); a shared `AppLoadingScreen` (icon + progress) replaced the black/spinner loaders on the app session **and** legacy iframe overlay; the dashboard is pinned to `100dvh` on mobile so the mesh chat/tools panes stop sliding under the tab bar in mobile browsers (no-op in companion); ElectrumX shows its real icon in My Apps. Companion APK bumped to **v0.4.7** (versionCode 11) with a committed shared debug keystore so updates install without an uninstall. **Not yet:** merge to `main`; publish the 0.4.7 companion download (deferred until the gate work lands so they ship together). **Post-beta (deferred — do not start until gate is green):** P2P encrypted voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`); Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash phases 2–6 (`dual-ecash-design.md`). ## 8b. SESSION STATE + RESUME (updated 2026-06-23) — READ §8b "CURRENT STATE + RESUME" FIRST ### ▶ CURRENT STATE + RESUME (2026-06-23) — RESUME FROM HERE (works from any device) **✅ HEADLINE (2026-06-23): single-node gate GREEN (`run-gate.sh` 5/5 on .228, 0 not-ok) + multinode test deploy DONE to 6 nodes.** The exit criterion (§5) is met. Green took fixing **two real orchestrator bugs** (package.stop per-app grace, 2026-06-22; package.restart phantom stack-member injection, 2026-06-23 — `order_present_containers`, commit 92d7f52d) plus hardening two single-shot probes (bitcoin-knots state, immich lan_address). All work is **committed + PUSHED to `gitea-vps2` (146) `main` @ `ccb594fb`** — the local-only state is resolved. Binary = release sha `5472c575…`. **▶ DEPLOY STATE (latest backend `5472c575` + UX frontend + one-tap companion APK) — 2026-06-23:** | Node | Pw | Done | Notes | |------|----|----|-------| | .116 (local, http:80) | `ThisIsWeb54321@` | ✅ | dev node: bitcoin mid-IBD + http-only | | .198 | `archipelago` | ✅ | resilience; user manual-testing here | | .228 | `archipelago` | ✅ | canonical gate node (5×-green) | | 100.82.34.38 (archipelago-1) | `archipelago` | ✅ | | | 100.89.209.89 (archy-x250-pa) | `ThisIsWeb54321@` | ✅ | | | 100.70.96.88 (archipelago node) | `ThisIsWeb54321!` | ✅ | note the `!` | | 100.64.83.15 (archy-dev-pa) | ? | ⏳ | UP (tailscale ping ok) but `ThisIsWeb54321@` REJECTED — **need correct pw** | | 100.66.157.120 (archy-x250-exp) | `ThisIsWeb54321@` | ⏭️ | DOWN — user said leave it | Deploy scripts saved in scratchpad: `deploy-node.sh` (full binary+FE, sha+health verify) and `fe-only.sh` (FE-only, no archipelago restart). Reusable: `bash deploy-node.sh 127.0.0.1`. **▶ COMPANION APK fixed (other agent's commit `5c43e127` + my reconcile):** QR + download were a zip-wrapped `.apk.zip` (forced unzip). Now serve raw `archipelago-companion.apk` (one-tap) from the 146 raw URL; `CompanionIntroOverlay.vue` + ship/publish scripts repointed; old `.zip` dropped. The OLD `.apk.zip` URL now 404s, so EVERY node was FE-refreshed to the new build (all 6 verified `/ : 200` + bundle references `archipelago-companion.apk`). **▶ MANUAL-TEST BUGS FOUND on .198 → workstream F (§4/§6c).** The green gate is DESTRUCTIVE-tier / ~8 core apps; it SKIPS uninstall/reinstall and has no progress-UI / all-apps coverage. Real bugs: immich+grafana **uninstall hangs at a solid full-red bar + leaves a ghost in My Apps** (doesn't actually remove); grafana **reinstall stops**; fedimint guardian shows "waiting for bitcoin sync" (verify legit vs stuck). These motivate **workstream F** (cascade + progress + all-apps gate). Also added **§10**: investigate TanStack-Query/push-based state mgmt for neode-ui (the state-drift root cause behind the stuck bar + ghosts). **▶ NEXT — agreed task order (do IN ORDER, see §6b):** 1. **netbird #20 ph4** — last real manifest migration. 2. **Phase-3 `use_quadlet_backends`** — orchestrator backends → Quadlet units. 3. **§6c workstream F** — cascade/uninstall + progress-UI + ALL-apps gate; fix the immich/grafana uninstall + ghost-My-Apps + reinstall-stops bugs to a 5×-green; then §10 state-mgmt investigation. 4. **Multinode pass** — `docs/multinode-testing-plan.md` (the 6 deployed nodes are ready for manual testing now). **▶ LOOSE ENDS / gotchas for the resuming session:** - **`neode-ui/src/components/AppLoadingScreen.vue` is UNTRACKED** on .116 — the other agent created it but NO committed code imports it (orphan, not in `e825bbed`). Left in place; decide whether to wire it in or delete. Not deployed (committed UX doesn't reference it). - **gitea-local mirror (`localhost:3000`) push is BROKEN** (token redirects to `/login`); push to `gitea-vps2` works and is primary. Reconcile the local mirror token if you need it. - **Don't delete bitcoin/electrum data** (user directive) — run only the DESTRUCTIVE gate (`run-gate.sh` default; never set `ARCHY_ALLOW_CASCADE_DESTRUCTIVE` on real nodes with synced chains). - **.198 gate not run this session** (user was manual-testing there + restarting). .116 gate ran but failed 12 tests — ALL environmental (.116 is http-only → ui-coverage hardcodes `https://`; + bitcoin mid-IBD → bitcoin/lnd preconditions). NOT product regressions. `gate-116.log` on .116. **(historical resume notes for the 5× chase below — superseded by the green result above)** **Headline (2026-06-22):** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN (110/110)**; a **fresh 5× run is IN PROGRESS on `.228`** (the single-node exit criterion) after a real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out (`docs/multinode-testing-plan.md`). The gate is canonically **5×** now — `run-gate.sh` (the `20x` naming/script was removed 2026-06-22, commit `57a013bc`). **2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake):** The 1st 5× attempt failed iteration 1 on `#74 mempool api backend remains queryable`. Root cause was NOT timing — the frontend nginx pinned mempool-api's IP at startup (no `resolver`); after the gate restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in `mempool-frontend:v3.0.1` (resolver+variable proxy_pass; see `[[project_mempool_nginx_stale_ip_fix]]` / `docker/mempool-frontend/`), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience- verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (`mempool.bats` #74: 180s→300s + real `fail` helper). Commits `0f05f73a` (fix) `57a013bc` (gate rename). **THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:** ``` sshpass -p archipelago ssh archipelago@192.168.1.228 \ 'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \ echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u' ``` - Log: `/tmp/gate-5x3.log` on .228 · launched `nohup` · `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` via `./run-gate.sh` (ARCHY_HOST=127.0.0.1). `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228. - **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.** - If it flakes again: readiness-under-churn (lnd/mempool); hardening in `98f4fa44` (inter-iteration `settle_stack()` + readiness windows). Re-copy repo `tests/lifecycle` to /tmp/lifecycle-run, relaunch. **▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real orchestrator bug (NOT flakes) + FIXED:** the overnight run finished `passed: 2 / failed: 3` on `gate-5x3.log`, three *distinct one-off* fails, none repeating: - iter1 `#5 container-list valid state for bitcoin-knots` — pre-launch churn (as predicted); didn't repeat. **Hardened anyway:** the probe was a single-shot read; now polls ≤30s for a settled valid state so a momentary `restarting`/transient can't flake a 20-min iteration (`bitcoin-knots.bats`). - iter2 `#74 mempool api queryable` + iter5 `#73 mempool stack running` — **SAME root cause.** `package.restart mempool` resolves its container list via `ordered_containers_for_start`, which was **injecting phantom stack-member names** (`mysql-mempool`, `archy-mempool-api`, `archy-mempool-web` — variant names from the union `startup_order` list that aren't live on this node). The phantom `mysql-mempool` is 2nd in the start order; `do_orchestrator_package_start` hits its unknown-app-id fallback → `do_package_start` inspect fails "no such object" → the `?` **aborts the whole start sequence**, so `mempool-api` (pos 5) + `mempool` frontend (pos 8) never start. They then sat down ~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s) and #74 (api not queryable in 300s) both flake. Journal proof on .228: `package.restart mempool failed: Start failed: mysql-mempool: ... no such object`, 23:27:32. **Fix:** `ordered_containers_for_start` now orders only the *actually-present* containers and never injects phantom order entries (new pure helper `order_present_containers` + 3 unit tests, `dependencies.rs`). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets. - **Deploy + relaunch:** built release binary on .116, swapped `/usr/local/bin/archipelago` on .228 (containers live under `user@1000.service`, NOT the `archipelago.service` cgroup, so a service restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart keeps the stack up, then relaunched a clean 5× → see `gate-5x4.log` (check cmd above, swap the filename). Expectation: all three fixed → 5/5 green → demote the banner. **Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):** - `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL). - `760a32bc` reconciler stops resurrecting user-stopped apps (dep-override + host-port watchdog). - `6e49ce6f` container-list reports user-stopped apps as `stopped` despite a live UI companion. - `452f05d8` companion self-heal on its own ~30s loop (was gated behind the slow per-app pass). - Test-harness hardening: `88930558` `53b8e47f` `892ff083` `98f4fa44` (readiness retries, immich/ fedimint/NPM/lnd windows, inter-iteration settle). Binary built on .116 `core/target/release/archipelago` (4-fix); deploy = stop archipelago, cp to /usr/local/bin, start. **NODE-STATE fixes on .228 NOT in the repo (re-apply if .228 is reset/reimaged):** - nginx `/app/lnd/` proxy target was stale `8081` → fixed to `18083` (sed in /etc/nginx/sites-{available,enabled}/archipelago + snippets, then `nginx -s reload`). Repo code is correct (18083); old node config was stale. - Removed a stale orphan `~/.config/containers/systemd/home-assistant.container` (ContainerName `home-assistant` ≠ the real `homeassistant` container; it was stuck "activating"). Real app fine. - electrumx was re-installed (`package.install` w/ image `146.59.87.168:3000/lfg2025/electrumx:v1.18.0`) to re-register it as a tracked manifest app (it had become adopted plain-podman). **KEY LESSON:** run the lifecycle gate **ON the node**, not via RPC from .116 — its bitcoin/companion/ orphan/endpoint tests use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, so a remote run silently tests the *runner* (this is why earlier runs from .116 falsely showed "bitcoin in IBD" etc.). **Remaining (after 5× green):** netbird migration (#20 ph4 — the one real migration left) + btcpay/ mempool stack polish; Phase-3 `use_quadlet_backends`; B flip-on (EMBED_MANIFESTS+sign); per-app test coverage (~30 apps unwritten); the mobile app-launch UX (§8 Roadmap P1). Multinode → its own plan. --- ### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified Manifest-driven lifecycle hooks + the IndeedHub stack migration are **complete and live-verified on BOTH .228 and .198** (adoption + fresh-create + post_install hook exec, stable under load). 15 commits this session: `4c1a4e59`..`e2a012d0`. Working tree clean. The release lifecycle gate is **5×** (`ARCHY_ITERATIONS=5`). **Shipped (all on `main`, newest first):** - `e2a012d0` indeedhub frontend health → `tcp:7777` (was http GET `/`; the http check false-failed under load and the reconciler churned the frontend — fixed). - `ff78b312` hook `exec` runs in a transient user scope (`systemd-run --user --scope --quiet --collect podman exec …`) — fixes "crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service. - `ff8f11b8` indeedhub frontend caps `[CHOWN,DAC_OVERRIDE,SETGID,SETUID]` — nginx workers died "setgid(101) failed" under the orchestrator's `--cap-drop=ALL`. - `b73084db` DELETED the legacy indeedhub orchestrator special-cases (−382 lines: reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts) → "indeedhub" now uses the GENERIC install_fresh/reconcile path. - `b1eea8c0` 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api, -ffmpeg}) + `install_indeedhub_stack` orchestrator-first (immich pattern). - `b94b61f6` `network_aliases` ContainerConfig field (podman_client + quadlet rendering, DNS-label validated) — lets the frontend nginx reach `api:4000`/`minio:9000`/`relay:8080` on the dedicated `indeedhub-net`. - `955c54b7`/`4c1a4e59` #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in archipelago-container::manifest) + executor `container::hooks::run_post_install` (allowlist-canonicalised copy_from_host + scoped exec), wired into `install_fresh`. - `84031e62` gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md). **Design = adoption-safe + manifest-driven.** Manifests reproduce the live install exactly so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime already references, named volumes `indeedhub-{postgres,redis,minio,relay}-data`, `indeedhub-net` + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js + sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject / nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering guard is KEPT on purpose (beneficial; not a blocker). ### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED) Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a genuine product bug, not node contamination. Root cause is fully pinned (below). **Symptom.** `package.stop ` returns `{"status":"stopping"}` but the container **never stops** (`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps). `filebrowser` passes because it exits on SIGTERM in <30s. **ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):** ``` WARN quadlet: systemctl --user stop fedimint.service timed out after 45s ERROR runtime: package.stop fedimint failed: stop_container fedimint: podman stop -t 30 fedimint timed out after 30s: deadline has elapsed ``` The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline equals the grace: - `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace (**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30). The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t `). - The **orchestrator** path does NOT: `prod_orchestrator::stop()` → `ContainerRuntime::stop_container` (`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`** (podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails → state reverts to `running`. - **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill would land a moment later. The wrapper deadline must exceed the `-t` grace. **FIX (two parts, design choice flagged):** 1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate `stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the `ContainerRuntime::stop_container` signature to take a `grace: Duration` and have `prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)** add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in `stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).** 2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`). Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`. **Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago` → sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend `.container` files are gone from my cascade-gate contamination — reinstall its apps so units regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only). ### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug **Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed (`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198 (running→exited→removed) — no regression; the deployed binary's stop path works. **The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx lifecycle suite is GREEN (10/10, 66s) on .228:** 1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`. Orchestrator now uses manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline = grace + 15s; applied to quadlet stop + API + CLI. 2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's `dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool), the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")` when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through); install/start clear the marker first so user actions are unaffected. 3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces `stopped` for `user_stopped` apps before the launch-port refresh. **Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn** — left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout" (electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s. **TWO-NODE GATE RESULT (1×, DESTRUCTIVE, both with the 3-fix binary):** - **.228: 104/110.** All previously-failing `package.stop` tests now PASS (bitcoin/btcpay/electrumx/ fedimint/immich). Remaining 6: test 31 (companion recreate), 44 (fedimint orphan — probe pollution), 55 (immich restart timing), 83 (bitcoin not archival-synced), 94/99 (endpoint/lnd-proxy cascade from 83). - **.198: 94/110.** **14 of 16 failures are one root cause: bitcoin is in IBD** (test 83 says `blocks=817652 headers=954850` — ~137k behind). Everything chained to bitcoin cascades: lnd (16,85), btcpay (22,23,103), electrumx (37), mempool stack (71,72,73,101), endpoints (94), bitcoin.getinfo (7,12). The other 2 are node-independent: **31** (companion recreate) and **44** (fedimint orphan pollution). **CONCLUSION: the lifecycle-stop blocker is FIXED and validated on both nodes.** The residual red is NOT lifecycle bugs — it is (a) **bitcoin still syncing (IBD)** on the test nodes [test 83 is an explicit precondition; nothing electrumx/lnd/btcpay/mempool can pass until it finishes], (b) **.228 plain-podman contamination** (my cascade-gate), and (c) two minor items: **test 31** companion-unit recreate (both nodes — likely the 90s window vs reconcile tick + image step; investigate) and **test 44** orphan fedimint container left by my probing. **EVERY gate failure is now FIXED or explained — NO lifecycle code bugs remain.** Final read: - ✅ `package.stop` (the blocker): 3 bugs fixed (`2dad64b2`/`760a32bc`/`6e49ce6f`), green both nodes. - **bitcoin-IBD cascade** (most of .198's red): environmental — bitcoin syncing (test 83 precondition). - **test 31** companion-recreate: NOT a product bug. Two things: (a) **FIXED** — the companion reconcile stage was gated behind the slow per-app pass; now it runs on its own ~30s loop (`452f05d8`). Validated on .228 with the new binary: a deleted `archy-electrs-ui` unit self-heals in **~10s** (was stuck 100s+), journal: `companion not active, repairing → wrote quadlet unit → companion started`. (b) **HARNESS CAVEAT** — the companion-survives bats does LOCAL `rm`/`systemctl --user` (no ssh), so running the gate from .116 against a remote node actually tests **.116's** companions with **.116's** (old) binary, not the RPC target. ⇒ the companion-survives suite must be run ON the target node (or with the new binary on .116) to be meaningful. This explains the "failed on both nodes" runs — both were silently testing .116. - **test 55** immich restart: NOT a bug — the heavy 3-container stack (postgres+redis+server) restarts in >120s under load; immich DOES return to running. *Optional:* bump the immich restart wait. - **test 44** fedimint orphan: my probe pollution; a teardown clears it. **To reach a literally-green 5× gate (now infra/node-prep + minor test-window tuning, not lifecycle code):** 1. Let bitcoin finish IBD on a test node (or point the gate at an archival-synced bitcoin). 2. Re-quadletize .228 (reinstall its backends so `.container` units regenerate, matching .198). electrumx done; bitcoin/btcpay/fedimint/immich/etc. remain. (Most backends ARE in manifest_ids already; this is about regenerating quadlet units + clearing adopted plain-podman state.) 3. Optional: faster companion-reconcile cadence (test 31) + longer immich-restart wait (test 55) + clear the test-44 orphan — or simply run the gate on a less-loaded, bitcoin-synced node. 4. ✅ **test 31 ROOT-CAUSED = contamination + load (NOT a product bug).** `companion::reconcile` only recreates a deleted companion unit (e.g. `archy-electrs-ui`) when its PARENT backend (electrumx) is in `manifest_ids`. On contaminated .228 electrumx ran as plain podman and was NOT a tracked manifest install (its `/opt/.../electrumx/manifest.yml` exists on disk but wasn't loaded), so the reconciler never iterated it → companion orphaned. **Proven fix:** `package.install electrumx` re-registered it (now `reconcile action app_id=electrumx` fires) AND restored the companion (unit present, service active). The companion self-heal logic is correct. ⇒ test 31 clears once .228 is re-quadletized (step 2). electrumx on .228 is now de-contaminated. Still: clear test-44 orphans. 4. Then run `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1` on the synced+quadlet node, then the other. **Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/ indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain; `bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%" from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running". The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334 keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the bug is purely "container never stops", not "state not reported". ### MY-SESSION ERRATA (own it on resume) - I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-gate.sh "Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image `146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale `user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`. - Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name → `Invalid Docker image format`. ### NEXT STEPS (in order) — SINGLE-NODE (.228) criterion 1. ✅ **DONE** — 4 stop/reconcile bugs fixed + deployed (`2dad64b2` grace, `760a32bc` reconcile-resurrection guard, `6e49ce6f` container-list user-stopped, `452f05d8` companion cadence). Plus test-harness fixes (lnd/immich/fedimint/NPM readiness + config). 2. ✅ **DONE** — gate run **ON .228** (synced bitcoin): **110/110 GREEN** (1×). Key lesson: **run the gate on the node**, not via RPC from .116 (local podman/systemctl/bitcoin probes). 3. ◧ **5× run on .228 in progress** (`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, on the node). 5 consecutive clean iterations = the single-node gate criterion → demote the banner. 4. **netbird migration (#20 phase 4)** — the one real migration left; assess setup steps first (TLS cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is install_netbird_stack in stacks.rs). Then btcpay/mempool stack polish. 5. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman. **Multinode / fleet (.198 + the rest) → `docs/multinode-testing-plan.md` (separate, after .228 green).** Carry-over notes for that plan: .198 bitcoin was mid-IBD; the lnd `/app/lnd/` nginx proxy had a stale `8081` target on .228 (repo code is correct at 18083 — re-check on other nodes). ### KNOWN ISSUES / WATCH-OUTS - **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates containers it deems unhealthy; under load, false-failing health checks → churn. The tcp-health fix (`e2a012d0`) mitigated the frontend case. If the lifecycle gate churns on .198, look for other apps whose http health checks false-fail under load → prefer tcp. - **Many concurrent SSH sessions to .198 wedge its sshd** (MaxStartups) — it pings but SSH hangs for minutes. Use ONE ssh at a time to .198; `pkill -f 192.168.1.198` to clear strays. - Hook `exec` only works in the scoped form (committed). `copy_from_host` is direct `cp`. ### DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both) - **Build:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago` (~12 min, opt-level=3). Binary at `core/target/release/archipelago`. Linker "undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. `archipelago` is a bin-only crate (no lib). Filtered tests: `cargo test -p archipelago --bin archipelago -- hooks quadlet`. - **Sideload:** `scp binary $H:/tmp/archipelago-new` → `sudo systemctl stop archipelago; sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl start archipelago`. Containers SURVIVE the restart (--restart unless-stopped + podman-restart.service). Binary path is /usr/local/bin/archipelago. - **Manifests** live at /opt/archipelago/apps//manifest.yml (root-owned ok). The orchestrator CACHES them at startup → **edit on disk then RESTART archipelago to reload**. Bulk deploy: `tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg`; scp; `sudo tar xzf t.tgz -C /opt/archipelago/apps`. - **Nodes:** .228 = 192.168.1.228, SSH pw `archipelago`, RPC/UI pw `password123` (https). .198 = 192.168.1.198, SSH pw `archipelago`, **RPC/UI pw `ThisIsWeb54321@`** (https). Both have the 7-container indeedhub stack + secrets + named volumes pre-existing. - **Trigger install via RPC:** `auth.login` (sets session+csrf cookies) → send the csrf cookie value as `X-CSRF-Token` header → `package.install` with params `{"id":"indeedhub","dockerImage":""}` (dockerImage required even for stacks; install is async → returns `{"status":"installing"}`). install logs go to /var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago. - **Fresh-create test recipe:** `podman rm -f indeedhub` (stateless frontend) → package.install indeedhub → expect install_fresh + post_install hook (all 4 steps `ok`) + UI 200 on :7778 (/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run — install_fresh is the only hook trigger). ## 9. Documentation map (what survives) This master plan is the hub. Authoritative standalone docs (linked above), kept: - **Design:** `architecture.md`, `app-developer-guide.md`, `APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`, `marketplace-protocol.md`, `dht-distribution-design.md`, `multi-node-architecture.md`, `rust-orchestrator-migration.md`, `bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`, `meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`. - **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`, `operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`, `bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`, `SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`. All dated handoffs/resumes/transcripts/superseded trackers were consolidated here and removed (recoverable via git) on 2026-06-21. ## 10. Backlog — investigate frontend state management (2026-06-23) **Investigate adopting a real client-state/data-fetching layer for `neode-ui`** instead of the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX bugs like the stuck "full-red" install/uninstall progress bar and ghost **My Apps** entries (see §6c) are partly a *state-sync* problem — the UI's view of package state drifts from the backend and isn't reliably invalidated/refetched. A principled query/cache layer (request dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale handling) would make these classes of bug structurally hard. **Research → recommend → (maybe) adopt:** - Evaluate **TanStack Query** (Vue Query) as the leading candidate, plus alternatives (Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or an SSE/WebSocket push model for package-state events instead of polling). - Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA behaviour, how cleanly it models long-running mutations (install/uninstall with progress), and whether a push channel for package-state changes is the better root-cause fix. - Deliverable: a short design note + a recommendation, then a scoped migration of the package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).