# PRODUCTION MASTER PLAN — Archipelago App Platform & Registry > **✅ SINGLE-NODE PRODUCTION GATE IS GREEN (2026-06-23): `run-gate.sh` 5/5 on .228, 0 failures.** > This remains the authoritative plan for the broader north star (manifest-driven > platform, registry-distributed manifests, external marketplace), but it is no > longer a hard priority banner blocking all other work. Remaining workstreams are > in §6 / §8b. Next exit-criteria: multinode (`docs/multinode-testing-plan.md`) + > workstreams B/C/D. > > Last updated: 2026-06-26 · zombie-container guard + gitea launch-port fix shipped, binary `040df5ce` rolled to the fleet (see §8b SESSION h). Prior: orchestrator Fix A+B (`a721532f`/`e0343137`) deployed + proven. --- ## 1. The North Star Make Archipelago a **world-class, developer-ready app platform** where: 1. **Every app is manifest-driven** — install/run/update/uninstall needs only the app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app Rust installers, no `sudo mkdir/chown`, no host provisioning. 2. **Manifests are distributed via the (signed) registry**, not baked into the binary OTA as disk files. Bumping/adding an app = a signed catalog change. 3. **Third-party developers can build and ship apps via an external registry** — a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation), not a gatekept central store. `archy app validate/render/install/test` tooling. 4. The platform stays **rootless, secure-by-default, elegant, robust, and 100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate). **Definition of done:** the production test gate (§5) is green for the app set on real nodes. Until then, this plan is the priority. ## 2. Invariants (never violate) - **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged containers unless explicitly approved. (ADR-001, ADR-009.) - **No app-specific business logic in the Rust backend.** The orchestrator owns the lifecycle state machine; apps are declarative. Legacy `install_immich_stack` (hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted. - **Secrets are manifest-declared** (`generated_secrets`, materialised by `container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded, per-app, or logged. Replaces the deleted `ensure_fmcd_password`. - **Migrations never destroy data.** Preserve `/var/lib/archipelago/`, generated secrets, displayed credentials, public ports, and adoption container names. Always provide a rollback path. Stop/recreate only when necessary. - **Verify on the real node .228 before any tag.** (Fleet/multinode verification is a separate pass → `docs/multinode-testing-plan.md`.) ## 3. Current state (2026-06-21) - **~40 apps are manifest-based and Quadlet-migrated** (survive `archipelago.service` restart + reboot). Exhaustive per-app table: `docs/app-registry-status-2026-06-21.md`. - **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data. The migration proof case. - **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`). The signed catalog (`app-catalog.json`) currently distributes **only image overrides** — not full manifests. Gap closed by workstream B. - **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`, `-fedimint-ui`) build from `docker/` contexts via `companion.rs`, not the manifest registry — a later phase folds them in. - **No app has passed the formal production gate.** That is the blocker. ## 4. Workstreams (each links its authoritative detail doc) | # | Workstream | Detail doc | Status | |---|-----------|-----------|--------| | A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain | | B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet | | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending | | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) | | E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps | | F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **IN PROGRESS (2026-06-26)** — root bug FIXED: uninstall could hang → ghost/stuck-bar/reinstall-block (`71cc9ac4`, unbounded systemctl/podman in `quadlet::disable_remove`); `cascade-uninstall.bats` **7/7 green on .228** w/ binary `ae349a75`. Remaining: wire CASCADE into the canonical gate run, progress-UI truthfulness, all-apps matrix, guardian/IBD state. | **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md` (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure modes FM1–FM6 + the desired-state-first reconciler that fixes them). ## 5. Production test gate (exit criterion) An app is **production-ready** only when `tests/lifecycle/run-gate.sh` is green across the full matrix — install / UI-reachable / stop / start / restart / reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall — **5× on .228** (`ARCHY_ITERATIONS=5`). **The gate runs ON the node** (it uses local podman/systemctl/bitcoin probes; running it via RPC from another host silently tests the runner). **Multinode / fleet verification (.198 + others) is a SEPARATE plan — `docs/multinode-testing-plan.md` — NOT part of this single-node criterion.** Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage. > ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the > **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips > uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, > never set by the gate) and tests no install/uninstall **progress UI**. Real > uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing > right after — see **§6c (workstream F)** for the gap and the expanded-gate plan. > The true "every app, fully" criterion is F's definition-of-done, not this run. ## 6. Immediate sequence (live workstream) 1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry`; `load_manifests` catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped in phase 1); unit tests. *(commit 220666d3)* 2. ✅ **B-phase 2** — `EMBED_MANIFESTS` publisher generator + round-trip guard. *(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)* 3. ✅ **C immich proof** — immich is a manifest-driven stack (immich + immich-postgres + immich-redis) installed via `install_stack_via_orchestrator`; legacy installer is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening, data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)* 4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide) for the podman-`--restart` path. *(f160e0c4)* 5. ✅ **E** — 5× gate on **.228** (`ARCHY_ITERATIONS=5`) is **GREEN: 5/5, 0 not-ok** (2026-06-23). Two real orchestrator bugs were found + fixed en route (package.stop per-app grace; package.restart phantom stack-member injection → `order_present_containers`, commit 92d7f52d) plus two single-shot-read probes hardened (bitcoin-knots state, immich lan_address). The single-node criterion is met. 6. ✅ Banner demoted (this doc, 2026-06-23). Next: multinode pass + workstreams B/C/D. **Multinode / fleet verification (.198 and the rest) is split into its own plan:** `docs/multinode-testing-plan.md`. Do it AFTER the .228 single-node gate is green. **Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the published catalog (then sign) to actually distribute manifests via the registry; Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not just podman-`--restart`). ## 6b. Post-deploy task order (agreed 2026-06-23) After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228 + Tailscale testers), do these IN ORDER: 1. **netbird #20 ph4** — the last real manifest migration (workstream A). 2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units. 3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall + progress-UI + all-apps gate expansion below. ## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar) **Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the "every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate (`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over **~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint, filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage** for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism, uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall, reinstall, install-progress UI, and most apps were never under test. **Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:** - **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a **solid full-red with no real progression**, and the app **does not actually uninstall** — it still appears in **My Apps** afterward (ghost entry / state not cleared). - **grafana reinstall just stops** partway (no completion, no clear error). - **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync). **✅ 2026-06-26 — root cause of the immich/grafana uninstall trio FOUND + FIXED (`71cc9ac4`).** Single cause: `quadlet::disable_remove()` (first op in uninstall teardown, via companion + orchestrator) ran `systemctl --user stop` / `daemon-reload` / `podman rm -f` with **no timeout**. On rootless podman a generated unit can wedge "deactivating" while podman hangs → `systemctl stop` blocks forever → the spawned uninstall task returns neither Ok nor Err, so (a) `set_uninstall_stage` never fires → **frozen full-red bar**, (b) `remove_package_state_entry` never runs → **ghost stuck in `Removing`**, (c) the install guard rejects reinstall (`already Removing`). The spawn wrapper already reverts state on Err/removes on Ok — only a *hang* stranded it. Fix bounds all three calls (stop→`QUADLET_STOP_TIMEOUT` + SIGKILL/reset-failed escalation; daemon-reload→30s; podman rm→timeout). **Validated live: `cascade-uninstall.bats` 7/7 on .228** (binary `ae349a75`) — grafana install → uninstall (no ghost, data dir gone) → reinstall → running → cleanup. NOTE: proves the happy path + no-regression; the original hang was load/timing-induced and not separately reproduced. **Workstream F scope — the gate must grow to (in priority order):** 1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps / `container-list` / package state (no ghost), data preserved per policy, then reinstall → verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs. *(Test EXISTS + passes — `bats/cascade-uninstall.bats`, gated on `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`; `run-gate.sh` still never sets it. DECISION PENDING: run a single cascade pass alongside the 5× destructive loop vs. a dedicated cascade gate — do NOT fold uninstall/reinstall into all 5 iterations, it balloons runtime.)* 2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.) 3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start / restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are covered automatically. 4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a legitimate, surfaced wait (with a path to ready) and never a permanent stuck state. **Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on .228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with honest progress, no ghosts, no data loss, reboot-survivable. ## 7. Release blockers & operational gotchas (durable) Carried forward from prior handoffs (deduped against persistent memory): - **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at startup must not surface a false "no apps installed" UI. **My Apps must preserve last-known apps during scanner backoff**, never show empty during a transient. - **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service` restart; legacy in-cgroup containers get SIGKILLed and reconciled back. - **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false` before launching fedimintd (proxy/wait companion on :8175 during IBD). - **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool. - **Adoption** — match existing containers by name and adopt without recreate; record a migration version in app state; preserve Nostr signer bridges (IndeeHub needs `/nostr-provider.js` served, not just port reachability). - **Image presence** — use bounded targeted `podman image inspect`, not `podman image exists` (avoids store-walk stalls). - **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never reach nodes. `:local` is a manual override, never auto-rebuilt. ## 8. Roadmap **Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) → Beta Live (public). Hardening priorities feeding the gate: - **P0** Container app reliability — bulletproof install/health/restart/uninstall across all apps, dependency chains, multi-container stacks. - **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor hidden services, LND Connect). - **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/` (AES-256-XTS, Argon2id, key from setup password + hardware salt). - **P1** Meshtastic plug-and-play parity with MeshCore. - **P1 ✅ CODE-COMPLETE** (branch `companion-mobile-ux`, 2026-06-23; needs on-device + mobile-web verification before merge to `main`) — Mobile app-launch UX — drop the "this app opens in a tab" interstitial. Two surfaces (both: no interstitial screen, launch the app directly): - **Companion app (Android):** open **every** app in the **in-app WebView** (not just non-iframeable ones) — *and* carry the current mobile-iframe footer controls into the WebView (back/forward/reload/close — good, useful UX). - **Mobile web browser (PWA):** open tab-apps directly in a **new browser tab**. Touch points: `neode-ui/src/stores/appLauncher.ts`, `AppLauncherOverlay.vue`, the Android in-app WebView bridge, and the mesh-mobile iframe footer controls. (Reference prior work: `b5a9deb8` in-app webview for non-iframeable apps, `d1fbcd9b` "open in browser" via native bridge.) - **✅ Done (branch `companion-mobile-ux`):** mobile launches now use the store-driven panel (no route push) so the background tab no longer changes and closing returns you where you launched; tab-only apps open directly (in-app WebView on companion via `openInApp`, new browser tab on PWA) with **no interstitial**; the Android `InAppBrowser` (`WebViewScreen.kt`) gained a bottom footer bar (back/forward/reload/open-in-browser/close) + a centered loading screen (favicon + progress); a shared `AppLoadingScreen` (icon + progress) replaced the black/spinner loaders on the app session **and** legacy iframe overlay; the dashboard is pinned to `100dvh` on mobile so the mesh chat/tools panes stop sliding under the tab bar in mobile browsers (no-op in companion); ElectrumX shows its real icon in My Apps. Companion APK bumped to **v0.4.7** (versionCode 11) with a committed shared debug keystore so updates install without an uninstall. **Not yet:** merge to `main`; publish the 0.4.7 companion download (deferred until the gate work lands so they ship together). **Post-beta (deferred — do not start until gate is green):** P2P encrypted voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`); Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash phases 2–6 (`dual-ecash-design.md`). ## 8b. SESSION STATE + RESUME (updated 2026-06-26) — READ §8b "CURRENT STATE + RESUME" FIRST ### ▶ SESSION h (2026-06-26) — LATEST, RESUME FROM HERE **Canonical resume detail: memory `project_session_resume_2026_06_23b` (▶️ top of MEMORY.md).** Local main = `670ebb06` (3 commits past the previously-pushed `43e70049`: `0a8db904` zombie guard + `670ebb06` gitea launch-port fix; `43e70049` webview was already pushed). **Combined release binary `040df5ce2551d17b` rolled to the fleet.** Binary+FE not in git — rebuild on a fresh machine (`cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`). **DONE this session:** 1. ✅ **Zombie-container guard** (`0a8db904`) — the reconciler's Running branch now verifies a container's `State.Pid` is alive (`/proc/` exists) before trusting podman's "Up"; on a concrete dead PID it stop+remove+`install_fresh` from the manifest. Conservative: any uncertainty (inspect fail / unparseable PID) assumes alive, so a transient hiccup never destroys a healthy container. Fixes the class that broke NetBird login on .228 (dashboard "Up" w/ dead PID → proxy 502, no host port → reconciler never recovered it). Unit test + **live-proven on .228**: synthetic zombie on `jellyfin` (killed conmon+PID → podman still "Up") → guard logged `…process is dead (zombie) — recreating app_id=jellyfin` → recreated → settled to NoOp. **Zero false-positives across the other 33 healthy containers.** 2. ✅ **Gitea launch-port fix** (`670ebb06`) — gitea launched at **:2222 (SSH)** instead of **:3001 (web)** on nodes without the gitea manifest on disk (`manifest_lan_address_for` returns None → fell through to `extract_lan_address`, which returns podman's first-listed port; podman lists `2222->22` before `3001->3000`). Added `"gitea" => http://localhost:3001` to the static `lan_address_for` map (`core/container/src/podman_client.rs`) like every other core app. Reported on tailscale node **100.82.34.38** — that node still needs the new binary (or a refreshed gitea manifest) to pick it up. 3. ✅ **Rolled `040df5ce`** to .228/.116/.198/.89 (verified sha+active); .88/.5/.120 rolling. **OPEN follow-ups (logged, NOT regressions):** - **mempool env-drift recreate-loop on .228** — reconciler logs `container env drift detected — recreating app_id=mempool` every ~30-90s, never converges (pre-existing; the known mempool nginx stale-IP class, [[project_mempool_nginx_stale_ip_fix]]). mempool stays running but churns. - **nostr-rs-relay** stuck "Stopping" + ~2s create-loop on .228 (from session g). **NEXT:** finish .88/.5/.120 roll → push main to gitea-vps2 → Phase-3 quadlet / Workstream F / multinode. SSH/sudo pw `ThisIsWeb54321@` (**.88 = `ThisIsWeb54321!`**); UI/RPC .228/.198 = `ThisIsWeb54321@`. Reusable tooling in scratchpad: `deploy-bin.sh`/`remote-apply.sh` (EXPECT_SHA = `040df5ce…`), `rpc.sh`. --- ### ▶ SESSION g (2026-06-25) — earlier, historical **Canonical resume detail: memory `project_session_resume_2026_06_23b` + `project_netbird_ph4_legacy_deletion_map` + `project_workstream_f_lifecycle_perfection`.** `gitea-vps2/main = a721532f` (pushed). **Local main = `89d397bb`** (2 new commits this session, NOT pushed/deployed: `41e7f500` harness tolerance + `89d397bb` netbird ph4 legacy delete). Binary+FE are NOT in git — rebuild on a fresh machine. **TL;DR (SESSION g, 2026-06-25) — everything below DONE this session:** 1. ✅ **Rolled** `e0343137` + fresh FE (`index-a75rd6Hy.js`) to **7 nodes** (.116/.198/.228/.89/.88/.5/.120), all verified. **.15 SKIPPED** (auth rejected — creds don't match). 2. ✅ **Harness tolerance fixes COMMITTED** `41e7f500` (run-gate settle/immich + immich.bats 90s + mempool.bats poll). 3. ✅ **mempool RESOLVED** fleet-wide — see mempool note below. 4. ✅ **netbird #20 ph4 DONE** — legacy Rust installer DELETED, committed `89d397bb` (492 lines gone, manifest-driven only, `cargo check` clean). Release binary BUILDING for the .228 live-verify (build left running — check after). **NEXT (resume here):** (a) check the release build, deploy the `89d397bb` binary to .228, live-verify netbird adopts via manifest (https:8087→200, no `bail!`); (b) roll `89d397bb` to the rest of the fleet (behavior-neutral — manifest path already executed); (c) **push local main → gitea-vps2** (2 commits ahead); then **Phase-3 `use_quadlet_backends` → Workstream F → multinode**. **ROLL RESULTS (2026-06-25, binary `e0343137b99bf066` + fresh FE bundled):** | Node | Result | |------|--------| | .228 | ✅ already on `e0343137` (prior session, binary-only) | | .116 (local) | ✅ binary + fresh FE; 36 containers survived restart; UI 200; `index-a75rd6Hy.js` live | | .198 (LAN) | ✅ binary + fresh FE; 38 containers up; UI 200 | | .89 (100.89.209.89) | ✅ binary + fresh FE; service active | | .88 (100.70.96.88, pw `ThisIsWeb54321!`) | ✅ binary + fresh FE; service active | | .5 (100.72.136.5) | ⏳ attempted — see resume note (cellular x250) | | .120 (100.66.157.120) | ⏳ attempted — see resume note (cellular x250) | | .15 (100.64.83.15, archy-dev-pa) | ❌ SKIPPED — `archipelago@` + `ThisIsWeb54321@` rejected (`Permission denied (publickey,password)`); node creds unknown | Deploy tooling (reusable): scratchpad `deploy-bin.sh