archy/docs/PRODUCTION-MASTER-PLAN.md
2026-06-21 17:26:23 -04:00

289 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
> **THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until
> the production test gate (§5) is green.** It overrides ad-hoc direction and
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
> the priority banner and demote this doc.
>
> Last updated: 2026-06-21 · Binary: v1.7.99-alpha
---
## 1. The North Star
Make Archipelago a **world-class, developer-ready app platform** where:
1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app
Rust installers, no `sudo mkdir/chown`, no host provisioning.
2. **Manifests are distributed via the (signed) registry**, not baked into the
binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry**
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).
**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.
## 2. Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
(hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
`container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
per-app, or logged. Replaces the deleted `ensure_fmcd_password`.
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>`,
generated secrets, displayed credentials, public ports, and adoption container
names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on a real node (.228, then .198) before any tag.**
## 3. Current state (2026-06-21)
- **~40 apps are manifest-based and Quadlet-migrated** (survive
`archipelago.service` restart + reboot). Exhaustive per-app table:
`docs/app-registry-status-2026-06-21.md`.
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`).
The signed catalog (`app-catalog.json`) currently distributes **only image
overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`,
`-fedimint-ui`) build from `docker/<name>` contexts via `companion.rs`, not the
manifest registry — a later phase folds them in.
- **No app has passed the formal production gate (5× for now, was 20×).** That is the blocker.
## 4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 5× lifecycle on .228 + .198 (for now; was 20×), per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1FM6 + the desired-state-first reconciler that fixes them).
## 5. Production test gate (exit criterion)
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**5× on .228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from
20× — restore to 20× before the final ship). All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
## 6. Immediate sequence (live workstream)
1.**B-phase 1**`manifest` field on `AppCatalogEntry`; `load_manifests`
catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped
in phase 1); unit tests. *(commit 220666d3)*
2.**B-phase 2**`EMBED_MANIFESTS` publisher generator + round-trip guard.
*(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)*
3.**C immich proof** — immich is a manifest-driven stack (immich + immich-postgres
+ immich-redis) installed via `install_stack_via_orchestrator`; legacy installer
is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name
duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening,
data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)*
4.**Reboot-survival** — podman-restart.service enabled (startup, fleet-wide)
for the podman-`--restart` path. *(f160e0c4)*
5.**Verify on .198** (immich migration validated on .228 only so far).
6.**E** — run the 5× gate (`ARCHY_ITERATIONS=5`, was 20×); fix until green.
7. ◻ Demote this banner.
**Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the
published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
just podman-`--restart`); immich on .198.
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at
startup must not surface a false "no apps installed" UI. **My Apps must preserve
last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits
for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect`, not
`podman image exists` (avoids store-walk stalls).
- **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build
context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
reach nodes. `:local` is a manual override, never auto-rebuilt.
## 8. Roadmap
**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:
- **P0** Container app reliability — bulletproof install/health/restart/uninstall
across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
(AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`);
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
phases 26 (`dual-ecash-design.md`).
## 8b. SESSION STATE + RESUME (2026-06-21, live)
**Landed + committed on main this session (newest first):**
- **#20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21).** Built
v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start —
containers survived via --restart unless-stopped + podman-restart.service). RPC
`package.install indeedhub``complete`, orchestrator-first path adopted all 7
members (`reconcile action app_id=indeedhub-* action=NoOp`), containers stayed
**Up 4 days (NOT recreated)** — zero data/credential disruption. UI green:
frontend :7778 → 200, nostr-provider.js → 200, **/api/ → 200 (proves
network_aliases: frontend nginx `http://api:4000` resolved on indeedhub-net)**.
Fleet healthy (36 containers, none down).
**FRESH-CREATE PATH = FIXED + VERIFIED (2026-06-21).** Deleted the legacy
indeedhub orchestrator special-cases (`b73084db`, 382 lines: reconcile_indeedhub_stack,
start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider,
etc.) so "indeedhub" flows through the generic install_fresh path. Then two live fixes
on .228: (1) frontend nginx needs `capabilities: [CHOWN,DAC_OVERRIDE,SETGID,SETUID]`
under the orchestrator's --cap-drop=ALL (workers died "setgid(101) failed"); manifest
fix `ff8f11b8`. (2) NOTE: manifest reload needs an archipelago restart (manifests
cached at startup) — a disk manifest edit alone won't take. RESULT: frontend
fresh-creates via install_fresh, caps applied, post_install hook FIRES
(copy_from_host nostr-provider.js ✅), UI 200 (/, /nostr-provider.js, /api/).
**KNOWN GAP (general hook capability, NOT blocking indeedhub):** the post_install
`exec` steps fail via `podman exec` from the archipelago.service systemd cgroup
(`crun: write cgroup.procs: Permission denied / OCI permission denied`). Harmless
here (image bakes the nginx config so the exec steps are no-ops; copy_from_host is
the one that matters and works). FIX = wrap the hook executor's `podman exec` in a
transient user scope (`systemd-run --user --scope`, like `podman_user_scope`) in
core/archipelago/src/container/hooks.rs::run_podman. Do before relying on exec hooks
for an app whose image does NOT pre-bake its mutations.
PRIOR (now resolved) — was: **FRESH-CREATE PATH = BLOCKED (found live 2026-06-21).** Removed the stateless
frontend + reinstalled to exercise install_fresh → it FAILED:
`orchestrator stack install indeedhub failed at app indeedhub: IndeedHub
dependencies were not ready within 120s (indeedhub-api dependency DNS not ready)`,
and the frontend was left down. Recovered manually on .228 (podman run w/ alias
indeedhub on indeedhub-net; UI 200). ROOT CAUSE = hardcoded indeedhub orchestrator
special-cases that predate + conflict with the manifest path:
- prod_orchestrator `ensure_running` ~L1377: `app_id=="indeedhub"`
`reconcile_indeedhub_stack`, which REFUSES manifest creation when the frontend
is absent (returns Left("stack-managed")).
- `run_pre_start_hooks("indeedhub")` ~L2324 → `start_indeedhub_backends`
`wait_for_indeedhub_dependencies_ready(120)` — the gate that blocked install_fresh
(`indeedhub_api_dependency_dns_ready` returns false while the frontend's own alias
is absent + a getent transiently fails).
- also `repair_indeedhub_network_aliases`, `patch_indeedhub_nostr_provider`, the
"frontend did not stay reachable; restart" path (~L2474), `INDEEDHUB_BACKEND_*`
consts, and a crash_recovery.rs indeedhub special-case.
**FIX (next, its own build/deploy/test cycle):** delete these special-cases now
that the manifest carries dependencies/network_aliases/post_install — route
"indeedhub" through the GENERIC install_fresh + reconcile path so the frontend
fresh-creates normally (hook fires). Then re-run the destructive lifecycle on .228
(frontend recreate must succeed + run the hook), then .198, then the gate.
NOTE: .228 currently runs v1.7.99-alpha (these special-cases still present) — the
running stack is fine (adoption NoOp); only a frontend-absent event re-triggers the
bug, and the frontend is up.
- `b1eea8c0` indeedhub (#20) **phase 3 — CODE COMPLETE, unit-tested.** 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api,
ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first
(immich pattern). Data-preserving by construction = ADOPTION on .228: exact live
hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net +
network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live
/var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend
carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider;
defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured:
7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network
indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image
already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter.
**NEXT = live verify on .228:** build+sideload binary, restart, package.install
indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk:
service restart SIGKILL-cascade if Quadlet not fully shipped on .228.
- `b94b61f6` `network_aliases` manifest field (ContainerConfig) + podman_client &
quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest
test failures (network_policy: archy-net invalid; bind sources outside
/var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net.
- `955c54b7` hook capability (#20) **phase 2**`container::hooks::run_post_install`
executor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape
prefix check; best-effort/idempotent) wired into `install_fresh` after container
is up (fresh-container-only). 5 unit tests; `cargo test -p archipelago` green.
- `4c1a4e59` hook capability (#20) **phase 1**`LifecycleHooks`/`HookStep`/`HostCopy`
schema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existing
`ContainerConfig` test literals missing `generated_secrets` (container crate now
compiles; `cargo test -p archipelago-container` green, 53 pass).
- `f0c6b79d` immich containers named underscore (immich_server/_postgres/_redis) to
match runtime lifecycle code — fixes package.stop/start/restart. **immich fully
migrated + verified on .228** (manifest-driven stack via orchestrator).
- `b0b54a96` immich lifecycle bats suite (tests/lifecycle/bats/immich.bats).
- `d5ef4573`/`9e6c5370`/`011081d1` immich migration (rename→immich, orchestrator-first).
- `f160e0c4` podman-restart.service enabled at startup (reboot-survival).
- `0860dfac` Services-tab UI (backends→Services, parent icons, categories sub-nav, swipe).
- `220666d3`/`7bfbe8fe` registry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish).
- `192238cb` docs consolidation 56→28 + CLAUDE.md.
- `03a4ee1b` generated-secrets system + companion/quadlet fixes.
**DONE — hook capability (#20), phases 1+2 (schema + executor + wiring):**
controlled post-install hooks so indeedhub/netbird can migrate. Design:
`docs/manifest-hooks-design.md`. Schema, validate(), executor, and install-path
wiring all landed + green (commits `4c1a4e59`/`955c54b7` above). Remaining #20
phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = `pre_start` hooks
(type exists, NOT yet executed — wire into `prepare_for_start` if/when needed).
**NEXT — #20 phase 3, indeedhub migration:** author 7 member manifests
(postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name
hostnames; frontend carries the `post_install` hook (strip X-Frame-Options, copy
nostr-provider.js, inject script, nginx reload — see `patch_indeedhub_nostr_provider`
in install.rs:68 for exact ops); wire `install_indeedhub_stack` orchestrator-first;
generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password
(reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user
"indeeadmin". Then netbird (assess its setup steps). Then single-container legacy
apps (add to `uses_orchestrator_install_flow` allowlist in install.rs + verify each).
Then the lifecycle gate (#6) — needs harness hardening (#18) + .228 bitcoin synced.
**Test/deploy facts:** .228 = archi resilience node, UI/RPC pw `password123` (https),
SSH pw `archipelago`. Lifecycle harness runs from .116: `cd tests/lifecycle &&
ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123
ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh <suite>`. RPC trigger: auth.login (sets session
+ csrf cookies) → send csrf cookie value as `X-CSRF-Token` header. package.install
needs `{"id":"<app>","dockerImage":"<any-valid-image>"}` (dockerImage required even
for stacks). Rust workspace root = `core/`. Linker `undefined hidden symbol`
rebuild with `CARGO_INCREMENTAL=0`. immich on .228: app_id `immich`, containers
immich_server/immich_postgres/immich_redis, data dir owner 100998:100998.
## 9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- **Design:** `architecture.md`, `app-developer-guide.md`,
`APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`,
`marketplace-protocol.md`, `dht-distribution-design.md`,
`multi-node-architecture.md`, `rust-orchestrator-migration.md`,
`bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`,
`meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`.
- **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`,
`operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`,
`bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`,
`SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`.
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.