archy/docs/PRODUCTION-MASTER-PLAN.md
archipelago 192238cbb8 docs: consolidate into PRODUCTION-MASTER-PLAN, add CLAUDE.md, prune 25 stale docs
Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform
north star: every app manifest-driven (zero OS-level reliance), manifests via the
signed registry, developer-ready external marketplace; rootless/secure/robust/
100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until
the 20x lifecycle gate is green. New design doc registry-manifest-design.md.

Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and
superseded trackers (content folded into the master plan or already in memory).
Kept all evergreen design/reference docs + ADRs (the master links them).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:11:32 -04:00

158 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
> **THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until
> the production test gate (§5) is green.** It overrides ad-hoc direction and
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
> the priority banner and demote this doc.
>
> Last updated: 2026-06-21 · Binary: v1.7.99-alpha
---
## 1. The North Star
Make Archipelago a **world-class, developer-ready app platform** where:
1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app
Rust installers, no `sudo mkdir/chown`, no host provisioning.
2. **Manifests are distributed via the (signed) registry**, not baked into the
binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry**
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).
**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.
## 2. Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
(hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
`container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
per-app, or logged. Replaces the deleted `ensure_fmcd_password`.
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>`,
generated secrets, displayed credentials, public ports, and adoption container
names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on a real node (.228, then .198) before any tag.**
## 3. Current state (2026-06-21)
- **~40 apps are manifest-based and Quadlet-migrated** (survive
`archipelago.service` restart + reboot). Exhaustive per-app table:
`docs/app-registry-status-2026-06-21.md`.
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`).
The signed catalog (`app-catalog.json`) currently distributes **only image
overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`,
`-fedimint-ui`) build from `docker/<name>` contexts via `companion.rs`, not the
manifest registry — a later phase folds them in.
- **No app has passed the formal 20× production gate.** That is the blocker.
## 4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **design done — implementing phase 1** |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 20× lifecycle on .228 + .198, per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1FM6 + the desired-state-first reconciler that fixes them).
## 5. Production test gate (exit criterion)
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**20× on .228 AND .198**. All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
## 6. Immediate sequence (live workstream)
1. **B-phase 1**`manifest` field on `AppCatalogEntry`; `load_manifests`
catalog-wins merge; `manifest_dir: Option`; unit tests (image-only apps first).
2. **B-phase 2** — publisher generator embeds + signs manifests into
`releases/app-catalog.json`.
3. **C immich proof** — author immich as registry manifests (postgres/redis/server)
installed via `install_stack_via_orchestrator`; delete `install_immich_stack`;
`generated_secrets: [immich-db-password]`**reuse the live secret `39ec03dc…`**
(postgres is initialised with it; never regenerate). Anon `/data` vol is empty.
4. **Verify on .228, then .198.**
5. **E** — run the 20× gate; fix until green.
6. Demote this banner.
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at
startup must not surface a false "no apps installed" UI. **My Apps must preserve
last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits
for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect`, not
`podman image exists` (avoids store-walk stalls).
- **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build
context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
reach nodes. `:local` is a manual override, never auto-rebuilt.
## 8. Roadmap
**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:
- **P0** Container app reliability — bulletproof install/health/restart/uninstall
across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
(AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`);
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
phases 26 (`dual-ecash-design.md`).
## 9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- **Design:** `architecture.md`, `app-developer-guide.md`,
`APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`,
`marketplace-protocol.md`, `dht-distribution-design.md`,
`multi-node-architecture.md`, `rust-orchestrator-migration.md`,
`bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`,
`meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`.
- **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`,
`operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`,
`bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`,
`SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`.
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.