2026-06-21 05:11:32 -04:00
# 🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
> **THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until
> the production test gate (§5) is green.** It overrides ad-hoc direction and
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
> the priority banner and demote this doc.
>
> Last updated: 2026-06-21 · Binary: v1.7.99-alpha
---
## 1. The North Star
Make Archipelago a **world-class, developer-ready app platform** where:
1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). **Zero OS-level code reliance** : no per-app
Rust installers, no `sudo mkdir/chown` , no host provisioning.
2. **Manifests are distributed via the (signed) registry** , not baked into the
binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry** —
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).
**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.
## 2. Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
(hardcoded `podman run` + `sudo chown` ) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets` , materialised by
`container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
per-app, or logged. Replaces the deleted `ensure_fmcd_password` .
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>` ,
generated secrets, displayed credentials, public ports, and adoption container
names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on a real node (.228, then .198) before any tag.**
## 3. Current state (2026-06-21)
- **~40 apps are manifest-based and Quadlet-migrated** (survive
`archipelago.service` restart + reboot). Exhaustive per-app table:
`docs/app-registry-status-2026-06-21.md` .
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps` ).
The signed catalog (`app-catalog.json` ) currently distributes **only image
overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui` , `-lnd-ui` , `-electrs-ui` ,
`-fedimint-ui` ) build from `docker/<name>` contexts via `companion.rs` , not the
manifest registry — a later phase folds them in.
- **No app has passed the formal 20× production gate.** That is the blocker.
## 4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
feat(immich): manifest-driven stack via orchestrator — live-migrated on .228
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).
- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
* named by app_id (dropped container_name override) — using container_name
spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
on the same PGDATA, which corrupted a postgres cluster. Server reaches its
siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
* immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
host 100998 under rootless; verified the fresh dir is chowned correctly).
* immich-server version "release"→"2.7.4" (manifest validation requires a digit;
the bad version made the manifest silently skip → partial orchestrator install
→ legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
errors instead of double-creating containers on shared data (the corruption
root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
manifest — this gap let the bad immich-server version through.
Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 07:08:45 -04:00
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
2026-06-21 05:11:32 -04:00
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md` , `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0– 2 code-complete (worktree) |
| E | **Production test gate** — 20× lifecycle on .228 + .198, per-app L1/L2 matrix | `tests/lifecycle/TESTING.md` , `bulletproof-containers.md` | **never green — exit criterion** |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1– FM6 + the desired-state-first reconciler that fixes them).
## 5. Production test gate (exit criterion)
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**20× on .228 AND .198**. All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
## 6. Immediate sequence (live workstream)
2026-06-21 08:25:40 -04:00
1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry` ; `load_manifests`
catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped
in phase 1); unit tests. *(commit 220666d3)*
2. ✅ **B-phase 2** — `EMBED_MANIFESTS` publisher generator + round-trip guard.
*(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)*
3. ✅ **C immich proof** — immich is a manifest-driven stack (immich + immich-postgres
+ immich-redis) installed via `install_stack_via_orchestrator` ; legacy installer
is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name
duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening,
data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)*
4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide)
for the podman-`--restart` path. *(f160e0c4)*
5. ◻ **Verify on .198** (immich migration validated on .228 only so far).
6. ◻ **E** — run the 20× gate; fix until green.
7. ◻ Demote this banner.
**Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the
published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
just podman-`--restart` ); immich on .198.
2026-06-21 05:11:32 -04:00
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- **Rootless control-plane responsiveness** — slow `podman ps` /store cleanup at
startup must not surface a false "no apps installed" UI. **My Apps must preserve
last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep` . Tailscale waits
for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1` , non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect` , not
`podman image exists` (avoids store-walk stalls).
- **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build
context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
reach nodes. `:local` is a manual override, never auto-rebuilt.
## 8. Roadmap
**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:
- **P0** Container app reliability — bulletproof install/health/restart/uninstall
across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
(AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md` );
Meshroller Rust-native mesh AI (`meshroller-integration-design.md` ); dual-ecash
phases 2– 6 (`dual-ecash-design.md` ).
2026-06-21 11:07:00 -04:00
## 8b. SESSION STATE + RESUME (2026-06-21, live)
**Landed + committed on main this session (newest first):**
2026-06-21 16:23:09 -04:00
- **#20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21).** Built
v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start —
containers survived via --restart unless-stopped + podman-restart.service). RPC
`package.install indeedhub` → `complete` , orchestrator-first path adopted all 7
members (`reconcile action app_id=indeedhub-* action=NoOp` ), containers stayed
**Up 4 days (NOT recreated)** — zero data/credential disruption. UI green:
frontend :7778 → 200, nostr-provider.js → 200, ** /api/ → 200 (proves
network_aliases: frontend nginx `http://api:4000` resolved on indeedhub-net)**.
2026-06-21 16:36:22 -04:00
Fleet healthy (36 containers, none down).
**FRESH-CREATE PATH = BLOCKED (found live 2026-06-21).** Removed the stateless
frontend + reinstalled to exercise install_fresh → it FAILED:
`orchestrator stack install indeedhub failed at app indeedhub: IndeedHub
dependencies were not ready within 120s (indeedhub-api dependency DNS not ready)`,
and the frontend was left down. Recovered manually on .228 (podman run w/ alias
indeedhub on indeedhub-net; UI 200). ROOT CAUSE = hardcoded indeedhub orchestrator
special-cases that predate + conflict with the manifest path:
- prod_orchestrator `ensure_running` ~L1377: `app_id=="indeedhub"` →
`reconcile_indeedhub_stack` , which REFUSES manifest creation when the frontend
is absent (returns Left("stack-managed")).
- `run_pre_start_hooks("indeedhub")` ~L2324 → `start_indeedhub_backends` →
`wait_for_indeedhub_dependencies_ready(120)` — the gate that blocked install_fresh
(`indeedhub_api_dependency_dns_ready` returns false while the frontend's own alias
is absent + a getent transiently fails).
- also `repair_indeedhub_network_aliases` , `patch_indeedhub_nostr_provider` , the
"frontend did not stay reachable; restart" path (~L2474), `INDEEDHUB_BACKEND_*`
consts, and a crash_recovery.rs indeedhub special-case.
**FIX (next, its own build/deploy/test cycle):** delete these special-cases now
that the manifest carries dependencies/network_aliases/post_install — route
"indeedhub" through the GENERIC install_fresh + reconcile path so the frontend
fresh-creates normally (hook fires). Then re-run the destructive lifecycle on .228
(frontend recreate must succeed + run the hook), then .198, then the gate.
NOTE: .228 currently runs v1.7.99-alpha (these special-cases still present) — the
running stack is fine (adoption NoOp); only a frontend-absent event re-triggers the
bug, and the frontend is up.
2026-06-21 16:23:09 -04:00
- `b1eea8c0` indeedhub (#20 ) **phase 3 — CODE COMPLETE, unit-tested.** 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api,
2026-06-21 15:48:18 -04:00
ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first
(immich pattern). Data-preserving by construction = ADOPTION on .228: exact live
hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net +
network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live
/var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend
carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider;
defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured:
7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network
indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image
already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter.
**NEXT = live verify on .228:** build+sideload binary, restart, package.install
indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk:
service restart SIGKILL-cascade if Quadlet not fully shipped on .228.
- `b94b61f6` `network_aliases` manifest field (ContainerConfig) + podman_client &
quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest
test failures (network_policy: archy-net invalid; bind sources outside
/var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net.
2026-06-21 11:49:05 -04:00
- `955c54b7` hook capability (#20 ) **phase 2** — `container::hooks::run_post_install`
executor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape
prefix check; best-effort/idempotent) wired into `install_fresh` after container
is up (fresh-container-only). 5 unit tests; `cargo test -p archipelago` green.
- `4c1a4e59` hook capability (#20 ) **phase 1** — `LifecycleHooks` /`HookStep` /`HostCopy`
schema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existing
`ContainerConfig` test literals missing `generated_secrets` (container crate now
compiles; `cargo test -p archipelago-container` green, 53 pass).
2026-06-21 11:07:00 -04:00
- `f0c6b79d` immich containers named underscore (immich_server/_postgres/_redis) to
match runtime lifecycle code — fixes package.stop/start/restart. **immich fully
migrated + verified on .228** (manifest-driven stack via orchestrator).
- `b0b54a96` immich lifecycle bats suite (tests/lifecycle/bats/immich.bats).
- `d5ef4573` /`9e6c5370` /`011081d1` immich migration (rename→immich, orchestrator-first).
- `f160e0c4` podman-restart.service enabled at startup (reboot-survival).
- `0860dfac` Services-tab UI (backends→Services, parent icons, categories sub-nav, swipe).
- `220666d3` /`7bfbe8fe` registry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish).
- `192238cb` docs consolidation 56→28 + CLAUDE.md.
- `03a4ee1b` generated-secrets system + companion/quadlet fixes.
2026-06-21 11:49:05 -04:00
**DONE — hook capability (#20 ), phases 1+2 (schema + executor + wiring):**
controlled post-install hooks so indeedhub/netbird can migrate. Design:
`docs/manifest-hooks-design.md` . Schema, validate(), executor, and install-path
wiring all landed + green (commits `4c1a4e59` /`955c54b7` above). Remaining #20
phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = `pre_start` hooks
(type exists, NOT yet executed — wire into `prepare_for_start` if/when needed).
**NEXT — #20 phase 3, indeedhub migration:** author 7 member manifests
2026-06-21 11:07:00 -04:00
(postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name
hostnames; frontend carries the `post_install` hook (strip X-Frame-Options, copy
nostr-provider.js, inject script, nginx reload — see `patch_indeedhub_nostr_provider`
in install.rs:68 for exact ops); wire `install_indeedhub_stack` orchestrator-first;
generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password
(reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user
"indeeadmin". Then netbird (assess its setup steps). Then single-container legacy
apps (add to `uses_orchestrator_install_flow` allowlist in install.rs + verify each).
Then the lifecycle gate (#6 ) — needs harness hardening (#18 ) + .228 bitcoin synced.
**Test/deploy facts:** .228 = archi resilience node, UI/RPC pw `password123` (https),
SSH pw `archipelago` . Lifecycle harness runs from .116: `cd tests/lifecycle &&
ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123
ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh < suite > `. RPC trigger: auth.login (sets session
+ csrf cookies) → send csrf cookie value as `X-CSRF-Token` header. package.install
needs `{"id":"<app>","dockerImage":"<any-valid-image>"}` (dockerImage required even
for stacks). Rust workspace root = `core/` . Linker `undefined hidden symbol` →
rebuild with `CARGO_INCREMENTAL=0` . immich on .228: app_id `immich` , containers
immich_server/immich_postgres/immich_redis, data dir owner 100998:100998.
2026-06-21 05:11:32 -04:00
## 9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- **Design:** `architecture.md` , `app-developer-guide.md` ,
`APP-PACKAGING-MIGRATION-PLAN.md` , `registry-manifest-design.md` ,
`marketplace-protocol.md` , `dht-distribution-design.md` ,
`multi-node-architecture.md` , `rust-orchestrator-migration.md` ,
`bulletproof-containers.md` , `three-mode-ui-design.md` , `dual-ecash-design.md` ,
`meshroller-integration-design.md` , `phase4-streaming-ecash-plan.md` , `adr/*` .
- **Reference:** `app-manifest-spec.md` , `api-reference.md` , `developer-guide.md` ,
`operations-runbook.md` , `troubleshooting.md` , `user-walkthrough.md` ,
`bitcoin-rpc-relay.md` , `security-code-audit-2026-03.md` , `GAMEPAD-NAV.md` ,
`SEED-VERIFICATION.md` , `hotfix-process.md` , `app-registry-status-2026-06-21.md` .
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.