19 KiB
🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until the production test gate (§5) is green. It overrides ad-hoc direction and supersedes all prior roadmap/handoff/status docs. When the gate passes, remove the priority banner and demote this doc.
Last updated: 2026-06-21 · Binary: v1.7.99-alpha
1. The North Star
Make Archipelago a world-class, developer-ready app platform where:
- Every app is manifest-driven — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). Zero OS-level code reliance: no per-app
Rust installers, no
sudo mkdir/chown, no host provisioning. - Manifests are distributed via the (signed) registry, not baked into the binary OTA as disk files. Bumping/adding an app = a signed catalog change.
- Third-party developers can build and ship apps via an external registry —
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store.
archy app validate/render/install/testtooling. - The platform stays rootless, secure-by-default, elegant, robust, and 100%-uptime-capable (reboot-survivable, self-healing, no data loss on migrate).
Definition of done: the production test gate (§5) is green for the app set on real nodes. Until then, this plan is the priority.
2. Invariants (never violate)
- Rootless Podman only. No rootful, no Docker-socket mounts, no privileged containers unless explicitly approved. (ADR-001, ADR-009.)
- No app-specific business logic in the Rust backend. The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy
install_immich_stack(hardcodedpodman run+sudo chown) is the anti-pattern being deleted. - Secrets are manifest-declared (
generated_secrets, materialised bycontainer::secrets0600/rootless, idempotent + self-healing) — never hardcoded, per-app, or logged. Replaces the deletedensure_fmcd_password. - Migrations never destroy data. Preserve
/var/lib/archipelago/<app>, generated secrets, displayed credentials, public ports, and adoption container names. Always provide a rollback path. Stop/recreate only when necessary. - Verify on a real node (.228, then .198) before any tag.
3. Current state (2026-06-21)
- ~40 apps are manifest-based and Quadlet-migrated (survive
archipelago.servicerestart + reboot). Exhaustive per-app table:docs/app-registry-status-2026-06-21.md. - Legacy holdout: immich — the one app with no manifest and a hardcoded Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data. The migration proof case.
- Manifests still travel by OTA disk rsync (
apps/ → /opt/archipelago/apps). The signed catalog (app-catalog.json) currently distributes only image overrides — not full manifests. Gap closed by workstream B. - The 4 companions (
archy-bitcoin-ui,-lnd-ui,-electrs-ui,-fedimint-ui) build fromdocker/<name>contexts viacompanion.rs, not the manifest registry — a later phase folds them in. - No app has passed the formal production gate (5× for now, was 20×). That is the blocker.
4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|---|---|---|
| A | Manifest-driven app platform — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | APP-PACKAGING-MIGRATION-PLAN.md |
mostly done; immich + multi-container polish remain |
| B | Registry-distributed manifests — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | registry-manifest-design.md |
phases 1+2 done (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | Developer-ready external registry — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, archy app … tooling |
marketplace-protocol.md, app-developer-guide.md |
design exists; tooling + trust UX pending |
| D | Distribution backbone — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | dht-distribution-design.md |
phases 0–2 code-complete (worktree) |
| E | Production test gate — 5× lifecycle on .228 + .198 (for now; was 20×), per-app L1/L2 matrix | tests/lifecycle/TESTING.md, bulletproof-containers.md |
never green — exit criterion |
Orchestrator architecture (foundation for A/B): rust-orchestrator-migration.md
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and bulletproof-containers.md (the six container failure
modes FM1–FM6 + the desired-state-first reconciler that fixes them).
5. Production test gate (exit criterion)
An app is production-ready only when tests/lifecycle/run-20x.sh is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / reboot-survive / archipelago-restart-survive / uninstall —
5× on .228 AND .198 for now (ARCHY_ITERATIONS=5; temporarily reduced from
20× — restore to 20× before the final ship). All 8 gate checkboxes in tests/lifecycle/TESTING.md
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
6. Immediate sequence (live workstream)
- ✅ B-phase 1 —
manifestfield onAppCatalogEntry;load_manifestscatalog-wins merge;manifest_dirkept (build-source catalog manifests skipped in phase 1); unit tests. (commit220666d3) - ✅ B-phase 2 —
EMBED_MANIFESTSpublisher generator + round-trip guard. (7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.) - ✅ C immich proof — immich is a manifest-driven stack (immich + immich-postgres
- immich-redis) installed via
install_stack_via_orchestrator; legacy installer is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening, data_uid 100998. Canonical app_idimmich(title+icon). (9e6c5370,d5ef4573)
- immich-redis) installed via
- ✅ Reboot-survival — podman-restart.service enabled (startup, fleet-wide)
for the podman-
--restartpath. (f160e0c4) - ◻ Verify on .198 (immich migration validated on .228 only so far).
- ◻ E — run the 5× gate (
ARCHY_ITERATIONS=5, was 20×); fix until green. - ◻ Demote this banner.
Not yet done / deliberate follow-ups: flip EMBED_MANIFESTS on for the
published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 use_quadlet_backends rollout so orchestrator backends are Quadlet (not
just podman---restart); immich on .198.
7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- Rootless control-plane responsiveness — slow
podman ps/store cleanup at startup must not surface a false "no apps installed" UI. My Apps must preserve last-known apps during scanner backoff, never show empty during a transient. - Reboot survival — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under
user.slicesurvivearchipelago.servicerestart; legacy in-cgroup containers get SIGKILLed and reconciled back. - Startup patterns — wait on a socket/health, never
sleep. Tailscale waits for its socket; Fedimint Guardian waits for Bitcoin RPCinitialblockdownload:falsebefore launching fedimintd (proxy/wait companion on :8175 during IBD). - Bitcoin must run full (
txindex=1, non-pruned) for ElectrumX/mempool. - Adoption — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs
/nostr-provider.jsserved, not just port reachability). - Image presence — use bounded targeted
podman image inspect, notpodman image exists(avoids store-walk stalls). - Companion rebuilds —
companion.rsmust rebuild:latestwhen the build context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never reach nodes.:localis a manual override, never auto-rebuilt.
8. Roadmap
Pipeline: Feature Testing (internal) → User Testing (controlled hardware) → Beta Live (public). Hardening priorities feeding the gate:
- P0 Container app reliability — bulletproof install/health/restart/uninstall across all apps, dependency chains, multi-container stacks.
- P0 Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor hidden services, LND Connect).
- P1 LUKS2 full-partition encryption for
/var/lib/archipelago/(AES-256-XTS, Argon2id, key from setup password + hardware salt). - P1 Meshtastic plug-and-play parity with MeshCore.
Post-beta (deferred — do not start until gate is green): P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (phase4-streaming-ecash-plan.md);
Meshroller Rust-native mesh AI (meshroller-integration-design.md); dual-ecash
phases 2–6 (dual-ecash-design.md).
8b. SESSION STATE + RESUME (2026-06-21, live)
Landed + committed on main this session (newest first):
-
#20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21). Built v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start — containers survived via --restart unless-stopped + podman-restart.service). RPC
package.install indeedhub→complete, orchestrator-first path adopted all 7 members (reconcile action app_id=indeedhub-* action=NoOp), containers stayed Up 4 days (NOT recreated) — zero data/credential disruption. UI green: frontend :7778 → 200, nostr-provider.js → 200, /api/ → 200 (proves network_aliases: frontend nginxhttp://api:4000resolved on indeedhub-net). Fleet healthy (36 containers, none down). FRESH-CREATE PATH = FIXED + VERIFIED (2026-06-21). Deleted the legacy indeedhub orchestrator special-cases (b73084db, −382 lines: reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider, etc.) so "indeedhub" flows through the generic install_fresh path. Then two live fixes on .228: (1) frontend nginx needscapabilities: [CHOWN,DAC_OVERRIDE,SETGID,SETUID]under the orchestrator's --cap-drop=ALL (workers died "setgid(101) failed"); manifest fixff8f11b8. (2) NOTE: manifest reload needs an archipelago restart (manifests cached at startup) — a disk manifest edit alone won't take. RESULT: frontend fresh-creates via install_fresh, caps applied, post_install hook FIRES (copy_from_host nostr-provider.js ✅), UI 200 (/, /nostr-provider.js, /api/). KNOWN GAP (general hook capability, NOT blocking indeedhub): the post_installexecsteps fail viapodman execfrom the archipelago.service systemd cgroup (crun: write cgroup.procs: Permission denied / OCI permission denied). Harmless here (image bakes the nginx config so the exec steps are no-ops; copy_from_host is the one that matters and works). FIX = wrap the hook executor'spodman execin a transient user scope (systemd-run --user --scope, likepodman_user_scope) in core/archipelago/src/container/hooks.rs::run_podman. Do before relying on exec hooks for an app whose image does NOT pre-bake its mutations.PRIOR (now resolved) — was: FRESH-CREATE PATH = BLOCKED (found live 2026-06-21). Removed the stateless frontend + reinstalled to exercise install_fresh → it FAILED:
orchestrator stack install indeedhub failed at app indeedhub: IndeedHub dependencies were not ready within 120s (indeedhub-api dependency DNS not ready), and the frontend was left down. Recovered manually on .228 (podman run w/ alias indeedhub on indeedhub-net; UI 200). ROOT CAUSE = hardcoded indeedhub orchestrator special-cases that predate + conflict with the manifest path:- prod_orchestrator
ensure_running~L1377:app_id=="indeedhub"→reconcile_indeedhub_stack, which REFUSES manifest creation when the frontend is absent (returns Left("stack-managed")). run_pre_start_hooks("indeedhub")~L2324 →start_indeedhub_backends→wait_for_indeedhub_dependencies_ready(120)— the gate that blocked install_fresh (indeedhub_api_dependency_dns_readyreturns false while the frontend's own alias is absent + a getent transiently fails).- also
repair_indeedhub_network_aliases,patch_indeedhub_nostr_provider, the "frontend did not stay reachable; restart" path (~L2474),INDEEDHUB_BACKEND_*consts, and a crash_recovery.rs indeedhub special-case. FIX (next, its own build/deploy/test cycle): delete these special-cases now that the manifest carries dependencies/network_aliases/post_install — route "indeedhub" through the GENERIC install_fresh + reconcile path so the frontend fresh-creates normally (hook fires). Then re-run the destructive lifecycle on .228 (frontend recreate must succeed + run the hook), then .198, then the gate. NOTE: .228 currently runs v1.7.99-alpha (these special-cases still present) — the running stack is fine (adoption NoOp); only a frontend-absent event re-triggers the bug, and the frontend is up.
- prod_orchestrator
-
b1eea8c0indeedhub (#20) phase 3 — CODE COMPLETE, unit-tested. 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api, ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first (immich pattern). Data-preserving by construction = ADOPTION on .228: exact live hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live /var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider; defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured: 7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter. NEXT = live verify on .228: build+sideload binary, restart, package.install indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk: service restart SIGKILL-cascade if Quadlet not fully shipped on .228. -
b94b61f6network_aliasesmanifest field (ContainerConfig) + podman_client & quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest test failures (network_policy: archy-net invalid; bind sources outside /var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net. -
955c54b7hook capability (#20) phase 2 —container::hooks::run_post_installexecutor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape prefix check; best-effort/idempotent) wired intoinstall_freshafter container is up (fresh-container-only). 5 unit tests;cargo test -p archipelagogreen. -
4c1a4e59hook capability (#20) phase 1 —LifecycleHooks/HookStep/HostCopyschema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existingContainerConfigtest literals missinggenerated_secrets(container crate now compiles;cargo test -p archipelago-containergreen, 53 pass). -
f0c6b79dimmich containers named underscore (immich_server/_postgres/_redis) to match runtime lifecycle code — fixes package.stop/start/restart. immich fully migrated + verified on .228 (manifest-driven stack via orchestrator). -
b0b54a96immich lifecycle bats suite (tests/lifecycle/bats/immich.bats). -
d5ef4573/9e6c5370/011081d1immich migration (rename→immich, orchestrator-first). -
f160e0c4podman-restart.service enabled at startup (reboot-survival). -
0860dfacServices-tab UI (backends→Services, parent icons, categories sub-nav, swipe). -
220666d3/7bfbe8feregistry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish). -
192238cbdocs consolidation 56→28 + CLAUDE.md. -
03a4ee1bgenerated-secrets system + companion/quadlet fixes.
DONE — hook capability (#20), phases 1+2 (schema + executor + wiring):
controlled post-install hooks so indeedhub/netbird can migrate. Design:
docs/manifest-hooks-design.md. Schema, validate(), executor, and install-path
wiring all landed + green (commits 4c1a4e59/955c54b7 above). Remaining #20
phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = pre_start hooks
(type exists, NOT yet executed — wire into prepare_for_start if/when needed).
NEXT — #20 phase 3, indeedhub migration: author 7 member manifests
(postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name
hostnames; frontend carries the post_install hook (strip X-Frame-Options, copy
nostr-provider.js, inject script, nginx reload — see patch_indeedhub_nostr_provider
in install.rs:68 for exact ops); wire install_indeedhub_stack orchestrator-first;
generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password
(reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user
"indeeadmin". Then netbird (assess its setup steps). Then single-container legacy
apps (add to uses_orchestrator_install_flow allowlist in install.rs + verify each).
Then the lifecycle gate (#6) — needs harness hardening (#18) + .228 bitcoin synced.
Test/deploy facts: .228 = archi resilience node, UI/RPC pw password123 (https),
SSH pw archipelago. Lifecycle harness runs from .116: cd tests/lifecycle && ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh <suite>. RPC trigger: auth.login (sets session
- csrf cookies) → send csrf cookie value as
X-CSRF-Tokenheader. package.install needs{"id":"<app>","dockerImage":"<any-valid-image>"}(dockerImage required even for stacks). Rust workspace root =core/. Linkerundefined hidden symbol→ rebuild withCARGO_INCREMENTAL=0. immich on .228: app_idimmich, containers immich_server/immich_postgres/immich_redis, data dir owner 100998:100998.
9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- Design:
architecture.md,app-developer-guide.md,APP-PACKAGING-MIGRATION-PLAN.md,registry-manifest-design.md,marketplace-protocol.md,dht-distribution-design.md,multi-node-architecture.md,rust-orchestrator-migration.md,bulletproof-containers.md,three-mode-ui-design.md,dual-ecash-design.md,meshroller-integration-design.md,phase4-streaming-ecash-plan.md,adr/*. - Reference:
app-manifest-spec.md,api-reference.md,developer-guide.md,operations-runbook.md,troubleshooting.md,user-walkthrough.md,bitcoin-rpc-relay.md,security-code-audit-2026-03.md,GAMEPAD-NAV.md,SEED-VERIFICATION.md,hotfix-process.md,app-registry-status-2026-06-21.md.
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here and removed (recoverable via git) on 2026-06-21.