archy/docs/PRODUCTION-MASTER-PLAN.md
archipelago 8bdc857911 docs(#20): indeedhub phase 3 adoption path live-verified on .228
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 16:23:09 -04:00

16 KiB
Raw Blame History

🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry

THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until the production test gate (§5) is green. It overrides ad-hoc direction and supersedes all prior roadmap/handoff/status docs. When the gate passes, remove the priority banner and demote this doc.

Last updated: 2026-06-21 · Binary: v1.7.99-alpha


1. The North Star

Make Archipelago a world-class, developer-ready app platform where:

  1. Every app is manifest-driven — install/run/update/uninstall needs only the app's manifest (+ catalog entry). Zero OS-level code reliance: no per-app Rust installers, no sudo mkdir/chown, no host provisioning.
  2. Manifests are distributed via the (signed) registry, not baked into the binary OTA as disk files. Bumping/adding an app = a signed catalog change.
  3. Third-party developers can build and ship apps via an external registry — a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation), not a gatekept central store. archy app validate/render/install/test tooling.
  4. The platform stays rootless, secure-by-default, elegant, robust, and 100%-uptime-capable (reboot-survivable, self-healing, no data loss on migrate).

Definition of done: the production test gate (§5) is green for the app set on real nodes. Until then, this plan is the priority.

2. Invariants (never violate)

  • Rootless Podman only. No rootful, no Docker-socket mounts, no privileged containers unless explicitly approved. (ADR-001, ADR-009.)
  • No app-specific business logic in the Rust backend. The orchestrator owns the lifecycle state machine; apps are declarative. Legacy install_immich_stack (hardcoded podman run + sudo chown) is the anti-pattern being deleted.
  • Secrets are manifest-declared (generated_secrets, materialised by container::secrets 0600/rootless, idempotent + self-healing) — never hardcoded, per-app, or logged. Replaces the deleted ensure_fmcd_password.
  • Migrations never destroy data. Preserve /var/lib/archipelago/<app>, generated secrets, displayed credentials, public ports, and adoption container names. Always provide a rollback path. Stop/recreate only when necessary.
  • Verify on a real node (.228, then .198) before any tag.

3. Current state (2026-06-21)

  • ~40 apps are manifest-based and Quadlet-migrated (survive archipelago.service restart + reboot). Exhaustive per-app table: docs/app-registry-status-2026-06-21.md.
  • Legacy holdout: immich — the one app with no manifest and a hardcoded Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data. The migration proof case.
  • Manifests still travel by OTA disk rsync (apps/ → /opt/archipelago/apps). The signed catalog (app-catalog.json) currently distributes only image overrides — not full manifests. Gap closed by workstream B.
  • The 4 companions (archy-bitcoin-ui, -lnd-ui, -electrs-ui, -fedimint-ui) build from docker/<name> contexts via companion.rs, not the manifest registry — a later phase folds them in.
  • No app has passed the formal 20× production gate. That is the blocker.
# Workstream Detail doc Status
A Manifest-driven app platform — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) APP-PACKAGING-MIGRATION-PLAN.md mostly done; immich + multi-container polish remain
B Registry-distributed manifests — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback registry-manifest-design.md phases 1+2 done (node consume + opt-in publisher embed); not yet flipped on for the fleet
C Developer-ready external registry — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, archy app … tooling marketplace-protocol.md, app-developer-guide.md design exists; tooling + trust UX pending
D Distribution backbone — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) dht-distribution-design.md phases 02 code-complete (worktree)
E Production test gate — 20× lifecycle on .228 + .198, per-app L1/L2 matrix tests/lifecycle/TESTING.md, bulletproof-containers.md never green — exit criterion

Orchestrator architecture (foundation for A/B): rust-orchestrator-migration.md (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption scan, Quadlet rendering) and bulletproof-containers.md (the six container failure modes FM1FM6 + the desired-state-first reconciler that fixes them).

5. Production test gate (exit criterion)

An app is production-ready only when tests/lifecycle/run-20x.sh is green across the full matrix — install / UI-reachable / stop / start / restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall — 20× on .228 AND .198. All 8 gate checkboxes in tests/lifecycle/TESTING.md are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.

6. Immediate sequence (live workstream)

  1. B-phase 1manifest field on AppCatalogEntry; load_manifests catalog-wins merge; manifest_dir kept (build-source catalog manifests skipped in phase 1); unit tests. (commit 220666d3)
  2. B-phase 2EMBED_MANIFESTS publisher generator + round-trip guard. (7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)
  3. C immich proof — immich is a manifest-driven stack (immich + immich-postgres
    • immich-redis) installed via install_stack_via_orchestrator; legacy installer is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening, data_uid 100998. Canonical app_id immich (title+icon). (9e6c5370, d5ef4573)
  4. Reboot-survival — podman-restart.service enabled (startup, fleet-wide) for the podman---restart path. (f160e0c4)
  5. Verify on .198 (immich migration validated on .228 only so far).
  6. E — run the 20× gate; fix until green.
  7. ◻ Demote this banner.

Not yet done / deliberate follow-ups: flip EMBED_MANIFESTS on for the published catalog (then sign) to actually distribute manifests via the registry; Phase-3 use_quadlet_backends rollout so orchestrator backends are Quadlet (not just podman---restart); immich on .198.

7. Release blockers & operational gotchas (durable)

Carried forward from prior handoffs (deduped against persistent memory):

  • Rootless control-plane responsiveness — slow podman ps/store cleanup at startup must not surface a false "no apps installed" UI. My Apps must preserve last-known apps during scanner backoff, never show empty during a transient.
  • Reboot survival — gate on ≥3 (prefer 5) consecutive clean post-reboot lifecycle passes. Quadlet units under user.slice survive archipelago.service restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
  • Startup patterns — wait on a socket/health, never sleep. Tailscale waits for its socket; Fedimint Guardian waits for Bitcoin RPC initialblockdownload:false before launching fedimintd (proxy/wait companion on :8175 during IBD).
  • Bitcoin must run full (txindex=1, non-pruned) for ElectrumX/mempool.
  • Adoption — match existing containers by name and adopt without recreate; record a migration version in app state; preserve Nostr signer bridges (IndeeHub needs /nostr-provider.js served, not just port reachability).
  • Image presence — use bounded targeted podman image inspect, not podman image exists (avoids store-walk stalls).
  • Companion rebuildscompanion.rs must rebuild :latest when the build context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never reach nodes. :local is a manual override, never auto-rebuilt.

8. Roadmap

Pipeline: Feature Testing (internal) → User Testing (controlled hardware) → Beta Live (public). Hardening priorities feeding the gate:

  • P0 Container app reliability — bulletproof install/health/restart/uninstall across all apps, dependency chains, multi-container stacks.
  • P0 Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor hidden services, LND Connect).
  • P1 LUKS2 full-partition encryption for /var/lib/archipelago/ (AES-256-XTS, Argon2id, key from setup password + hardware salt).
  • P1 Meshtastic plug-and-play parity with MeshCore.

Post-beta (deferred — do not start until gate is green): P2P encrypted voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC hardening; paid swarm streaming + IndeeHub source (phase4-streaming-ecash-plan.md); Meshroller Rust-native mesh AI (meshroller-integration-design.md); dual-ecash phases 26 (dual-ecash-design.md).

8b. SESSION STATE + RESUME (2026-06-21, live)

Landed + committed on main this session (newest first):

  • #20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21). Built v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start — containers survived via --restart unless-stopped + podman-restart.service). RPC package.install indeedhubcomplete, orchestrator-first path adopted all 7 members (reconcile action app_id=indeedhub-* action=NoOp), containers stayed Up 4 days (NOT recreated) — zero data/credential disruption. UI green: frontend :7778 → 200, nostr-provider.js → 200, /api/ → 200 (proves network_aliases: frontend nginx http://api:4000 resolved on indeedhub-net). Fleet healthy (36 containers, none down). STILL TODO: fresh-create path — adoption is NoOp so install_fresh (→ post_install hook + alias rendering on a NEW container) was NOT exercised live; validate via destructive lifecycle (uninstall→reinstall or recreate one member) on .228, then .198, then the gate.
  • b1eea8c0 indeedhub (#20) phase 3 — CODE COMPLETE, unit-tested. 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api, ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first (immich pattern). Data-preserving by construction = ADOPTION on .228: exact live hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live /var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider; defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured: 7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter. NEXT = live verify on .228: build+sideload binary, restart, package.install indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk: service restart SIGKILL-cascade if Quadlet not fully shipped on .228.
  • b94b61f6 network_aliases manifest field (ContainerConfig) + podman_client & quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest test failures (network_policy: archy-net invalid; bind sources outside /var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net.
  • 955c54b7 hook capability (#20) phase 2container::hooks::run_post_install executor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape prefix check; best-effort/idempotent) wired into install_fresh after container is up (fresh-container-only). 5 unit tests; cargo test -p archipelago green.
  • 4c1a4e59 hook capability (#20) phase 1LifecycleHooks/HookStep/HostCopy schema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existing ContainerConfig test literals missing generated_secrets (container crate now compiles; cargo test -p archipelago-container green, 53 pass).
  • f0c6b79d immich containers named underscore (immich_server/_postgres/_redis) to match runtime lifecycle code — fixes package.stop/start/restart. immich fully migrated + verified on .228 (manifest-driven stack via orchestrator).
  • b0b54a96 immich lifecycle bats suite (tests/lifecycle/bats/immich.bats).
  • d5ef4573/9e6c5370/011081d1 immich migration (rename→immich, orchestrator-first).
  • f160e0c4 podman-restart.service enabled at startup (reboot-survival).
  • 0860dfac Services-tab UI (backends→Services, parent icons, categories sub-nav, swipe).
  • 220666d3/7bfbe8fe registry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish).
  • 192238cb docs consolidation 56→28 + CLAUDE.md.
  • 03a4ee1b generated-secrets system + companion/quadlet fixes.

DONE — hook capability (#20), phases 1+2 (schema + executor + wiring): controlled post-install hooks so indeedhub/netbird can migrate. Design: docs/manifest-hooks-design.md. Schema, validate(), executor, and install-path wiring all landed + green (commits 4c1a4e59/955c54b7 above). Remaining #20 phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = pre_start hooks (type exists, NOT yet executed — wire into prepare_for_start if/when needed).

NEXT — #20 phase 3, indeedhub migration: author 7 member manifests (postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name hostnames; frontend carries the post_install hook (strip X-Frame-Options, copy nostr-provider.js, inject script, nginx reload — see patch_indeedhub_nostr_provider in install.rs:68 for exact ops); wire install_indeedhub_stack orchestrator-first; generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password (reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user "indeeadmin". Then netbird (assess its setup steps). Then single-container legacy apps (add to uses_orchestrator_install_flow allowlist in install.rs + verify each). Then the lifecycle gate (#6) — needs harness hardening (#18) + .228 bitcoin synced.

Test/deploy facts: .228 = archi resilience node, UI/RPC pw password123 (https), SSH pw archipelago. Lifecycle harness runs from .116: cd tests/lifecycle && ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh <suite>. RPC trigger: auth.login (sets session

  • csrf cookies) → send csrf cookie value as X-CSRF-Token header. package.install needs {"id":"<app>","dockerImage":"<any-valid-image>"} (dockerImage required even for stacks). Rust workspace root = core/. Linker undefined hidden symbol → rebuild with CARGO_INCREMENTAL=0. immich on .228: app_id immich, containers immich_server/immich_postgres/immich_redis, data dir owner 100998:100998.

9. Documentation map (what survives)

This master plan is the hub. Authoritative standalone docs (linked above), kept:

  • Design: architecture.md, app-developer-guide.md, APP-PACKAGING-MIGRATION-PLAN.md, registry-manifest-design.md, marketplace-protocol.md, dht-distribution-design.md, multi-node-architecture.md, rust-orchestrator-migration.md, bulletproof-containers.md, three-mode-ui-design.md, dual-ecash-design.md, meshroller-integration-design.md, phase4-streaming-ecash-plan.md, adr/*.
  • Reference: app-manifest-spec.md, api-reference.md, developer-guide.md, operations-runbook.md, troubleshooting.md, user-walkthrough.md, bitcoin-rpc-relay.md, security-code-audit-2026-03.md, GAMEPAD-NAV.md, SEED-VERIFICATION.md, hotfix-process.md, app-registry-status-2026-06-21.md.

All dated handoffs/resumes/transcripts/superseded trackers were consolidated here and removed (recoverable via git) on 2026-06-21.