archy/docs/PRODUCTION-MASTER-PLAN.md

377 lines
26 KiB
Markdown
Raw Normal View History

# 🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
> **THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until
> the production test gate (§5) is green.** It overrides ad-hoc direction and
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
> the priority banner and demote this doc.
>
> Last updated: 2026-06-22 · Binary: v1.7.99-alpha · See §8b for the live resume.
---
## 1. The North Star
Make Archipelago a **world-class, developer-ready app platform** where:
1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app
Rust installers, no `sudo mkdir/chown`, no host provisioning.
2. **Manifests are distributed via the (signed) registry**, not baked into the
binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry**
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).
**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.
## 2. Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
(hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
`container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
per-app, or logged. Replaces the deleted `ensure_fmcd_password`.
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>`,
generated secrets, displayed credentials, public ports, and adoption container
names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on a real node (.228, then .198) before any tag.**
## 3. Current state (2026-06-21)
- **~40 apps are manifest-based and Quadlet-migrated** (survive
`archipelago.service` restart + reboot). Exhaustive per-app table:
`docs/app-registry-status-2026-06-21.md`.
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`).
The signed catalog (`app-catalog.json`) currently distributes **only image
overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`,
`-fedimint-ui`) build from `docker/<name>` contexts via `companion.rs`, not the
manifest registry — a later phase folds them in.
- **No app has passed the formal production gate (5× for now, was 20×).** That is the blocker.
## 4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
feat(immich): manifest-driven stack via orchestrator — live-migrated on .228 Completes the immich migration off the legacy hardcoded install_immich_stack (podman run + sudo chown) to the registry-manifest + orchestrator path. Validated live on .228 (clean single set, healthy v2.7.4, data dir ownership correct). - install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids) first; legacy remains only as the no-manifests fallback. - immich-{postgres,redis,server} manifests corrected from live findings: * named by app_id (dropped container_name override) — using container_name spawned DUPLICATE containers (app_id-named install vs name-override reconcile) on the same PGDATA, which corrupted a postgres cluster. Server reaches its siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis). * immich-postgres data_uid 100998:100998 (postgres drops to container 999 → host 100998 under rootless; verified the fresh dir is chowned correctly). * immich-server version "release"→"2.7.4" (manifest validation requires a digit; the bad version made the manifest silently skip → partial orchestrator install → legacy fallback → the duplicate corruption above). - HARDEN install_stack_via_orchestrator: only fall back to the legacy installer when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now errors instead of double-creating containers on shared data (the corruption root cause). - Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped manifest — this gap let the bad immich-server version through. Known follow-up (pre-existing, platform-wide): orchestrator-installed backends (immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service is disabled on .228 → reboot-survival gap independent of this migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 07:08:45 -04:00
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 5× lifecycle on .228 + .198 (for now; was 20×), per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1FM6 + the desired-state-first reconciler that fixes them).
## 5. Production test gate (exit criterion)
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**5× on .228 AND .198 for now** (`ARCHY_ITERATIONS=5`; temporarily reduced from
20× — restore to 20× before the final ship). All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
## 6. Immediate sequence (live workstream)
1.**B-phase 1**`manifest` field on `AppCatalogEntry`; `load_manifests`
catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped
in phase 1); unit tests. *(commit 220666d3)*
2.**B-phase 2**`EMBED_MANIFESTS` publisher generator + round-trip guard.
*(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)*
3.**C immich proof** — immich is a manifest-driven stack (immich + immich-postgres
+ immich-redis) installed via `install_stack_via_orchestrator`; legacy installer
is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name
duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening,
data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)*
4.**Reboot-survival** — podman-restart.service enabled (startup, fleet-wide)
for the podman-`--restart` path. *(f160e0c4)*
5.**Verify on .198** (immich migration validated on .228 only so far).
6.**E** — run the 5× gate (`ARCHY_ITERATIONS=5`, was 20×); fix until green.
7. ◻ Demote this banner.
**Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the
published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
just podman-`--restart`); immich on .198.
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at
startup must not surface a false "no apps installed" UI. **My Apps must preserve
last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits
for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect`, not
`podman image exists` (avoids store-walk stalls).
- **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build
context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
reach nodes. `:local` is a manual override, never auto-rebuilt.
## 8. Roadmap
**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:
- **P0** Container app reliability — bulletproof install/health/restart/uninstall
across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
(AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`);
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
phases 26 (`dual-ecash-design.md`).
## 8b. SESSION STATE + RESUME (updated 2026-06-22) — READ THIS FIRST ON RESUME
### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified
Manifest-driven lifecycle hooks + the IndeedHub stack migration are **complete and
live-verified on BOTH .228 and .198** (adoption + fresh-create + post_install hook
exec, stable under load). 15 commits this session: `4c1a4e59`..`e2a012d0`. Working
tree clean. The release lifecycle gate is temporarily **5×** (was 20×; `ARCHY_ITERATIONS=5`).
**Shipped (all on `main`, newest first):**
- `e2a012d0` indeedhub frontend health → `tcp:7777` (was http GET `/`; the http check
false-failed under load and the reconciler churned the frontend — fixed).
- `ff78b312` hook `exec` runs in a transient user scope
(`systemd-run --user --scope --quiet --collect podman exec …`) — fixes
"crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service.
- `ff8f11b8` indeedhub frontend caps `[CHOWN,DAC_OVERRIDE,SETGID,SETUID]` — nginx
workers died "setgid(101) failed" under the orchestrator's `--cap-drop=ALL`.
- `b73084db` DELETED the legacy indeedhub orchestrator special-cases (382 lines:
reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate,
patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts)
→ "indeedhub" now uses the GENERIC install_fresh/reconcile path.
- `b1eea8c0` 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api,
-ffmpeg}) + `install_indeedhub_stack` orchestrator-first (immich pattern).
- `b94b61f6` `network_aliases` ContainerConfig field (podman_client + quadlet rendering,
DNS-label validated) — lets the frontend nginx reach `api:4000`/`minio:9000`/`relay:8080`
on the dedicated `indeedhub-net`.
- `955c54b7`/`4c1a4e59` #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in
archipelago-container::manifest) + executor `container::hooks::run_post_install`
(allowlist-canonicalised copy_from_host + scoped exec), wired into `install_fresh`.
- `84031e62` gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md).
**Design = adoption-safe + manifest-driven.** Manifests reproduce the live install exactly
so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime
already references, named volumes `indeedhub-{postgres,redis,minio,relay}-data`,
`indeedhub-net` + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse
the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is
fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The
frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js
+ sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject /
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
guard is KEPT on purpose (beneficial; not a blocker).
### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED)
Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a
real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a
genuine product bug, not node contamination. Root cause is fully pinned (below).
**Symptom.** `package.stop <app>` returns `{"status":"stopping"}` but the container **never stops**
(`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times
out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps).
`filebrowser` passes because it exits on SIGTERM in <30s.
**ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):**
```
WARN quadlet: systemctl --user stop fedimint.service timed out after 45s
ERROR runtime: package.stop fedimint failed: stop_container fedimint:
podman stop -t 30 fedimint timed out after 30s: deadline has elapsed
```
The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline
equals the grace:
- `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace
(**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30).
The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t <stop_timeout_secs>`).
- The **orchestrator** path does NOT: `prod_orchestrator::stop()``ContainerRuntime::stop_container`
(`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`**
(podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s
but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails →
state reverts to `running`.
- **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so
the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill
would land a moment later. The wrapper deadline must exceed the `-t` grace.
**FIX (two parts, design choice flagged):**
1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate
`stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the
`ContainerRuntime::stop_container` signature to take a `grace: Duration` and have
`prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)**
add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in
`stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare
their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).**
2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL
completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and
the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`).
Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`.
**Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
→ sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend
`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).
### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug
**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs`
`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed
(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
(running→exited→removed) — no regression; the deployed binary's stop path works.
**But validation revealed the gate failures are MULTI-CAUSED — the grace bug is only one of ~5:**
1. ✅ FIXED — orchestrator ignored per-app stop grace (`podman stop -t 30` spurious 30s timeout).
2.**`fedimint` is crash-looping / unhealthy on BOTH nodes** (`health_monitor: Auto-restarting
unhealthy container: fedimint`, attempt 6/10). An app that won't stay up can't be cleanly
stopped — fedimint was a *confounded* test case. Needs a fedimint-health investigation
(why is its container unhealthy / why does host port 8173 not become reachable).
`health_monitor` DOES respect `user_stopped` (health_monitor.rs:983) so that part is correct.
3.**Host-listener repair watchdog** (`prod_orchestrator`: "host listener disappeared after
startup; restarting container app_id=fedimint") restarts containers whose launch port isn't
reachable — fights any stop of a port-unreachable app.
4. ⚠️ **State-model nuance:** `vaultwarden` showed `exited``absent`, never `stopped`; the gate waits
for exactly `"stopped"` (`wait_for_container_status … stopped`). The `Exited→Stopped` conversion
(server.rs:1191, needs `user_stopped.contains(id)`) isn't always firing — likely an id-vs-name
key mismatch. The gate may need to accept `exited`/`absent` as terminal, or the conversion fixed.
5. ⚠️ **Grace vs gate-timeout:** `electrumx` grace is 300s; if it ignores SIGQUIT the container
only dies at the 300s SIGKILL — far past the gate's 60s wait. `-t` is a *ceiling*, so a HEALTHY
electrumx that honours SIGQUIT stops fast; an unhealthy/ignoring one blows the gate window.
Decide: trim graces, make the gate's per-app stop-wait ≥ grace, or both.
6. ⚠️ **.228 contamination** (plain podman, no quadlet units) — my cascade-gate; re-quadletize.
**Bottom line:** the grace fix is correct and shipped, but **the gate will not go green until #2#6
are addressed**. These are pre-existing product/health issues the gate is correctly surfacing, not
regressions from this work. They need owner prioritization (esp. fedimint health, the watchdog-vs-
stop interaction, and the gate's terminal-state acceptance).
**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
`bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and
my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without
regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing
quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%"
from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running".
The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the
bug is purely "container never stops", not "state not reported".
### MY-SESSION ERRATA (own it on resume)
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-20x.sh
"Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I
killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or
stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image
`146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale
`user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`.
- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name
`Invalid Docker image format`.
### NEXT STEPS (in order)
1.**DONE** — root-caused the stop-grace bug, fixed it (commit `2dad64b2`), unit-tested,
release-built, **deployed to .198 + .228**, validated no-regression (vaultwarden stops on .198).
2.**fedimint health** — why is its container unhealthy on both nodes (health_monitor restart
6/10; host port 8173 unreachable)? A crash-looping app can't pass the lifecycle gate. Likely the
real top blocker now. Same lens for any other unhealthy app surfaced by the gate.
3.**Host-listener repair vs user-stop** — the launch-port watchdog
(`prod_orchestrator`: "host listener disappeared after startup; restarting container") must NOT
restart a container the user just stopped. Check it consults `disabled`/`user_stopped`.
4. ⚠️ **Gate terminal-state acceptance** — apps end `exited`/`absent`, not always `stopped`
(Exited→Stopped conversion at server.rs:1191 needs a matching `user_stopped` key). Either fix the
conversion (id-vs-name) or have `wait_for_container_status … stopped` accept exited/absent.
5. ⚠️ **Grace vs gate-timeout** — trim over-long graces (electrumx 300s) and/or make the gate's
per-app stop-wait ≥ the app's grace.
6. **Re-quadletize .228** (backend `.container` files wiped by my cascade-gate; reinstall its apps so
units regenerate, matching .198; verify `.container` + `PODMAN_SYSTEMD_UNIT`).
7. **Run the canonical gate** `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ITERATIONS=5` (NO cascade; never kill
mid-iteration) on .198 then .228. Green = Step-2-of-plan done.
8. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman;
re-survey the status doc's quadlet % from `.container`-file presence.
9. **netbird migration (#20 phase 4)** — same pattern; assess setup steps first (TLS cert gen,
config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is
install_netbird_stack in stacks.rs).
10. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
### KNOWN ISSUES / WATCH-OUTS
- **.198 is a weak/loaded node** (load avg ~35). The generic reconcile recreates
containers it deems unhealthy; under load, false-failing health checks → churn. The
tcp-health fix (`e2a012d0`) mitigated the frontend case. If the lifecycle gate churns on
.198, look for other apps whose http health checks false-fail under load → prefer tcp.
- **Many concurrent SSH sessions to .198 wedge its sshd** (MaxStartups) — it pings but SSH
hangs for minutes. Use ONE ssh at a time to .198; `pkill -f 192.168.1.198` to clear strays.
- Hook `exec` only works in the scoped form (committed). `copy_from_host` is direct `cp`.
### DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both)
- **Build:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
(~12 min, opt-level=3). Binary at `core/target/release/archipelago`. Linker
"undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. `archipelago` is a
bin-only crate (no lib). Filtered tests: `cargo test -p archipelago --bin archipelago -- hooks quadlet`.
- **Sideload:** `scp binary $H:/tmp/archipelago-new` → `sudo systemctl stop archipelago;
sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl
start archipelago`. Containers SURVIVE the restart (--restart unless-stopped +
podman-restart.service). Binary path is /usr/local/bin/archipelago.
- **Manifests** live at /opt/archipelago/apps/<app_id>/manifest.yml (root-owned ok). The
orchestrator CACHES them at startup → **edit on disk then RESTART archipelago to reload**.
Bulk deploy: `tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis
indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg`; scp; `sudo tar xzf t.tgz
-C /opt/archipelago/apps`.
- **Nodes:** .228 = 192.168.1.228, SSH pw `archipelago`, RPC/UI pw `password123` (https).
.198 = 192.168.1.198, SSH pw `archipelago`, **RPC/UI pw `ThisIsWeb54321@`** (https). Both
have the 7-container indeedhub stack + secrets + named volumes pre-existing.
- **Trigger install via RPC:** `auth.login` (sets session+csrf cookies) → send the csrf
cookie value as `X-CSRF-Token` header → `package.install` with params
`{"id":"indeedhub","dockerImage":"<any>"}` (dockerImage required even for stacks; install
is async → returns `{"status":"installing"}`). install logs go to
/var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago.
- **Fresh-create test recipe:** `podman rm -f indeedhub` (stateless frontend) → package.install
indeedhub → expect install_fresh + post_install hook (all 4 steps `ok`) + UI 200 on :7778
(/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run —
install_fresh is the only hook trigger).
## 9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- **Design:** `architecture.md`, `app-developer-guide.md`,
`APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`,
`marketplace-protocol.md`, `dht-distribution-design.md`,
`multi-node-architecture.md`, `rust-orchestrator-migration.md`,
`bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`,
`meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`.
- **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`,
`operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`,
`bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`,
`SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`.
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.