archy/docs/PRODUCTION-MASTER-PLAN.md
archipelago 67426c0d41 docs(master-plan): cascade tier wired into the gate (b7d92107)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 05:24:07 -04:00

69 KiB
Raw Blame History

PRODUCTION MASTER PLAN — Archipelago App Platform & Registry

SINGLE-NODE PRODUCTION GATE IS GREEN (2026-06-23): run-gate.sh 5/5 on .228, 0 failures. This remains the authoritative plan for the broader north star (manifest-driven platform, registry-distributed manifests, external marketplace), but it is no longer a hard priority banner blocking all other work. Remaining workstreams are in §6 / §8b. Next exit-criteria: multinode (docs/multinode-testing-plan.md) + workstreams B/C/D.

Last updated: 2026-06-26 · zombie-container guard + gitea launch-port fix shipped, binary 040df5ce rolled to the fleet (see §8b SESSION h). Prior: orchestrator Fix A+B (a721532f/e0343137) deployed + proven.


1. The North Star

Make Archipelago a world-class, developer-ready app platform where:

  1. Every app is manifest-driven — install/run/update/uninstall needs only the app's manifest (+ catalog entry). Zero OS-level code reliance: no per-app Rust installers, no sudo mkdir/chown, no host provisioning.
  2. Manifests are distributed via the (signed) registry, not baked into the binary OTA as disk files. Bumping/adding an app = a signed catalog change.
  3. Third-party developers can build and ship apps via an external registry — a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation), not a gatekept central store. archy app validate/render/install/test tooling.
  4. The platform stays rootless, secure-by-default, elegant, robust, and 100%-uptime-capable (reboot-survivable, self-healing, no data loss on migrate).

Definition of done: the production test gate (§5) is green for the app set on real nodes. Until then, this plan is the priority.

2. Invariants (never violate)

  • Rootless Podman only. No rootful, no Docker-socket mounts, no privileged containers unless explicitly approved. (ADR-001, ADR-009.)
  • No app-specific business logic in the Rust backend. The orchestrator owns the lifecycle state machine; apps are declarative. Legacy install_immich_stack (hardcoded podman run + sudo chown) is the anti-pattern being deleted.
  • Secrets are manifest-declared (generated_secrets, materialised by container::secrets 0600/rootless, idempotent + self-healing) — never hardcoded, per-app, or logged. Replaces the deleted ensure_fmcd_password.
  • Migrations never destroy data. Preserve /var/lib/archipelago/<app>, generated secrets, displayed credentials, public ports, and adoption container names. Always provide a rollback path. Stop/recreate only when necessary.
  • Verify on the real node .228 before any tag. (Fleet/multinode verification is a separate pass → docs/multinode-testing-plan.md.)

3. Current state (2026-06-21)

  • ~40 apps are manifest-based and Quadlet-migrated (survive archipelago.service restart + reboot). Exhaustive per-app table: docs/app-registry-status-2026-06-21.md.
  • Legacy holdout: immich — the one app with no manifest and a hardcoded Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data. The migration proof case.
  • Manifests still travel by OTA disk rsync (apps/ → /opt/archipelago/apps). The signed catalog (app-catalog.json) currently distributes only image overrides — not full manifests. Gap closed by workstream B.
  • The 4 companions (archy-bitcoin-ui, -lnd-ui, -electrs-ui, -fedimint-ui) build from docker/<name> contexts via companion.rs, not the manifest registry — a later phase folds them in.
  • No app has passed the formal production gate. That is the blocker.
# Workstream Detail doc Status
A Manifest-driven app platform — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) APP-PACKAGING-MIGRATION-PLAN.md mostly done; immich + multi-container polish remain
B Registry-distributed manifests — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback registry-manifest-design.md phases 1+2 done (node consume + opt-in publisher embed); not yet flipped on for the fleet
C Developer-ready external registry — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, archy app … tooling marketplace-protocol.md, app-developer-guide.md design exists; tooling + trust UX pending
D Distribution backbone — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) dht-distribution-design.md phases 02 code-complete (worktree)
E Production test gate — 5× lifecycle on .228, per-app L1/L2 matrix; multinode is split out → multinode-testing-plan.md tests/lifecycle/TESTING.md, bulletproof-containers.md .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23) — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps
F Lifecycle perfection — cascade + progress + ALL apps — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. §6c (below), tests/lifecycle/TESTING.md IN PROGRESS (2026-06-26) — root bug FIXED: uninstall could hang → ghost/stuck-bar/reinstall-block (71cc9ac4, unbounded systemctl/podman in quadlet::disable_remove); cascade-uninstall.bats 7/7 green on .228 w/ binary ae349a75. Remaining: wire CASCADE into the canonical gate run, progress-UI truthfulness, all-apps matrix, guardian/IBD state.

Orchestrator architecture (foundation for A/B): rust-orchestrator-migration.md (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption scan, Quadlet rendering) and bulletproof-containers.md (the six container failure modes FM1FM6 + the desired-state-first reconciler that fixes them).

5. Production test gate (exit criterion)

An app is production-ready only when tests/lifecycle/run-gate.sh is green across the full matrix — install / UI-reachable / stop / start / restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall — 5× on .228 (ARCHY_ITERATIONS=5). The gate runs ON the node (it uses local podman/systemctl/bitcoin probes; running it via RPC from another host silently tests the runner). Multinode / fleet verification (.198 + others) is a SEPARATE plan — docs/multinode-testing-plan.md — NOT part of this single-node criterion. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.

⚠️ The 2026-06-23 5×-green is NOT the full bar. run-gate.sh runs only the DESTRUCTIVE tier (stop/start/restart/survive) over ~8 core apps; it skips uninstall/reinstall (CASCADE is gated behind ARCHY_ALLOW_CASCADE_DESTRUCTIVE, never set by the gate) and tests no install/uninstall progress UI. Real uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing right after — see §6c (workstream F) for the gap and the expanded-gate plan. The true "every app, fully" criterion is F's definition-of-done, not this run.

6. Immediate sequence (live workstream)

  1. B-phase 1manifest field on AppCatalogEntry; load_manifests catalog-wins merge; manifest_dir kept (build-source catalog manifests skipped in phase 1); unit tests. (commit 220666d3)
  2. B-phase 2EMBED_MANIFESTS publisher generator + round-trip guard. (7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)
  3. C immich proof — immich is a manifest-driven stack (immich + immich-postgres
    • immich-redis) installed via install_stack_via_orchestrator; legacy installer is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening, data_uid 100998. Canonical app_id immich (title+icon). (9e6c5370, d5ef4573)
  4. Reboot-survival — podman-restart.service enabled (startup, fleet-wide) for the podman---restart path. (f160e0c4)
  5. E — 5× gate on .228 (ARCHY_ITERATIONS=5) is GREEN: 5/5, 0 not-ok (2026-06-23). Two real orchestrator bugs were found + fixed en route (package.stop per-app grace; package.restart phantom stack-member injection → order_present_containers, commit 92d7f52d) plus two single-shot-read probes hardened (bitcoin-knots state, immich lan_address). The single-node criterion is met.
  6. Banner demoted (this doc, 2026-06-23). Next: multinode pass + workstreams B/C/D.

Multinode / fleet verification (.198 and the rest) is split into its own plan: docs/multinode-testing-plan.md. Do it AFTER the .228 single-node gate is green.

Not yet done / deliberate follow-ups: flip EMBED_MANIFESTS on for the published catalog (then sign) to actually distribute manifests via the registry; Phase-3 use_quadlet_backends rollout so orchestrator backends are Quadlet (not just podman---restart).

6b. Post-deploy task order (agreed 2026-06-23)

After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228

  • Tailscale testers), do these IN ORDER:
  1. netbird #20 ph4 — the last real manifest migration (workstream A).
  2. Phase-3 use_quadlet_backends — orchestrator backends become Quadlet units.
  3. §6c Lifecycle perfection (workstream F) — the comprehensive uninstall/reinstall + progress-UI + all-apps gate expansion below.

6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar)

Why this exists: the 2026-06-23 single-node gate went 5×-green but is NOT the "every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate (run-gate.sh) only runs the DESTRUCTIVE tier (stop / start / restart / survive) over ~8 core apps (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint, filebrowser). It explicitly SKIPS uninstall/reinstall (the CASCADE tier is gated behind ARCHY_ALLOW_CASCADE_DESTRUCTIVE, which run-gate.sh never sets) and has zero coverage for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism, uptime-kuma, homeassistant, … — see app-registry-status-2026-06-21.md). So uninstall, reinstall, install-progress UI, and most apps were never under test.

Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:

  • Uninstall is broken for immich + grafana: takes very long, the progress bar sits at a solid full-red with no real progression, and the app does not actually uninstall — it still appears in My Apps afterward (ghost entry / state not cleared).
  • grafana reinstall just stops partway (no completion, no clear error).
  • fedimint guardian suddenly showed "starting up — Guardian opens a wait page until Bitcoin finishes initial sync" / "starting" on that node — verify this is correct wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).

2026-06-26 — root cause of the immich/grafana uninstall trio FOUND + FIXED (71cc9ac4). Single cause: quadlet::disable_remove() (first op in uninstall teardown, via companion + orchestrator) ran systemctl --user stop / daemon-reload / podman rm -f with no timeout. On rootless podman a generated unit can wedge "deactivating" while podman hangs → systemctl stop blocks forever → the spawned uninstall task returns neither Ok nor Err, so (a) set_uninstall_stage never fires → frozen full-red bar, (b) remove_package_state_entry never runs → ghost stuck in Removing, (c) the install guard rejects reinstall (already Removing). The spawn wrapper already reverts state on Err/removes on Ok — only a hang stranded it. Fix bounds all three calls (stop→QUADLET_STOP_TIMEOUT + SIGKILL/reset-failed escalation; daemon-reload→30s; podman rm→timeout). Validated live: cascade-uninstall.bats 7/7 on .228 (binary ae349a75) — grafana install → uninstall (no ghost, data dir gone) → reinstall → running → cleanup. NOTE: proves the happy path + no-regression; the original hang was load/timing-induced and not separately reproduced.

Workstream F scope — the gate must grow to (in priority order):

  1. CASCADE tier in the canonical gate: uninstall → verify the app is GONE from My Apps / container-list / package state (no ghost), data preserved per policy, then reinstall → verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs. ( DONE b7d92107: run-gate.sh now runs ONE cascade pass after the 5× loop when ARCHY_GATE_CASCADE=1 (+ARCHY_ALLOW_DESTRUCTIVE=1), counted into the tally — opt-in so default behavior is unchanged, and deliberately NOT folded into all 5 iterations. cascade-uninstall.bats 7/7 on .228. Next: extend cascade coverage beyond the single throwaway app to the multi-container stacks, e.g. an immich/btcpay cascade variant.)
  2. Progress-UI assertions: install AND uninstall must report monotonic, truthful progress (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)
  3. ALL-apps coverage: a generic per-app lifecycle matrix (install / UI-reach / stop / start / restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are covered automatically.
  4. Guardian/IBD-dependent states: assert that "waiting for bitcoin sync"-style states are a legitimate, surfaced wait (with a path to ready) and never a permanent stuck state.

Definition of done for F: the expanded gate (CASCADE + progress + all-apps) is 5×-green on .228, then re-verified across the multinode fleet — i.e. an insanely-perfect OS/container environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with honest progress, no ghosts, no data loss, reboot-survivable.

7. Release blockers & operational gotchas (durable)

Carried forward from prior handoffs (deduped against persistent memory):

  • Rootless control-plane responsiveness — slow podman ps/store cleanup at startup must not surface a false "no apps installed" UI. My Apps must preserve last-known apps during scanner backoff, never show empty during a transient.
  • Reboot survival — gate on ≥3 (prefer 5) consecutive clean post-reboot lifecycle passes. Quadlet units under user.slice survive archipelago.service restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
  • Startup patterns — wait on a socket/health, never sleep. Tailscale waits for its socket; Fedimint Guardian waits for Bitcoin RPC initialblockdownload:false before launching fedimintd (proxy/wait companion on :8175 during IBD).
  • Bitcoin must run full (txindex=1, non-pruned) for ElectrumX/mempool.
  • Adoption — match existing containers by name and adopt without recreate; record a migration version in app state; preserve Nostr signer bridges (IndeeHub needs /nostr-provider.js served, not just port reachability).
  • Image presence — use bounded targeted podman image inspect, not podman image exists (avoids store-walk stalls).
  • Companion rebuildscompanion.rs must rebuild :latest when the build context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never reach nodes. :local is a manual override, never auto-rebuilt.

8. Roadmap

Pipeline: Feature Testing (internal) → User Testing (controlled hardware) → Beta Live (public). Hardening priorities feeding the gate:

  • P0 Container app reliability — bulletproof install/health/restart/uninstall across all apps, dependency chains, multi-container stacks.
  • P0 Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor hidden services, LND Connect).
  • P1 LUKS2 full-partition encryption for /var/lib/archipelago/ (AES-256-XTS, Argon2id, key from setup password + hardware salt).
  • P1 Meshtastic plug-and-play parity with MeshCore.
  • P1 CODE-COMPLETE (branch companion-mobile-ux, 2026-06-23; needs on-device + mobile-web verification before merge to main) — Mobile app-launch UX — drop the "this app opens in a tab" interstitial. Two surfaces (both: no interstitial screen, launch the app directly):
    • Companion app (Android): open every app in the in-app WebView (not just non-iframeable ones) — and carry the current mobile-iframe footer controls into the WebView (back/forward/reload/close — good, useful UX).
    • Mobile web browser (PWA): open tab-apps directly in a new browser tab. Touch points: neode-ui/src/stores/appLauncher.ts, AppLauncherOverlay.vue, the Android in-app WebView bridge, and the mesh-mobile iframe footer controls. (Reference prior work: b5a9deb8 in-app webview for non-iframeable apps, d1fbcd9b "open in browser" via native bridge.)
    • Done (branch companion-mobile-ux): mobile launches now use the store-driven panel (no route push) so the background tab no longer changes and closing returns you where you launched; tab-only apps open directly (in-app WebView on companion via openInApp, new browser tab on PWA) with no interstitial; the Android InAppBrowser (WebViewScreen.kt) gained a bottom footer bar (back/forward/reload/open-in-browser/close) + a centered loading screen (favicon + progress); a shared AppLoadingScreen (icon + progress) replaced the black/spinner loaders on the app session and legacy iframe overlay; the dashboard is pinned to 100dvh on mobile so the mesh chat/tools panes stop sliding under the tab bar in mobile browsers (no-op in companion); ElectrumX shows its real icon in My Apps. Companion APK bumped to v0.4.7 (versionCode 11) with a committed shared debug keystore so updates install without an uninstall. Not yet: merge to main; publish the 0.4.7 companion download (deferred until the gate work lands so they ship together).

Post-beta (deferred — do not start until gate is green): P2P encrypted voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC hardening; paid swarm streaming + IndeeHub source (phase4-streaming-ecash-plan.md); Meshroller Rust-native mesh AI (meshroller-integration-design.md); dual-ecash phases 26 (dual-ecash-design.md).

8b. SESSION STATE + RESUME (updated 2026-06-26) — READ §8b "CURRENT STATE + RESUME" FIRST

▶ SESSION h (2026-06-26) — LATEST, RESUME FROM HERE

Canonical resume detail: memory project_session_resume_2026_06_23b (▶️ top of MEMORY.md). Local main = 670ebb06 (3 commits past the previously-pushed 43e70049: 0a8db904 zombie guard + 670ebb06 gitea launch-port fix; 43e70049 webview was already pushed). Combined release binary 040df5ce2551d17b rolled to the fleet. Binary+FE not in git — rebuild on a fresh machine (cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago).

DONE this session:

  1. Zombie-container guard (0a8db904) — the reconciler's Running branch now verifies a container's State.Pid is alive (/proc/<pid> exists) before trusting podman's "Up"; on a concrete dead PID it stop+remove+install_fresh from the manifest. Conservative: any uncertainty (inspect fail / unparseable PID) assumes alive, so a transient hiccup never destroys a healthy container. Fixes the class that broke NetBird login on .228 (dashboard "Up" w/ dead PID → proxy 502, no host port → reconciler never recovered it). Unit test + live-proven on .228: synthetic zombie on jellyfin (killed conmon+PID → podman still "Up") → guard logged …process is dead (zombie) — recreating app_id=jellyfin → recreated → settled to NoOp. Zero false-positives across the other 33 healthy containers.
  2. Gitea launch-port fix (670ebb06) — gitea launched at :2222 (SSH) instead of :3001 (web) on nodes without the gitea manifest on disk (manifest_lan_address_for returns None → fell through to extract_lan_address, which returns podman's first-listed port; podman lists 2222->22 before 3001->3000). Added "gitea" => http://localhost:3001 to the static lan_address_for map (core/container/src/podman_client.rs) like every other core app. Reported on tailscale node 100.82.34.38 — that node still needs the new binary (or a refreshed gitea manifest) to pick it up.
  3. Rolled 040df5ce to .228/.116/.198/.89 (verified sha+active); .88/.5/.120 rolling.

OPEN follow-ups (logged, NOT regressions):

  • mempool env-drift recreate-loop on .228 — reconciler logs container env drift detected — recreating app_id=mempool every ~30-90s, never converges (pre-existing; the known mempool nginx stale-IP class, project_mempool_nginx_stale_ip_fix). mempool stays running but churns.
  • nostr-rs-relay stuck "Stopping" + ~2s create-loop on .228 (from session g).

NEXT: finish .88/.5/.120 roll → push main to gitea-vps2 → Phase-3 quadlet / Workstream F / multinode. SSH/sudo pw ThisIsWeb54321@ (.88 = ThisIsWeb54321!); UI/RPC .228/.198 = ThisIsWeb54321@. Reusable tooling in scratchpad: deploy-bin.sh/remote-apply.sh (EXPECT_SHA = 040df5ce…), rpc.sh.


▶ SESSION g (2026-06-25) — earlier, historical

Canonical resume detail: memory project_session_resume_2026_06_23b + project_netbird_ph4_legacy_deletion_map + project_workstream_f_lifecycle_perfection. gitea-vps2/main = a721532f (pushed). Local main = 89d397bb (2 new commits this session, NOT pushed/deployed: 41e7f500 harness tolerance + 89d397bb netbird ph4 legacy delete). Binary+FE are NOT in git — rebuild on a fresh machine.

TL;DR (SESSION g, 2026-06-25) — everything below DONE this session:

  1. Rolled e0343137 + fresh FE (index-a75rd6Hy.js) to 7 nodes (.116/.198/.228/.89/.88/.5/.120), all verified. .15 SKIPPED (auth rejected — creds don't match).
  2. Harness tolerance fixes COMMITTED 41e7f500 (run-gate settle/immich + immich.bats 90s + mempool.bats poll).
  3. mempool RESOLVED fleet-wide — see mempool note below.
  4. netbird #20 ph4 DONE — legacy Rust installer DELETED, committed 89d397bb (492 lines gone, manifest-driven only, cargo check clean). Release binary BUILDING for the .228 live-verify (build left running — check after).

NEXT (resume here): (a) check the release build, deploy the 89d397bb binary to .228, live-verify netbird adopts via manifest (https:8087→200, no bail!); (b) roll 89d397bb to the rest of the fleet (behavior-neutral — manifest path already executed); (c) push local main → gitea-vps2 (2 commits ahead); then Phase-3 use_quadlet_backends → Workstream F → multinode.

ROLL RESULTS (2026-06-25, binary e0343137b99bf066 + fresh FE bundled):

Node Result
.228 already on e0343137 (prior session, binary-only)
.116 (local) binary + fresh FE; 36 containers survived restart; UI 200; index-a75rd6Hy.js live
.198 (LAN) binary + fresh FE; 38 containers up; UI 200
.89 (100.89.209.89) binary + fresh FE; service active
.88 (100.70.96.88, pw ThisIsWeb54321!) binary + fresh FE; service active
.5 (100.72.136.5) attempted — see resume note (cellular x250)
.120 (100.66.157.120) attempted — see resume note (cellular x250)
.15 (100.64.83.15, archy-dev-pa) SKIPPED — archipelago@ + ThisIsWeb54321@ rejected (Permission denied (publickey,password)); node creds unknown

Deploy tooling (reusable): scratchpad deploy-bin.sh <label> <local\|ssh\|ts> <host> <pw> + remote-apply.sh (mv binary avoids ETXTBSY, atomic FE swap preserving aiui/APK/claude-login.html, chown 1000:1000, restart, sha+health verify). Frontend tarball = tar -C web/dist/neode-ui -czf neode-ui.tgz . (flat). Full sha e0343137b99bf06642c45da67bb092e9a411190ff59eda8e5177c2a06b6f6e89.

Focus: validate the two UNVALIDATED-WIP orchestrator fixes (commit a721532f) on the .228 canary, then roll to the 7-node fleet.

  • Fix A — desired-state recovery: a was-running app that vanished (e.g. lost through a failed teardown + reboot) auto-recreates on reconcile, via new crash_recovery::load_last_running_names (reads running-containers.json sans PID gate) + exact container-name match in reconcile_all_with_mode. Zero false-positives (uninstalled/user-stopped excluded).
  • Fix B — recreate volume-ownership: a freshly-created bind dir for a NO-data_uid app gets chown --reference=<parent> so container-root can write → kills the immich-class recreate EACCES crash-loop. Only fresh dirs (zero regression for existing installs).

VALIDATION PROGRESS (sessions e→f):

  1. Release binary built — sha16 e0343137b99bf066 (differs from pre-fix f2aa2fab → fixes compiled in).
  2. cargo test -p archipelago crash_recovery13/13 green, incl. the two new Fix A tests.
  3. Deployed new binary to .228 canary (binary-only; FE unchanged at 435b9f92). Verified live sha e0343137, active, RPC OK. Container cgroup confirmed in user@1000.service (NOT archipelago.service) → systemctl stop is container-safe on .228.
  4. Fix A PROVENpodman rm -f jellyfin (non-baseline, no-data_uid) → periodic ExistingOnly reconciler (30s) recreated it; journal: previously-running app has no container after boot — recreating (desired-state recovery) app_id=jellyfin.
  5. Fix B PROVEN — fresh package.install uptime-kuma (no-data_uid, no prior data dir) → bind dir chowned to parent owner 1000:1000 (NOT root:root), state=running, RestartCount=0, no EACCES, app wrote its own subdirs → clean uninstall (container+data-dir gone). all-apps matrix read-only 5/5 (17 apps).
  6. 🟡 5× DESTRUCTIVE gate on .228 — NOT yet 5/5, but failures are HARNESS-TOLERANCE FLAKES, NOT Fix A/B regressions (proven: Fix A logged 0 desired-state-recovery firings during the failures; immich/lnd RestartCount: 0, no crashes). Under sustained 5× churn on this 34-app node a different heavy-app recovery probe slips each iteration:
    • immich lan_address (test 64): 30s probe too tight after archipelago-restart recovery. FIXED (settle_stack now waits on immich :2283 when present, cap 180→300s; test 64 deadline 30→90s). Went ok/ok/ok 3× after fix.
    • mempool orphan count (test 82): single-shot count caught a transient extra container mid-recreate (clears to 3=3). FIXED locally (poll for steady-state ≤30s) — fix is in local tests/lifecycle/bats/mempool.bats, NOT yet re-gated.
    • lnd getinfo recovers after restart (test 77): already has a generous 240s deadline; peak concurrent load occasionally beats it. lnd itself HEALTHY (wallet unlocked — "wallet already unlocked, WalletUnlocker no longer available", RestartCount 0). Likely needs deadline bump or lnd added to within-iteration tolerance. NOT yet fixed.
    • NOTE: the 300s settle bump made iterations very long (iter2=1062s) and a diagnostic run wedged in iter3; killed it. Re-think settle (maybe per-app readiness with shorter caps) before the next run.
  7. DECISION RESOLVED (2026-06-25): user chose (B) roll now AND bundle the fresh UX frontend (per feedback_deploy_targets_and_ux_bundle). Gate load-robustness deferred to a separate hardening pass.
  8. ROLLED e0343137 + fresh FE (index-a75rd6Hy.js) to .116/.198/.89/.88/.5/.120 (.228 already on it) — all verified sha=e0343137, service active. .15 skipped (auth reject). See roll table above.
  9. Harness fixes COMMITTED 41e7f500 (no longer uncommitted).
  10. netbird #20 ph4 — legacy installer DELETED, committed 89d397bb. install_netbird_stack is now orchestrator-manifest → adopt → bail! (no in-Rust installer); removed 6 dead helpers + 3 NETBIRD_*_IMAGE consts + unused import (~492 lines). cargo check clean (0 warnings). Manifest path verified live pre-delete (.228 https:8087→200). Release binary BUILT: sha cccb7cfd9c38a651 (core/target/release/archipelago, supersedes e0343137) — NOT yet deployed; deploy to .228 + live-verify then roll. Map+rationale: memory project_netbird_ph4_legacy_deletion_map. Pre-existing follow-up (NOT introduced by delete): the manifest path lacks an active #10 OIDC-readiness gate — if that login race resurfaces, add an OIDC-ready gate to the netbird manifest.

2026-06-25 — STRAY 13h GATE on .228 found + killed; mempool RESOLVED. A setsid gate run from session-e was still churning .228 ~13h later (pathologically slow — only reached test 71/lnd; the 300s settle bump is the suspect). Killed its process group (note: pkill -f bats self-matches the ssh command's own argv → kill by numeric PID/PGID instead). After kill, crash_recovery (Fix A) auto-recovered the immich/indeedhub/netbird stacks — good live exercise of Fix A. mempool fallout RESOLVED: the gate churn left .228's podman overlay storage corrupt (mempool frontend crash-looped — container couldn't write /etc/nginx, same image serves fine on .116) → fixed by rebooting .228 (clears overlay corruption; Fix A staggered-recovered all apps; mempool stable 200). .198 is PRUNED bitcoin → mempool requires archival (install correctly refused) → cleanly uninstalled the orphan mempool-db. All nodes now correct. LESSON: never leave the gate running unsupervised; reconsider the 300s settle before re-running.

Fleet on e0343137 + FE index-a75rd6Hy.js on .116/.198/.228/.89/.88/.5/.120 (.15 still old). 89d397bb (netbird-delete) binary NOT yet deployed anywhere — verify on .228 then roll. SSH/sudo pw UNIFORM ThisIsWeb54321@ (.88 = ThisIsWeb54321!); UI/RPC: .228=ThisIsWeb54321@, .198=ThisIsWeb54321@. Reusable tooling in scratchpad: deploy-bin.sh/remote-apply.sh (binary+FE swap), rpc.sh <host> <pw> <method> [params] (auth.login→call). Gate harness at ~/lifecycle/lifecycle on .228 — CHECK it isn't already running/wedged before re-launching.


▶ SESSION b (2026-06-23 PM) — earlier, historical

Canonical resume detail: memory project_session_resume_2026_06_23b (▶️ top of MEMORY.md). gitea-vps2/main = 4346007d pushed; local HEAD e57514b6 (uninstall fix, committed, not pushed/deployed).

Shipped + verified live on .228 (all in 4346007d):

  • Connection-lost FULLY fixed — companion image_exists journal-flood (Stdio::null) + netbird UDP-port reconcile churn (wait_for_manifest_host_ports tcp-only). .228: flood→0, ws/db→0 disconnects, load 3.95→2.26.
  • netbird → manifest-driven (#20 ph4) — 3 manifests + 4 orchestrator primitives (base64 secret, GeneratedCert+ensure_manifest_certs, templated-file render {{HOST_IP}}/{{NETWORK_GATEWAY}}/{{secret:}}, udp port protocol). Live: https 8087→200, OIDC→200, resolver=gateway. Legacy-Rust delete deferred to post-full-verify.
  • registry-manifest flip (code)EMBED_MANIFESTS default-on, main.rs bounded pre-load refresh_catalog. Catalog regenerated w/ 52 embedded manifests but NOT published (gitignored + never committed; publish = force-add to gitea-vps2 main). Do after fleet binary roll.
  • UX regression root-caused + fixed — the mobile/desktop UX (loader/AppLoadingScreen, store-driven launch, app icons, android webview footer) was on companion-mobile-ux and never merged to main, so any main build silently dropped it. Merged → main, frontend redeployed to .228. Android 0.4.9/code13 pushed for user to build APK elsewhere.

In progress — Workstream F lifecycle bugs (this §, user-picked next):

  • uninstall ghost — FIXED + pushed (e57514b6) + DEPLOYED to .228. handle_package_uninstall returned Err on any cleanup-residue failure before removing the package state entry → ghost in My Apps + revert-to-Installed. Now: split container vs cleanup errors; remove state entry as soon as containers gone (before slow data rm). LIVE-VERIFY IN PROGRESS: fresh grafana (not previously installed → no data risk) install→uninstall→reinstall on .228; install was mid image-pull at handoff. RPC recipe + caution in memory project_session_resume_2026_06_23b.
  • #15 fedimint guardian — RESOLVED, not stuck (legit until IBD-gate → setup wizard now bitcoin synced; no code change).
  • #14 grafana reinstall-stops — verify in the same grafana test (likely same root cause as #13).

Next: finish grafana uninstall/reinstall live-verify on .228 → roll the new binary to the rest of the fleet (.116/.198/.5/.120 still on old binary) → publish embedded catalog (#8) → finish Workstream F (gate CASCADE+progress+all-apps expansion) → Phase 3 Quadlet → multinode. WATCH: main.rs pre-load refresh_catalog (≤25s) slows startup — sanity-check startup→RPC-ready isn't egregious on the fleet roll.


▶ CURRENT STATE + RESUME (2026-06-23) — earlier session-a baseline (historical)

HEADLINE (2026-06-23): single-node gate GREEN (run-gate.sh 5/5 on .228, 0 not-ok) + multinode test deploy DONE to 6 nodes. The exit criterion (§5) is met. Green took fixing two real orchestrator bugs (package.stop per-app grace, 2026-06-22; package.restart phantom stack-member injection, 2026-06-23 — order_present_containers, commit 92d7f52d) plus hardening two single-shot probes (bitcoin-knots state, immich lan_address). All work is committed + PUSHED to gitea-vps2 (146) main @ ccb594fb — the local-only state is resolved. Binary = release sha 5472c575….

▶ DEPLOY STATE (latest backend 5472c575 + UX frontend + one-tap companion APK) — 2026-06-23:

Node Pw Done Notes
.116 (local, http:80) ThisIsWeb54321@ dev node: bitcoin mid-IBD + http-only
.198 archipelago resilience; user manual-testing here
.228 archipelago canonical gate node (5×-green)
100.82.34.38 (archipelago-1) archipelago
100.89.209.89 (archy-x250-pa) ThisIsWeb54321@
100.70.96.88 (archipelago node) ThisIsWeb54321! note the !
100.64.83.15 (archy-dev-pa) ? UP (tailscale ping ok) but ThisIsWeb54321@ REJECTED — need correct pw
100.66.157.120 (archy-x250-exp) ThisIsWeb54321@ ⏭️ DOWN — user said leave it

Deploy scripts saved in scratchpad: deploy-node.sh (full binary+FE, sha+health verify) and fe-only.sh (FE-only, no archipelago restart). Reusable: bash deploy-node.sh <host> <pw> <scheme> 127.0.0.1.

▶ COMPANION APK fixed (other agent's commit 5c43e127 + my reconcile): QR + download were a zip-wrapped .apk.zip (forced unzip). Now serve raw archipelago-companion.apk (one-tap) from the 146 raw URL; CompanionIntroOverlay.vue + ship/publish scripts repointed; old .zip dropped. The OLD .apk.zip URL now 404s, so EVERY node was FE-refreshed to the new build (all 6 verified / : 200 + bundle references archipelago-companion.apk).

▶ MANUAL-TEST BUGS FOUND on .198 → workstream F (§4/§6c). The green gate is DESTRUCTIVE-tier / ~8 core apps; it SKIPS uninstall/reinstall and has no progress-UI / all-apps coverage. Real bugs: immich+grafana uninstall hangs at a solid full-red bar + leaves a ghost in My Apps (doesn't actually remove); grafana reinstall stops; fedimint guardian shows "waiting for bitcoin sync" (verify legit vs stuck). These motivate workstream F (cascade + progress + all-apps gate). Also added §10: investigate TanStack-Query/push-based state mgmt for neode-ui (the state-drift root cause behind the stuck bar + ghosts).

▶ NEXT — agreed task order (do IN ORDER, see §6b):

  1. netbird #20 ph4 — last real manifest migration.
  2. Phase-3 use_quadlet_backends — orchestrator backends → Quadlet units.
  3. §6c workstream F — cascade/uninstall + progress-UI + ALL-apps gate; fix the immich/grafana uninstall + ghost-My-Apps + reinstall-stops bugs to a 5×-green; then §10 state-mgmt investigation.
  4. Multinode passdocs/multinode-testing-plan.md (the 6 deployed nodes are ready for manual testing now).

▶ LOOSE ENDS / gotchas for the resuming session:

  • neode-ui/src/components/AppLoadingScreen.vue is UNTRACKED on .116 — the other agent created it but NO committed code imports it (orphan, not in e825bbed). Left in place; decide whether to wire it in or delete. Not deployed (committed UX doesn't reference it).
  • gitea-local mirror (localhost:3000) push is BROKEN (token redirects to /login); push to gitea-vps2 works and is primary. Reconcile the local mirror token if you need it.
  • Don't delete bitcoin/electrum data (user directive) — run only the DESTRUCTIVE gate (run-gate.sh default; never set ARCHY_ALLOW_CASCADE_DESTRUCTIVE on real nodes with synced chains).
  • .198 gate not run this session (user was manual-testing there + restarting). .116 gate ran but failed 12 tests — ALL environmental (.116 is http-only → ui-coverage hardcodes https://; + bitcoin mid-IBD → bitcoin/lnd preconditions). NOT product regressions. gate-116.log on .116.

(historical resume notes for the 5× chase below — superseded by the green result above)

Headline (2026-06-22): the production gate's package.stop blocker is FIXED; .228 is 1×-GREEN (110/110); a fresh 5× run is IN PROGRESS on .228 (the single-node exit criterion) after a real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out (docs/multinode-testing-plan.md). The gate is canonically 5× now — run-gate.sh (the 20x naming/script was removed 2026-06-22, commit 57a013bc).

2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake): The 1st 5× attempt failed iteration 1 on #74 mempool api backend remains queryable. Root cause was NOT timing — the frontend nginx pinned mempool-api's IP at startup (no resolver); after the gate restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in mempool-frontend:v3.0.1 (resolver+variable proxy_pass; see [[project_mempool_nginx_stale_ip_fix]] / docker/mempool-frontend/), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience- verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (mempool.bats #74: 180s→300s + real fail helper). Commits 0f05f73a (fix) 57a013bc (gate rename).

THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:

sshpass -p archipelago ssh archipelago@192.168.1.228 \
  'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \
   echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u'
  • Log: /tmp/gate-5x3.log on .228 · launched nohup · ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1, run ON the node from /tmp/lifecycle-run/tests/lifecycle via ./run-gate.sh (ARCHY_HOST=127.0.0.1). bats 1.11.1 + static jq 1.7.1 are installed on .228.
  • If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.
  • If it flakes again: readiness-under-churn (lnd/mempool); hardening in 98f4fa44 (inter-iteration settle_stack() + readiness windows). Re-copy repo tests/lifecycle to /tmp/lifecycle-run, relaunch.

▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real orchestrator bug (NOT flakes) + FIXED: the overnight run finished passed: 2 / failed: 3 on gate-5x3.log, three distinct one-off fails, none repeating:

  • iter1 #5 container-list valid state for bitcoin-knots — pre-launch churn (as predicted); didn't repeat. Hardened anyway: the probe was a single-shot read; now polls ≤30s for a settled valid state so a momentary restarting/transient can't flake a 20-min iteration (bitcoin-knots.bats).
  • iter2 #74 mempool api queryable + iter5 #73 mempool stack runningSAME root cause. package.restart mempool resolves its container list via ordered_containers_for_start, which was injecting phantom stack-member names (mysql-mempool, archy-mempool-api, archy-mempool-web — variant names from the union startup_order list that aren't live on this node). The phantom mysql-mempool is 2nd in the start order; do_orchestrator_package_start hits its unknown-app-id fallback → do_package_start inspect fails "no such object" → the ? aborts the whole start sequence, so mempool-api (pos 5) + mempool frontend (pos 8) never start. They then sat down ~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s) and #74 (api not queryable in 300s) both flake. Journal proof on .228: package.restart mempool failed: Start failed: mysql-mempool: ... no such object, 23:27:32. Fix: ordered_containers_for_start now orders only the actually-present containers and never injects phantom order entries (new pure helper order_present_containers + 3 unit tests, dependencies.rs). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets.
  • Deploy + relaunch: built release binary on .116, swapped /usr/local/bin/archipelago on .228 (containers live under user@1000.service, NOT the archipelago.service cgroup, so a service restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart keeps the stack up, then relaunched a clean 5× → see gate-5x4.log (check cmd above, swap the filename). Expectation: all three fixed → 5/5 green → demote the banner.

Code fixes shipped this session (all on main, built + DEPLOYED to .228 AND .198):

  • 2dad64b2 stop honours per-app grace (was -t 30 deadline racing SIGKILL).
  • 760a32bc reconciler stops resurrecting user-stopped apps (dep-override + host-port watchdog).
  • 6e49ce6f container-list reports user-stopped apps as stopped despite a live UI companion.
  • 452f05d8 companion self-heal on its own ~30s loop (was gated behind the slow per-app pass).
  • Test-harness hardening: 88930558 53b8e47f 892ff083 98f4fa44 (readiness retries, immich/ fedimint/NPM/lnd windows, inter-iteration settle). Binary built on .116 core/target/release/archipelago (4-fix); deploy = stop archipelago, cp to /usr/local/bin, start.

NODE-STATE fixes on .228 NOT in the repo (re-apply if .228 is reset/reimaged):

  • nginx /app/lnd/ proxy target was stale 8081 → fixed to 18083 (sed in /etc/nginx/sites-{available,enabled}/archipelago + snippets, then nginx -s reload). Repo code is correct (18083); old node config was stale.
  • Removed a stale orphan ~/.config/containers/systemd/home-assistant.container (ContainerName home-assistant ≠ the real homeassistant container; it was stuck "activating"). Real app fine.
  • electrumx was re-installed (package.install w/ image 146.59.87.168:3000/lfg2025/electrumx:v1.18.0) to re-register it as a tracked manifest app (it had become adopted plain-podman).

KEY LESSON: run the lifecycle gate ON the node, not via RPC from .116 — its bitcoin/companion/ orphan/endpoint tests use local podman/systemctl/bitcoin-cli/curl, so a remote run silently tests the runner (this is why earlier runs from .116 falsely showed "bitcoin in IBD" etc.).

Remaining (after 5× green): netbird migration (#20 ph4 — the one real migration left) + btcpay/ mempool stack polish; Phase-3 use_quadlet_backends; B flip-on (EMBED_MANIFESTS+sign); per-app test coverage (~30 apps unwritten); the mobile app-launch UX (§8 Roadmap P1). Multinode → its own plan.


Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified

Manifest-driven lifecycle hooks + the IndeedHub stack migration are complete and live-verified on BOTH .228 and .198 (adoption + fresh-create + post_install hook exec, stable under load). 15 commits this session: 4c1a4e59..e2a012d0. Working tree clean. The release lifecycle gate is 5× (ARCHY_ITERATIONS=5).

Shipped (all on main, newest first):

  • e2a012d0 indeedhub frontend health → tcp:7777 (was http GET /; the http check false-failed under load and the reconciler churned the frontend — fixed).
  • ff78b312 hook exec runs in a transient user scope (systemd-run --user --scope --quiet --collect podman exec …) — fixes "crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service.
  • ff8f11b8 indeedhub frontend caps [CHOWN,DAC_OVERRIDE,SETGID,SETUID] — nginx workers died "setgid(101) failed" under the orchestrator's --cap-drop=ALL.
  • b73084db DELETED the legacy indeedhub orchestrator special-cases (382 lines: reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts) → "indeedhub" now uses the GENERIC install_fresh/reconcile path.
  • b1eea8c0 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api, -ffmpeg}) + install_indeedhub_stack orchestrator-first (immich pattern).
  • b94b61f6 network_aliases ContainerConfig field (podman_client + quadlet rendering, DNS-label validated) — lets the frontend nginx reach api:4000/minio:9000/relay:8080 on the dedicated indeedhub-net.
  • 955c54b7/4c1a4e59 #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in archipelago-container::manifest) + executor container::hooks::run_post_install (allowlist-canonicalised copy_from_host + scoped exec), wired into install_fresh.
  • 84031e62 gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md).

Design = adoption-safe + manifest-driven. Manifests reproduce the live install exactly so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime already references, named volumes indeedhub-{postgres,redis,minio,relay}-data, indeedhub-net + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js

  • sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject / nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering guard is KEPT on purpose (beneficial; not a blocker).

GATE BLOCKER 2026-06-22 — package.stop ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED)

Step 1 (sync .228 tcp-health manifest) is DONE + verified. Step 2 (the 5× gate) surfaced a real, fleet-wide package.stop bug — reproduced on the CLEAN, quadlet-correct .198, so it is a genuine product bug, not node contamination. Root cause is fully pinned (below).

Symptom. package.stop <app> returns {"status":"stopping"} but the container never stops (container-list shows running 60s+); the gate's wait_for_container_status … stopped 60 times out. Hits fedimint, electrumx, bitcoin-knots, btcpay-server, immich (slow-to-SIGTERM apps). filebrowser passes because it exits on SIGTERM in <30s.

ROOT CAUSE (from .198 journal during a live package.stop fedimint):

WARN  quadlet: systemctl --user stop fedimint.service timed out after 45s
ERROR runtime: package.stop fedimint failed: stop_container fedimint:
      podman stop -t 30 fedimint timed out after 30s: deadline has elapsed

The orchestrator stop path ignores the per-app graceful-stop table and the wrapper deadline equals the grace:

  • archipelago::api::rpc::package::runtime::stop_timeout_secs() defines per-app grace (bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s, default 30). The legacy stop paths use it (runtime.rs:329/607/1060 podman stop -t <stop_timeout_secs>).
  • The orchestrator path does NOT: prod_orchestrator::stop()ContainerRuntime::stop_container (container/src/runtime.rs:124) → API PodmanClient::stop_container hardcodes ?t=10 (podman_client.rs) and the CLI fallback hardcodes -t 30 (runtime.rs:128). fedimint needs 60s but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails → state reverts to running.
  • Compounding: PODMAN_CLI_DEFAULT_TIMEOUT = 30s (runtime.rs:9) wraps podman stop -t 30, so the await fires exactly when podman would SIGKILL → "timed out after 30s" even though the kill would land a moment later. The wrapper deadline must exceed the -t grace.

FIX (two parts, design choice flagged):

  1. Thread the per-app stop grace into the orchestrator stop path. Either (A) move/duplicate stop_timeout_secs into the container crate and have stop_container use it, (B) extend the ContainerRuntime::stop_container signature to take a grace: Duration and have prod_orchestrator::stop() compute it from the loaded manifest, or (C, north-star-aligned) add a stop_grace_secs field to the manifest (default 30) and read it from lm.manifest in stop(). (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare their value. DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).
  2. Make the CLI/API wrapper deadline = grace + buffer (e.g. grace + 15s) so podman's SIGKILL completes inside the await. Apply to both PodmanClient::stop_container (?t=+HTTP timeout) and the runtime.rs CLI fallback (-t+PODMAN_CLI_DEFAULT_TIMEOUT). Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end stopped.

Build/deploy after the fix: cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago → sideload to .228 + .198 (stop archipelago, cp binary, start) → re-quadletize .228 (its backend .container files are gone from my cascade-gate contamination — reinstall its apps so units regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).

/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug

Done: the grace fix is implemented (option C+table fallback: manifest stop_grace_secsstop_grace_secs_for() table; deadline = grace + 15s), unit-tested (3 tests green), committed (2dad64b2), release-built, and deployed to BOTH .228 and .198 (active, UI 200). Quadlet regression suite green (37/37). Validated: healthy app vaultwarden stops cleanly on .198 (running→exited→removed) — no regression; the deployed binary's stop path works.

The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx lifecycle suite is GREEN (10/10, 66s) on .228:

  1. Stop ignored per-app grace (podman stop -t 30 spurious 30s timeout) — commit 2dad64b2. Orchestrator now uses manifest stop_grace_secsstop_grace_secs_for() table; deadline = grace + 15s; applied to quadlet stop + API + CLI.
  2. Reconciler resurrected user-stopped apps — commit 760a32bc. The reconcile filter's dependency_required override re-included a user-stopped dependency (electrumx ← active mempool), the in-memory disabled set is wiped on manifest reload, and the host-port "repair" then restarted the stopped backend within ~8s. Fix: ensure_running_with_mode now bails Left("user-stopped") when the on-disk user_stopped marker is set (the single choke point all reconcile flows through); install/start clear the marker first so user actions are unaffected.
  3. container-list reported user-stopped apps as running — commit 6e49ce6f. The backend was Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the state-refresh upgraded any reachable launch port to running. Fix: handle_container_list forces stopped for user_stopped apps before the launch-port refresh.

Earlier theories now RESOLVED/superseded: "fedimint crash-looping" was probe-induced churn — left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout" (electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s.

TWO-NODE GATE RESULT (1×, DESTRUCTIVE, both with the 3-fix binary):

  • .228: 104/110. All previously-failing package.stop tests now PASS (bitcoin/btcpay/electrumx/ fedimint/immich). Remaining 6: test 31 (companion recreate), 44 (fedimint orphan — probe pollution), 55 (immich restart timing), 83 (bitcoin not archival-synced), 94/99 (endpoint/lnd-proxy cascade from 83).
  • .198: 94/110. 14 of 16 failures are one root cause: bitcoin is in IBD (test 83 says blocks=817652 headers=954850 — ~137k behind). Everything chained to bitcoin cascades: lnd (16,85), btcpay (22,23,103), electrumx (37), mempool stack (71,72,73,101), endpoints (94), bitcoin.getinfo (7,12). The other 2 are node-independent: 31 (companion recreate) and 44 (fedimint orphan pollution).

CONCLUSION: the lifecycle-stop blocker is FIXED and validated on both nodes. The residual red is NOT lifecycle bugs — it is (a) bitcoin still syncing (IBD) on the test nodes [test 83 is an explicit precondition; nothing electrumx/lnd/btcpay/mempool can pass until it finishes], (b) .228 plain-podman contamination (my cascade-gate), and (c) two minor items: test 31 companion-unit recreate (both nodes — likely the 90s window vs reconcile tick + image step; investigate) and test 44 orphan fedimint container left by my probing.

EVERY gate failure is now FIXED or explained — NO lifecycle code bugs remain. Final read:

  • package.stop (the blocker): 3 bugs fixed (2dad64b2/760a32bc/6e49ce6f), green both nodes.
  • bitcoin-IBD cascade (most of .198's red): environmental — bitcoin syncing (test 83 precondition).
  • test 31 companion-recreate: NOT a product bug. Two things: (a) FIXED — the companion reconcile stage was gated behind the slow per-app pass; now it runs on its own ~30s loop (452f05d8). Validated on .228 with the new binary: a deleted archy-electrs-ui unit self-heals in ~10s (was stuck 100s+), journal: companion not active, repairing → wrote quadlet unit → companion started. (b) HARNESS CAVEAT — the companion-survives bats does LOCAL rm/systemctl --user (no ssh), so running the gate from .116 against a remote node actually tests .116's companions with .116's (old) binary, not the RPC target. ⇒ the companion-survives suite must be run ON the target node (or with the new binary on .116) to be meaningful. This explains the "failed on both nodes" runs — both were silently testing .116.
  • test 55 immich restart: NOT a bug — the heavy 3-container stack (postgres+redis+server) restarts in >120s under load; immich DOES return to running. Optional: bump the immich restart wait.
  • test 44 fedimint orphan: my probe pollution; a teardown clears it.

To reach a literally-green 5× gate (now infra/node-prep + minor test-window tuning, not lifecycle code):

  1. Let bitcoin finish IBD on a test node (or point the gate at an archival-synced bitcoin).
  2. Re-quadletize .228 (reinstall its backends so .container units regenerate, matching .198). electrumx done; bitcoin/btcpay/fedimint/immich/etc. remain. (Most backends ARE in manifest_ids already; this is about regenerating quadlet units + clearing adopted plain-podman state.)
  3. Optional: faster companion-reconcile cadence (test 31) + longer immich-restart wait (test 55) + clear the test-44 orphan — or simply run the gate on a less-loaded, bitcoin-synced node.
  4. test 31 ROOT-CAUSED = contamination + load (NOT a product bug). companion::reconcile only recreates a deleted companion unit (e.g. archy-electrs-ui) when its PARENT backend (electrumx) is in manifest_ids. On contaminated .228 electrumx ran as plain podman and was NOT a tracked manifest install (its /opt/.../electrumx/manifest.yml exists on disk but wasn't loaded), so the reconciler never iterated it → companion orphaned. Proven fix: package.install electrumx re-registered it (now reconcile action app_id=electrumx fires) AND restored the companion (unit present, service active). The companion self-heal logic is correct. ⇒ test 31 clears once .228 is re-quadletized (step 2). electrumx on .228 is now de-contaminated. Still: clear test-44 orphans.
  5. Then run ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1 on the synced+quadlet node, then the other.

Quadlet context (still true, but SEPARATE from the bug above): quadlet IS the intended backend runtime — .198 has the backend .container files (bitcoin-knots/btcpay-server/fedimint/filebrowser/ indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain; bitcoin-core.container is .disabled-20260506) because my cascade-gate uninstalled its apps and my package.start restore recreated them as bare podman run --restart=unless-stopped without regenerating units. Two related hardening items: (a) package.start should regenerate a missing quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%" from .container-file presence + PODMAN_SYSTEMD_UNIT, not from "container running".

The stop→stopped STATE reporting is correct once the container actually stops (server.rs:1334 keeps a --rm'd app visible as Stopped via the user_stopped guard — proven on filebrowser); the bug is purely "container never stops", not "state not reported".

MY-SESSION ERRATA (own it on resume)

  • I ran the gate with ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1, which is NOT the canonical gate (that is ARCHY_ALLOW_DESTRUCTIVE=1 only — stop/start/restart, no uninstall/reinstall; see run-gate.sh "Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or stranded. I fully restored .228 (reinstalled bitcoin-knots with the correct image 146.59.87.168:3000/lfg2025/bitcoin-knots:latest; started the rest; cleared a stale user-stopped.json). Verified healthy: UI 200, 35 containers, 17 apps running.
  • Reinstall gotcha: package.install needs a REAL image ref in dockerImage; a bare app name → Invalid Docker image format.

NEXT STEPS (in order) — SINGLE-NODE (.228) criterion

  1. DONE — 4 stop/reconcile bugs fixed + deployed (2dad64b2 grace, 760a32bc reconcile-resurrection guard, 6e49ce6f container-list user-stopped, 452f05d8 companion cadence). Plus test-harness fixes (lnd/immich/fedimint/NPM readiness + config).
  2. DONE — gate run ON .228 (synced bitcoin): 110/110 GREEN (1×). Key lesson: run the gate on the node, not via RPC from .116 (local podman/systemctl/bitcoin probes).
  3. 5× run on .228 in progress (ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1, on the node). 5 consecutive clean iterations = the single-node gate criterion → demote the banner.
  4. netbird migration (#20 phase 4) — the one real migration left; assess setup steps first (TLS cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host; legacy is install_netbird_stack in stacks.rs). Then btcpay/mempool stack polish.
  5. Hardening: package.start should regenerate a missing quadlet unit, not fall back to bare podman.

Multinode / fleet (.198 + the rest) → docs/multinode-testing-plan.md (separate, after .228 green). Carry-over notes for that plan: .198 bitcoin was mid-IBD; the lnd /app/lnd/ nginx proxy had a stale 8081 target on .228 (repo code is correct at 18083 — re-check on other nodes).

KNOWN ISSUES / WATCH-OUTS

  • .198 is a weak/loaded node (load avg ~35). The generic reconcile recreates containers it deems unhealthy; under load, false-failing health checks → churn. The tcp-health fix (e2a012d0) mitigated the frontend case. If the lifecycle gate churns on .198, look for other apps whose http health checks false-fail under load → prefer tcp.
  • Many concurrent SSH sessions to .198 wedge its sshd (MaxStartups) — it pings but SSH hangs for minutes. Use ONE ssh at a time to .198; pkill -f 192.168.1.198 to clear strays.
  • Hook exec only works in the scoped form (committed). copy_from_host is direct cp.

DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both)

  • Build: cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago (~12 min, opt-level=3). Binary at core/target/release/archipelago. Linker "undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. archipelago is a bin-only crate (no lib). Filtered tests: cargo test -p archipelago --bin archipelago -- hooks quadlet.
  • Sideload: scp binary $H:/tmp/archipelago-newsudo systemctl stop archipelago; sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl start archipelago. Containers SURVIVE the restart (--restart unless-stopped + podman-restart.service). Binary path is /usr/local/bin/archipelago.
  • Manifests live at /opt/archipelago/apps/<app_id>/manifest.yml (root-owned ok). The orchestrator CACHES them at startup → edit on disk then RESTART archipelago to reload. Bulk deploy: tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg; scp; sudo tar xzf t.tgz -C /opt/archipelago/apps.
  • Nodes: .228 = 192.168.1.228, SSH pw archipelago, RPC/UI pw password123 (https). .198 = 192.168.1.198, SSH pw archipelago, RPC/UI pw ThisIsWeb54321@ (https). Both have the 7-container indeedhub stack + secrets + named volumes pre-existing.
  • Trigger install via RPC: auth.login (sets session+csrf cookies) → send the csrf cookie value as X-CSRF-Token header → package.install with params {"id":"indeedhub","dockerImage":"<any>"} (dockerImage required even for stacks; install is async → returns {"status":"installing"}). install logs go to /var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago.
  • Fresh-create test recipe: podman rm -f indeedhub (stateless frontend) → package.install indeedhub → expect install_fresh + post_install hook (all 4 steps ok) + UI 200 on :7778 (/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run — install_fresh is the only hook trigger).

9. Documentation map (what survives)

This master plan is the hub. Authoritative standalone docs (linked above), kept:

  • Design: architecture.md, app-developer-guide.md, APP-PACKAGING-MIGRATION-PLAN.md, registry-manifest-design.md, marketplace-protocol.md, dht-distribution-design.md, multi-node-architecture.md, rust-orchestrator-migration.md, bulletproof-containers.md, three-mode-ui-design.md, dual-ecash-design.md, meshroller-integration-design.md, phase4-streaming-ecash-plan.md, adr/*.
  • Reference: app-manifest-spec.md, api-reference.md, developer-guide.md, operations-runbook.md, troubleshooting.md, user-walkthrough.md, bitcoin-rpc-relay.md, security-code-audit-2026-03.md, GAMEPAD-NAV.md, SEED-VERIFICATION.md, hotfix-process.md, app-registry-status-2026-06-21.md.

All dated handoffs/resumes/transcripts/superseded trackers were consolidated here and removed (recoverable via git) on 2026-06-21.

10. Backlog — investigate frontend state management (2026-06-23)

Investigate adopting a real client-state/data-fetching layer for neode-ui instead of the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX bugs like the stuck "full-red" install/uninstall progress bar and ghost My Apps entries (see §6c) are partly a state-sync problem — the UI's view of package state drifts from the backend and isn't reliably invalidated/refetched. A principled query/cache layer (request dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale handling) would make these classes of bug structurally hard.

Research → recommend → (maybe) adopt:

  • Evaluate TanStack Query (Vue Query) as the leading candidate, plus alternatives (Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or an SSE/WebSocket push model for package-state events instead of polling).
  • Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA behaviour, how cleanly it models long-running mutations (install/uninstall with progress), and whether a push channel for package-state changes is the better root-cause fix.
  • Deliverable: a short design note + a recommendation, then a scoped migration of the package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).

10b. Backlog — intelligent launch-port selection (2026-06-26)

Replace the per-app static launch-port map with a smart, manifest-first heuristic. Gitea launched at :2222 (SSH) instead of :3001 (web) on a node missing the gitea manifest on disk: manifest_lan_address_for returned None → the code fell through to extract_lan_address, which returns podman's first-listed published port, and podman lists 2222->22 before 3001->3000. Patched 2026-06-26 (670ebb06) with a static "gitea" => 3001 entry in lan_address_for (core/container/src/podman_client.rs) — but that's a per-app band-aid (the anti-pattern CLAUDE.md warns against; the map already carries bitcoin/lnd/mempool/immich/… by hand).

Real fix (do this, then delete the static entries):

  • Primary is already correct — derive the launch URL from the manifest's declared interfaces.main port. The failure was only the fallback. The north-star cure is registry-distributed manifests (workstream B) so the manifest is always present and we never guess.
  • Smart fallback — make extract_lan_address stop returning the blind first port: skip container-side ports that are known non-HTTP (22/SSH, etc.) and prefer the published port whose container side matches the manifest health_check endpoint / a known web port. Fixes the whole multi-port-app class generically (no per-app hardcoding), and lets us drop the static map.
  • ~20-line change to one function + unit tests; rides the next fleet roll. NOT a free-port remap (that's port_allocator.rs, which already resolves host-port collisions — a different problem; gitea's web UI was never in conflict).

10c. Backlog — generalize the archival/full-node install blocker (2026-06-26)

Make "this app needs an un-pruned (archival, txindex) Bitcoin node" a manifest-declared dependency, applied to every app that needs it — using the electrumX/mempool blocker as the reference behavior. Today the gate works but is hardcoded: requires_unpruned_bitcoin() in core/archipelago/src/api/rpc/package/dependencies.rs is a literal matches!(package_id, "electrumx" | "electrs" | "mempool-electrs" | "mempool" | "mempool-web"), and install bail!s with archival_bitcoin_required_message when bitcoin.pruned is true or disk < ARCHIVAL_BITCOIN_DISK_GB (1 TB). That's the same per-app-hardcoding anti-pattern as the gitea static map (§10b) and the install_*_stack Rust — any new app needing a full node is silently un-gated until someone edits this match.

Do:

  • Declare it in the manifest — e.g. requires: { bitcoin: archival } (or a dependencies.bitcoin.pruned: false constraint) so the install pre-flight reads the requirement from the manifest set instead of a hardcoded list. Covers future apps automatically (manifest-driven north star).
  • Audit coverage — confirm EVERY archival-dependent app is gated (electrumX, electrs, mempool + its electrs, and any BTC-indexer/explorer added later); add a unit test asserting the manifest constraint ⇒ blocker fires.
  • UX — the blocker must be a clear, surfaced pre-install state in the UI (not just an RPC bail! string): explain why (pruned node / insufficient disk), what to do (add ~1 TB, resync un-pruned with txindex), and keep the app visibly "requires archival node" rather than a confusing generic failure. Pairs with workstream F's honest-progress/blocker UX.
  • Reference: the existing package-install-prune-check dependency descriptor (dependencies.rs:208) is the seam to make data-driven.