archy/docs/PRODUCTION-MASTER-PLAN.md

# PRODUCTION MASTER PLAN — Archipelago App Platform & Registry

> **✅ SINGLE-NODE PRODUCTION GATE IS GREEN (2026-06-23): `run-gate.sh` 5/5 on .228, 0 failures.**
> This remains the authoritative plan for the broader north star (manifest-driven
> platform, registry-distributed manifests, external marketplace), but it is no
> longer a hard priority banner blocking all other work. Remaining workstreams are
> in §6 / §8b. Next exit-criteria: multinode (`docs/multinode-testing-plan.md`) +
> workstreams B/C/D.
>
> Last updated: 2026-06-26 · zombie-container guard + gitea launch-port fix shipped, binary `040df5ce` rolled to the fleet (see §8b SESSION h). Prior: orchestrator Fix A+B (`a721532f`/`e0343137`) deployed + proven.

---

## 1. The North Star

Make Archipelago a **world-class, developer-ready app platform** where:

1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
   app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app
   Rust installers, no `sudo mkdir/chown`, no host provisioning.
2. **Manifests are distributed via the (signed) registry**, not baked into the
   binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry** —
   a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
   not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
   100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).

**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.

## 2. Invariants (never violate)

- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
  containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
  the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
  (hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
  `container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
  per-app, or logged. Replaces the deleted `ensure_fmcd_password`.
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>`,
  generated secrets, displayed credentials, public ports, and adoption container
  names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on the real node .228 before any tag.** (Fleet/multinode verification is
  a separate pass → `docs/multinode-testing-plan.md`.)

## 3. Current state (2026-06-21)

- **~40 apps are manifest-based and Quadlet-migrated** (survive
  `archipelago.service` restart + reboot). Exhaustive per-app table:
  `docs/app-registry-status-2026-06-21.md`.
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
  Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
  The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`).
  The signed catalog (`app-catalog.json`) currently distributes **only image
  overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`,
  `-fedimint-ui`) build from `docker/<name>` contexts via `companion.rs`, not the
  manifest registry — a later phase folds them in.
- **No app has passed the formal production gate.** That is the blocker.

## 4. Workstreams (each links its authoritative detail doc)

| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) |
| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps |
| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **IN PROGRESS (2026-06-26)** — root bug FIXED: uninstall could hang → ghost/stuck-bar/reinstall-block (`71cc9ac4`, unbounded systemctl/podman in `quadlet::disable_remove`); `cascade-uninstall.bats` **7/7 green on .228** w/ binary `ae349a75`. Remaining: wire CASCADE into the canonical gate run, progress-UI truthfulness, all-apps matrix, guardian/IBD state. |

**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1–FM6 + the desired-state-first reconciler that fixes them).

## 5. Production test gate (exit criterion)

An app is **production-ready** only when `tests/lifecycle/run-gate.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**5× on .228** (`ARCHY_ITERATIONS=5`). **The gate runs ON the node** (it uses local
podman/systemctl/bitcoin probes; running it via RPC from another host silently
tests the runner). **Multinode / fleet verification (.198 + others) is a SEPARATE
plan — `docs/multinode-testing-plan.md` — NOT part of this single-node criterion.**
Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps, L2 UI ● dashboard +
proxies; L3 survival ◐; ~30 apps have zero automated coverage.

> ⚠️ **The 2026-06-23 5×-green is NOT the full bar.** `run-gate.sh` runs only the
> **DESTRUCTIVE tier** (stop/start/restart/survive) over ~8 core apps; it **skips
> uninstall/reinstall** (CASCADE is gated behind `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`,
> never set by the gate) and tests no install/uninstall **progress UI**. Real
> uninstall/reinstall/progress bugs (immich + grafana) were found in manual testing
> right after — see **§6c (workstream F)** for the gap and the expanded-gate plan.
> The true "every app, fully" criterion is F's definition-of-done, not this run.

## 6. Immediate sequence (live workstream)

1. ✅ **B-phase 1** — `manifest` field on `AppCatalogEntry`; `load_manifests`
   catalog-wins merge; `manifest_dir` kept (build-source catalog manifests skipped
   in phase 1); unit tests. *(commit 220666d3)*
2. ✅ **B-phase 2** — `EMBED_MANIFESTS` publisher generator + round-trip guard.
   *(7bfbe8fe; signing via existing ceremony — not yet flipped on for the fleet.)*
3. ✅ **C immich proof** — immich is a manifest-driven stack (immich + immich-postgres
   + immich-redis) installed via `install_stack_via_orchestrator`; legacy installer
   is now fallback-only. Live-migrated + verified on .228. Found+fixed: container_name
   duplicate-on-shared-PGDATA, version-digit validation, partial-fallback hardening,
   data_uid 100998. Canonical app_id `immich` (title+icon). *(9e6c5370, d5ef4573)*
4. ✅ **Reboot-survival** — podman-restart.service enabled (startup, fleet-wide)
   for the podman-`--restart` path. *(f160e0c4)*
5. ✅ **E** — 5× gate on **.228** (`ARCHY_ITERATIONS=5`) is **GREEN: 5/5, 0 not-ok**
   (2026-06-23). Two real orchestrator bugs were found + fixed en route (package.stop
   per-app grace; package.restart phantom stack-member injection → `order_present_containers`,
   commit 92d7f52d) plus two single-shot-read probes hardened (bitcoin-knots state, immich
   lan_address). The single-node criterion is met.
6. ✅ Banner demoted (this doc, 2026-06-23). Next: multinode pass + workstreams B/C/D.

**Multinode / fleet verification (.198 and the rest) is split into its own plan:**
`docs/multinode-testing-plan.md`. Do it AFTER the .228 single-node gate is green.

**Not yet done / deliberate follow-ups:** flip `EMBED_MANIFESTS` on for the
published catalog (then sign) to actually distribute manifests via the registry;
Phase-3 `use_quadlet_backends` rollout so orchestrator backends are Quadlet (not
just podman-`--restart`).

## 6b. Post-deploy task order (agreed 2026-06-23)

After the 2026-06-23 multinode test deploy (latest backend + UX frontend to .116/.198/.228
+ Tailscale testers), do these IN ORDER:
1. **netbird #20 ph4** — the last real manifest migration (workstream A).
2. **Phase-3 `use_quadlet_backends`** — orchestrator backends become Quadlet units.
3. **§6c Lifecycle perfection** (workstream F) — the comprehensive uninstall/reinstall +
   progress-UI + all-apps gate expansion below.

## 6b-bis. Bitcoin multi-version bulletproofing (2026-06-29) — READY TO MERGE + DEPLOY

Branch `bitcoin-version-bulletproof` (base `095a76cd`). Fixes the "switch version silently
fails / crash-loops" class + a data-access mismatch that can corrupt a node's index. All
code + images + catalog + frontend DONE; **.228** carries it (Knots chainstate mid-reindex
recovery). The **coordinated fleet rollout** (OTA binary+frontend, mirror catalog publish,
`:latest` repoint sequencing, full switch-matrix test) is the remaining work — fold it into
the next release. **Authoritative detail + exact remaining steps + test matrix →
`docs/bitcoin-version-bulletproof-rollout.md`.** Pairs with `docs/bitcoin-multi-version-design.md`.

## 6c. Lifecycle perfection — what "green" MISSED (workstream F, the perfection bar)

**Why this exists:** the 2026-06-23 single-node gate went 5×-green but is **NOT** the
"every app fully lifecycle-tested" guarantee a user reasonably assumes. The canonical gate
(`run-gate.sh`) only runs the **DESTRUCTIVE tier** (stop / start / restart / survive) over
**~8 core apps** (bitcoin-knots, btcpay, electrumx, lnd, mempool, immich, fedimint,
filebrowser). It explicitly **SKIPS uninstall/reinstall** (the CASCADE tier is gated behind
`ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, which `run-gate.sh` never sets) and has **zero coverage**
for the other ~30 apps (grafana, jellyfin, vaultwarden, penpot, nextcloud, photoprism,
uptime-kuma, homeassistant, … — see `app-registry-status-2026-06-21.md`). So uninstall,
reinstall, install-progress UI, and most apps were never under test.

**Real bugs found in manual multinode testing on .198 (2026-06-23) — the motivating evidence:**
- **Uninstall is broken for immich + grafana:** takes very long, the progress bar sits at a
  **solid full-red with no real progression**, and the app **does not actually uninstall** —
  it still appears in **My Apps** afterward (ghost entry / state not cleared).
- **grafana reinstall just stops** partway (no completion, no clear error).
- **fedimint guardian** suddenly showed **"starting up — Guardian opens a wait page until
  Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct
  wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).

**✅ 2026-06-26 — root cause of the immich/grafana uninstall trio FOUND + FIXED (`71cc9ac4`).**
Single cause: `quadlet::disable_remove()` (first op in uninstall teardown, via companion +
orchestrator) ran `systemctl --user stop` / `daemon-reload` / `podman rm -f` with **no timeout**.
On rootless podman a generated unit can wedge "deactivating" while podman hangs → `systemctl stop`
blocks forever → the spawned uninstall task returns neither Ok nor Err, so (a) `set_uninstall_stage`
never fires → **frozen full-red bar**, (b) `remove_package_state_entry` never runs → **ghost stuck in
`Removing`**, (c) the install guard rejects reinstall (`already Removing`). The spawn wrapper already
reverts state on Err/removes on Ok — only a *hang* stranded it. Fix bounds all three calls
(stop→`QUADLET_STOP_TIMEOUT` + SIGKILL/reset-failed escalation; daemon-reload→30s; podman rm→timeout).
**Validated live: `cascade-uninstall.bats` 7/7 on .228** (binary `ae349a75`) — grafana install →
uninstall (no ghost, data dir gone) → reinstall → running → cleanup. NOTE: proves the happy path +
no-regression; the original hang was load/timing-induced and not separately reproduced.

**Workstream F scope — the gate must grow to (in priority order):**
1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps /
   `container-list` / package state (no ghost), data preserved per policy, then reinstall →
   verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs.
   *(✅ DONE `b7d92107`: `run-gate.sh` now runs ONE cascade pass after the 5× loop when
   `ARCHY_GATE_CASCADE=1` (+`ARCHY_ALLOW_DESTRUCTIVE=1`), counted into the tally — opt-in so default
   behavior is unchanged, and deliberately NOT folded into all 5 iterations. `cascade-uninstall.bats`
   7/7 on .228. Next: extend cascade coverage beyond the single throwaway app to the multi-container
   stacks, e.g. an immich/btcpay cascade variant.)*
2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress
   (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal
   success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)
   *(✅ 2026-06-26 `9f17ba68`: the "stuck full-red bar" was `AppCard.vue` hardcoding the uninstall
   bar to `w-full bg-red-400/60 animate-pulse` — solid, full, red, fake-pulse. Now derives a real
   percentage from the backend's existing `uninstall-stage` label ("Stopping containers (X/N)"→10–50%,
   "Cleaning up volumes"→70%, "Removing app data"→90%) and renders like install (neutral fill, real
   width+%, shimmer). FE built `index-DtZyZomC.js`, rolled to .228/.116/.198/.89 (+.88/.5/.120).
   STILL TODO: a bats/UI assertion that the bar is monotonic + lands on a terminal state; possibly a
   backend numeric-progress field so the UI doesn't parse stage strings.)*
3. **ALL-apps coverage:** a generic per-app lifecycle matrix (install / UI-reach / stop / start /
   restart / uninstall / reinstall / reboot-survive) driven by the manifest set, so grafana and
   the ~30 uncovered apps are gated too — not just the 8 core. Manifest-driven, so new apps are
   covered automatically.
   *(✅ 2026-06-26 `43934eef`: `bats/all-apps-lifecycle.bats` — DESTRUCTIVE counterpart to the
   read-only `all-apps-matrix.bats`. Discovers the app set from My Apps ∩ the node `catalog.json`;
   drives stop/start/restart for every app and, under `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`, a FULL
   teardown (uninstall→no-ghost→reinstall) with the catalog `{dockerImage, containerConfig}` as the
   reinstall spec. PROTECTED (never touched): bitcoin*/electrum* (resync cost) + lnd/btcpay*/fedimint*
   (irreversible wallet loss — user asked to protect only bitcoin+electrum; wallet apps added for
   safety, override via `ARCHY_MATRIX_PROTECT`). Validated on .228 (discovery + 1-app lifecycle
   green). HEAVY/destructive → a supervised pass on LAN nodes (.116/.198/.228), NOT folded into
   run-gate. Invoke: `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1 ARCHY_PASSWORD=…
   ARCHY_SCHEME=https bats bats/all-apps-lifecycle.bats`.)*
   **✅ FIRST FULL DESTRUCTIVE RUN on .228 (2026-06-26):** lifecycle **11/11 clean**; teardown
   **8/11** (immich 3-container stack incl.) — and it surfaced **3 real reinstall bugs** (the payoff):
   1. **fresh-install bind-dir ownership = root:root** → EACCES on reinstall (jellyfin `/config`
      denied exit 139; netbird-server can't open its SQLite store). Fix B's chown-to-parent only
      runs on the reconcile path, **not** `package.install`. The important orchestrator fix.
   2. **netbird reinstall adopts leftover containers → skips the manifest cert/file render**
      (tls.crt/key/nginx.conf never written → proxy can't start → app reads absent). Only a fully
      clean reinstall renders them.
   3. **portainer image pin `lfg2025/portainer:2.19.4` is `manifest unknown`** (never pushed to the
      registry) and the pin OVERRIDES the RPC dockerImage → portainer is un(re)installable
      fleet-wide. Registry/catalog data bug (push the image or change the pin).
   .228 restored (jellyfin+netbird via manual chown / clean reinstall; all installed apps running,
   28 ctrs; portainer left uninstalled — uninstallable until #3 fixed). TODO: fix #1 (extend chown
   to install path) + #2 + #3; add reboot-survive + UI-reach per app to the matrix.
4. **Guardian/IBD-dependent states:** assert that "waiting for bitcoin sync"-style states are a
   legitimate, surfaced wait (with a path to ready) and never a permanent stuck state.

**Definition of done for F:** the expanded gate (CASCADE + progress + all-apps) is 5×-green on
.228, then re-verified across the multinode fleet — i.e. an *insanely-perfect* OS/container
environment where every app installs, runs, updates, uninstalls, and reinstalls cleanly with
honest progress, no ghosts, no data loss, reboot-survivable.

## 7. Release blockers & operational gotchas (durable)

Carried forward from prior handoffs (deduped against persistent memory):

- **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at
  startup must not surface a false "no apps installed" UI. **My Apps must preserve
  last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
  lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
  restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits
  for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
  before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
  record a migration version in app state; preserve Nostr signer bridges
  (IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect`, not
  `podman image exists` (avoids store-walk stalls).
- **Companion rebuilds** — `companion.rs` must rebuild `:latest` when the build
  context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
  reach nodes. `:local` is a manual override, never auto-rebuilt.

## 8. Roadmap

**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:

- **P0** Container app reliability — bulletproof install/health/restart/uninstall
  across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
  hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
  (AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
- **P1 ✅ CODE-COMPLETE** (branch `companion-mobile-ux`, 2026-06-23; needs
  on-device + mobile-web verification before merge to `main`) — Mobile app-launch
  UX — drop the "this app opens in a tab" interstitial.
  Two surfaces (both: no interstitial screen, launch the app directly):
  - **Companion app (Android):** open **every** app in the **in-app WebView**
    (not just non-iframeable ones) — *and* carry the current mobile-iframe footer
    controls into the WebView (back/forward/reload/close — good, useful UX).
  - **Mobile web browser (PWA):** open tab-apps directly in a **new browser tab**.
  Touch points: `neode-ui/src/stores/appLauncher.ts`, `AppLauncherOverlay.vue`,
  the Android in-app WebView bridge, and the mesh-mobile iframe footer controls.
  (Reference prior work: `b5a9deb8` in-app webview for non-iframeable apps,
  `d1fbcd9b` "open in browser" via native bridge.)
  - **✅ Done (branch `companion-mobile-ux`):** mobile launches now use the
    store-driven panel (no route push) so the background tab no longer changes and
    closing returns you where you launched; tab-only apps open directly (in-app
    WebView on companion via `openInApp`, new browser tab on PWA) with **no
    interstitial**; the Android `InAppBrowser` (`WebViewScreen.kt`) gained a bottom
    footer bar (back/forward/reload/open-in-browser/close) + a centered loading
    screen (favicon + progress); a shared `AppLoadingScreen` (icon + progress)
    replaced the black/spinner loaders on the app session **and** legacy iframe
    overlay; the dashboard is pinned to `100dvh` on mobile so the mesh chat/tools
    panes stop sliding under the tab bar in mobile browsers (no-op in companion);
    ElectrumX shows its real icon in My Apps. Companion APK bumped to **v0.4.7**
    (versionCode 11) with a committed shared debug keystore so updates install
    without an uninstall. **Not yet:** merge to `main`; publish the 0.4.7 companion
    download (deferred until the gate work lands so they ship together).

**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`);
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
phases 2–6 (`dual-ecash-design.md`).

## 8b. SESSION STATE + RESUME (updated 2026-06-26) — READ §8b "CURRENT STATE + RESUME" FIRST

### ▶ SESSION i (2026-06-30) — CURRENT HANDOFF / 1.8.0 OTA RESUME

**Branch/worktree:** currently on `bitcoin-version-bulletproof`, not `main`. Worktree is dirty.
Do **not** discard mesh changes: they include E2E/transport indicator plumbing and the Meshtastic
receive-path fixes below. Separate recovery note: `docs/SESSION-1.8.0-OTA-PROGRESS.md`.

**What was done this session:**
1. ✅ **Local Rust release gate fixed and green.** `cargo test -p archipelago --bin archipelago` is
   green: **849/849** after fixing stale tests and the invalid `fedimint-clientd` manifest
   (`cpu_limit` was `0.25`, invalid for the current schema; now integer). `cargo check -p archipelago`
   also green after mesh edits.
2. ✅ **Catalog/release static gates green.** `python3 scripts/check-app-catalog-drift.py --release
   --strict` is green. `scripts/check-release-manifest.sh` is green for the currently staged
   `1.7.99-alpha` manifest/artifacts. `npm run build` and `npm run type-check` are green.
3. ✅ **Frontend unit gate fixed.** `npx vitest run --silent` now green: **81 files / 668 tests**. Fixes
   were test-only: add `router.onError` to the login test router mock and update the `AppIconGrid`
   mobile unresolved-new-tab expectation to match current app-launcher behavior.
4. ✅ **Workstream F harness gap closed.** `tests/lifecycle/bats/cascade-uninstall.bats` now asserts
   uninstall progress truthfulness via backend `uninstall-stage`: stage must be parseable, monotonic,
   below 100 before terminal absence, and present before the app disappears. Non-destructive skip-mode
   parse check is green: `ARCHY_PASSWORD=dummy bats tests/lifecycle/bats/cascade-uninstall.bats` → 7 skip-ok.
5. ✅ **3ccc → .116 Meshtastic receive bug taken over and partially live-validated.** Context: `3ccc`
   is the stock/non-Archy Meshtastic peer. The bug was LoRa text from `3ccc` not surfacing in
   `.116` `mesh.messages`. Root causes/fixes:
   - The prior attempted fix dropped any packet older than 10 minutes by `rx_time`; live `.116` logs
     showed `FromRadio.packet` from `!433e3ccc` being dropped as stale (`rx_time` about an hour old).
     The window is now **24h**, so recent radio FIFO/store-forward backlog surfaces instead of vanishing.
   - Radios with unset clocks can report tiny nonzero epoch values; those are now treated as unknown,
     not stale.
   - Serial prevalidation was rejecting valid `FromRadio.queueStatus` frames (`field 11`, live bytes like
     `5a04100e1810`) as corrupt payloads; field 11 and other modern non-message `FromRadio` variants
     are now accepted/ignored instead of poisoning the stream.
   - Focused Meshtastic tests green: **8/8**, including `packet_to_inbound_frame_accepts_recent_meshtastic_backlog`
     and `packet_to_inbound_frame_accepts_stock_peer_with_unset_clock`.
   - Deployed patched binary to **.116**: sha256
     `028ec6ff9a60ca8970c081987457d78ed1c517cd81f7089f51b9a01745b5c3c4` at `/usr/local/bin/archipelago`.
     Service active. Post-deploy checked window showed `FromRadio field=11` accepted and no new
     `Dropping stale ... !433e3ccc` entries.
   - There are stale other-agent `RXDIAG` shell watcher processes on `.116`; leave them unless they
     actively interfere.
6. ✅ **Phase-3 Quadlet read-only check on .116 skip-clean.** Copied lifecycle tests to `.116` and ran
   `bats bats/use-quadlet-backends-install.bats`: **6/6 skip-clean** because no backend `.container`
   units exist. This confirms `use_quadlet_backends` is not active on `.116`; Phase-3 remains a rollout gate.

**Commands/results worth trusting:**
- `cargo test -p archipelago --bin archipelago` → 849/849 green.
- `npx vitest run --silent` from `neode-ui/` → 81 files / 668 tests green.
- `npm run build` from `neode-ui/` → green, bundle `index-CYaDgfX3.js`.
- `python3 scripts/check-app-catalog-drift.py --release --strict` → green.
- `scripts/check-release-manifest.sh` → green for **v1.7.99-alpha** staged artifacts.
- `tests/release/run.sh --manifest` was rerun after `cargo fmt`; it previously reached frontend tests,
  which are now fixed. Re-run it from scratch as the next static gate.

**Remaining blockers / decisions before 1.8.0 OTA:**
1. **Release version metadata is not 1.8.0 yet.** `releases/manifest.json`, Cargo, and npm still say
   `1.7.99-alpha`; `CHANGELOG.md` top says `v1.8.00-alpha` (note double zero). Do not silently publish
   until the release version naming is decided (`1.8.0-alpha` vs `1.8.00-alpha` vs `1.8.0`).
2. **Workstream B signing is blocked on the offline release-root mnemonic.** `docs/workstream-b-signing-runbook.md`
   says catalog distribution/embedded manifests are live, but authenticity requires the publisher to pin
   `RELEASE_ROOT_PUBKEY_HEX` and sign `releases/app-catalog.json` with `RELEASE_MASTER_MNEMONIC`.
   This cannot be automated by an agent without the offline mnemonic.
3. **Phase-3 `use_quadlet_backends` is implemented but default-off.** Completing this requires explicit
   node/fleet flag rollout plus backend reinstall/migration verification. `.116` currently skip-clean only.
4. **Bitcoin multi-version coordinated rollout is still separately owned/blocked by its runbook.** See
   `docs/bitcoin-version-bulletproof-rollout.md`; do not repoint `bitcoin-knots:latest` before fixed binary
   is fleet-wide.
5. **True RF validation of 3ccc requires either a live 3ccc send or waiting for another FIFO/backlog packet.**
   Parser/unit coverage and `.116` logs strongly validate the drop-path fix, but no human was available to
   send a fresh 3ccc message during this session.

**Immediate next steps for the next agent:**
1. Run `tests/release/run.sh --manifest` from repo root again; frontend unit failures are fixed, so expect
   it to pass or continue from the next failing stage.
2. If `.116` is still the canary, monitor logs after any 3ccc activity:
   `journalctl -u archipelago --since "<time>" | grep -Ei "!433e3ccc|3ccc|Dropping stale|Meshtastic received text|FromRadio field field=2"`.
3. Decide/reconcile version naming for the actual 1.8.0 OTA, then use the release scripts intentionally
   (do not run `create-release.sh` casually: it commits/tags and requires `main` + clean tree).
4. If pursuing Workstream B completion, get the offline release mnemonic from the publisher and follow
   `docs/workstream-b-signing-runbook.md` exactly.
5. If pursuing Phase-3 Quadlet, enable `ARCHY_USE_QUADLET_BACKENDS=1` only on a canary first and run the
   Quadlet/lifecycle gates before considering fleet rollout.

### ▶ SESSION h (2026-06-26) — LATEST, RESUME FROM HERE

**Canonical resume detail: memory `project_session_resume_2026_06_23b` (▶️ top of MEMORY.md).**
Local main = `670ebb06` (3 commits past the previously-pushed `43e70049`: `0a8db904` zombie
guard + `670ebb06` gitea launch-port fix; `43e70049` webview was already pushed). **Combined
release binary `040df5ce2551d17b` rolled to the fleet.** Binary+FE not in git — rebuild on a
fresh machine (`cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`).

**DONE this session:**
1. ✅ **Zombie-container guard** (`0a8db904`) — the reconciler's Running branch now verifies a
   container's `State.Pid` is alive (`/proc/<pid>` exists) before trusting podman's "Up"; on a
   concrete dead PID it stop+remove+`install_fresh` from the manifest. Conservative: any
   uncertainty (inspect fail / unparseable PID) assumes alive, so a transient hiccup never
   destroys a healthy container. Fixes the class that broke NetBird login on .228 (dashboard
   "Up" w/ dead PID → proxy 502, no host port → reconciler never recovered it). Unit test +
   **live-proven on .228**: synthetic zombie on `jellyfin` (killed conmon+PID → podman still
   "Up") → guard logged `…process is dead (zombie) — recreating app_id=jellyfin` → recreated →
   settled to NoOp. **Zero false-positives across the other 33 healthy containers.**
2. ✅ **Gitea launch-port fix** (`670ebb06`) — gitea launched at **:2222 (SSH)** instead of
   **:3001 (web)** on nodes without the gitea manifest on disk (`manifest_lan_address_for`
   returns None → fell through to `extract_lan_address`, which returns podman's first-listed
   port; podman lists `2222->22` before `3001->3000`). Added `"gitea" => http://localhost:3001`
   to the static `lan_address_for` map (`core/container/src/podman_client.rs`) like every other
   core app. Reported on tailscale node **100.82.34.38** — that node still needs the new binary
   (or a refreshed gitea manifest) to pick it up.
3. ✅ **Rolled `040df5ce`** to .228/.116/.198/.89 (verified sha+active); .88/.5/.120 rolling.

**OPEN follow-ups (logged, NOT regressions):**
- **mempool env-drift recreate-loop on .228** — reconciler logs `container env drift detected —
  recreating app_id=mempool` every ~30-90s, never converges (pre-existing; the known mempool
  nginx stale-IP class, [[project_mempool_nginx_stale_ip_fix]]). mempool stays running but churns.
- **nostr-rs-relay** stuck "Stopping" + ~2s create-loop on .228 (from session g).

**NEXT:** finish .88/.5/.120 roll → push main to gitea-vps2 → Phase-3 quadlet / Workstream F /
multinode. SSH/sudo pw `ThisIsWeb54321@` (**.88 = `ThisIsWeb54321!`**); UI/RPC .228/.198 =
`ThisIsWeb54321@`. Reusable tooling in scratchpad: `deploy-bin.sh`/`remote-apply.sh` (EXPECT_SHA
= `040df5ce…`), `rpc.sh`.

---

### ▶ SESSION g (2026-06-25) — earlier, historical

**Canonical resume detail: memory `project_session_resume_2026_06_23b` + `project_netbird_ph4_legacy_deletion_map` + `project_workstream_f_lifecycle_perfection`.**
`gitea-vps2/main = a721532f` (pushed). **Local main = `89d397bb`** (2 new commits this session, NOT pushed/deployed: `41e7f500` harness tolerance + `89d397bb` netbird ph4 legacy delete). Binary+FE are NOT in git — rebuild on a fresh machine.

**TL;DR (SESSION g, 2026-06-25) — everything below DONE this session:**
1. ✅ **Rolled** `e0343137` + fresh FE (`index-a75rd6Hy.js`) to **7 nodes** (.116/.198/.228/.89/.88/.5/.120), all verified. **.15 SKIPPED** (auth rejected — creds don't match).
2. ✅ **Harness tolerance fixes COMMITTED** `41e7f500` (run-gate settle/immich + immich.bats 90s + mempool.bats poll).
3. ✅ **mempool RESOLVED** fleet-wide — see mempool note below.
4. ✅ **netbird #20 ph4 DONE** — legacy Rust installer DELETED, committed `89d397bb` (492 lines gone, manifest-driven only, `cargo check` clean). Release binary BUILDING for the .228 live-verify (build left running — check after).

**NEXT (resume here):** (a) check the release build, deploy the `89d397bb` binary to .228, live-verify netbird adopts via manifest (https:8087→200, no `bail!`); (b) roll `89d397bb` to the rest of the fleet (behavior-neutral — manifest path already executed); (c) **push local main → gitea-vps2** (2 commits ahead); then **Phase-3 `use_quadlet_backends` → Workstream F → multinode**.

**ROLL RESULTS (2026-06-25, binary `e0343137b99bf066` + fresh FE bundled):**
| Node | Result |
|------|--------|
| .228 | ✅ already on `e0343137` (prior session, binary-only) |
| .116 (local) | ✅ binary + fresh FE; 36 containers survived restart; UI 200; `index-a75rd6Hy.js` live |
| .198 (LAN) | ✅ binary + fresh FE; 38 containers up; UI 200 |
| .89 (100.89.209.89) | ✅ binary + fresh FE; service active |
| .88 (100.70.96.88, pw `ThisIsWeb54321!`) | ✅ binary + fresh FE; service active |
| .5 (100.72.136.5) | ⏳ attempted — see resume note (cellular x250) |
| .120 (100.66.157.120) | ⏳ attempted — see resume note (cellular x250) |
| .15 (100.64.83.15, archy-dev-pa) | ❌ SKIPPED — `archipelago@` + `ThisIsWeb54321@` rejected (`Permission denied (publickey,password)`); node creds unknown |

Deploy tooling (reusable): scratchpad `deploy-bin.sh <label> <local\|ssh\|ts> <host> <pw>` + `remote-apply.sh` (mv binary avoids ETXTBSY, atomic FE swap preserving `aiui`/APK/`claude-login.html`, chown 1000:1000, restart, sha+health verify). Frontend tarball = `tar -C web/dist/neode-ui -czf neode-ui.tgz .` (flat). Full sha `e0343137b99bf06642c45da67bb092e9a411190ff59eda8e5177c2a06b6f6e89`.

**Focus: validate the two UNVALIDATED-WIP orchestrator fixes (commit `a721532f`) on the .228 canary, then roll to the 7-node fleet.**
- **Fix A** — desired-state recovery: a was-running app that vanished (e.g. lost through a failed teardown + reboot) auto-recreates on reconcile, via new `crash_recovery::load_last_running_names` (reads `running-containers.json` sans PID gate) + exact container-name match in `reconcile_all_with_mode`. Zero false-positives (uninstalled/user-stopped excluded).
- **Fix B** — recreate volume-ownership: a freshly-created bind dir for a NO-`data_uid` app gets `chown --reference=<parent>` so container-root can write → kills the immich-class recreate EACCES crash-loop. Only fresh dirs (zero regression for existing installs).

VALIDATION PROGRESS (sessions e→f):
1. ✅ Release binary built — sha16 `e0343137b99bf066` (differs from pre-fix `f2aa2fab` → fixes compiled in).
2. ✅ `cargo test -p archipelago crash_recovery` — **13/13 green**, incl. the two new Fix A tests.
3. ✅ Deployed new binary to **.228 canary** (binary-only; FE unchanged at `435b9f92`). Verified live sha `e0343137`, active, RPC OK. Container cgroup confirmed in `user@1000.service` (NOT archipelago.service) → `systemctl stop` is container-safe on .228.
4. ✅ **Fix A PROVEN** — `podman rm -f jellyfin` (non-baseline, no-data_uid) → periodic ExistingOnly reconciler (30s) recreated it; journal: `previously-running app has no container after boot — recreating (desired-state recovery) app_id=jellyfin`.
5. ✅ **Fix B PROVEN** — fresh `package.install uptime-kuma` (no-data_uid, no prior data dir) → bind dir chowned to parent owner `1000:1000` (NOT root:root), state=running, RestartCount=0, no EACCES, app wrote its own subdirs → clean uninstall (container+data-dir gone). all-apps matrix read-only **5/5 (17 apps)**.
6. 🟡 **5× DESTRUCTIVE gate on .228 — NOT yet 5/5, but failures are HARNESS-TOLERANCE FLAKES, NOT Fix A/B regressions** (proven: Fix A logged **0** desired-state-recovery firings during the failures; immich/lnd `RestartCount: 0`, no crashes). Under sustained 5× churn on this 34-app node a *different* heavy-app recovery probe slips each iteration:
   - immich `lan_address` (test 64): 30s probe too tight after archipelago-restart recovery. **FIXED** (settle_stack now waits on immich :2283 when present, cap 180→300s; test 64 deadline 30→90s). Went **ok/ok/ok 3×** after fix.
   - mempool orphan count (test 82): single-shot count caught a transient extra container mid-recreate (clears to 3=3). **FIXED locally** (poll for steady-state ≤30s) — fix is in local `tests/lifecycle/bats/mempool.bats`, NOT yet re-gated.
   - lnd `getinfo recovers after restart` (test 77): already has a generous 240s deadline; peak concurrent load occasionally beats it. lnd itself **HEALTHY** (wallet unlocked — "wallet already unlocked, WalletUnlocker no longer available", RestartCount 0). Likely needs deadline bump or lnd added to within-iteration tolerance. **NOT yet fixed.**
   - NOTE: the 300s settle bump made iterations very long (iter2=1062s) and a diagnostic run wedged in iter3; killed it. Re-think settle (maybe per-app readiness with shorter caps) before the next run.
7. ✅ **DECISION RESOLVED (2026-06-25):** user chose **(B) roll now** AND bundle the fresh UX frontend (per `feedback_deploy_targets_and_ux_bundle`). Gate load-robustness deferred to a separate hardening pass.
8. ✅ **ROLLED** `e0343137` + fresh FE (`index-a75rd6Hy.js`) to .116/.198/.89/.88/.5/.120 (.228 already on it) — all verified `sha=e0343137`, service active. **.15 skipped** (auth reject). See roll table above.
9. ✅ **Harness fixes COMMITTED** `41e7f500` (no longer uncommitted).
10. ✅ **netbird #20 ph4 — legacy installer DELETED**, committed `89d397bb`. `install_netbird_stack` is now orchestrator-manifest → adopt → `bail!` (no in-Rust installer); removed 6 dead helpers + 3 `NETBIRD_*_IMAGE` consts + unused import (~492 lines). `cargo check` clean (0 warnings). Manifest path verified live pre-delete (.228 https:8087→200). **Release binary BUILT: sha `cccb7cfd9c38a651`** (`core/target/release/archipelago`, supersedes `e0343137`) — NOT yet deployed; deploy to .228 + live-verify then roll. Map+rationale: memory `project_netbird_ph4_legacy_deletion_map`. **Pre-existing follow-up (NOT introduced by delete): the manifest path lacks an active #10 OIDC-readiness gate — if that login race resurfaces, add an OIDC-ready gate to the netbird manifest.**

**✅ 2026-06-25 — STRAY 13h GATE on .228 found + killed; mempool RESOLVED.** A `setsid` gate run from session-e was still churning .228 ~13h later (pathologically slow — only reached test 71/lnd; the 300s settle bump is the suspect). Killed its process group (note: `pkill -f bats` self-matches the ssh command's own argv → kill by numeric PID/PGID instead). After kill, `crash_recovery` (Fix A) auto-recovered the immich/indeedhub/netbird stacks — **good live exercise of Fix A**. **mempool fallout RESOLVED:** the gate churn left .228's podman **overlay storage corrupt** (mempool frontend crash-looped — container couldn't write `/etc/nginx`, same image serves fine on .116) → **fixed by rebooting .228** (clears overlay corruption; Fix A staggered-recovered all apps; mempool stable 200). **.198 is PRUNED** bitcoin → mempool requires archival (install correctly refused) → **cleanly uninstalled** the orphan mempool-db. All nodes now correct. LESSON: never leave the gate running unsupervised; reconsider the 300s settle before re-running.

Fleet on `e0343137` + FE `index-a75rd6Hy.js` on .116/.198/.228/.89/.88/.5/.120 (.15 still old). **`89d397bb` (netbird-delete) binary NOT yet deployed anywhere — verify on .228 then roll.** SSH/sudo pw UNIFORM `ThisIsWeb54321@` (**.88 = `ThisIsWeb54321!`**); **UI/RPC: .228=`ThisIsWeb54321@`, .198=`ThisIsWeb54321@`.** Reusable tooling in scratchpad: `deploy-bin.sh`/`remote-apply.sh` (binary+FE swap), `rpc.sh <host> <pw> <method> [params]` (auth.login→call). Gate harness at `~/lifecycle/lifecycle` on .228 — **CHECK it isn't already running/wedged before re-launching**.

---

### ▶ SESSION b (2026-06-23 PM) — earlier, historical

**Canonical resume detail: memory `project_session_resume_2026_06_23b` (▶️ top of MEMORY.md).**
`gitea-vps2/main = 4346007d` pushed; local HEAD `e57514b6` (uninstall fix, committed, **not pushed/deployed**).

Shipped + verified live on .228 (all in 4346007d):
- **Connection-lost FULLY fixed** — companion `image_exists` journal-flood (Stdio::null) + netbird UDP-port reconcile churn (`wait_for_manifest_host_ports` tcp-only). .228: flood→0, ws/db→0 disconnects, load 3.95→2.26.
- **netbird → manifest-driven** (#20 ph4) — 3 manifests + 4 orchestrator primitives (base64 secret, GeneratedCert+`ensure_manifest_certs`, templated-file render `{{HOST_IP}}/{{NETWORK_GATEWAY}}/{{secret:}}`, udp port protocol). Live: https 8087→200, OIDC→200, resolver=gateway. Legacy-Rust delete deferred to post-full-verify.
- **registry-manifest flip (code)** — `EMBED_MANIFESTS` default-on, `main.rs` bounded pre-load `refresh_catalog`. Catalog regenerated w/ 52 embedded manifests but **NOT published** (gitignored + never committed; publish = force-add to gitea-vps2 main). Do after fleet binary roll.
- **UX regression root-caused + fixed** — the mobile/desktop UX (loader/AppLoadingScreen, store-driven launch, app icons, android webview footer) was on `companion-mobile-ux` and **never merged to main**, so any main build silently dropped it. **Merged → main**, frontend redeployed to .228. Android 0.4.9/code13 pushed for user to build APK elsewhere.

In progress — **Workstream F lifecycle bugs** (this §, user-picked next):
- **uninstall ghost — FIXED + pushed (e57514b6) + DEPLOYED to .228.** `handle_package_uninstall` returned Err on any cleanup-residue failure *before* removing the package state entry → ghost in My Apps + revert-to-Installed. Now: split container vs cleanup errors; remove state entry as soon as containers gone (before slow data rm). **LIVE-VERIFY IN PROGRESS:** fresh grafana (not previously installed → no data risk) install→uninstall→reinstall on .228; install was mid image-pull at handoff. RPC recipe + caution in memory `project_session_resume_2026_06_23b`.
- **#15 fedimint guardian — RESOLVED, not stuck** (legit `until` IBD-gate → setup wizard now bitcoin synced; no code change).
- #14 grafana reinstall-stops — verify in the same grafana test (likely same root cause as #13).

Next: finish grafana uninstall/reinstall live-verify on .228 → roll the new binary to the rest of the fleet (.116/.198/.5/.120 still on old binary) → publish embedded catalog (#8) → finish Workstream F (gate CASCADE+progress+all-apps expansion) → Phase 3 Quadlet → multinode.
WATCH: main.rs pre-load `refresh_catalog` (≤25s) slows startup — sanity-check startup→RPC-ready isn't egregious on the fleet roll.

---

### ▶ CURRENT STATE + RESUME (2026-06-23) — earlier session-a baseline (historical)

**✅ HEADLINE (2026-06-23): single-node gate GREEN (`run-gate.sh` 5/5 on .228, 0 not-ok) +
multinode test deploy DONE to 6 nodes.** The exit criterion (§5) is met. Green took fixing **two real
orchestrator bugs** (package.stop per-app grace, 2026-06-22; package.restart phantom stack-member
injection, 2026-06-23 — `order_present_containers`, commit 92d7f52d) plus hardening two single-shot
probes (bitcoin-knots state, immich lan_address). All work is **committed + PUSHED to `gitea-vps2`
(146) `main` @ `ccb594fb`** — the local-only state is resolved. Binary = release sha `5472c575…`.

**▶ DEPLOY STATE (latest backend `5472c575` + UX frontend + one-tap companion APK) — 2026-06-23:**

| Node | Pw | Done | Notes |
|------|----|----|-------|
| .116 (local, http:80) | `ThisIsWeb54321@` | ✅ | dev node: bitcoin mid-IBD + http-only |
| .198 | `archipelago` | ✅ | resilience; user manual-testing here |
| .228 | `archipelago` | ✅ | canonical gate node (5×-green) |
| 100.82.34.38 (archipelago-1) | `archipelago` | ✅ | |
| 100.89.209.89 (archy-x250-pa) | `ThisIsWeb54321@` | ✅ | |
| 100.70.96.88 (archipelago node) | `ThisIsWeb54321!` | ✅ | note the `!` |
| 100.64.83.15 (archy-dev-pa) | ? | ⏳ | UP (tailscale ping ok) but `ThisIsWeb54321@` REJECTED — **need correct pw** |
| 100.66.157.120 (archy-x250-exp) | `ThisIsWeb54321@` | ⏭️ | DOWN — user said leave it |

Deploy scripts saved in scratchpad: `deploy-node.sh` (full binary+FE, sha+health verify) and
`fe-only.sh` (FE-only, no archipelago restart). Reusable: `bash deploy-node.sh <host> <pw> <scheme> 127.0.0.1`.

**▶ COMPANION APK fixed (other agent's commit `5c43e127` + my reconcile):** QR + download were a
zip-wrapped `.apk.zip` (forced unzip). Now serve raw `archipelago-companion.apk` (one-tap) from the
146 raw URL; `CompanionIntroOverlay.vue` + ship/publish scripts repointed; old `.zip` dropped. The
OLD `.apk.zip` URL now 404s, so EVERY node was FE-refreshed to the new build (all 6 verified
`/ : 200` + bundle references `archipelago-companion.apk`).

**▶ MANUAL-TEST BUGS FOUND on .198 → workstream F (§4/§6c).** The green gate is DESTRUCTIVE-tier /
~8 core apps; it SKIPS uninstall/reinstall and has no progress-UI / all-apps coverage. Real bugs:
immich+grafana **uninstall hangs at a solid full-red bar + leaves a ghost in My Apps** (doesn't
actually remove); grafana **reinstall stops**; fedimint guardian shows "waiting for bitcoin sync"
(verify legit vs stuck). These motivate **workstream F** (cascade + progress + all-apps gate).
Also added **§10**: investigate TanStack-Query/push-based state mgmt for neode-ui (the state-drift
root cause behind the stuck bar + ghosts).

**▶ NEXT — agreed task order (do IN ORDER, see §6b):**
1. **netbird #20 ph4** — last real manifest migration.
2. **Phase-3 `use_quadlet_backends`** — orchestrator backends → Quadlet units.
3. **§6c workstream F** — cascade/uninstall + progress-UI + ALL-apps gate; fix the immich/grafana
   uninstall + ghost-My-Apps + reinstall-stops bugs to a 5×-green; then §10 state-mgmt investigation.
4. **Multinode pass** — `docs/multinode-testing-plan.md` (the 6 deployed nodes are ready for manual
   testing now).

**▶ LOOSE ENDS / gotchas for the resuming session:**
- **`neode-ui/src/components/AppLoadingScreen.vue` is UNTRACKED** on .116 — the other agent created it
  but NO committed code imports it (orphan, not in `e825bbed`). Left in place; decide whether to wire
  it in or delete. Not deployed (committed UX doesn't reference it).
- **gitea-local mirror (`localhost:3000`) push is BROKEN** (token redirects to `/login`); push to
  `gitea-vps2` works and is primary. Reconcile the local mirror token if you need it.
- **Don't delete bitcoin/electrum data** (user directive) — run only the DESTRUCTIVE gate
  (`run-gate.sh` default; never set `ARCHY_ALLOW_CASCADE_DESTRUCTIVE` on real nodes with synced chains).
- **.198 gate not run this session** (user was manual-testing there + restarting). .116 gate ran but
  failed 12 tests — ALL environmental (.116 is http-only → ui-coverage hardcodes `https://`; + bitcoin
  mid-IBD → bitcoin/lnd preconditions). NOT product regressions. `gate-116.log` on .116.

**(historical resume notes for the 5× chase below — superseded by the green result above)**

**Headline (2026-06-22):** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN
(110/110)**; a **fresh 5× run is IN PROGRESS on `.228`** (the single-node exit criterion) after a
real mempool bug found + fixed (below). The gate is now single-node (.228); multinode is split out
(`docs/multinode-testing-plan.md`). The gate is canonically **5×** now — `run-gate.sh` (the `20x`
naming/script was removed 2026-06-22, commit `57a013bc`).

**2026-06-22 (late) — mempool stale-IP bug FOUND + FIXED (real production bug, not a flake):**
The 1st 5× attempt failed iteration 1 on `#74 mempool api backend remains queryable`. Root cause was
NOT timing — the frontend nginx pinned mempool-api's IP at startup (no `resolver`); after the gate
restarts mempool-api (new podman IP) nginx 502s and the UI shows "offline". Fixed in
`mempool-frontend:v3.0.1` (resolver+variable proxy_pass; see `[[project_mempool_nginx_stale_ip_fix]]`
/ `docker/mempool-frontend/`), pushed to vps2, manifests bumped 3.0.0→3.0.1, deployed + resilience-
verified live on .228 (backend restart now auto-recovers). Also fixed the test itself (`mempool.bats`
#74: 180s→300s + real `fail` helper). Commits `0f05f73a` (fix) `57a013bc` (gate rename).

**THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:**
```
sshpass -p archipelago ssh archipelago@192.168.1.228 \
  'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x3.log; \
   echo "running pid: $(pgrep -f run-gate.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x3.log | sort -u'
```
- Log: `/tmp/gate-5x3.log` on .228 · launched `nohup` · `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`,
  run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` via `./run-gate.sh` (ARCHY_HOST=127.0.0.1).
  `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228.
- **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.**
- If it flakes again: readiness-under-churn (lnd/mempool); hardening in `98f4fa44` (inter-iteration
  `settle_stack()` + readiness windows). Re-copy repo `tests/lifecycle` to /tmp/lifecycle-run, relaunch.

**▶ 2026-06-23 (morning) — 5× FINISHED 2/5; both mempool fails ROOT-CAUSED to ONE real
orchestrator bug (NOT flakes) + FIXED:** the overnight run finished `passed: 2 / failed: 3` on
`gate-5x3.log`, three *distinct one-off* fails, none repeating:
- iter1 `#5 container-list valid state for bitcoin-knots` — pre-launch churn (as predicted); didn't
  repeat. **Hardened anyway:** the probe was a single-shot read; now polls ≤30s for a settled valid
  state so a momentary `restarting`/transient can't flake a 20-min iteration (`bitcoin-knots.bats`).
- iter2 `#74 mempool api queryable` + iter5 `#73 mempool stack running` — **SAME root cause.**
  `package.restart mempool` resolves its container list via `ordered_containers_for_start`, which was
  **injecting phantom stack-member names** (`mysql-mempool`, `archy-mempool-api`, `archy-mempool-web`
  — variant names from the union `startup_order` list that aren't live on this node). The phantom
  `mysql-mempool` is 2nd in the start order; `do_orchestrator_package_start` hits its unknown-app-id
  fallback → `do_package_start` inspect fails "no such object" → the `?` **aborts the whole start
  sequence**, so `mempool-api` (pos 5) + `mempool` frontend (pos 8) never start. They then sat down
  ~6 min until the health monitor independently recovered them → #73 (frontend not running in 180s)
  and #74 (api not queryable in 300s) both flake. Journal proof on .228: `package.restart mempool
  failed: Start failed: mysql-mempool: ... no such object`, 23:27:32.
  **Fix:** `ordered_containers_for_start` now orders only the *actually-present* containers and never
  injects phantom order entries (new pure helper `order_present_containers` + 3 unit tests,
  `dependencies.rs`). This is the SAME class as the mempool nginx bug — a hardcoded-name/reality
  mismatch — and is exactly the manifest-driven-lifecycle anti-pattern the master plan targets.
- **Deploy + relaunch:** built release binary on .116, swapped `/usr/local/bin/archipelago` on .228
  (containers live under `user@1000.service`, NOT the `archipelago.service` cgroup, so a service
  restart does NOT kill them — verified via conmon cgroup paths). Manually verified mempool restart
  keeps the stack up, then relaunched a clean 5× → see `gate-5x4.log` (check cmd above, swap the
  filename). Expectation: all three fixed → 5/5 green → demote the banner.

**Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):**
- `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL).
- `760a32bc` reconciler stops resurrecting user-stopped apps (dep-override + host-port watchdog).
- `6e49ce6f` container-list reports user-stopped apps as `stopped` despite a live UI companion.
- `452f05d8` companion self-heal on its own ~30s loop (was gated behind the slow per-app pass).
- Test-harness hardening: `88930558` `53b8e47f` `892ff083` `98f4fa44` (readiness retries, immich/
  fedimint/NPM/lnd windows, inter-iteration settle). Binary built on .116
  `core/target/release/archipelago` (4-fix); deploy = stop archipelago, cp to /usr/local/bin, start.

**NODE-STATE fixes on .228 NOT in the repo (re-apply if .228 is reset/reimaged):**
- nginx `/app/lnd/` proxy target was stale `8081` → fixed to `18083` (sed in
  /etc/nginx/sites-{available,enabled}/archipelago + snippets, then `nginx -s reload`). Repo code is
  correct (18083); old node config was stale.
- Removed a stale orphan `~/.config/containers/systemd/home-assistant.container` (ContainerName
  `home-assistant` ≠ the real `homeassistant` container; it was stuck "activating"). Real app fine.
- electrumx was re-installed (`package.install` w/ image `146.59.87.168:3000/lfg2025/electrumx:v1.18.0`)
  to re-register it as a tracked manifest app (it had become adopted plain-podman).

**KEY LESSON:** run the lifecycle gate **ON the node**, not via RPC from .116 — its bitcoin/companion/
orphan/endpoint tests use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, so a remote run silently
tests the *runner* (this is why earlier runs from .116 falsely showed "bitcoin in IBD" etc.).

**Remaining (after 5× green):** netbird migration (#20 ph4 — the one real migration left) + btcpay/
mempool stack polish; Phase-3 `use_quadlet_backends`; B flip-on (EMBED_MANIFESTS+sign); per-app test
coverage (~30 apps unwritten); the mobile app-launch UX (§8 Roadmap P1). Multinode → its own plan.

---

### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified

Manifest-driven lifecycle hooks + the IndeedHub stack migration are **complete and
live-verified on BOTH .228 and .198** (adoption + fresh-create + post_install hook
exec, stable under load). 15 commits this session: `4c1a4e59`..`e2a012d0`. Working
tree clean. The release lifecycle gate is **5×** (`ARCHY_ITERATIONS=5`).

**Shipped (all on `main`, newest first):**
- `e2a012d0` indeedhub frontend health → `tcp:7777` (was http GET `/`; the http check
  false-failed under load and the reconciler churned the frontend — fixed).
- `ff78b312` hook `exec` runs in a transient user scope
  (`systemd-run --user --scope --quiet --collect podman exec …`) — fixes
  "crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service.
- `ff8f11b8` indeedhub frontend caps `[CHOWN,DAC_OVERRIDE,SETGID,SETUID]` — nginx
  workers died "setgid(101) failed" under the orchestrator's `--cap-drop=ALL`.
- `b73084db` DELETED the legacy indeedhub orchestrator special-cases (−382 lines:
  reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate,
  patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts)
  → "indeedhub" now uses the GENERIC install_fresh/reconcile path.
- `b1eea8c0` 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api,
  -ffmpeg}) + `install_indeedhub_stack` orchestrator-first (immich pattern).
- `b94b61f6` `network_aliases` ContainerConfig field (podman_client + quadlet rendering,
  DNS-label validated) — lets the frontend nginx reach `api:4000`/`minio:9000`/`relay:8080`
  on the dedicated `indeedhub-net`.
- `955c54b7`/`4c1a4e59` #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in
  archipelago-container::manifest) + executor `container::hooks::run_post_install`
  (allowlist-canonicalised copy_from_host + scoped exec), wired into `install_fresh`.
- `84031e62` gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md).

**Design = adoption-safe + manifest-driven.** Manifests reproduce the live install exactly
so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime
already references, named volumes `indeedhub-{postgres,redis,minio,relay}-data`,
`indeedhub-net` + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse
the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is
fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The
frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js
+ sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject /
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
guard is KEPT on purpose (beneficial; not a blocker).

### ⛔ GATE BLOCKER 2026-06-22 — `package.stop` ignores the per-app stop grace (REAL, fleet-wide, ROOT-CAUSED)

Step 1 (sync .228 tcp-health manifest) is **DONE + verified**. Step 2 (the 5× gate) surfaced a
real, fleet-wide `package.stop` bug — **reproduced on the CLEAN, quadlet-correct .198**, so it is a
genuine product bug, not node contamination. Root cause is fully pinned (below).

**Symptom.** `package.stop <app>` returns `{"status":"stopping"}` but the container **never stops**
(`container-list` shows `running` 60s+); the gate's `wait_for_container_status … stopped 60` times
out. Hits **fedimint, electrumx, bitcoin-knots, btcpay-server, immich** (slow-to-SIGTERM apps).
`filebrowser` passes because it exits on SIGTERM in <30s.

**ROOT CAUSE (from .198 journal during a live `package.stop fedimint`):**
```
WARN  quadlet: systemctl --user stop fedimint.service timed out after 45s
ERROR runtime: package.stop fedimint failed: stop_container fedimint:
      podman stop -t 30 fedimint timed out after 30s: deadline has elapsed
```
The orchestrator stop path **ignores the per-app graceful-stop table** and the wrapper deadline
equals the grace:
- `archipelago::api::rpc::package::runtime::stop_timeout_secs()` defines per-app grace
  (**bitcoin 600s, lnd 330s, electrumx 300s, immich_postgres 120s, fedimint/btcpay 60s**, default 30).
  The **legacy** stop paths use it (runtime.rs:329/607/1060 `podman stop -t <stop_timeout_secs>`).
- The **orchestrator** path does NOT: `prod_orchestrator::stop()` → `ContainerRuntime::stop_container`
  (`container/src/runtime.rs:124`) → API `PodmanClient::stop_container` hardcodes **`?t=10`**
  (podman_client.rs) and the CLI fallback hardcodes **`-t 30`** (runtime.rs:128). fedimint needs 60s
  but gets 10s/30s ⇒ SIGTERM grace expires; the API/CLI stop errors out and the whole stop fails →
  state reverts to `running`.
- **Compounding:** `PODMAN_CLI_DEFAULT_TIMEOUT = 30s` (runtime.rs:9) wraps `podman stop -t 30`, so
  the await fires **exactly** when podman would SIGKILL → "timed out after 30s" even though the kill
  would land a moment later. The wrapper deadline must exceed the `-t` grace.

**FIX (two parts, design choice flagged):**
1. **Thread the per-app stop grace into the orchestrator stop path.** Either (A) move/duplicate
   `stop_timeout_secs` into the `container` crate and have `stop_container` use it, (B) extend the
   `ContainerRuntime::stop_container` signature to take a `grace: Duration` and have
   `prod_orchestrator::stop()` compute it from the loaded manifest, or **(C, north-star-aligned)**
   add a `stop_grace_secs` field to the manifest (default 30) and read it from `lm.manifest` in
   `stop()`. (C) is the manifest-driven choice; bitcoin/lnd/electrumx/fedimint manifests then declare
   their value. **DECISION NEEDED from owner: A/B (fast, table-based) vs C (manifest-driven).**
2. **Make the CLI/API wrapper deadline = grace + buffer** (e.g. grace + 15s) so podman's SIGKILL
   completes inside the await. Apply to both `PodmanClient::stop_container` (`?t=`+HTTP timeout) and
   the `runtime.rs` CLI fallback (`-t`+`PODMAN_CLI_DEFAULT_TIMEOUT`).
   Add a mock-orchestrator test: a container that ignores SIGTERM for >30s must still end `stopped`.

**Build/deploy after the fix:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
→ sideload to .228 + .198 (stop archipelago, cp binary, start) → **re-quadletize .228** (its backend
`.container` files are gone from my cascade-gate contamination — reinstall its apps so units
regenerate, matching .198) → re-run the canonical gate (DESTRUCTIVE only).

### ✅/⚠️ FIX SHIPPED + VALIDATED 2026-06-22 — and the gate has MORE causes than the grace bug

**Done:** the grace fix is implemented (option **C+table fallback**: manifest `stop_grace_secs` →
`stop_grace_secs_for()` table; deadline = grace + 15s), unit-tested (3 tests green), committed
(`2dad64b2`), release-built, and **deployed to BOTH .228 and .198** (active, UI 200). Quadlet
regression suite green (37/37). **Validated:** healthy app `vaultwarden` stops cleanly on .198
(running→exited→removed) — no regression; the deployed binary's stop path works.

**The gate stop-failure was MULTI-CAUSED (3 real product bugs) — all 3 now FIXED + the electrumx
lifecycle suite is GREEN (10/10, 66s) on .228:**
1. ✅ **Stop ignored per-app grace** (`podman stop -t 30` spurious 30s timeout) — commit `2dad64b2`.
   Orchestrator now uses manifest `stop_grace_secs` → `stop_grace_secs_for()` table; deadline =
   grace + 15s; applied to quadlet stop + API + CLI.
2. ✅ **Reconciler resurrected user-stopped apps** — commit `760a32bc`. The reconcile filter's
   `dependency_required` override re-included a user-stopped dependency (electrumx ← active mempool),
   the in-memory `disabled` set is wiped on manifest reload, and the host-port "repair" then restarted
   the stopped backend within ~8s. Fix: `ensure_running_with_mode` now bails `Left("user-stopped")`
   when the on-disk `user_stopped` marker is set (the single choke point all reconcile flows through);
   install/start clear the marker first so user actions are unaffected.
3. ✅ **container-list reported user-stopped apps as `running`** — commit `6e49ce6f`. The backend was
   Exited but its UI companion (electrs-ui/bitcoin-ui/…) kept serving the launch port, and the
   state-refresh upgraded any reachable launch port to `running`. Fix: `handle_container_list` forces
   `stopped` for `user_stopped` apps before the launch-port refresh.

**Earlier theories now RESOLVED/superseded:** "fedimint crash-looping" was **probe-induced churn** —
left alone, fedimint is stable (Up 48 min, 0 watchdog restarts/30 min); its restarts during testing
were the host-port watchdog firing while I rapid-cycled stop/start (fixed by #2). "Exited→Stopped
key mismatch" was actually the live-UI-companion launch-port issue (#3). "Grace vs gate-timeout"
(electrumx 300s) was moot — a healthy electrumx honours SIGQUIT and stops in <1s.

**TWO-NODE GATE RESULT (1×, DESTRUCTIVE, both with the 3-fix binary):**
- **.228: 104/110.** All previously-failing `package.stop` tests now PASS (bitcoin/btcpay/electrumx/
  fedimint/immich). Remaining 6: test 31 (companion recreate), 44 (fedimint orphan — probe
  pollution), 55 (immich restart timing), 83 (bitcoin not archival-synced), 94/99 (endpoint/lnd-proxy
  cascade from 83).
- **.198: 94/110.** **14 of 16 failures are one root cause: bitcoin is in IBD** (test 83 says
  `blocks=817652 headers=954850` — ~137k behind). Everything chained to bitcoin cascades: lnd
  (16,85), btcpay (22,23,103), electrumx (37), mempool stack (71,72,73,101), endpoints (94),
  bitcoin.getinfo (7,12). The other 2 are node-independent: **31** (companion recreate) and **44**
  (fedimint orphan pollution).

**CONCLUSION: the lifecycle-stop blocker is FIXED and validated on both nodes.** The residual red is
NOT lifecycle bugs — it is (a) **bitcoin still syncing (IBD)** on the test nodes [test 83 is an
explicit precondition; nothing electrumx/lnd/btcpay/mempool can pass until it finishes], (b) **.228
plain-podman contamination** (my cascade-gate), and (c) two minor items: **test 31** companion-unit
recreate (both nodes — likely the 90s window vs reconcile tick + image step; investigate) and **test
44** orphan fedimint container left by my probing.

**EVERY gate failure is now FIXED or explained — NO lifecycle code bugs remain.** Final read:
- ✅ `package.stop` (the blocker): 3 bugs fixed (`2dad64b2`/`760a32bc`/`6e49ce6f`), green both nodes.
- **bitcoin-IBD cascade** (most of .198's red): environmental — bitcoin syncing (test 83 precondition).
- **test 31** companion-recreate: NOT a product bug. Two things: (a) **FIXED** — the companion
  reconcile stage was gated behind the slow per-app pass; now it runs on its own ~30s loop
  (`452f05d8`). Validated on .228 with the new binary: a deleted `archy-electrs-ui` unit self-heals
  in **~10s** (was stuck 100s+), journal: `companion not active, repairing → wrote quadlet unit →
  companion started`. (b) **HARNESS CAVEAT** — the companion-survives bats does LOCAL `rm`/`systemctl
  --user` (no ssh), so running the gate from .116 against a remote node actually tests **.116's**
  companions with **.116's** (old) binary, not the RPC target. ⇒ the companion-survives suite must be
  run ON the target node (or with the new binary on .116) to be meaningful. This explains the
  "failed on both nodes" runs — both were silently testing .116.
- **test 55** immich restart: NOT a bug — the heavy 3-container stack (postgres+redis+server) restarts
  in >120s under load; immich DOES return to running. *Optional:* bump the immich restart wait.
- **test 44** fedimint orphan: my probe pollution; a teardown clears it.

**To reach a literally-green 5× gate (now infra/node-prep + minor test-window tuning, not lifecycle code):**
1. Let bitcoin finish IBD on a test node (or point the gate at an archival-synced bitcoin).
2. Re-quadletize .228 (reinstall its backends so `.container` units regenerate, matching .198).
   electrumx done; bitcoin/btcpay/fedimint/immich/etc. remain. (Most backends ARE in manifest_ids
   already; this is about regenerating quadlet units + clearing adopted plain-podman state.)
3. Optional: faster companion-reconcile cadence (test 31) + longer immich-restart wait (test 55) +
   clear the test-44 orphan — or simply run the gate on a less-loaded, bitcoin-synced node.
4. ✅ **test 31 ROOT-CAUSED = contamination + load (NOT a product bug).** `companion::reconcile` only
   recreates a deleted companion unit (e.g. `archy-electrs-ui`) when its PARENT backend (electrumx)
   is in `manifest_ids`. On contaminated .228 electrumx ran as plain podman and was NOT a tracked
   manifest install (its `/opt/.../electrumx/manifest.yml` exists on disk but wasn't loaded), so the
   reconciler never iterated it → companion orphaned. **Proven fix:** `package.install electrumx`
   re-registered it (now `reconcile action app_id=electrumx` fires) AND restored the companion (unit
   present, service active). The companion self-heal logic is correct. ⇒ test 31 clears once .228 is
   re-quadletized (step 2). electrumx on .228 is now de-contaminated. Still: clear test-44 orphans.
4. Then run `ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1` on the synced+quadlet node, then the other.

**Quadlet context (still true, but SEPARATE from the bug above):** quadlet IS the intended backend
runtime — .198 has the backend `.container` files (bitcoin-knots/btcpay-server/fedimint/filebrowser/
indeedhub/gitea/grafana/botfights/…). .228 lost them (only UI companions + home-assistant remain;
`bitcoin-core.container` is `.disabled-20260506`) **because my cascade-gate uninstalled its apps and
my `package.start` restore recreated them as bare `podman run --restart=unless-stopped`** without
regenerating units. Two related hardening items: (a) `package.start` should regenerate a missing
quadlet unit, not fall back to bare podman; (b) re-survey the status doc's "Quadlet-everywhere ~96%"
from `.container`-file presence + `PODMAN_SYSTEMD_UNIT`, not from "container running".

The **stop→stopped STATE reporting is correct** once the container actually stops (server.rs:1334
keeps a `--rm`'d app visible as `Stopped` via the `user_stopped` guard — proven on filebrowser); the
bug is purely "container never stops", not "state not reported".

### MY-SESSION ERRATA (own it on resume)
- I ran the gate with `ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1`, which is **NOT** the canonical gate (that
  is `ARCHY_ALLOW_DESTRUCTIVE=1` only — stop/start/restart, no uninstall/reinstall; see run-gate.sh
  "Suggested release-gate invocation"). Cascade ran uninstall/reinstall on every app and, when I
  killed the run mid-iteration, left bitcoin-knots/electrumx/btcpay/fedimint/immich uninstalled or
  stranded. **I fully restored .228** (reinstalled bitcoin-knots with the correct image
  `146.59.87.168:3000/lfg2025/bitcoin-knots:latest`; started the rest; cleared a stale
  `user-stopped.json`). Verified healthy: UI 200, 35 containers, 17 apps `running`.
- Reinstall gotcha: `package.install` needs a REAL image ref in `dockerImage`; a bare app name
  → `Invalid Docker image format`.

### NEXT STEPS (in order) — SINGLE-NODE (.228) criterion
1. ✅ **DONE** — 4 stop/reconcile bugs fixed + deployed (`2dad64b2` grace, `760a32bc`
   reconcile-resurrection guard, `6e49ce6f` container-list user-stopped, `452f05d8` companion
   cadence). Plus test-harness fixes (lnd/immich/fedimint/NPM readiness + config).
2. ✅ **DONE** — gate run **ON .228** (synced bitcoin): **110/110 GREEN** (1×). Key lesson:
   **run the gate on the node**, not via RPC from .116 (local podman/systemctl/bitcoin probes).
3. ◧ **5× run on .228 in progress** (`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`, on the node).
   5 consecutive clean iterations = the single-node gate criterion → demote the banner.
4. **netbird migration (#20 phase 4)** — the one real migration left; assess setup steps first (TLS
   cert gen, config files, resolver IP — may need host-file-write hooks beyond exec/copy_from_host;
   legacy is install_netbird_stack in stacks.rs). Then btcpay/mempool stack polish.
5. Hardening: `package.start` should regenerate a missing quadlet unit, not fall back to bare podman.

**Multinode / fleet (.198 + the rest) → `docs/multinode-testing-plan.md` (separate, after .228 green).**
Carry-over notes for that plan: .198 bitcoin was mid-IBD; the lnd `/app/lnd/` nginx proxy had a
stale `8081` target on .228 (repo code is correct at 18083 — re-check on other nodes).

### KNOWN ISSUES / WATCH-OUTS
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
  containers it deems unhealthy; under load, false-failing health checks → churn. The
  tcp-health fix (`e2a012d0`) mitigated the frontend case. If the lifecycle gate churns on
  .198, look for other apps whose http health checks false-fail under load → prefer tcp.
- **Many concurrent SSH sessions to .198 wedge its sshd** (MaxStartups) — it pings but SSH
  hangs for minutes. Use ONE ssh at a time to .198; `pkill -f 192.168.1.198` to clear strays.
- Hook `exec` only works in the scoped form (committed). `copy_from_host` is direct `cp`.

### DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both)
- **Build:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
  (~12 min, opt-level=3). Binary at `core/target/release/archipelago`. Linker
  "undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. `archipelago` is a
  bin-only crate (no lib). Filtered tests: `cargo test -p archipelago --bin archipelago -- hooks quadlet`.
- **Sideload:** `scp binary $H:/tmp/archipelago-new` → `sudo systemctl stop archipelago;
  sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl
  start archipelago`. Containers SURVIVE the restart (--restart unless-stopped +
  podman-restart.service). Binary path is /usr/local/bin/archipelago.
- **Manifests** live at /opt/archipelago/apps/<app_id>/manifest.yml (root-owned ok). The
  orchestrator CACHES them at startup → **edit on disk then RESTART archipelago to reload**.
  Bulk deploy: `tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis
  indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg`; scp; `sudo tar xzf t.tgz
  -C /opt/archipelago/apps`.
- **Nodes:** .228 = 192.168.1.228, SSH pw `archipelago`, RPC/UI pw `password123` (https).
  .198 = 192.168.1.198, SSH pw `archipelago`, **RPC/UI pw `ThisIsWeb54321@`** (https). Both
  have the 7-container indeedhub stack + secrets + named volumes pre-existing.
- **Trigger install via RPC:** `auth.login` (sets session+csrf cookies) → send the csrf
  cookie value as `X-CSRF-Token` header → `package.install` with params
  `{"id":"indeedhub","dockerImage":"<any>"}` (dockerImage required even for stacks; install
  is async → returns `{"status":"installing"}`). install logs go to
  /var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago.
- **Fresh-create test recipe:** `podman rm -f indeedhub` (stateless frontend) → package.install
  indeedhub → expect install_fresh + post_install hook (all 4 steps `ok`) + UI 200 on :7778
  (/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run —
  install_fresh is the only hook trigger).

## 9. Documentation map (what survives)

This master plan is the hub. Authoritative standalone docs (linked above), kept:

- **Design:** `architecture.md`, `app-developer-guide.md`,
  `APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`,
  `marketplace-protocol.md`, `dht-distribution-design.md`,
  `multi-node-architecture.md`, `rust-orchestrator-migration.md`,
  `bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`,
  `meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`.
- **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`,
  `operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`,
  `bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`,
  `SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`.

All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.

## 10. Backlog — investigate frontend state management (2026-06-23)

**Investigate adopting a real client-state/data-fetching layer for `neode-ui`** instead of
the current hand-rolled Pinia stores + ad-hoc fetch/poll patterns. Motivation: lifecycle/UX
bugs like the stuck "full-red" install/uninstall progress bar and ghost **My Apps** entries
(see §6c) are partly a *state-sync* problem — the UI's view of package state drifts from the
backend and isn't reliably invalidated/refetched. A principled query/cache layer (request
dedup, background refetch, cache invalidation on mutation, optimistic updates, retry/stale
handling) would make these classes of bug structurally hard.

**Research → recommend → (maybe) adopt:**
- Evaluate **TanStack Query** (Vue Query) as the leading candidate, plus alternatives
  (Pinia Colada, vue-query alternatives, plain Pinia + a disciplined invalidation layer, or
  an SSE/WebSocket push model for package-state events instead of polling).
- Criteria: fit with the existing Pinia/RPC architecture, bundle-size cost, offline/PWA
  behaviour, how cleanly it models long-running mutations (install/uninstall with progress),
  and whether a push channel for package-state changes is the better root-cause fix.
- Deliverable: a short design note + a recommendation, then a scoped migration of the
  package-lifecycle surfaces (My Apps / install / uninstall / update progress) as the proof
  case — sequence AFTER workstream F (it informs F's progress-UI fix and vice-versa).

## 10b. Backlog — intelligent launch-port selection (2026-06-26)

**Replace the per-app static launch-port map with a smart, manifest-first heuristic.** Gitea
launched at **:2222 (SSH)** instead of **:3001 (web)** on a node missing the gitea manifest on
disk: `manifest_lan_address_for` returned None → the code fell through to `extract_lan_address`,
which returns podman's **first-listed** published port, and podman lists `2222->22` before
`3001->3000`. Patched 2026-06-26 (`670ebb06`) with a static `"gitea" => 3001` entry in
`lan_address_for` (`core/container/src/podman_client.rs`) — but that's a per-app band-aid (the
anti-pattern CLAUDE.md warns against; the map already carries bitcoin/lnd/mempool/immich/… by hand).

**Real fix (do this, then delete the static entries):**
- **Primary** is already correct — derive the launch URL from the manifest's declared
  `interfaces.main` port. The failure was only the *fallback*. The north-star cure is
  registry-distributed manifests (workstream B) so the manifest is always present and we never
  guess.
- **Smart fallback** — make `extract_lan_address` stop returning the blind first port: **skip
  container-side ports that are known non-HTTP (22/SSH, etc.) and prefer the published port whose
  container side matches the manifest `health_check` endpoint / a known web port.** Fixes the whole
  multi-port-app class generically (no per-app hardcoding), and lets us drop the static map.
- ~20-line change to one function + unit tests; rides the next fleet roll. NOT a free-port
  remap (that's `port_allocator.rs`, which already resolves host-port *collisions* — a different
  problem; gitea's web UI was never in conflict).

## 10c. Backlog — generalize the archival/full-node install blocker (2026-06-26)

**Make "this app needs an un-pruned (archival, txindex) Bitcoin node" a manifest-declared
dependency, applied to every app that needs it — using the electrumX/mempool blocker as the
reference behavior.** Today the gate works but is **hardcoded**: `requires_unpruned_bitcoin()` in
`core/archipelago/src/api/rpc/package/dependencies.rs` is a literal `matches!(package_id, "electrumx"
| "electrs" | "mempool-electrs" | "mempool" | "mempool-web")`, and install `bail!`s with
`archival_bitcoin_required_message` when `bitcoin.pruned` is true or disk < `ARCHIVAL_BITCOIN_DISK_GB`
(1 TB). That's the same per-app-hardcoding anti-pattern as the gitea static map (§10b) and the
`install_*_stack` Rust — any new app needing a full node is silently *un*-gated until someone edits
this match.

**Do:**
- **Declare it in the manifest** — e.g. `requires: { bitcoin: archival }` (or a
  `dependencies.bitcoin.pruned: false` constraint) so the install pre-flight reads the requirement
  from the manifest set instead of a hardcoded list. Covers future apps automatically (manifest-driven
  north star).
- **Audit coverage** — confirm EVERY archival-dependent app is gated (electrumX, electrs,
  mempool + its electrs, and any BTC-indexer/explorer added later); add a unit test asserting the
  manifest constraint ⇒ blocker fires.
- **UX** — the blocker must be a clear, surfaced **pre-install** state in the UI (not just an RPC
  `bail!` string): explain *why* (pruned node / insufficient disk), what to do (add ~1 TB, resync
  un-pruned with txindex), and keep the app visibly "requires archival node" rather than a confusing
  generic failure. Pairs with workstream F's honest-progress/blocker UX.
- Reference: the existing `package-install-prune-check` dependency descriptor (dependencies.rs:208)
  is the seam to make data-driven.

## 10d. Mesh — Meshtastic MeshCore-parity (active blocker: stock 3ccc LoRa text) (2026-06-30)

**Current deployed canary:** `.116` is running commit `b4531bb4` with backend sha
`4ab53e539d89679ef664401a9a57996267772fed02327abc2912c3e77543acbf` and frontend bundle
`index-YOAeJF7w.js` / `Mesh-BSAo88jN.js`. `main` was pushed to `gitea-vps2`.

**What is fixed in this deployed canary:**
- Public stock Meshtastic interop is intentional: slot 0 PRIMARY is the public default LongFast
  channel (`name=""`, default PSK); slot 1 SECONDARY is `archipelago`.
- Outgoing Meshtastic messages to stock peer `3ccc` are recorded with real 2026 timestamps and
  `transport:"lora"` in RPC. The Mesh UI label maps `lora` to **LoRa**, not "Mesh".
- Post-send message refresh now polls briefly so FIPS/Tor/LoRa pills do not require a manual browser
  refresh.
- Off-grid mode now blocks the mesh-chat federation fallback path as well as the generic transport
  router: when enabled it forces LoRa-only sends and the UI banner reads
  `Tor/FIPS disabled - LoRa only`.
- Empty mesh-chat placeholder opacity was reduced.

**Still broken / resume here:**
- Stock Meshtastic peer `3ccc` -> `.116` LoRa text still does **not** surface in `mesh.messages`.
- Live `.116` logs prove bytes arrive from 3ccc, but the custom Meshtastic protobuf parser rejects
  the packet before it becomes an inbound frame:
  `Meshtastic FromRadio.packet did not parse into a decoded MeshPacket len=73 head=0dcc3c3e43153ca5b5432a16df56cbed`.
- 3ccc NodeInfo is discovered and PKC-capable:
  `Meshtastic peer is PKC-capable (NodeInfo public_key) node=1128152268 key_len=32`.
- Other received packets are decoded and intentionally ignored as non-text (`portnum=3/4/5`), so
  the serial reader is alive; the remaining blocker is the exact `MeshPacket` shape for stock
  Meshtastic text.
- Definition of done: a new text sent from stock Meshtastic `3ccc` appears in `.116`
  `mesh.messages` as an incoming LoRa message without a browser refresh, and `.116` -> `3ccc`
  visibly arrives in the Meshtastic app.

## 11. Arch Issues (reported 2026-07-01, untriaged)

User-reported, raw, not yet root-caused. Split by owner — **do not fix the mesh items from the
non-mesh thread**; they route to the mesh/Reticulum agent (§10d owner).

- **[MESH — routes to §10d owner]** Transport-type label on mesh is delayed / requires a browser
  refresh to show. Note: §10d (2026-06-30) already claims this was fixed ("Post-send message
  refresh now polls briefly so FIPS/Tor/LoRa pills do not require a manual browser refresh") — this
  report means it has regressed or the fix didn't fully land/deploy. Needs re-verification by the
  mesh owner, not a re-fix from scratch. (The "mesh"-tag-should-read-"LoRa" report that used to be
  listed alongside this was dropped 2026-07-01 — user is OK with current behavior there.)
- **[NON-MESH]** Indeedhub won't install on Arch Dev (node identity TBD — likely `.116`; confirm).
  Untriaged.
- **[NON-MESH, touches bitcoin lifecycle] ROOT-CAUSED + FIX WRITTEN 2026-07-01** — Uninstalling
  Bitcoin didn't stick: the container came back in My Apps and restarted IBD. Root cause:
  `is_required_baseline_app` in `prod_orchestrator.rs` (bitcoin-knots, electrumx, lnd, mempool,
  mempool-api, archy-mempool-db, filebrowser, fedimint-clientd) self-heals when its container is
  missing — including right after an explicit uninstall — because the in-memory `disabled` set used
  to suppress that is unconditionally wiped by `load_manifests()`, which runs once per archipelago
  startup/reboot, immediately before the boot reconciler's first pass. Fix: a durable
  `user-uninstalled.json` marker (mirrors the existing `user_stopped` mechanism in
  `crash_recovery.rs`) checked at the same single reconcile choke point in
  `ensure_running_with_mode`, set on successful `remove()`, cleared on `install()`/`start()`.
  Test `reconcile_existing_respects_durable_user_uninstalled_marker_for_baseline_apps` passes;
  `cargo test --workspace` green (873 tests). Low collision risk confirmed — the mechanism is
  generic (applies to all baseline apps, not bitcoin-multi-version-specific) and the
  `bitcoin-version-bulletproof` branch/worktree had no uncommitted changes in these files at the
  time this was written. Not yet committed/pushed — pending user go-ahead.
- **[NON-MESH, touches bitcoin lifecycle]** Manually stopping Bitcoin causes it to auto-restart — a
  user-initiated `package.stop` should NOT be treated as a crash by the auto-restart/health-monitor
  logic. Investigated 2026-07-01: both live restart paths (`prod_orchestrator.rs`
  `ensure_running_with_mode` and the legacy `health_monitor.rs` loop) already check the durable
  `user_stopped` marker before restarting and look correctly wired on current `main` — no live
  repro path found in code. Likely the reporting node's deployed binary predates a fix already on
  `main`; needs the node identity + build/commit to confirm before further action.
- **[NON-MESH] FIXED 2026-07-01, LIVE ON `.228`** — `.228` Bitcoin RPC was connection-refused
  ("waiting for the Bitcoin RPC listener"). Root cause: the queued `bitcoin-knots-reindex` swap from
  the bitcoin-rollout handover (`project_bitcoin_rollout_handover.md`) was never finished — the
  detached reindex container (RPC intentionally off) had been fully synced and idling for 2 days
  (height 956191, `progress=1.000000`). Executed the queued swap: stopped+removed
  `bitcoin-knots-reindex`, started the managed `bitcoin-knots` service via RPC. Confirmed healthy:
  v29.3.knots20260210, connected to peers, tip advanced to 956193, RPC listening on 8332.
  **Follow-up same day:** user asked to confirm the version, since the UI/catalog said "latest" —
  turned out the container was running a **4-month-old cached `:latest` image**
  (`v29.3.knots20260210`) while the actual newest release (`29.3.knots20260508`) was already pulled
  locally 2 days earlier but never applied. Root-caused why: `installed_version()` in
  `set_config.rs` (`package.versions`/`package.set-config`) reported the literal image **tag string**
  used to create the container (`"latest"`), not the content actually running — a stale local
  `:latest` cache reports "latest" forever regardless of what `latest` has since moved to. **FIXED**:
  when the resolved tag is a floating one (`latest`/`stable`/`release`/`main`), `installed_version()`
  now asks the Bitcoin backend directly (`podman exec <name> bitcoind --version`, parsed via new
  `parse_bitcoind_version_output`) instead of trusting the tag literal. 5 new tests in
  `set_config.rs` (`floating_tag_detects_generic_channel_names`, `parses_knots_version_line`,
  `parses_core_version_line`, `parse_returns_none_when_output_has_no_version_marker`,
  `image_tag_keeps_registry_port_colon`) all pass. No frontend change needed — `AppSidebar.vue`
  ("Running Version" in the Version & Updates card) already renders `versionInfo.installedVersion`
  verbatim, so it will show the real version once this backend fix ships. Then used the existing
  bulletproof switch mechanism itself — `package.set-config {id: "bitcoin-knots", version:
  "29.3.knots20260508"}` (an upgrade, so no downgrade-confirm gate) — to move `.228` onto the real
  latest image. Confirmed: `bitcoind --version` now reports `v29.3.knots20260508`, no reindex
  triggered, tip advancing normally. **Committed + pushed** `5b7cd5d5` (same batch as the
  uninstall-durability fix above).
- **[NON-MESH] ROOT-CAUSED 2026-07-01, NOT A CODE BUG — needs a capacity/ops decision** — `.198`
  `bitcoin-knots` RPC saturation ("work queue depth exceeded" despite `-rpcworkqueue=256`),
  cascading into stuck `fedimint`/`fedimint-gateway`/`fedimint-clientd` (`(starting)` 36-46h — this
  is what the user meant by "fedimint guardian keeps going down," not `.228`) and portainer
  flapping (seen completely absent from `podman ps -a` at one check, `Up 12 seconds` moments later
  at a follow-up check — it's being killed+recreated repeatedly, not missing). Real root cause:
  **`.198`'s `bitcoin-knots` is still only ~21% synced (height 507247, unchanged from the ~21%
  noted 2026-06-28 in [[project_bitcoin_multiversion_integration]] three days ago) and its root
  disk is nearly I/O-saturated** (`iostat -x`: `%util` 92-97%, `w_await` ~82ms) from IBD validation
  competing with ~30 other containers' disk I/O on a small (29GB) root partition on an OptiPlex
  3020M. CPU is mostly idle (bitcoin-knots at 3.68%) — this is a **disk I/O bottleneck**, not the
  retry-storm hypothesis first suspected. Every RPC caller (health_monitor, fedimint, electrumx,
  UI) times out waiting on a disk that can't keep up, and portainer's health-check failures trigger
  the orchestrator's zombie/drift-repair kill+recreate cycle, which never stabilizes because the
  underlying I/O contention never resolves. **Not fixed** — this needs a user decision (accept slow
  IBD and wait, uninstall some of the ~15 other apps competing for I/O on this node, or a hardware
  upgrade), not a code change. `docs/multinode-testing-plan.md` already treats `.198` IBD status as
  a pre-req to check before the multinode pass, consistent with this finding.
- **[NON-MESH] ROOT-CAUSED + FIXED 2026-07-01** — Indeedhub wouldn't install on Arch Dev (`.116`).
  Root cause: orphan leftover containers (`indeedhub-api`, `indeedhub-ffmpeg`) from a prior
  partial/failed install, with `indeedhub-postgres` and the rest of the stack never created.
  `health_monitor` correctly saw these as orphans (no `package_data` entry) and left them alone, but
  a separate runtime crash-recovery loop (`start_stopped_app_stacks` in `crash_recovery.rs`, runs
  every 120s — see `main.rs` "Stack supervisor") fired on ANY existing stack container regardless of
  whether the stack's core dependency existed, force-restarting `indeedhub-api` forever against a
  `postgres` hostname that could never resolve (`indeedhub-postgres` doesn't exist) — an infinite
  crash loop that also blocked a real reinstall via container-name conflicts. **Fixed**: added an
  `anchor` field to `StackRecoverySpec` (the stack's core DB/server container — `immich_postgres`,
  `indeedhub-postgres`, `netbird-server`) and gated recovery on that anchor existing first, not on
  any container existing. New test `stack_recovery_anchor_is_the_stacks_own_core_dependency`.
  **Committed + pushed** `d414ae3d`.
- **[NON-MESH] ROOT-CAUSED + FIXED 2026-07-01** — Electrum launch/app-loader UI overlapped with the
  ElectrumX syncing screen. Root cause (found via a parallel Explore-agent investigation):
  `AppSessionFrame.vue` rendered the generic `AppLoadingScreen` and the ElectrumX sync overlay
  simultaneously at the same `z-index: 10` — both conditions (`loading` and
  `electrsSync && !electrsSync.stale`) could be true at once during launch. **Fixed**: the generic
  loader now also checks `!(electrsSync && !electrsSync.stale)` so the more-informative sync screen
  takes precedence instead of the two stacking. `vue-tsc --noEmit` clean. **Committed + pushed**
  `d414ae3d`.

## 12. `.198` portainer + boot-reconciler circuit breaker (2026-07-01)

**`.198` portainer flapping was NOT the same root cause as the disk-I/O issue above** — user
correctly pushed back on that assumption. Actual cause: fatal, permanent — `podman logs portainer`
showed `The database schema version does not align with the server version`. `.116`/`.228` both run
the same pinned `portainer:2.19.4` and are healthy, so this was `.198`-specific data drift: its
`portainer.db` was created/upgraded by a newer binary at some point in that node's own history,
independent of the other nodes (git history has no record of the pin ever being anything but
2.19.4, so this was very likely a manual/ad-hoc podman operation on `.198` outside the normal
install/update path, not a platform bug in version selection). **Fixed live**: backed up
`portainer.db` to `_reset-backup-2026-07-01/` (not deleted) and let the pinned `2.19.4` reinitialize
fresh — portainer only holds its own dashboard/endpoint config, not irreplaceable user data, and the
user approved a reset over attempting recovery. Confirmed stable afterward.

**Follow-up "make sure this can't happen again" (user request)** — root-caused why this could loop
forever undetected: `BootReconciler` (`boot_reconciler.rs`, ticks every 30s, `reconcile_existing()`)
recreates containers via `ensure_running_with_mode`'s `ContainerState::Created`/`Stopped`/`Exited`
"start failed → stop+remove+install_fresh" branches with **no bound at all** — unlike
`health_monitor.rs`'s independent restart path, which already has `MAX_RESTART_ATTEMPTS=10` +
backoff + a persistent user-facing notification after giving up. A container whose entrypoint
process fatally crashes moments after `podman start` succeeds (podman itself sees no error) has its
container recreated every single tick, forever, with only debug/warn-level logs — exactly
portainer's failure mode, and the reason it could keep looping (crash_recovery's periodic
supervisor doesn't cover single-container apps like portainer — only stack members — so this was
the actual mechanism, not the one used for indeedhub above).

**Fixed**: added `MAX_REPAIR_ATTEMPTS=5` / `REPAIR_ATTEMPT_RESET_WINDOW=30min` circuit breaker
(`should_attempt_repair`/`clear_repair_attempts`, `prod_orchestrator.rs`) gating the zombie-guard
recreate and both "start failed" recreate branches (`Created` and `Stopped|Exited` states). Once
exhausted, reconcile leaves the container alone (`ReconcileAction::Left("repair-attempts-exhausted")`)
and logs an `error!` pointing at `podman logs <name>` instead of recreating forever; an explicit
`install()`/`start()` clears the counter, same pattern as `user_stopped`. New test
`repair_recreate_stops_after_max_attempts_instead_of_looping_forever`. **Scoped deliberately**: left
the drift-detection recreates (port/env drift, `Stopping`-stuck) unguarded for this pass — those are
host-state-corrections that normally resolve in one shot, a materially different failure shape from
"the app itself is fatally broken," and touching all ~8 recreate call sites in one pass risked
regressing carefully-tuned existing behavior for low incremental benefit. Full breaker coverage
(and/or wiring a persistent `Notification` through, which needs `StateManager` threaded into
`BootReconciler` — a bigger `main.rs` startup-order change not attempted here) is a reasonable
future follow-up if another single-container app hits this same failure class.

**Also answered**: "why does portainer's setup wizard not have podman as an option?" —
`apps/portainer/manifest.yml` bind-mounts the rootless podman socket
(`/run/user/1000/podman/podman.sock`) to `/var/run/docker.sock` inside the container. Portainer
never knows it's talking to podman — it just sees the standard Docker socket path and speaks the
Docker Engine API, which podman's socket implements compatibly. Not a bug: pick "Docker" (local) in
the wizard.

## 12b. `.198` disk-I/O relief — apps uninstalled, immich uninstall-mapping bug found+fixed (2026-07-01)

User approved uninstalling immich, botfights, grafana, searxng on `.198` to relieve the disk-I/O
contention from §12 (bitcoin-knots' slow IBD). All 4 uninstalled via RPC. **Found another instance
of the exact §11 uninstall-durability bug class, this time in the uninstall app_id MAPPING rather
than the durability mechanism**: `orchestrator_uninstall_app_ids("immich")` had no case (fell to the
generic `_ => vec![package_id]`), so uninstalling "immich" only disabled the "immich" app_id itself
— "immich-postgres" and "immich-redis" (separate orchestrator-tracked manifests, same shape as
mempool-api/archy-mempool-db) stayed enabled, and the boot reconciler kept restarting their leftover
*stopped* containers every ~30s. Confirmed live via `journalctl`: `reconcile action
app_id=immich-redis action=Started` well after uninstall. **Fixed** (mirrors the existing
mempool/btcpay/electrum mappings) + new test `immich_uninstall_covers_every_sibling_orchestrator_app_id`.
Cleaned up live on `.198` by fully removing (not just stopping) the orphaned containers — a fully
*absent* optional container is already correctly left alone even by the old deployed binary, so this
stuck without needing a redeploy. **Committed + pushed** `09d42cbb`.

**Outcome**: disk still showed 90-100% `%util` and `getblockchaininfo` still timed out (65s) right
after the uninstalls — likely because bitcoin-knots' own IBD validation (492GB+ cumulative block I/O
already) is the dominant consumer, not the other apps; removing 4 relatively light/idle apps gives
some relief (less concurrent contention) but doesn't fix a fundamentally disk-bound full-chain
validation in progress. Data volumes for the uninstalled apps were left in place (uninstall doesn't
wipe `/var/lib/archipelago/<app>` by default) — disk *space* usage (72%) is unchanged, only the
*active* I/O from those containers stopped.

**`.228` "fedimint guardian" — clarified, not a bug**: user separately flagged ".228 has the fedimint
guardian stop issue." Checked: `.228` has NO `fedimint` (guardian) container installed at all — only
`fedimint-clientd` (a client joining *external* federations) and its UI, both healthy (`Up 2-5 days`).
Only `.198` runs an actual guardian (`fedimint`), and that's the one already covered by §12's
disk-I/O root cause. Likely a node mix-up in the report — flag if something else specific to `.228`
was meant.

## 13. Peer-federated content 404s over FIPS (2026-07-01) — DATA LOSS, not a code bug in the transport

User report: `.116 → .228` streaming/downloading peer-federated content over FIPS failed with
`/api/peer-content/<onion>/<id>` 404s, surfacing in the browser as `NotSupportedError: no supported
source`. Investigated the full path: nginx's `/api/peer-content/` proxy block is present on `.116`;
`handle_peer_content_stream` (`api/handler/proxy.rs`) correctly dials `.228` over FIPS and passes
the peer's real HTTP status straight through — not a routing bug. `.228`'s `content/catalog.json`
genuinely lists both content IDs from the error log as `access: free`, `availability: allpeers` (so
not a permissions bug either), **but the backing files don't exist anywhere on `.228`** — checked
both `content/files/` (empty except `catalog.json`) and the FileBrowser fallback path (`Music/`,
`Photos/` dirs exist but are empty, `mtime` 2026-06-26). The catalog's last real edit was
2026-06-19, so these files were lost in a data-dir reset that post-dates the catalog (most likely
the same window as other 2026-06-26 fixes in `docs/PRODUCTION-MASTER-PLAN.md` §6c) and nobody
pruned the stale catalog entries or re-uploaded the files since. **This is real data loss on `.228`,
not recoverable via code** — flag to the user if the original files (a screen recording + an mp3)
still exist somewhere else to re-add.

**Code fix shipped regardless** (self-healing, generalizable): `content_server::serve_content` now
prunes a catalog entry from disk the moment it 404s because its backing file is missing
(`prune_missing_content_entry`), instead of leaving it advertised to every peer forever with no way
to distinguish "gone" from "transient failure." New tests
`serve_content_prunes_catalog_entry_whose_file_is_missing` +
`serve_content_leaves_other_entries_untouched_when_pruning`.

## 14. Known test flakiness (not investigated, low priority)

`credentials::operations::tests::*` has thrown 3 different failures
(`test_list_credentials_no_filter`, `test_list_credentials_filter_by_did`) across separate
`cargo test --workspace` runs this session — `invalid utf-8 sequence` panics from
`credentials/operations.rs:336`. Passes reliably in isolation and under `--test-threads=1`; only
fails under full-parallel `--workspace` runs, and never on the same test twice — points to a shared
test-fixture/tempfile collision generating non-UTF8 bytes under parallelism, not a real credentials
bug and not related to anything touched this session. Worth a real fix at some point (a test isolation
issue makes CI flaky) but out of scope here.