archy/docs/RESUME.md

# RESUME - Archipelago Release Hardening on `.198`

Last updated: 2026-06-10

## 2026-06-10 05:48 EDT Active Session Checkpoint

Work resumed from `docs/NEXT_TERMINAL_HANDOFF.md`. No `.198` host actions have
been run yet in this resumed pass.

Current first steps:

1. Rerun `git diff --check`.
2. Rerun the focused Rust image-version test for the Nextcloud false-update
   helper.
3. If those are clean, inspect and continue the rootless Podman lifecycle/
   scanner-backoff work before any `.198` validation.

Progress:

- `git diff --check` passed.
- Focused Rust image-version test in `/tmp/archy-cargo-image-versions` remains
  inconclusive: the tool PTY stayed open after compile output stopped, with no
  active `cargo`, `rustc`, or linker process visible.
- Bounded retry of the focused image-version test using the normal workspace
  target also timed out: `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests`
  exited `124` after compiling the `archipelago` test target without reaching
  test output. Nextcloud false-update validation is still not closed.
- Local code change in progress: single-orchestrator `package.stop` now returns
  immediately with `stopping` and runs the orchestrator stop in the background,
  instead of blocking the RPC/UI while Podman cleanup happens.
- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed.
- Compile check passed in `/tmp/archy-cargo-runtime-check`:
  `cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`.
- `git diff --check` passed after the stop-path edit and doc updates.
- Lower-level stop path inspection: Quadlet service stop is already bounded
  with kill/reset recovery, and the runtime fallback treats already-absent
  containers as success. No extra lower-level stop change was made.

## 2026-06-10 05:30 EDT Pause Checkpoint

User paused to switch machines. Continue from `/home/archipelago/Projects/archy`
and read `docs/NEXT_TERMINAL_HANDOFF.md` plus
`docs/1.8-alpha-improvements-tracker.md` first. No dev server or validation
command should be intentionally left running from this checkpoint.

Latest local-only tracker progress:

- Done: uninstall preserve/delete-data choice, companion APK QR/download modal,
  App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight
  AI placeholder removal.
- In progress: Fleet/tab loading polish, Bitcoin receive-address readiness
  states, no-registration credentials inventory, Nextcloud false-update fix.
- New credential fallback: PhotoPrism now shows manifest-backed credentials
  (`admin` / `archipelago`) when backend credentials are empty. Grafana was not
  added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo
  default/secret.
- Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29`
  and image update detection ignores registry-host-only changes. Catalog drift
  passed, but backend focused Rust validation did not complete cleanly. First
  `cargo test -p archipelago container::image_versions::tests` from `core/`
  hit a Rust linker/incremental artifact failure while `/tmp` was full; a
  non-incremental retry was killed after running too long. Old
  `/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered.

Latest local validations:

- `npm run type-check` passed after the PhotoPrism credential fallback.
- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed.
- `git diff --check` passed after the Spotlight cleanup and should be rerun
  after resuming.
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during
  the Nextcloud pass.

Immediate next steps:

1. Rerun `git diff --check`.
2. Rerun `cargo test -p archipelago container::image_versions::tests` from
   `core/` when ready to validate the Nextcloud update-detection helper.
3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain
   `todo` or `in-progress`, avoiding host-gated items until `.198` access is
   intentionally resumed.

## 2026-06-09 Resume Handoff - Read First

Last user prompt to preserve:

> please can we save all our progress, backlog, and goal to memory so I can resume on another device please
>
> including the last prompt

Ultimate release goal:

Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs.

Important target node:

- Validation node: `archipelago@192.168.1.198`, password `password123`.
- Current release deadline pressure from user: production release target was Thursday, 2026-06-11.
- Tests have been run mostly on `.198`; user noted we may also need to validate on the current intended release server, not only `.198`.
- Avoid broad/destructive Podman store cleanup. Do not use `git reset --hard` or revert unrelated user changes.

Current deployed backend on `.198`:

- Latest deployed `/usr/local/bin/archipelago` sha256: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
- A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff.

Major progress achieved in the latest session:

- Beta Telemetry / Fleet collector:
  - Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it.
  - Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body.
  - Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`.
  - Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`.
  - Documented the expected value shape in `scripts/deploy-config.example`: `https://<collector-host>/rpc/v1`.
  - Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`.
  - `systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`.
  - Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https://<collector-host>/rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet.
- IndeeHub:
  - Recovered stale/corrupt metadata/container state enough for fresh lifecycle.
  - Full lifecycle passed earlier on `.198`.
  - Verified launch on `7778`.
  - Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved.
- Saleor:
  - Removed from app catalog/server as requested.
- Bitcoin Knots / Bitcoin UI:
  - Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`.
  - Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure.
  - Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks.
- Fedimint:
  - Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway.
  - Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup.
  - Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts.
  - Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`.
  - Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`.
  - Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`.
  - Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe.
- BotFights:
  - User reported stopped/unhealthy.
  - Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery.
  - Deployed backend hash `9a00e543...`.
  - BotFights started and is active.
  - Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`.
  - Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later.
- Status/health correctness:
  - Reduced container health/status Podman timeouts to avoid UI hanging forever.
  - `container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states.
  - Fedimint stale `stopping` fixed to `starting`.
  - Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights.
- Filebrowser/Home Assistant/Immich/Bitcoin:
  - Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy.
  - Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation.

Current critical blockers:

- Runtime control plane / Podman scanning:
  - Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`.
  - Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`.
  - This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions.
  - Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff.
- My Apps UI false negatives:
  - User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed".
  - Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout.
- Fedimint Guardian:
  - Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD.
  - Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied.
- Progress UX:
  - User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen.
  - Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications.
- Stale health notifications:
  - Must not persistently trigger on new logins/refreshes after no longer valid.
  - Some UI filtering was patched earlier, but keep this in regression backlog.
- Reboot survival:
  - Must pass repeated reboot validation after runtime/status fixes.
  - Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5.

Backlog captured from user reports:

- Portainer:
  - Environment wizard error: `Dial unix /var/run/docker.sock: connect: connection refused`.
  - User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful.
- Fedimint:
  - Setup after guardian confirmation caused app not to launch.
  - Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct.
  - Gateway app disappeared from catalog before; it has been restored but keep in regression tests.
- Bitcoin Knots:
  - User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression.
- Home Assistant:
  - Setup has issues on this node and restart hung for a long time.
- Immich:
  - After setup user saw HTTP 500 stacktrace from `loadServerConfig`. Needs focused post-setup validation, not just "healthy".
- Filebrowser:
  - User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression.
- Tailscale:
  - Launch must show local login/auth UI, not merely container running.
- BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps:
  - Need clearer dependency wait states when Bitcoin RPC is slow/IBD.
- App catalog/developer readiness:
  - Apps should not require OS-level changes per app.
  - App migration document and developer guide must include this principle and current app packaging contract.
- Saleor:
  - Removed from catalog/server and should stay removed unless intentionally reintroduced.

Release readiness estimate:

- Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%.
- Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation.

Suggested immediate next steps after resuming:

1. Read this file and verify no background build/process is running.
2. Build/deploy the local-only HTTP-health tightening patch if not already deployed.
3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`.
4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking.
5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer.
6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts.

Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim.

---

## Resume Prompt

> Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://<node>:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step.

---

## Current Goal

Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.

Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO.

## Release Readiness Estimate

- Estimated completion: `68%`.
- What is already achieved:
  - manifest-driven app migration is substantially advanced;
  - catalog metadata generation and strict drift checks are green;
  - local backend/frontend release gates have been green in prior passes;
  - broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding;
  - Podman store-risk paths have been quarantined from known fragile broad image/store commands;
  - IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness;
  - targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness;
  - mobile and desktop app progress UX now has clearer install/remove phase labels in local changes;
  - Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix;
  - Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`.
- What must still pass before release:
  - deploy the current Immich readiness-gating backend and frontend progress UX changes;
  - focused Immich validation: install must stay in progress until `http://<node>:2283/` returns HTTP success and app launch opens the frontend;
  - focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://<node>:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served;
  - keep Vaultwarden in regression coverage even though the latest full lifecycle passed;
  - focused Tailscale validation: launch must present the local login/auth link/UI on `8240`;
  - focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`;
  - full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`;
  - progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough;
  - app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks;
  - required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward;
  - broad non-destructive lifecycle after the deploy;
  - at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each;
  - preferably 5 consecutive clean reboot iterations before calling `1.8-alpha` production-release ready;
  - final local release gates after any additional fixes;
  - cut the `1.8-alpha` ISO;
  - boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle.

---

## Latest User Directive

> A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria
>
> please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks
>
> also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't
>
> Also BTCPay is not running either
>
> no my bad, wrong server, BTCPay is fine just slow, please continue
>
> Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
>
> please confirm there is a refactor/remove dead code release gate too

Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability.

Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm.

There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates.

---

## Live `.198` State

- Host: `192.168.1.198`.
- Password for lifecycle harness/RPC login: `password123`.
- Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`.
- `archipelago.service`: active.
- `archipelago-doctor.timer`: inactive.
- `archipelago-reconcile.timer`: inactive.
- `/`: `65%` used, about `9.6G` free.
- `/var/lib/archipelago`: about `9-10%` used, about `370G` free.

Current active app blockers:

- Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes.
- IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://<node>:7778/` plus `/nostr-provider.js` for the Nostr signer.
- Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker.
- Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`.
- Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`.
- Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping.
- Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`.
- Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`.
- Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks.
- Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery.
- Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer.
- Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`.
- Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen.

Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free.

### 2026-06-10 Resume Continuation Checkpoint

- Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`.
  - Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
  - `archipelago.service` is active.
  - `archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive.
- Added explicit release gates to this handoff:
  - app packaging docs must be updated before `1.8-alpha`;
  - refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO.
- Local validation before deploy:
  - `bash -n tests/lifecycle/remote-lifecycle.sh` passed;
  - `cargo fmt --manifest-path core/Cargo.toml --all`;
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests);
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed;
  - `python3 scripts/check-app-catalog-drift.py --release --strict` passed;
  - `git diff --check` passed.
  - Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed.
- IndeeHub live validation after deploy:
  - `container-list` reports `indeedhub` running;
  - `container-health` reports `{"indeedhub":"healthy"}`;
  - `http://192.168.1.198:7778/` returns HTTP `200`;
  - `http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim.
- Immich live validation after deploy:
  - `container-list` reports `immich` running;
  - direct `http://192.168.1.198:2283/` returns HTTP `200`;
  - `container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable.
- Tailscale live validation after deploy:
  - Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`.
  - App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`.
  - Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes.
  - After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content.
  - Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker.
- Other live probes after deploy:
  - `portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard.
  - `vaultwarden` HTTP `8082` returns `200` from localhost on `.198`.
  - `botfights` HTTP `9100` returns `200` from localhost on `.198`.
  - `btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails.
  - `fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope.
- Podman/control-plane remains the active systemic blocker:
  - logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup;
  - do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts.

---

## Latest Completed Work

### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix

- Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive.
- Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`:
  - `/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep;
  - socket bind mounts call explicit socket repair before other bind prep;
  - `ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed.
- Validated locally before deploy:
  - `cargo fmt --manifest-path core/Cargo.toml --all`.
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests).
  - `git diff --check`.
  - `timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release`.
- Vaultwarden full preserve-data lifecycle passed on `.198`:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Portainer full preserve-data lifecycle passed on `.198`:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Portainer stale socket mount was confirmed and repaired:
  - Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`.
  - After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`.
  - User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect.
- Direct state check after deploy:
  - `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`.
  - `indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed.
  - `vaultwarden running true`.
  - `portainer running true`.

### 2026-06-08 Reboot Blocker Follow-up In Progress

- User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot.
- Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean.
- Local changes made in this pass:
  - hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`;
  - hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests;
  - updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement.
- Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed:
  - `indeedhub` stuck `stopping` and unhealthy;
  - `immich` stopped/unhealthy;
  - `tailscale` running/healthy but direct launch `8240` returned `000`;
  - `vaultwarden` health RPC errored and launch `8082` returned `000`;
  - `btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm.
- Targeted diagnostics on `.198` found:
  - IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener;
  - Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener;
  - Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint;
  - Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes;
  - Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready.
- Local follow-up fixes after those diagnostics:
  - `core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails;
  - `ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists;
  - IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS;
  - lifecycle harness now requires Tailscale launch content to look like login/auth UI.
- Local validation passed after those fixes:
  - `cargo fmt --manifest-path core/Cargo.toml --all`.
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
  - `bash -n tests/lifecycle/remote-lifecycle.sh`.
  - `git diff --check`.
- Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive.
- Public RPC recovery attempts on hash `06420c...`:
  - `package.restart indeedhub` still failed;
  - `package.start immich` accepted async start but app remained `starting` with no `2283` launch;
  - `package.start vaultwarden` accepted async start but no `8082` launch appeared;
  - `package.restart portainer` failed;
  - `package.restart tailscale` accepted async restart but no `8240` launch UI appeared.
- Latest focused probe after hash `06420c...`:
  - `tailscale` `running`, `http://192.168.1.198:8240/` returns `000`;
  - `immich` `starting`, `http://192.168.1.198:2283/` returns `000`;
  - `indeedhub` `stopping`, `http://192.168.1.198:7778/` returns `000`;
  - `portainer` `running`, `http://192.168.1.198:9000/` returns `000`;
  - `vaultwarden` absent/not listed, `http://192.168.1.198:8082/` returns `000`.
- Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful.
- Local validation passed so far:
  - `cargo fmt --manifest-path core/Cargo.toml --all`.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `bash -n tests/lifecycle/remote-lifecycle.sh`.
  - `git diff --check`.
- A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet.
- Next steps:
  - deploy the new backend only after approval;
  - verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability;
  - run reboot validation iterations on `.198` only after explicit approval;
  - pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence.
  - cut and smoke-test the `1.8-alpha` ISO after reboot validation is green.

### Local Release Gate Completion After `.198` App Recovery

- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
- Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active.
- Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them.
- Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism.
- Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`.
- Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion.
- Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs.
- Validation passed locally:
  - `cargo fmt --manifest-path core/Cargo.toml --all`.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`).
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests).
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `python3 scripts/check-app-catalog-drift.py --release --strict`.
  - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
  - `git diff --check`.
  - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
- Remaining gated item remains host reboot validation on `.198`, only if explicitly approved.

### Frontend Release Gate Completion

- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
- Found and fixed a mobile app-launch regression in `neode-ui/src/stores/appLauncher.ts`:
  - desktop-only new-tab apps still open directly on desktop;
  - mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab;
  - `dashboardReturnPath()` now tolerates tests/minimal router mocks with no `currentRoute`.
- Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior.
- Fixed `AppIconGrid` test setup so it shares the mounted Pinia instance and mocks credential lookup before launch.
- Fixed onboarding retry test timing to cover the actual exponential retry budget.
- Validation passed locally:
  - `npm run type-check` from `neode-ui`.
  - `npm test` from `neode-ui` (`548 passed`).
  - `npm run build` from `neode-ui`.
  - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`).
  - `python3 scripts/check-app-catalog-drift.py --release --strict`.
  - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
  - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
  - `git diff --check`.
- Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval.

### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery

- Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`.
- Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable.
- Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”.
- Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`.
- Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port.
- Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC.
- Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app.
- Validation passed:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `python3 scripts/check-app-catalog-drift.py --release --strict`.
  - Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`.
  - Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`.
  - Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.

### Deployed Podman Store-Risk Cleanup

- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`.
- Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts.
- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`.
- Validation passed:
  - `python3 scripts/check-app-catalog-drift.py --release --strict`.
  - `cargo fmt` from `core/`.
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - Focused post-deploy lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
  - Broad post-deploy non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.

### Release Candidate Backend Restart Validation

- Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`.
- Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load.
- Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback.
- Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`.
- Recovered live Immich without data loss:
  - `immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written.
  - Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership.
  - A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root.
- Validation passed on latest hash:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `python3 scripts/check-app-catalog-drift.py --release --strict`.
  - `npm run build` from `neode-ui`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
  - Backend restart validation followed by focused `fedimint,immich,indeedhub,photoprism` lifecycle passed.
  - Post-restart broad non-destructive lifecycle passed.
- Remaining gate before calling this a release: host reboot validation, if approved.

### IndeedHub and Immich Lifecycle Recovery

- Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`.
- IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running.
- Fedimint and NetBird focused audits are green; they were not current blockers after rerun.
- Immich was the broad-audit blocker and is now green:
  - dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes;
  - `immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping;
  - this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`.
- Validation passed on latest hash:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands.

### Release Refactor Cleanup

- Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`.
- Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available.
- Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility.
- Removed the duplicate Gitea-specific stale port cleanup helper.
- Validation passed on latest hash:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation.

### Catalog Metadata Generation

- Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`.
- The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`.
- The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes.
- Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation.
- Release catalog drift is now zero:
  - `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`.
- Validation passed:
  - `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`.
  - canonical and UI public catalogs match byte-for-byte.
  - `cargo test --manifest-path core/Cargo.toml -p archipelago-container`.
  - `cargo check --manifest-path core/Cargo.toml -p archipelago`.
  - `npm run build` from `neode-ui`.

### Podman Store-Risk Hardening

- Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`.
- Fresh local-build installs now treat `podman image exists <local-build-tag>` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation.
- This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior.
- Validation passed on the latest hash:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation.

### Container Health Fallback and Broad Lifecycle Green

- Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`.
- Fixed `container-health` broad lifecycle timeout behavior:
  - `cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`.
  - The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others.
  - Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls.
- Validation passed on the latest hash:
  - `cargo check --manifest-path core/Cargo.toml -p archipelago`.
  - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.

### Generic Host-Port Health Checkpoint

- Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`.
- Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned.
- Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward.
- Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`.
- Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing.
- This is generic host-port health, not an app-specific mapping.
- After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart.
- Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails.
- Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior.

### Stale State and Jellyfin Pasta Listener Hardening

- Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`.
- `container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery.
- `container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads.
- Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`.
- `package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`.
- `package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`.
- Focused lifecycle passed on the latest hash:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Release catalog drift check remains: `missing_catalog=0`, `missing_manifests=0`, `metadata_drift=35`.

### Expanded Cleanup and Store-Safe Uninstall

- Built and deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`.
- Expanded `system.disk-cleanup` to remove old rollback artifacts while keeping newest rollback points:
  - `/usr/local/bin/archipelago.backup-*` newest 3.
  - legacy `/usr/local/bin/archipelago.bak*` newest 3.
  - `/usr/local/bin/archipelago.before-*` newest 3 as part of legacy backend cleanup.
  - `/opt/archipelago/web-ui.bak*` newest 3.
  - `/opt/archipelago/web-ui.old` included as web UI rollback cleanup.
- Live `system.disk-cleanup` reclaimed `10.3 GB`:
  - `Removed old backend backups: 41.6 MB freed`.
  - `Removed old legacy backend backups: 3.6 GB freed`.
  - `Removed old web UI backups: 6.6 GB freed`.
  - `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`.
- `/usr/local/bin` dropped to about `336M`.
- `/opt/archipelago` dropped to about `1.1G`.
- Removed global `podman volume prune -f` from uninstall. Uninstall now logs a skip and still removes explicit app data when `preserve_data=false`.

### Startup Scan and Uptime Kuma Fixes

- Startup `adopt_existing()` is bounded with a 35s timeout.
- Initial container scan seeds the same 300s Podman scan backoff used by periodic scans.
- Legacy pasta restart paths use scoped `podman restart` instead of stop+start.
- Uptime Kuma was repaired:
  - Before: container internally healthy on `127.0.0.1:3001`, but host `3002` had no pasta listener.
  - After: `package.restart uptime-kuma` returns `{"status":"restarted"}` and `http://192.168.1.198:3002/` returns HTTP `302`.

### Cleanup and Catalog Work Already Done

- `system.disk-cleanup` intentionally skips Podman image/volume prune.
- `nostr-rs-relay` was added to both catalog surfaces.
- `scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation.
- Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest.

---

## Verification Already Run

- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed for the currently deployed release-candidate line.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed for the currently deployed release-candidate line.
- Broad lifecycle on current hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` passed:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Targeted PhotoPrism audit on current hash passed:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`
- Focused lifecycle on current hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` passed:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Live cleanup RPC passed and reclaimed `10.3 GB`.
- Focused lifecycle after expanded cleanup passed:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Before the expanded cleanup pass, broad lifecycle also passed on hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`:
  - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Direct app checks after latest cleanup passed:
  - `http://192.168.1.198:3002/` -> HTTP `302`.
- `http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start.
  - `http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here.

### Test Caveat

- Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
- Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`.

---

## Critical Constraints

- Preserve app data.
- `.198` is the active validation node.
- Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them.
- Do not run destructive git commands.
- Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan:
  - Avoid `podman system df`.
  - Avoid `podman image list` / `podman image ls`.
  - Avoid broad `podman image exists` loops.
  - Avoid `podman image prune` and `podman volume prune`.
- Podman store commands can hang and block app health under current `.198` load.
- Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`.

---

## Current Remaining Blockers

1. Podman socket/store health remains unresolved.
    - Need quarantine/mitigation strategy rather than store-wide commands in release paths.
    - Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`.
    - Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward.
    - Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed.

2. Release code-review/refactor gate is still open.
   - Reduce remaining app-specific Rust/OS branches where possible.
   - Review scanner, health, reconcile, and install/update paths for performance and store-risk.
   - Clean up dead transitional paths.

3. Clean release branch hygiene is not done.
   - Worktree is very dirty with many modified and untracked files.
   - Do not commit unless explicitly asked.

4. Full production validation still needed.
   - Broad non-destructive lifecycle is green on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
   - Backend restart validation has passed.
   - Run host reboot validation if approved.
   - Run selected full lifecycle tests for critical apps if time allows.

---

## Files Changed In Latest Pass

- `core/container/src/runtime.rs`
  - Changed Podman runtime `image_exists()` from `podman image exists` to a bounded targeted `podman image inspect` local-storage probe.

- `core/archipelago/src/api/rpc/package/install.rs`
  - Replaced legacy `podman images -q` local fallback and post-pull verification checks with bounded targeted `podman image inspect`.

- `core/archipelago/src/container/companion.rs`
  - Changed companion image existence checks from `podman image exists` to `podman image inspect`.

- `core/archipelago/src/container/prod_orchestrator.rs`
  - Updated image-existence failure test fixture wording for the new `image inspect` probe.

- Validation for latest local mitigation:
  - `cargo fmt --all --check` passed.
  - `cargo check -p archipelago-container` passed.
  - `cargo check -p archipelago` passed.
  - `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests` passed.
  - `cargo test -p archipelago-container` passed (`43` tests).
  - `git diff --check -- <changed files>` passed.
  - Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward.

- `core/archipelago/src/api/rpc/system/handlers.rs`
  - Calls expanded rollback cleanup helpers and reports reclaimed bytes.

- `core/archipelago/src/api/rpc/system/mod.rs`
  - Added cleanup helpers for legacy backend backups and web UI rollback backups.
  - Uses size accounting for directories before removal.
  - Keeps newest rollback artifacts instead of deleting all.

- `core/archipelago/src/api/rpc/package/runtime.rs`
   - Skips global `podman volume prune -f` during uninstall.
   - Adds Jellyfin `8096` to runtime host-port/pasta cleanup repair.
   - Derives legacy runtime host-port cleanup/repair ports from manifests.
   - Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code.

- `core/archipelago/src/api/rpc/container.rs`
   - Adds stale cached `exited` refresh for `container-list`.
   - Adds cached-running plus local TCP reachability fallback for `container-health`.
   - Fixes fallback URL port parsing and expands lifecycle web app port coverage.

- `core/archipelago/src/container/prod_orchestrator.rs`
  - Rebuilds local-build images when `image_exists` fails/times out instead of failing fresh install.
  - Adds focused unit test coverage for that behavior.

- `scripts/generate-app-catalog.py`
  - Generates/syncs public catalog metadata from manifest-owned fields.

- `app-catalog/catalog.json` and `neode-ui/public/catalog.json`
  - Generated from current manifests; files match byte-for-byte.

- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
  - Added latest deployment, cleanup, validation, and residual-risk checkpoint.

- `docs/MIGRATION_STATUS_REPORT.md`
  - Updated current hash, root disk state, and remaining blockers.

- `docs/RESUME.md`
  - This file, replacing stale April migration resume content.

---

## Suggested Next Steps

1. Re-read the three docs:
   - `docs/RESUME.md`
   - `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
   - `docs/MIGRATION_STATUS_REPORT.md`

2. Verify latest `.198` state:
   - `ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'`

3. Start Podman-store-risk review:
   - Search for image/store operations: `image_exists`, `podman image`, `podman system`, `podman prune`, `volume prune`.
   - Prefer targeted container status/API calls with timeouts.
   - Avoid new broad store commands.

4. Continue release code-review/refactor cleanup.

5. If approved, run backend-restart validation and then host-reboot validation.

---

## Current Release Readiness Estimate

- Credible release candidate: closer now, roughly `87-91%`.
- Production-quality release developers will love: still closer to `73-79%`.

The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.