- New credential fallback: PhotoPrism now shows manifest-backed credentials
(`admin` / `archipelago`) when backend credentials are empty. Grafana was not
added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo
default/secret.
- Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29`
and image update detection ignores registry-host-only changes. Catalog drift
passed, but backend focused Rust validation did not complete cleanly. First
`cargo test -p archipelago container::image_versions::tests` from `core/`
hit a Rust linker/incremental artifact failure while `/tmp` was full; a
non-incremental retry was killed after running too long. Old
`/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered.
Latest local validations:
-`npm run type-check` passed after the PhotoPrism credential fallback.
-`npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed.
-`git diff --check` passed after the Spotlight cleanup and should be rerun
after resuming.
-`python3 scripts/check-app-catalog-drift.py --release --strict` passed during
the Nextcloud pass.
Immediate next steps:
1. Rerun `git diff --check`.
2. Rerun `cargo test -p archipelago container::image_versions::tests` from
`core/` when ready to validate the Nextcloud update-detection helper.
3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain
`todo` or `in-progress`, avoiding host-gated items until `.198` access is
intentionally resumed.
## 2026-06-09 Resume Handoff - Read First
Last user prompt to preserve:
> please can we save all our progress, backlog, and goal to memory so I can resume on another device please
>
> including the last prompt
Ultimate release goal:
Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs.
- A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff.
Major progress achieved in the latest session:
- Beta Telemetry / Fleet collector:
- Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it.
- Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body.
- Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`.
- Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`.
- Documented the expected value shape in `scripts/deploy-config.example`: `https://<collector-host>/rpc/v1`.
- Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`.
-`systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`.
- Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https://<collector-host>/rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet.
- IndeeHub:
- Recovered stale/corrupt metadata/container state enough for fresh lifecycle.
- Full lifecycle passed earlier on `.198`.
- Verified launch on `7778`.
- Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved.
- Saleor:
- Removed from app catalog/server as requested.
- Bitcoin Knots / Bitcoin UI:
- Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`.
- Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure.
- Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks.
- Fedimint:
- Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway.
- Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup.
- Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts.
- Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`.
- Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`.
- Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`.
- Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe.
- BotFights:
- User reported stopped/unhealthy.
- Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery.
- Deployed backend hash `9a00e543...`.
- BotFights started and is active.
- Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`.
- Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later.
-`container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states.
- Fedimint stale `stopping` fixed to `starting`.
- Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights.
- Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation.
Current critical blockers:
- Runtime control plane / Podman scanning:
- Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`.
- Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`.
- This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions.
- Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff.
- My Apps UI false negatives:
- User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed".
- Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout.
- Fedimint Guardian:
- Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD.
- Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied.
- Progress UX:
- User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen.
- Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications.
- Stale health notifications:
- Must not persistently trigger on new logins/refreshes after no longer valid.
- Some UI filtering was patched earlier, but keep this in regression backlog.
- Reboot survival:
- Must pass repeated reboot validation after runtime/status fixes.
- Need clearer dependency wait states when Bitcoin RPC is slow/IBD.
- App catalog/developer readiness:
- Apps should not require OS-level changes per app.
- App migration document and developer guide must include this principle and current app packaging contract.
- Saleor:
- Removed from catalog/server and should stay removed unless intentionally reintroduced.
Release readiness estimate:
- Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%.
- Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation.
Suggested immediate next steps after resuming:
1. Read this file and verify no background build/process is running.
2. Build/deploy the local-only HTTP-health tightening patch if not already deployed.
3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`.
4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking.
5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer.
6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts.
Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim.
> Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://<node>:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step.
Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.
Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO.
## Release Readiness Estimate
- Estimated completion: `68%`.
- What is already achieved:
- manifest-driven app migration is substantially advanced;
- catalog metadata generation and strict drift checks are green;
- local backend/frontend release gates have been green in prior passes;
- broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding;
- Podman store-risk paths have been quarantined from known fragile broad image/store commands;
- IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness;
- targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness;
- mobile and desktop app progress UX now has clearer install/remove phase labels in local changes;
- Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix;
- Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`.
- What must still pass before release:
- deploy the current Immich readiness-gating backend and frontend progress UX changes;
- focused Immich validation: install must stay in progress until `http://<node>:2283/` returns HTTP success and app launch opens the frontend;
- focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://<node>:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served;
- keep Vaultwarden in regression coverage even though the latest full lifecycle passed;
- focused Tailscale validation: launch must present the local login/auth link/UI on `8240`;
- focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`;
- full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`;
- progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough;
- app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks;
- required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward;
- broad non-destructive lifecycle after the deploy;
- at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each;
> A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria
>
> please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks
>
> also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't
>
> Also BTCPay is not running either
>
> no my bad, wrong server, BTCPay is fine just slow, please continue
>
> Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
>
> please confirm there is a refactor/remove dead code release gate too
Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability.
Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm.
There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates.
- Password for lifecycle harness/RPC login: `password123`.
- Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`.
-`archipelago.service`: active.
-`archipelago-doctor.timer`: inactive.
-`archipelago-reconcile.timer`: inactive.
-`/`: `65%` used, about `9.6G` free.
-`/var/lib/archipelago`: about `9-10%` used, about `370G` free.
Current active app blockers:
- Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes.
- IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://<node>:7778/` plus `/nostr-provider.js` for the Nostr signer.
- Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker.
- Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`.
- Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`.
- Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping.
- Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`.
- Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`.
- Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks.
- Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery.
- Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer.
- Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`.
- Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen.
Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free.
### 2026-06-10 Resume Continuation Checkpoint
- Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`.
- Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
-`archipelago.service` is active.
-`archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive.
- Added explicit release gates to this handoff:
- app packaging docs must be updated before `1.8-alpha`;
- refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO.
- Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed.
-`http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim.
- Immich live validation after deploy:
-`container-list` reports `immich` running;
- direct `http://192.168.1.198:2283/` returns HTTP `200`;
-`container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable.
- Tailscale live validation after deploy:
- Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`.
- App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`.
- Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes.
- After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content.
- Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker.
- Other live probes after deploy:
-`portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard.
-`vaultwarden` HTTP `8082` returns `200` from localhost on `.198`.
-`botfights` HTTP `9100` returns `200` from localhost on `.198`.
-`btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails.
-`fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope.
- Podman/control-plane remains the active systemic blocker:
- logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup;
- do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts.
### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix
- Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive.
- Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`:
-`/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep;
- socket bind mounts call explicit socket repair before other bind prep;
-`ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed.
-`cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests).
- Portainer stale socket mount was confirmed and repaired:
- Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`.
- After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`.
- User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect.
- Direct state check after deploy:
-`fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`.
-`indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed.
-`vaultwarden running true`.
-`portainer running true`.
### 2026-06-08 Reboot Blocker Follow-up In Progress
- User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot.
- Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean.
- Local changes made in this pass:
- hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`;
- hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests;
- updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement.
- Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed:
-`indeedhub` stuck `stopping` and unhealthy;
-`immich` stopped/unhealthy;
-`tailscale` running/healthy but direct launch `8240` returned `000`;
-`vaultwarden` health RPC errored and launch `8082` returned `000`;
-`btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm.
- Targeted diagnostics on `.198` found:
- IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener;
- Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener;
- Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint;
- Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes;
- Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready.
- Local follow-up fixes after those diagnostics:
-`core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails;
-`ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists;
- IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS;
- lifecycle harness now requires Tailscale launch content to look like login/auth UI.
-`cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
-`bash -n tests/lifecycle/remote-lifecycle.sh`.
-`git diff --check`.
- Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive.
- Public RPC recovery attempts on hash `06420c...`:
-`package.restart indeedhub` still failed;
-`package.start immich` accepted async start but app remained `starting` with no `2283` launch;
-`package.start vaultwarden` accepted async start but no `8082` launch appeared;
-`package.restart portainer` failed;
-`package.restart tailscale` accepted async restart but no `8240` launch UI appeared.
- Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful.
- A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet.
- Next steps:
- deploy the new backend only after approval;
- verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability;
- run reboot validation iterations on `.198` only after explicit approval;
- cut and smoke-test the `1.8-alpha` ISO after reboot validation is green.
### Local Release Gate Completion After `.198` App Recovery
- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
- Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active.
- Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them.
- Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism.
- Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`.
- Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion.
- Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs.
- Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval.
### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery
- Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`.
- Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable.
- Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”.
- Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`.
- Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port.
- Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC.
- Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app.
- Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`.
- Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`.
- Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
### Deployed Podman Store-Risk Cleanup
- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`.
- Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts.
- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
### Release Candidate Backend Restart Validation
- Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`.
- Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load.
- Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback.
- Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`.
- Recovered live Immich without data loss:
-`immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written.
- Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership.
- A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root.
- Remaining gate before calling this a release: host reboot validation, if approved.
### IndeedHub and Immich Lifecycle Recovery
- Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`.
- IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running.
- Fedimint and NetBird focused audits are green; they were not current blockers after rerun.
- Immich was the broad-audit blocker and is now green:
- dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes;
-`immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping;
- this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`.
- Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands.
### Release Refactor Cleanup
- Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`.
- Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available.
- Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility.
- Removed the duplicate Gitea-specific stale port cleanup helper.
- Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation.
### Catalog Metadata Generation
- Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`.
- The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`.
- The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes.
- Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation.
- Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`.
- Fresh local-build installs now treat `podman image exists <local-build-tag>` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation.
- This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior.
- Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation.
### Container Health Fallback and Broad Lifecycle Green
- Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`.
-`cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`.
- The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others.
- Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls.
- Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`.
- Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned.
- Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward.
- Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`.
- Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing.
- This is generic host-port health, not an app-specific mapping.
- After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart.
- Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails.
- Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior.
### Stale State and Jellyfin Pasta Listener Hardening
- Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`.
-`container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery.
-`container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads.
- Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`.
-`package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`.
-`package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`.
-`nostr-rs-relay` was added to both catalog surfaces.
-`scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation.
- Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest.
-`http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start.
-`http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here.
### Test Caveat
- Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
- Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`.
- Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them.
- Do not run destructive git commands.
- Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan:
- Avoid `podman system df`.
- Avoid `podman image list` / `podman image ls`.
- Avoid broad `podman image exists` loops.
- Avoid `podman image prune` and `podman volume prune`.
- Podman store commands can hang and block app health under current `.198` load.
- Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`.
- Need quarantine/mitigation strategy rather than store-wide commands in release paths.
- Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`.
- Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward.
- Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed.
-`cargo test -p archipelago-container` passed (`43` tests).
-`git diff --check -- <changed files>` passed.
- Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward.
The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.