# RESUME - Archipelago Release Hardening on `.198` Last updated: 2026-06-10 ## 2026-06-10 05:48 EDT Active Session Checkpoint Work resumed from `docs/NEXT_TERMINAL_HANDOFF.md`. No `.198` host actions have been run yet in this resumed pass. Current first steps: 1. Rerun `git diff --check`. 2. Rerun the focused Rust image-version test for the Nextcloud false-update helper. 3. If those are clean, inspect and continue the rootless Podman lifecycle/ scanner-backoff work before any `.198` validation. Progress: - `git diff --check` passed. - Focused Rust image-version test in `/tmp/archy-cargo-image-versions` remains inconclusive: the tool PTY stayed open after compile output stopped, with no active `cargo`, `rustc`, or linker process visible. - Bounded retry of the focused image-version test using the normal workspace target also timed out: `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests` exited `124` after compiling the `archipelago` test target without reaching test output. Nextcloud false-update validation is still not closed. - Local code change in progress: single-orchestrator `package.stop` now returns immediately with `stopping` and runs the orchestrator stop in the background, instead of blocking the RPC/UI while Podman cleanup happens. - `cargo fmt --manifest-path core/Cargo.toml --all --check` passed. - Compile check passed in `/tmp/archy-cargo-runtime-check`: `cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`. - `git diff --check` passed after the stop-path edit and doc updates. - Lower-level stop path inspection: Quadlet service stop is already bounded with kill/reset recovery, and the runtime fallback treats already-absent containers as success. No extra lower-level stop change was made. ## 2026-06-10 05:30 EDT Pause Checkpoint User paused to switch machines. Continue from `/home/archipelago/Projects/archy` and read `docs/NEXT_TERMINAL_HANDOFF.md` plus `docs/1.8-alpha-improvements-tracker.md` first. No dev server or validation command should be intentionally left running from this checkpoint. Latest local-only tracker progress: - Done: uninstall preserve/delete-data choice, companion APK QR/download modal, App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight AI placeholder removal. - In progress: Fleet/tab loading polish, Bitcoin receive-address readiness states, no-registration credentials inventory, Nextcloud false-update fix. - New credential fallback: PhotoPrism now shows manifest-backed credentials (`admin` / `archipelago`) when backend credentials are empty. Grafana was not added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo default/secret. - Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29` and image update detection ignores registry-host-only changes. Catalog drift passed, but backend focused Rust validation did not complete cleanly. First `cargo test -p archipelago container::image_versions::tests` from `core/` hit a Rust linker/incremental artifact failure while `/tmp` was full; a non-incremental retry was killed after running too long. Old `/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered. Latest local validations: - `npm run type-check` passed after the PhotoPrism credential fallback. - `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed. - `git diff --check` passed after the Spotlight cleanup and should be rerun after resuming. - `python3 scripts/check-app-catalog-drift.py --release --strict` passed during the Nextcloud pass. Immediate next steps: 1. Rerun `git diff --check`. 2. Rerun `cargo test -p archipelago container::image_versions::tests` from `core/` when ready to validate the Nextcloud update-detection helper. 3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain `todo` or `in-progress`, avoiding host-gated items until `.198` access is intentionally resumed. ## 2026-06-09 Resume Handoff - Read First Last user prompt to preserve: > please can we save all our progress, backlog, and goal to memory so I can resume on another device please > > including the last prompt Ultimate release goal: Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs. Important target node: - Validation node: `archipelago@192.168.1.198`, password `password123`. - Current release deadline pressure from user: production release target was Thursday, 2026-06-11. - Tests have been run mostly on `.198`; user noted we may also need to validate on the current intended release server, not only `.198`. - Avoid broad/destructive Podman store cleanup. Do not use `git reset --hard` or revert unrelated user changes. Current deployed backend on `.198`: - Latest deployed `/usr/local/bin/archipelago` sha256: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`. - A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff. Major progress achieved in the latest session: - Beta Telemetry / Fleet collector: - Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it. - Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body. - Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`. - Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`. - Documented the expected value shape in `scripts/deploy-config.example`: `https:///rpc/v1`. - Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`. - `systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`. - Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https:///rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet. - IndeeHub: - Recovered stale/corrupt metadata/container state enough for fresh lifecycle. - Full lifecycle passed earlier on `.198`. - Verified launch on `7778`. - Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved. - Saleor: - Removed from app catalog/server as requested. - Bitcoin Knots / Bitcoin UI: - Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`. - Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure. - Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks. - Fedimint: - Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway. - Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup. - Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts. - Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`. - Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`. - Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`. - Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe. - BotFights: - User reported stopped/unhealthy. - Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery. - Deployed backend hash `9a00e543...`. - BotFights started and is active. - Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`. - Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later. - Status/health correctness: - Reduced container health/status Podman timeouts to avoid UI hanging forever. - `container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states. - Fedimint stale `stopping` fixed to `starting`. - Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights. - Filebrowser/Home Assistant/Immich/Bitcoin: - Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy. - Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation. Current critical blockers: - Runtime control plane / Podman scanning: - Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`. - Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`. - This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions. - Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff. - My Apps UI false negatives: - User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed". - Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout. - Fedimint Guardian: - Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD. - Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied. - Progress UX: - User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen. - Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications. - Stale health notifications: - Must not persistently trigger on new logins/refreshes after no longer valid. - Some UI filtering was patched earlier, but keep this in regression backlog. - Reboot survival: - Must pass repeated reboot validation after runtime/status fixes. - Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5. Backlog captured from user reports: - Portainer: - Environment wizard error: `Dial unix /var/run/docker.sock: connect: connection refused`. - User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful. - Fedimint: - Setup after guardian confirmation caused app not to launch. - Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct. - Gateway app disappeared from catalog before; it has been restored but keep in regression tests. - Bitcoin Knots: - User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression. - Home Assistant: - Setup has issues on this node and restart hung for a long time. - Immich: - After setup user saw HTTP 500 stacktrace from `loadServerConfig`. Needs focused post-setup validation, not just "healthy". - Filebrowser: - User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression. - Tailscale: - Launch must show local login/auth UI, not merely container running. - BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps: - Need clearer dependency wait states when Bitcoin RPC is slow/IBD. - App catalog/developer readiness: - Apps should not require OS-level changes per app. - App migration document and developer guide must include this principle and current app packaging contract. - Saleor: - Removed from catalog/server and should stay removed unless intentionally reintroduced. Release readiness estimate: - Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%. - Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation. Suggested immediate next steps after resuming: 1. Read this file and verify no background build/process is running. 2. Build/deploy the local-only HTTP-health tightening patch if not already deployed. 3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`. 4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking. 5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer. 6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts. Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim. --- ## Resume Prompt > Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step. --- ## Current Goal Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image. Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO. ## Release Readiness Estimate - Estimated completion: `68%`. - What is already achieved: - manifest-driven app migration is substantially advanced; - catalog metadata generation and strict drift checks are green; - local backend/frontend release gates have been green in prior passes; - broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding; - Podman store-risk paths have been quarantined from known fragile broad image/store commands; - IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness; - targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness; - mobile and desktop app progress UX now has clearer install/remove phase labels in local changes; - Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix; - Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`. - What must still pass before release: - deploy the current Immich readiness-gating backend and frontend progress UX changes; - focused Immich validation: install must stay in progress until `http://:2283/` returns HTTP success and app launch opens the frontend; - focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served; - keep Vaultwarden in regression coverage even though the latest full lifecycle passed; - focused Tailscale validation: launch must present the local login/auth link/UI on `8240`; - focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`; - full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`; - progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough; - app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks; - required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward; - broad non-destructive lifecycle after the deploy; - at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each; - preferably 5 consecutive clean reboot iterations before calling `1.8-alpha` production-release ready; - final local release gates after any additional fixes; - cut the `1.8-alpha` ISO; - boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle. --- ## Latest User Directive > A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria > > please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks > > also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't > > Also BTCPay is not running either > > no my bad, wrong server, BTCPay is fine just slow, please continue > > Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" > > please confirm there is a refactor/remove dead code release gate too Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability. Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm. There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates. --- ## Live `.198` State - Host: `192.168.1.198`. - Password for lifecycle harness/RPC login: `password123`. - Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. - `archipelago.service`: active. - `archipelago-doctor.timer`: inactive. - `archipelago-reconcile.timer`: inactive. - `/`: `65%` used, about `9.6G` free. - `/var/lib/archipelago`: about `9-10%` used, about `370G` free. Current active app blockers: - Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes. - IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://:7778/` plus `/nostr-provider.js` for the Nostr signer. - Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker. - Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`. - Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`. - Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping. - Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`. - Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`. - Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks. - Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery. - Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer. - Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`. - Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen. Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free. ### 2026-06-10 Resume Continuation Checkpoint - Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`. - Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`. - `archipelago.service` is active. - `archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive. - Added explicit release gates to this handoff: - app packaging docs must be updated before `1.8-alpha`; - refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO. - Local validation before deploy: - `bash -n tests/lifecycle/remote-lifecycle.sh` passed; - `cargo fmt --manifest-path core/Cargo.toml --all`; - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests); - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed; - `python3 scripts/check-app-catalog-drift.py --release --strict` passed; - `git diff --check` passed. - Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed. - IndeeHub live validation after deploy: - `container-list` reports `indeedhub` running; - `container-health` reports `{"indeedhub":"healthy"}`; - `http://192.168.1.198:7778/` returns HTTP `200`; - `http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim. - Immich live validation after deploy: - `container-list` reports `immich` running; - direct `http://192.168.1.198:2283/` returns HTTP `200`; - `container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable. - Tailscale live validation after deploy: - Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`. - App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`. - Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes. - After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content. - Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker. - Other live probes after deploy: - `portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard. - `vaultwarden` HTTP `8082` returns `200` from localhost on `.198`. - `botfights` HTTP `9100` returns `200` from localhost on `.198`. - `btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails. - `fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope. - Podman/control-plane remains the active systemic blocker: - logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup; - do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts. --- ## Latest Completed Work ### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix - Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive. - Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`: - `/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep; - socket bind mounts call explicit socket repair before other bind prep; - `ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed. - Validated locally before deploy: - `cargo fmt --manifest-path core/Cargo.toml --all`. - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests). - `git diff --check`. - `timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release`. - Vaultwarden full preserve-data lifecycle passed on `.198`: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Portainer full preserve-data lifecycle passed on `.198`: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Portainer stale socket mount was confirmed and repaired: - Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`. - After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`. - User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect. - Direct state check after deploy: - `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. - `indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed. - `vaultwarden running true`. - `portainer running true`. ### 2026-06-08 Reboot Blocker Follow-up In Progress - User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot. - Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean. - Local changes made in this pass: - hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`; - hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests; - updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement. - Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed: - `indeedhub` stuck `stopping` and unhealthy; - `immich` stopped/unhealthy; - `tailscale` running/healthy but direct launch `8240` returned `000`; - `vaultwarden` health RPC errored and launch `8082` returned `000`; - `btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm. - Targeted diagnostics on `.198` found: - IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener; - Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener; - Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint; - Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes; - Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready. - Local follow-up fixes after those diagnostics: - `core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails; - `ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists; - IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS; - lifecycle harness now requires Tailscale launch content to look like login/auth UI. - Local validation passed after those fixes: - `cargo fmt --manifest-path core/Cargo.toml --all`. - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`). - `bash -n tests/lifecycle/remote-lifecycle.sh`. - `git diff --check`. - Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive. - Public RPC recovery attempts on hash `06420c...`: - `package.restart indeedhub` still failed; - `package.start immich` accepted async start but app remained `starting` with no `2283` launch; - `package.start vaultwarden` accepted async start but no `8082` launch appeared; - `package.restart portainer` failed; - `package.restart tailscale` accepted async restart but no `8240` launch UI appeared. - Latest focused probe after hash `06420c...`: - `tailscale` `running`, `http://192.168.1.198:8240/` returns `000`; - `immich` `starting`, `http://192.168.1.198:2283/` returns `000`; - `indeedhub` `stopping`, `http://192.168.1.198:7778/` returns `000`; - `portainer` `running`, `http://192.168.1.198:9000/` returns `000`; - `vaultwarden` absent/not listed, `http://192.168.1.198:8082/` returns `000`. - Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful. - Local validation passed so far: - `cargo fmt --manifest-path core/Cargo.toml --all`. - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`). - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `bash -n tests/lifecycle/remote-lifecycle.sh`. - `git diff --check`. - A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet. - Next steps: - deploy the new backend only after approval; - verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability; - run reboot validation iterations on `.198` only after explicit approval; - pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence. - cut and smoke-test the `1.8-alpha` ISO after reboot validation is green. ### Local Release Gate Completion After `.198` App Recovery - Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands. - Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active. - Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them. - Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism. - Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`. - Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion. - Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs. - Validation passed locally: - `cargo fmt --manifest-path core/Cargo.toml --all`. - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`). - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`. - `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests). - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `python3 scripts/check-app-catalog-drift.py --release --strict`. - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. - `git diff --check`. - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. - Remaining gated item remains host reboot validation on `.198`, only if explicitly approved. ### Frontend Release Gate Completion - Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands. - Found and fixed a mobile app-launch regression in `neode-ui/src/stores/appLauncher.ts`: - desktop-only new-tab apps still open directly on desktop; - mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab; - `dashboardReturnPath()` now tolerates tests/minimal router mocks with no `currentRoute`. - Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior. - Fixed `AppIconGrid` test setup so it shares the mounted Pinia instance and mocks credential lookup before launch. - Fixed onboarding retry test timing to cover the actual exponential retry budget. - Validation passed locally: - `npm run type-check` from `neode-ui`. - `npm test` from `neode-ui` (`548 passed`). - `npm run build` from `neode-ui`. - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). - `python3 scripts/check-app-catalog-drift.py --release --strict`. - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. - `git diff --check`. - Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval. ### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery - Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`. - Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable. - Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”. - Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`. - Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port. - Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC. - Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app. - Validation passed: - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `python3 scripts/check-app-catalog-drift.py --release --strict`. - Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`. - Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`. - Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. - Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free. ### Deployed Podman Store-Risk Cleanup - Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`. - Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts. - Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`. - Validation passed: - `python3 scripts/check-app-catalog-drift.py --release --strict`. - `cargo fmt` from `core/`. - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - Focused post-deploy lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Broad post-deploy non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free. ### Release Candidate Backend Restart Validation - Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`. - Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load. - Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback. - Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`. - Recovered live Immich without data loss: - `immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written. - Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership. - A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root. - Validation passed on latest hash: - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `python3 scripts/check-app-catalog-drift.py --release --strict`. - `npm run build` from `neode-ui`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Backend restart validation followed by focused `fedimint,immich,indeedhub,photoprism` lifecycle passed. - Post-restart broad non-destructive lifecycle passed. - Remaining gate before calling this a release: host reboot validation, if approved. ### IndeedHub and Immich Lifecycle Recovery - Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`. - IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running. - Fedimint and NetBird focused audits are green; they were not current blockers after rerun. - Immich was the broad-audit blocker and is now green: - dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes; - `immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping; - this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`. - Validation passed on latest hash: - `cargo check --manifest-path core/Cargo.toml -p archipelago`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands. ### Release Refactor Cleanup - Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`. - Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available. - Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility. - Removed the duplicate Gitea-specific stale port cleanup helper. - Validation passed on latest hash: - `cargo check --manifest-path core/Cargo.toml -p archipelago`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation. ### Catalog Metadata Generation - Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`. - The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`. - The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes. - Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation. - Release catalog drift is now zero: - `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. - Validation passed: - `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`. - canonical and UI public catalogs match byte-for-byte. - `cargo test --manifest-path core/Cargo.toml -p archipelago-container`. - `cargo check --manifest-path core/Cargo.toml -p archipelago`. - `npm run build` from `neode-ui`. ### Podman Store-Risk Hardening - Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`. - Fresh local-build installs now treat `podman image exists ` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation. - This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior. - Validation passed on the latest hash: - `cargo check --manifest-path core/Cargo.toml -p archipelago`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation. ### Container Health Fallback and Broad Lifecycle Green - Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`. - Fixed `container-health` broad lifecycle timeout behavior: - `cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`. - The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others. - Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls. - Validation passed on the latest hash: - `cargo check --manifest-path core/Cargo.toml -p archipelago`. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. ### Generic Host-Port Health Checkpoint - Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`. - Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned. - Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward. - Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`. - Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing. - This is generic host-port health, not an app-specific mapping. - After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart. - Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails. - Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior. ### Stale State and Jellyfin Pasta Listener Hardening - Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`. - `container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery. - `container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads. - Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`. - `package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`. - `package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`. - Focused lifecycle passed on the latest hash: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Release catalog drift check remains: `missing_catalog=0`, `missing_manifests=0`, `metadata_drift=35`. ### Expanded Cleanup and Store-Safe Uninstall - Built and deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`. - Expanded `system.disk-cleanup` to remove old rollback artifacts while keeping newest rollback points: - `/usr/local/bin/archipelago.backup-*` newest 3. - legacy `/usr/local/bin/archipelago.bak*` newest 3. - `/usr/local/bin/archipelago.before-*` newest 3 as part of legacy backend cleanup. - `/opt/archipelago/web-ui.bak*` newest 3. - `/opt/archipelago/web-ui.old` included as web UI rollback cleanup. - Live `system.disk-cleanup` reclaimed `10.3 GB`: - `Removed old backend backups: 41.6 MB freed`. - `Removed old legacy backend backups: 3.6 GB freed`. - `Removed old web UI backups: 6.6 GB freed`. - `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. - `/usr/local/bin` dropped to about `336M`. - `/opt/archipelago` dropped to about `1.1G`. - Removed global `podman volume prune -f` from uninstall. Uninstall now logs a skip and still removes explicit app data when `preserve_data=false`. ### Startup Scan and Uptime Kuma Fixes - Startup `adopt_existing()` is bounded with a 35s timeout. - Initial container scan seeds the same 300s Podman scan backoff used by periodic scans. - Legacy pasta restart paths use scoped `podman restart` instead of stop+start. - Uptime Kuma was repaired: - Before: container internally healthy on `127.0.0.1:3001`, but host `3002` had no pasta listener. - After: `package.restart uptime-kuma` returns `{"status":"restarted"}` and `http://192.168.1.198:3002/` returns HTTP `302`. ### Cleanup and Catalog Work Already Done - `system.disk-cleanup` intentionally skips Podman image/volume prune. - `nostr-rs-relay` was added to both catalog surfaces. - `scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation. - Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest. --- ## Verification Already Run - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed for the currently deployed release-candidate line. - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed for the currently deployed release-candidate line. - Broad lifecycle on current hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` passed: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Targeted PhotoPrism audit on current hash passed: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh` - Focused lifecycle on current hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` passed: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Live cleanup RPC passed and reclaimed `10.3 GB`. - Focused lifecycle after expanded cleanup passed: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Before the expanded cleanup pass, broad lifecycle also passed on hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`: - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` - Direct app checks after latest cleanup passed: - `http://192.168.1.198:3002/` -> HTTP `302`. - `http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start. - `http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here. ### Test Caveat - Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). - Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`. --- ## Critical Constraints - Preserve app data. - `.198` is the active validation node. - Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. - Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them. - Do not run destructive git commands. - Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan: - Avoid `podman system df`. - Avoid `podman image list` / `podman image ls`. - Avoid broad `podman image exists` loops. - Avoid `podman image prune` and `podman volume prune`. - Podman store commands can hang and block app health under current `.198` load. - Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`. --- ## Current Remaining Blockers 1. Podman socket/store health remains unresolved. - Need quarantine/mitigation strategy rather than store-wide commands in release paths. - Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`. - Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward. - Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed. 2. Release code-review/refactor gate is still open. - Reduce remaining app-specific Rust/OS branches where possible. - Review scanner, health, reconcile, and install/update paths for performance and store-risk. - Clean up dead transitional paths. 3. Clean release branch hygiene is not done. - Worktree is very dirty with many modified and untracked files. - Do not commit unless explicitly asked. 4. Full production validation still needed. - Broad non-destructive lifecycle is green on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. - Backend restart validation has passed. - Run host reboot validation if approved. - Run selected full lifecycle tests for critical apps if time allows. --- ## Files Changed In Latest Pass - `core/container/src/runtime.rs` - Changed Podman runtime `image_exists()` from `podman image exists` to a bounded targeted `podman image inspect` local-storage probe. - `core/archipelago/src/api/rpc/package/install.rs` - Replaced legacy `podman images -q` local fallback and post-pull verification checks with bounded targeted `podman image inspect`. - `core/archipelago/src/container/companion.rs` - Changed companion image existence checks from `podman image exists` to `podman image inspect`. - `core/archipelago/src/container/prod_orchestrator.rs` - Updated image-existence failure test fixture wording for the new `image inspect` probe. - Validation for latest local mitigation: - `cargo fmt --all --check` passed. - `cargo check -p archipelago-container` passed. - `cargo check -p archipelago` passed. - `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests` passed. - `cargo test -p archipelago-container` passed (`43` tests). - `git diff --check -- ` passed. - Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward. - `core/archipelago/src/api/rpc/system/handlers.rs` - Calls expanded rollback cleanup helpers and reports reclaimed bytes. - `core/archipelago/src/api/rpc/system/mod.rs` - Added cleanup helpers for legacy backend backups and web UI rollback backups. - Uses size accounting for directories before removal. - Keeps newest rollback artifacts instead of deleting all. - `core/archipelago/src/api/rpc/package/runtime.rs` - Skips global `podman volume prune -f` during uninstall. - Adds Jellyfin `8096` to runtime host-port/pasta cleanup repair. - Derives legacy runtime host-port cleanup/repair ports from manifests. - Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code. - `core/archipelago/src/api/rpc/container.rs` - Adds stale cached `exited` refresh for `container-list`. - Adds cached-running plus local TCP reachability fallback for `container-health`. - Fixes fallback URL port parsing and expands lifecycle web app port coverage. - `core/archipelago/src/container/prod_orchestrator.rs` - Rebuilds local-build images when `image_exists` fails/times out instead of failing fresh install. - Adds focused unit test coverage for that behavior. - `scripts/generate-app-catalog.py` - Generates/syncs public catalog metadata from manifest-owned fields. - `app-catalog/catalog.json` and `neode-ui/public/catalog.json` - Generated from current manifests; files match byte-for-byte. - `docs/CONTAINER_LIFECYCLE_HANDOFF.md` - Added latest deployment, cleanup, validation, and residual-risk checkpoint. - `docs/MIGRATION_STATUS_REPORT.md` - Updated current hash, root disk state, and remaining blockers. - `docs/RESUME.md` - This file, replacing stale April migration resume content. --- ## Suggested Next Steps 1. Re-read the three docs: - `docs/RESUME.md` - `docs/CONTAINER_LIFECYCLE_HANDOFF.md` - `docs/MIGRATION_STATUS_REPORT.md` 2. Verify latest `.198` state: - `ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'` 3. Start Podman-store-risk review: - Search for image/store operations: `image_exists`, `podman image`, `podman system`, `podman prune`, `volume prune`. - Prefer targeted container status/API calls with timeouts. - Avoid new broad store commands. 4. Continue release code-review/refactor cleanup. 5. If approved, run backend-restart validation and then host-reboot validation. --- ## Current Release Readiness Estimate - Credible release candidate: closer now, roughly `87-91%`. - Production-quality release developers will love: still closer to `73-79%`. The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.