841 lines
69 KiB
Markdown
841 lines
69 KiB
Markdown
# RESUME - Archipelago Release Hardening on `.198`
|
|
|
|
Last updated: 2026-06-10
|
|
|
|
## 2026-06-10 05:48 EDT Active Session Checkpoint
|
|
|
|
Work resumed from `docs/NEXT_TERMINAL_HANDOFF.md`. No `.198` host actions have
|
|
been run yet in this resumed pass.
|
|
|
|
Current first steps:
|
|
|
|
1. Rerun `git diff --check`.
|
|
2. Rerun the focused Rust image-version test for the Nextcloud false-update
|
|
helper.
|
|
3. If those are clean, inspect and continue the rootless Podman lifecycle/
|
|
scanner-backoff work before any `.198` validation.
|
|
|
|
Progress:
|
|
|
|
- `git diff --check` passed.
|
|
- Focused Rust image-version test in `/tmp/archy-cargo-image-versions` remains
|
|
inconclusive: the tool PTY stayed open after compile output stopped, with no
|
|
active `cargo`, `rustc`, or linker process visible.
|
|
- Bounded retry of the focused image-version test using the normal workspace
|
|
target also timed out: `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests`
|
|
exited `124` after compiling the `archipelago` test target without reaching
|
|
test output. Nextcloud false-update validation is still not closed.
|
|
- Local code change in progress: single-orchestrator `package.stop` now returns
|
|
immediately with `stopping` and runs the orchestrator stop in the background,
|
|
instead of blocking the RPC/UI while Podman cleanup happens.
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed.
|
|
- Compile check passed in `/tmp/archy-cargo-runtime-check`:
|
|
`cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`.
|
|
- `git diff --check` passed after the stop-path edit and doc updates.
|
|
- Lower-level stop path inspection: Quadlet service stop is already bounded
|
|
with kill/reset recovery, and the runtime fallback treats already-absent
|
|
containers as success. No extra lower-level stop change was made.
|
|
|
|
## 2026-06-10 05:30 EDT Pause Checkpoint
|
|
|
|
User paused to switch machines. Continue from `/home/archipelago/Projects/archy`
|
|
and read `docs/NEXT_TERMINAL_HANDOFF.md` plus
|
|
`docs/1.8-alpha-improvements-tracker.md` first. No dev server or validation
|
|
command should be intentionally left running from this checkpoint.
|
|
|
|
Latest local-only tracker progress:
|
|
|
|
- Done: uninstall preserve/delete-data choice, companion APK QR/download modal,
|
|
App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight
|
|
AI placeholder removal.
|
|
- In progress: Fleet/tab loading polish, Bitcoin receive-address readiness
|
|
states, no-registration credentials inventory, Nextcloud false-update fix.
|
|
- New credential fallback: PhotoPrism now shows manifest-backed credentials
|
|
(`admin` / `archipelago`) when backend credentials are empty. Grafana was not
|
|
added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo
|
|
default/secret.
|
|
- Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29`
|
|
and image update detection ignores registry-host-only changes. Catalog drift
|
|
passed, but backend focused Rust validation did not complete cleanly. First
|
|
`cargo test -p archipelago container::image_versions::tests` from `core/`
|
|
hit a Rust linker/incremental artifact failure while `/tmp` was full; a
|
|
non-incremental retry was killed after running too long. Old
|
|
`/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered.
|
|
|
|
Latest local validations:
|
|
|
|
- `npm run type-check` passed after the PhotoPrism credential fallback.
|
|
- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed.
|
|
- `git diff --check` passed after the Spotlight cleanup and should be rerun
|
|
after resuming.
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during
|
|
the Nextcloud pass.
|
|
|
|
Immediate next steps:
|
|
|
|
1. Rerun `git diff --check`.
|
|
2. Rerun `cargo test -p archipelago container::image_versions::tests` from
|
|
`core/` when ready to validate the Nextcloud update-detection helper.
|
|
3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain
|
|
`todo` or `in-progress`, avoiding host-gated items until `.198` access is
|
|
intentionally resumed.
|
|
|
|
## 2026-06-09 Resume Handoff - Read First
|
|
|
|
Last user prompt to preserve:
|
|
|
|
> please can we save all our progress, backlog, and goal to memory so I can resume on another device please
|
|
>
|
|
> including the last prompt
|
|
|
|
Ultimate release goal:
|
|
|
|
Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs.
|
|
|
|
Important target node:
|
|
|
|
- Validation node: `archipelago@192.168.1.198`, password `password123`.
|
|
- Current release deadline pressure from user: production release target was Thursday, 2026-06-11.
|
|
- Tests have been run mostly on `.198`; user noted we may also need to validate on the current intended release server, not only `.198`.
|
|
- Avoid broad/destructive Podman store cleanup. Do not use `git reset --hard` or revert unrelated user changes.
|
|
|
|
Current deployed backend on `.198`:
|
|
|
|
- Latest deployed `/usr/local/bin/archipelago` sha256: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
|
|
- A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff.
|
|
|
|
Major progress achieved in the latest session:
|
|
|
|
- Beta Telemetry / Fleet collector:
|
|
- Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it.
|
|
- Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body.
|
|
- Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`.
|
|
- Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`.
|
|
- Documented the expected value shape in `scripts/deploy-config.example`: `https://<collector-host>/rpc/v1`.
|
|
- Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`.
|
|
- `systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`.
|
|
- Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https://<collector-host>/rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet.
|
|
- IndeeHub:
|
|
- Recovered stale/corrupt metadata/container state enough for fresh lifecycle.
|
|
- Full lifecycle passed earlier on `.198`.
|
|
- Verified launch on `7778`.
|
|
- Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved.
|
|
- Saleor:
|
|
- Removed from app catalog/server as requested.
|
|
- Bitcoin Knots / Bitcoin UI:
|
|
- Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`.
|
|
- Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure.
|
|
- Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks.
|
|
- Fedimint:
|
|
- Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway.
|
|
- Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup.
|
|
- Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts.
|
|
- Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`.
|
|
- Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`.
|
|
- Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`.
|
|
- Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe.
|
|
- BotFights:
|
|
- User reported stopped/unhealthy.
|
|
- Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery.
|
|
- Deployed backend hash `9a00e543...`.
|
|
- BotFights started and is active.
|
|
- Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`.
|
|
- Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later.
|
|
- Status/health correctness:
|
|
- Reduced container health/status Podman timeouts to avoid UI hanging forever.
|
|
- `container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states.
|
|
- Fedimint stale `stopping` fixed to `starting`.
|
|
- Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights.
|
|
- Filebrowser/Home Assistant/Immich/Bitcoin:
|
|
- Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy.
|
|
- Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation.
|
|
|
|
Current critical blockers:
|
|
|
|
- Runtime control plane / Podman scanning:
|
|
- Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`.
|
|
- Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`.
|
|
- This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions.
|
|
- Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff.
|
|
- My Apps UI false negatives:
|
|
- User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed".
|
|
- Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout.
|
|
- Fedimint Guardian:
|
|
- Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD.
|
|
- Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied.
|
|
- Progress UX:
|
|
- User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen.
|
|
- Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications.
|
|
- Stale health notifications:
|
|
- Must not persistently trigger on new logins/refreshes after no longer valid.
|
|
- Some UI filtering was patched earlier, but keep this in regression backlog.
|
|
- Reboot survival:
|
|
- Must pass repeated reboot validation after runtime/status fixes.
|
|
- Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5.
|
|
|
|
Backlog captured from user reports:
|
|
|
|
- Portainer:
|
|
- Environment wizard error: `Dial unix /var/run/docker.sock: connect: connection refused`.
|
|
- User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful.
|
|
- Fedimint:
|
|
- Setup after guardian confirmation caused app not to launch.
|
|
- Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct.
|
|
- Gateway app disappeared from catalog before; it has been restored but keep in regression tests.
|
|
- Bitcoin Knots:
|
|
- User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression.
|
|
- Home Assistant:
|
|
- Setup has issues on this node and restart hung for a long time.
|
|
- Immich:
|
|
- After setup user saw HTTP 500 stacktrace from `loadServerConfig`. Needs focused post-setup validation, not just "healthy".
|
|
- Filebrowser:
|
|
- User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression.
|
|
- Tailscale:
|
|
- Launch must show local login/auth UI, not merely container running.
|
|
- BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps:
|
|
- Need clearer dependency wait states when Bitcoin RPC is slow/IBD.
|
|
- App catalog/developer readiness:
|
|
- Apps should not require OS-level changes per app.
|
|
- App migration document and developer guide must include this principle and current app packaging contract.
|
|
- Saleor:
|
|
- Removed from catalog/server and should stay removed unless intentionally reintroduced.
|
|
|
|
Release readiness estimate:
|
|
|
|
- Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%.
|
|
- Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation.
|
|
|
|
Suggested immediate next steps after resuming:
|
|
|
|
1. Read this file and verify no background build/process is running.
|
|
2. Build/deploy the local-only HTTP-health tightening patch if not already deployed.
|
|
3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`.
|
|
4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking.
|
|
5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer.
|
|
6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts.
|
|
|
|
Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim.
|
|
|
|
---
|
|
|
|
## Resume Prompt
|
|
|
|
> Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://<node>:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step.
|
|
|
|
---
|
|
|
|
## Current Goal
|
|
|
|
Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.
|
|
|
|
Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO.
|
|
|
|
## Release Readiness Estimate
|
|
|
|
- Estimated completion: `68%`.
|
|
- What is already achieved:
|
|
- manifest-driven app migration is substantially advanced;
|
|
- catalog metadata generation and strict drift checks are green;
|
|
- local backend/frontend release gates have been green in prior passes;
|
|
- broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding;
|
|
- Podman store-risk paths have been quarantined from known fragile broad image/store commands;
|
|
- IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness;
|
|
- targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness;
|
|
- mobile and desktop app progress UX now has clearer install/remove phase labels in local changes;
|
|
- Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix;
|
|
- Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`.
|
|
- What must still pass before release:
|
|
- deploy the current Immich readiness-gating backend and frontend progress UX changes;
|
|
- focused Immich validation: install must stay in progress until `http://<node>:2283/` returns HTTP success and app launch opens the frontend;
|
|
- focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://<node>:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served;
|
|
- keep Vaultwarden in regression coverage even though the latest full lifecycle passed;
|
|
- focused Tailscale validation: launch must present the local login/auth link/UI on `8240`;
|
|
- focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`;
|
|
- full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`;
|
|
- progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough;
|
|
- app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks;
|
|
- required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward;
|
|
- broad non-destructive lifecycle after the deploy;
|
|
- at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each;
|
|
- preferably 5 consecutive clean reboot iterations before calling `1.8-alpha` production-release ready;
|
|
- final local release gates after any additional fixes;
|
|
- cut the `1.8-alpha` ISO;
|
|
- boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle.
|
|
|
|
---
|
|
|
|
## Latest User Directive
|
|
|
|
> A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria
|
|
>
|
|
> please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks
|
|
>
|
|
> also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't
|
|
>
|
|
> Also BTCPay is not running either
|
|
>
|
|
> no my bad, wrong server, BTCPay is fine just slow, please continue
|
|
>
|
|
> Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
|
|
>
|
|
> please confirm there is a refactor/remove dead code release gate too
|
|
|
|
Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability.
|
|
|
|
Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm.
|
|
|
|
There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates.
|
|
|
|
---
|
|
|
|
## Live `.198` State
|
|
|
|
- Host: `192.168.1.198`.
|
|
- Password for lifecycle harness/RPC login: `password123`.
|
|
- Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`.
|
|
- `archipelago.service`: active.
|
|
- `archipelago-doctor.timer`: inactive.
|
|
- `archipelago-reconcile.timer`: inactive.
|
|
- `/`: `65%` used, about `9.6G` free.
|
|
- `/var/lib/archipelago`: about `9-10%` used, about `370G` free.
|
|
|
|
Current active app blockers:
|
|
|
|
- Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes.
|
|
- IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://<node>:7778/` plus `/nostr-provider.js` for the Nostr signer.
|
|
- Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker.
|
|
- Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`.
|
|
- Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`.
|
|
- Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping.
|
|
- Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`.
|
|
- Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`.
|
|
- Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks.
|
|
- Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery.
|
|
- Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer.
|
|
- Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`.
|
|
- Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen.
|
|
|
|
Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free.
|
|
|
|
### 2026-06-10 Resume Continuation Checkpoint
|
|
|
|
- Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`.
|
|
- Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
|
|
- `archipelago.service` is active.
|
|
- `archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive.
|
|
- Added explicit release gates to this handoff:
|
|
- app packaging docs must be updated before `1.8-alpha`;
|
|
- refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO.
|
|
- Local validation before deploy:
|
|
- `bash -n tests/lifecycle/remote-lifecycle.sh` passed;
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all`;
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests);
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed;
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed;
|
|
- `git diff --check` passed.
|
|
- Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed.
|
|
- IndeeHub live validation after deploy:
|
|
- `container-list` reports `indeedhub` running;
|
|
- `container-health` reports `{"indeedhub":"healthy"}`;
|
|
- `http://192.168.1.198:7778/` returns HTTP `200`;
|
|
- `http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim.
|
|
- Immich live validation after deploy:
|
|
- `container-list` reports `immich` running;
|
|
- direct `http://192.168.1.198:2283/` returns HTTP `200`;
|
|
- `container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable.
|
|
- Tailscale live validation after deploy:
|
|
- Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`.
|
|
- App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`.
|
|
- Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes.
|
|
- After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content.
|
|
- Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker.
|
|
- Other live probes after deploy:
|
|
- `portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard.
|
|
- `vaultwarden` HTTP `8082` returns `200` from localhost on `.198`.
|
|
- `botfights` HTTP `9100` returns `200` from localhost on `.198`.
|
|
- `btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails.
|
|
- `fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope.
|
|
- Podman/control-plane remains the active systemic blocker:
|
|
- logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup;
|
|
- do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts.
|
|
|
|
---
|
|
|
|
## Latest Completed Work
|
|
|
|
### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix
|
|
|
|
- Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive.
|
|
- Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`:
|
|
- `/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep;
|
|
- socket bind mounts call explicit socket repair before other bind prep;
|
|
- `ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed.
|
|
- Validated locally before deploy:
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all`.
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests).
|
|
- `git diff --check`.
|
|
- `timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release`.
|
|
- Vaultwarden full preserve-data lifecycle passed on `.198`:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Portainer full preserve-data lifecycle passed on `.198`:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Portainer stale socket mount was confirmed and repaired:
|
|
- Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`.
|
|
- After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`.
|
|
- User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect.
|
|
- Direct state check after deploy:
|
|
- `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`.
|
|
- `indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed.
|
|
- `vaultwarden running true`.
|
|
- `portainer running true`.
|
|
|
|
### 2026-06-08 Reboot Blocker Follow-up In Progress
|
|
|
|
- User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot.
|
|
- Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean.
|
|
- Local changes made in this pass:
|
|
- hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`;
|
|
- hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests;
|
|
- updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement.
|
|
- Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed:
|
|
- `indeedhub` stuck `stopping` and unhealthy;
|
|
- `immich` stopped/unhealthy;
|
|
- `tailscale` running/healthy but direct launch `8240` returned `000`;
|
|
- `vaultwarden` health RPC errored and launch `8082` returned `000`;
|
|
- `btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm.
|
|
- Targeted diagnostics on `.198` found:
|
|
- IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener;
|
|
- Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener;
|
|
- Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint;
|
|
- Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes;
|
|
- Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready.
|
|
- Local follow-up fixes after those diagnostics:
|
|
- `core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails;
|
|
- `ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists;
|
|
- IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS;
|
|
- lifecycle harness now requires Tailscale launch content to look like login/auth UI.
|
|
- Local validation passed after those fixes:
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all`.
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
|
|
- `bash -n tests/lifecycle/remote-lifecycle.sh`.
|
|
- `git diff --check`.
|
|
- Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive.
|
|
- Public RPC recovery attempts on hash `06420c...`:
|
|
- `package.restart indeedhub` still failed;
|
|
- `package.start immich` accepted async start but app remained `starting` with no `2283` launch;
|
|
- `package.start vaultwarden` accepted async start but no `8082` launch appeared;
|
|
- `package.restart portainer` failed;
|
|
- `package.restart tailscale` accepted async restart but no `8240` launch UI appeared.
|
|
- Latest focused probe after hash `06420c...`:
|
|
- `tailscale` `running`, `http://192.168.1.198:8240/` returns `000`;
|
|
- `immich` `starting`, `http://192.168.1.198:2283/` returns `000`;
|
|
- `indeedhub` `stopping`, `http://192.168.1.198:7778/` returns `000`;
|
|
- `portainer` `running`, `http://192.168.1.198:9000/` returns `000`;
|
|
- `vaultwarden` absent/not listed, `http://192.168.1.198:8082/` returns `000`.
|
|
- Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful.
|
|
- Local validation passed so far:
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all`.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `bash -n tests/lifecycle/remote-lifecycle.sh`.
|
|
- `git diff --check`.
|
|
- A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet.
|
|
- Next steps:
|
|
- deploy the new backend only after approval;
|
|
- verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability;
|
|
- run reboot validation iterations on `.198` only after explicit approval;
|
|
- pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence.
|
|
- cut and smoke-test the `1.8-alpha` ISO after reboot validation is green.
|
|
|
|
### Local Release Gate Completion After `.198` App Recovery
|
|
|
|
- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
|
|
- Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active.
|
|
- Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them.
|
|
- Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism.
|
|
- Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`.
|
|
- Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion.
|
|
- Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs.
|
|
- Validation passed locally:
|
|
- `cargo fmt --manifest-path core/Cargo.toml --all`.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`).
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests).
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
|
|
- `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
|
|
- `git diff --check`.
|
|
- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
|
|
- Remaining gated item remains host reboot validation on `.198`, only if explicitly approved.
|
|
|
|
### Frontend Release Gate Completion
|
|
|
|
- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
|
|
- Found and fixed a mobile app-launch regression in `neode-ui/src/stores/appLauncher.ts`:
|
|
- desktop-only new-tab apps still open directly on desktop;
|
|
- mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab;
|
|
- `dashboardReturnPath()` now tolerates tests/minimal router mocks with no `currentRoute`.
|
|
- Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior.
|
|
- Fixed `AppIconGrid` test setup so it shares the mounted Pinia instance and mocks credential lookup before launch.
|
|
- Fixed onboarding retry test timing to cover the actual exponential retry budget.
|
|
- Validation passed locally:
|
|
- `npm run type-check` from `neode-ui`.
|
|
- `npm test` from `neode-ui` (`548 passed`).
|
|
- `npm run build` from `neode-ui`.
|
|
- `python3 scripts/generate-app-catalog.py` (`updated 0 fields`).
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
|
|
- `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
|
|
- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
|
|
- `git diff --check`.
|
|
- Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval.
|
|
|
|
### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery
|
|
|
|
- Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`.
|
|
- Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable.
|
|
- Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”.
|
|
- Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`.
|
|
- Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port.
|
|
- Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC.
|
|
- Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app.
|
|
- Validation passed:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
|
|
- Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`.
|
|
- Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`.
|
|
- Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
|
|
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
|
|
|
|
### Deployed Podman Store-Risk Cleanup
|
|
|
|
- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`.
|
|
- Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts.
|
|
- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`.
|
|
- Validation passed:
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
|
|
- `cargo fmt` from `core/`.
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- Focused post-deploy lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Broad post-deploy non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
|
|
|
|
### Release Candidate Backend Restart Validation
|
|
|
|
- Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`.
|
|
- Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load.
|
|
- Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback.
|
|
- Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`.
|
|
- Recovered live Immich without data loss:
|
|
- `immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written.
|
|
- Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership.
|
|
- A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root.
|
|
- Validation passed on latest hash:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
|
|
- `npm run build` from `neode-ui`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Backend restart validation followed by focused `fedimint,immich,indeedhub,photoprism` lifecycle passed.
|
|
- Post-restart broad non-destructive lifecycle passed.
|
|
- Remaining gate before calling this a release: host reboot validation, if approved.
|
|
|
|
### IndeedHub and Immich Lifecycle Recovery
|
|
|
|
- Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`.
|
|
- IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running.
|
|
- Fedimint and NetBird focused audits are green; they were not current blockers after rerun.
|
|
- Immich was the broad-audit blocker and is now green:
|
|
- dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes;
|
|
- `immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping;
|
|
- this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`.
|
|
- Validation passed on latest hash:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands.
|
|
|
|
### Release Refactor Cleanup
|
|
|
|
- Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`.
|
|
- Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available.
|
|
- Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility.
|
|
- Removed the duplicate Gitea-specific stale port cleanup helper.
|
|
- Validation passed on latest hash:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation.
|
|
|
|
### Catalog Metadata Generation
|
|
|
|
- Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`.
|
|
- The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`.
|
|
- The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes.
|
|
- Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation.
|
|
- Release catalog drift is now zero:
|
|
- `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`.
|
|
- Validation passed:
|
|
- `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`.
|
|
- canonical and UI public catalogs match byte-for-byte.
|
|
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container`.
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
|
|
- `npm run build` from `neode-ui`.
|
|
|
|
### Podman Store-Risk Hardening
|
|
|
|
- Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`.
|
|
- Fresh local-build installs now treat `podman image exists <local-build-tag>` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation.
|
|
- This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior.
|
|
- Validation passed on the latest hash:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
- Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation.
|
|
|
|
### Container Health Fallback and Broad Lifecycle Green
|
|
|
|
- Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`.
|
|
- Fixed `container-health` broad lifecycle timeout behavior:
|
|
- `cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`.
|
|
- The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others.
|
|
- Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls.
|
|
- Validation passed on the latest hash:
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
|
|
|
|
### Generic Host-Port Health Checkpoint
|
|
|
|
- Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`.
|
|
- Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned.
|
|
- Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward.
|
|
- Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`.
|
|
- Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing.
|
|
- This is generic host-port health, not an app-specific mapping.
|
|
- After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart.
|
|
- Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails.
|
|
- Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior.
|
|
|
|
### Stale State and Jellyfin Pasta Listener Hardening
|
|
|
|
- Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`.
|
|
- `container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery.
|
|
- `container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads.
|
|
- Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`.
|
|
- `package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`.
|
|
- `package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`.
|
|
- Focused lifecycle passed on the latest hash:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Release catalog drift check remains: `missing_catalog=0`, `missing_manifests=0`, `metadata_drift=35`.
|
|
|
|
### Expanded Cleanup and Store-Safe Uninstall
|
|
|
|
- Built and deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`.
|
|
- Expanded `system.disk-cleanup` to remove old rollback artifacts while keeping newest rollback points:
|
|
- `/usr/local/bin/archipelago.backup-*` newest 3.
|
|
- legacy `/usr/local/bin/archipelago.bak*` newest 3.
|
|
- `/usr/local/bin/archipelago.before-*` newest 3 as part of legacy backend cleanup.
|
|
- `/opt/archipelago/web-ui.bak*` newest 3.
|
|
- `/opt/archipelago/web-ui.old` included as web UI rollback cleanup.
|
|
- Live `system.disk-cleanup` reclaimed `10.3 GB`:
|
|
- `Removed old backend backups: 41.6 MB freed`.
|
|
- `Removed old legacy backend backups: 3.6 GB freed`.
|
|
- `Removed old web UI backups: 6.6 GB freed`.
|
|
- `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`.
|
|
- `/usr/local/bin` dropped to about `336M`.
|
|
- `/opt/archipelago` dropped to about `1.1G`.
|
|
- Removed global `podman volume prune -f` from uninstall. Uninstall now logs a skip and still removes explicit app data when `preserve_data=false`.
|
|
|
|
### Startup Scan and Uptime Kuma Fixes
|
|
|
|
- Startup `adopt_existing()` is bounded with a 35s timeout.
|
|
- Initial container scan seeds the same 300s Podman scan backoff used by periodic scans.
|
|
- Legacy pasta restart paths use scoped `podman restart` instead of stop+start.
|
|
- Uptime Kuma was repaired:
|
|
- Before: container internally healthy on `127.0.0.1:3001`, but host `3002` had no pasta listener.
|
|
- After: `package.restart uptime-kuma` returns `{"status":"restarted"}` and `http://192.168.1.198:3002/` returns HTTP `302`.
|
|
|
|
### Cleanup and Catalog Work Already Done
|
|
|
|
- `system.disk-cleanup` intentionally skips Podman image/volume prune.
|
|
- `nostr-rs-relay` was added to both catalog surfaces.
|
|
- `scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation.
|
|
- Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest.
|
|
|
|
---
|
|
|
|
## Verification Already Run
|
|
|
|
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed for the currently deployed release-candidate line.
|
|
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed for the currently deployed release-candidate line.
|
|
- Broad lifecycle on current hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` passed:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Targeted PhotoPrism audit on current hash passed:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`
|
|
- Focused lifecycle on current hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` passed:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Live cleanup RPC passed and reclaimed `10.3 GB`.
|
|
- Focused lifecycle after expanded cleanup passed:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Before the expanded cleanup pass, broad lifecycle also passed on hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`:
|
|
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
|
|
- Direct app checks after latest cleanup passed:
|
|
- `http://192.168.1.198:3002/` -> HTTP `302`.
|
|
- `http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start.
|
|
- `http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here.
|
|
|
|
### Test Caveat
|
|
|
|
- Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
|
|
- Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`.
|
|
|
|
---
|
|
|
|
## Critical Constraints
|
|
|
|
- Preserve app data.
|
|
- `.198` is the active validation node.
|
|
- Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
|
|
- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them.
|
|
- Do not run destructive git commands.
|
|
- Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan:
|
|
- Avoid `podman system df`.
|
|
- Avoid `podman image list` / `podman image ls`.
|
|
- Avoid broad `podman image exists` loops.
|
|
- Avoid `podman image prune` and `podman volume prune`.
|
|
- Podman store commands can hang and block app health under current `.198` load.
|
|
- Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`.
|
|
|
|
---
|
|
|
|
## Current Remaining Blockers
|
|
|
|
1. Podman socket/store health remains unresolved.
|
|
- Need quarantine/mitigation strategy rather than store-wide commands in release paths.
|
|
- Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`.
|
|
- Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward.
|
|
- Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed.
|
|
|
|
2. Release code-review/refactor gate is still open.
|
|
- Reduce remaining app-specific Rust/OS branches where possible.
|
|
- Review scanner, health, reconcile, and install/update paths for performance and store-risk.
|
|
- Clean up dead transitional paths.
|
|
|
|
3. Clean release branch hygiene is not done.
|
|
- Worktree is very dirty with many modified and untracked files.
|
|
- Do not commit unless explicitly asked.
|
|
|
|
4. Full production validation still needed.
|
|
- Broad non-destructive lifecycle is green on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
|
|
- Backend restart validation has passed.
|
|
- Run host reboot validation if approved.
|
|
- Run selected full lifecycle tests for critical apps if time allows.
|
|
|
|
---
|
|
|
|
## Files Changed In Latest Pass
|
|
|
|
- `core/container/src/runtime.rs`
|
|
- Changed Podman runtime `image_exists()` from `podman image exists` to a bounded targeted `podman image inspect` local-storage probe.
|
|
|
|
- `core/archipelago/src/api/rpc/package/install.rs`
|
|
- Replaced legacy `podman images -q` local fallback and post-pull verification checks with bounded targeted `podman image inspect`.
|
|
|
|
- `core/archipelago/src/container/companion.rs`
|
|
- Changed companion image existence checks from `podman image exists` to `podman image inspect`.
|
|
|
|
- `core/archipelago/src/container/prod_orchestrator.rs`
|
|
- Updated image-existence failure test fixture wording for the new `image inspect` probe.
|
|
|
|
- Validation for latest local mitigation:
|
|
- `cargo fmt --all --check` passed.
|
|
- `cargo check -p archipelago-container` passed.
|
|
- `cargo check -p archipelago` passed.
|
|
- `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests` passed.
|
|
- `cargo test -p archipelago-container` passed (`43` tests).
|
|
- `git diff --check -- <changed files>` passed.
|
|
- Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward.
|
|
|
|
- `core/archipelago/src/api/rpc/system/handlers.rs`
|
|
- Calls expanded rollback cleanup helpers and reports reclaimed bytes.
|
|
|
|
- `core/archipelago/src/api/rpc/system/mod.rs`
|
|
- Added cleanup helpers for legacy backend backups and web UI rollback backups.
|
|
- Uses size accounting for directories before removal.
|
|
- Keeps newest rollback artifacts instead of deleting all.
|
|
|
|
- `core/archipelago/src/api/rpc/package/runtime.rs`
|
|
- Skips global `podman volume prune -f` during uninstall.
|
|
- Adds Jellyfin `8096` to runtime host-port/pasta cleanup repair.
|
|
- Derives legacy runtime host-port cleanup/repair ports from manifests.
|
|
- Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code.
|
|
|
|
- `core/archipelago/src/api/rpc/container.rs`
|
|
- Adds stale cached `exited` refresh for `container-list`.
|
|
- Adds cached-running plus local TCP reachability fallback for `container-health`.
|
|
- Fixes fallback URL port parsing and expands lifecycle web app port coverage.
|
|
|
|
- `core/archipelago/src/container/prod_orchestrator.rs`
|
|
- Rebuilds local-build images when `image_exists` fails/times out instead of failing fresh install.
|
|
- Adds focused unit test coverage for that behavior.
|
|
|
|
- `scripts/generate-app-catalog.py`
|
|
- Generates/syncs public catalog metadata from manifest-owned fields.
|
|
|
|
- `app-catalog/catalog.json` and `neode-ui/public/catalog.json`
|
|
- Generated from current manifests; files match byte-for-byte.
|
|
|
|
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
|
|
- Added latest deployment, cleanup, validation, and residual-risk checkpoint.
|
|
|
|
- `docs/MIGRATION_STATUS_REPORT.md`
|
|
- Updated current hash, root disk state, and remaining blockers.
|
|
|
|
- `docs/RESUME.md`
|
|
- This file, replacing stale April migration resume content.
|
|
|
|
---
|
|
|
|
## Suggested Next Steps
|
|
|
|
1. Re-read the three docs:
|
|
- `docs/RESUME.md`
|
|
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
|
|
- `docs/MIGRATION_STATUS_REPORT.md`
|
|
|
|
2. Verify latest `.198` state:
|
|
- `ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'`
|
|
|
|
3. Start Podman-store-risk review:
|
|
- Search for image/store operations: `image_exists`, `podman image`, `podman system`, `podman prune`, `volume prune`.
|
|
- Prefer targeted container status/API calls with timeouts.
|
|
- Avoid new broad store commands.
|
|
|
|
4. Continue release code-review/refactor cleanup.
|
|
|
|
5. If approved, run backend-restart validation and then host-reboot validation.
|
|
|
|
---
|
|
|
|
## Current Release Readiness Estimate
|
|
|
|
- Credible release candidate: closer now, roughly `87-91%`.
|
|
- Production-quality release developers will love: still closer to `73-79%`.
|
|
|
|
The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.
|