archy/docs/MIGRATION_STATUS_REPORT.md
2026-06-11 00:24:54 -04:00

106 lines
15 KiB
Markdown

# Migration Status Report
Last updated: 2026-06-11
## Goal
Make Archipelago's app/container system developer-ready and release-ready: app installs, lifecycle, recovery, and integrations should be portable, manifest-driven, and not rely on one-off OS-level changes or hardcoded Rust branches for each new app. The OS/backend should provide generic primitives for manifests, Quadlet rendering, lifecycle, health/readiness, dependency ordering, data ownership, image availability, bind mounts, secrets, app files, networking, bridge/signer integrations, and recovery.
The developer contract should be clear enough that a third-party developer can build and ship an Archipelago app from documentation plus manifest/schema examples. If an app needs a capability the platform does not yet expose, the release direction is to add a reusable manifest/orchestrator primitive rather than a special case tied to that app. This is the standard for the `1.8-alpha` app migration: professional app delivery, predictable behavior after restart/reboot, and a path for user-installed/community apps that does not require rebuilding the OS image for every app.
Release quality bar: every supported app must install, stop, start, restart, uninstall, survive host reboot, report accurate status, and expose clear install/uninstall progress. Stale health notifications must not persist across login or refresh after the underlying condition has cleared. Final release validation should run on the intended release validation server, not drift between appliances without an explicit checkpoint.
Target release: `1.8-alpha`, including a cut and smoke-tested ISO once validation is green.
Current release readiness estimate: about `82%`. The remaining percentage is mostly post-reboot recovery confidence, repeated reboot validation, and ISO creation/smoke testing rather than the core manifest/catalog migration itself.
## Current Result
- The migration is not final-release complete yet, but the core direction is being met.
- Portainer, Filebrowser, BTCPay, Grafana, Nostr Relay, SearXNG, Gitea, and key dependency units have moved further into the manifest/orchestrator path.
- `.198` has passed focused and broad lifecycle audits for the already migrated set.
- Meshtastic is now routed through the orchestrator path, no longer falls back to legacy `localhost/meshtastic:latest`, and has passed full lifecycle validation on `.198`.
- On 2026-06-02, focused and broad `.198` non-destructive lifecycle audits passed after clearing a wedged `nextcloud` Podman record. The live registry config already has OVH primary plus tx1138 mirror, and Meshtastic/Portainer were added to the catalog surfaces.
- Later on 2026-06-02, the current release backend hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265` was found active and stable on `.198`. Meshtastic `app.files` rendering was proven live by removing `/var/lib/archipelago/meshtastic/config.yaml`, restarting through `package.restart`, and verifying the manifest recreated the file. Focused Meshtastic, focused `meshtastic,jellyfin,filebrowser`, and broad non-destructive audits all passed afterward; raw Podman sweep was clean.
- The remaining release gate was continued on 2026-06-02: bounded disk cleanup, journal retention, backend-backup retention, and release-focused catalog drift classification were added. `.198` is active on backend hash `e285d421cef497beb6b4b929f36fb4296d6db1f4a4c786157b6751eec51619ca`; focused and broad post-cleanup lifecycle audits passed, and final raw Podman sweep was clean.
- Follow-up found Podman store commands can hang on `.198` beyond image prune (`podman system df`, image list/exists, and sometimes broad ps/inspect). The release cleanup path now skips Podman image/volume prune rather than touching that unstable path. `.198` is active on backend hash `c9695dc3db10ff6e593cdbcfbbdc94b2e98b6008aa62655bba51b9879b549e8c`; Uptime Kuma was repaired with a normal `package.restart`; focused and broad post-repair lifecycle audits passed, and final raw bad-state sweep was clean.
- On 2026-06-03, startup/adoption scanner hardening and pasta restart repair were deployed. `.198` is active on backend hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`; `package.restart` for Uptime Kuma now returns successfully and restores the `3002` pasta listener; focused `meshtastic,jellyfin,filebrowser,uptime-kuma` and broad lifecycle audits passed.
- Later on 2026-06-03, expanded rollback cleanup and store-safe uninstall hardening were deployed. `.198` is active on backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b`; `system.disk-cleanup` reclaimed `10.3 GB` from old backend and web UI rollback artifacts while still skipping Podman prune, and focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed afterward.
- Latest 2026-06-03 follow-up deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e`. It mitigates stale cached `container-list` state during Podman scan backoff, adds a bounded TCP reachability fallback for `container-health`, and adds Jellyfin `8096` to legacy pasta host-listener repair. Focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed on this hash. Broad lifecycle still needs rerun on this latest hash.
- Current validation backend hash is `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b`. It keeps the generic host-listener health direction, preserves the `container-health` fallback fix from `be95ea...`, hardens fresh local-build installs so `podman image exists <local-build-tag>` failures/timeouts rebuild instead of failing the lifecycle operation, and reduces duplicated legacy runtime port repair by deriving host ports from manifests. Targeted PhotoPrism and broad non-destructive `.198` lifecycle audits passed on this hash.
- Catalog metadata generation from manifests is now implemented via `scripts/generate-app-catalog.py`. The canonical catalog and UI public catalog are synced from manifest-owned fields, strict release drift is zero, and frontend build validation passed.
- Current live `.198` validation backend hash is `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. Broad non-destructive lifecycle is green on that deployed line after app health/port recovery, IndeedHub recovery, scoped legacy install hardening, and bounded Podman pull hardening.
- Local release validation now passes the full backend binary test target and every Rust workspace member after release cleanup fixes for scanner backoff wakeups, crash-recovery tests, manifest-port lookup, journal parsing, and boot-reconciler test determinism.
- Frontend release validation now passes `npm run type-check`, `npm test` (`548` tests), and `npm run build` after fixing mobile app-launch routing for new-tab apps and updating stale launch tests. Local `npm ci` is blocked by root-owned `neode-ui/node_modules` entries, so dependency reinstall remains a local environment cleanup item requiring explicit approval.
- Reboot validation is not yet green. User reported that a reboot test left IndeeHub stopped afterward, with multiple containers killed by SIGKILL during shutdown/reboot and at least one crash. Treat post-reboot recovery as the active release blocker.
- Local follow-up now hardens IndeeHub stack boot recovery and updates lifecycle validation so IndeeHub must still serve the Nostr signer bridge (`/nostr-provider.js`) before a launch probe passes.
## Completed In This Pass
- Pause checkpoint for resume: generated app-session metadata now covers manifest-owned launch ports, titles, and new-tab behavior. The next migration step should continue from proxy path/companion UI alias generation or return to the release blocker around post-reboot IndeeHub recovery.
- Updated `docs/APP-PACKAGING-MIGRATION-PLAN.md` to reflect the current `apps/<app-id>/manifest.yml` contract, replacing stale `archy-app.yml` next-step language with the actual parser/generator/orchestrator progress and the remaining migration blockers.
- Updated `docs/app-developer-guide.md` so developers see the current manifest fields, generated catalog flow, validation commands, and release lifecycle expectations instead of the older Nostr marketplace publish/trust-score draft.
- Verified the developer-guide manifest example parses as YAML, `scripts/generate-app-catalog.py` is idempotent, strict release catalog drift remains zero, and `git diff --check` is clean for the migration docs.
- Extended `scripts/generate-app-catalog.py` to also emit `neode-ui/src/views/appSession/generatedAppSessionConfig.ts` from manifests, and wired `appSessionConfig.ts` to merge generated launch ports/titles/new-tab launch behavior with the existing manual overrides for companion UIs and aliases.
- Added a Fedimint `interfaces.main` launch declaration for the Guardian wait/proxy UI on port `8175`, so that public launch surface is now represented in the manifest.
- Focused validation passed for the generated app-session path: Python helper compile, generator idempotence, strict catalog drift, `appSessionConfig.test.ts`, and frontend type-check.
- Aligned `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` with the current manifest/runtime contract so the release docs no longer describe the stale marketplace-style schema.
- Removed the hardcoded Portainer host-prep path and replaced it with a manifest plus generic Podman socket bind-mount preparation.
- Added generic Quadlet health drift detection for command, interval, timeout, and retry changes.
- Made rendered HTTP health helpers honor manifest timeouts.
- Added image availability guards before Quadlet starts/restarts so pruned images are pulled or built before systemd tries to start them.
- Fixed stale dependency handling so active manifest dependencies are not suppressed by old `user-stopped.json` entries.
- Added parent-app reconcile syncing for dependency Quadlet units.
- Validated Portainer, Filebrowser, BTCPay, and broad non-destructive audits on `.198`.
- Updated Meshtastic manifest to use a real available image, the real `/dev/ttyUSB0` device, the actual daemon data path, and a non-HTTP health check.
- Updated the lifecycle harness so non-HTTP apps do not require launch metadata.
- Added a generic manifest-owned file rendering primitive under `app.files` so apps can declare required bind-mounted config files without adding app-specific Rust/OS branches.
## Current `.198` State
- `archipelago.service`: active.
- `archipelago-doctor.timer`: inactive.
- `archipelago-reconcile.timer`: inactive.
- Current validation backend hash: `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- `.198` root filesystem pressure is currently resolved for release validation: latest sweep showed `/` at 65% used with about 9.6G free after expanded rollback cleanup.
- Latest focused Fedimint, Immich, IndeedHub, and PhotoPrism audits passed on the current hash.
- Broad non-destructive lifecycle passed on the current hash before and after backend restart validation.
## Meshtastic Status
- Orchestrator routing is fixed and verified by the generated Quadlet unit.
- Current generated unit uses:
- `Image=docker.io/meshtastic/meshtasticd:daily-alpine`
- `Volume=/var/lib/archipelago/meshtastic:/var/lib/meshtasticd:Z`
- `AddDevice=/dev/ttyUSB0`
- `HealthCmd=test -f /var/lib/meshtasticd/config.yaml`
- The daemon starts and accepts TCP API connections on port `4403`.
- Full lifecycle passed on `.198`: install, stop, start, restart, uninstall with preserved data, and reinstall.
- A persisted `config.yaml` is required. The release path is now the generic `app.files` manifest primitive rather than a Meshtastic-specific backend hook, and this has been verified live on `.198` by deleting the file and proving `package.restart` recreates it from the manifest.
## Release Blockers
- Continue monitoring the current optimized release backend on `.198`; the previously observed release-binary segfault is not reproducing with hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- `system.disk-cleanup` now handles journal, backend-backup, legacy backend rollback, and web UI rollback retention while intentionally skipping Podman image/volume prune because Podman store commands can hang on `.198` under current load. Diagnose Podman store health separately from the release cleanup path.
- Release image probes have been further quarantined from the fragile Podman store commands and deployed to `.198` on backend hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`: runtime, legacy install, and companion image checks now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`. Focused and broad non-destructive lifecycle validation passed on the deployed hash.
- Podman socket/runtime health remains a release blocker: `package.restart jellyfin` stopped the container but failed to complete because Podman reported `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`; `package.start jellyfin` recovered the app and the focused lifecycle passed afterward.
- Release-focused catalog drift now has zero missing catalog/manifest entries and zero metadata drift after generating catalog metadata from manifests.
- Backend-restart validation passed. Host-reboot validation is currently failed/pending due to post-reboot IndeeHub recovery. Reboot retests should run only after an explicit release checkpoint/approval.
- Local code-review/refactor cleanup gate has full local validation coverage now:
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` passed (`688` tests);
- all other workspace packages check/test clean;
- frontend type-check/tests/build passed;
- release build, catalog drift, catalog idempotence, Python helper compile, and whitespace checks passed.
- Before `1.8-alpha` release:
- deploy the post-reboot recovery fixes;
- prove focused IndeeHub lifecycle with Nostr signer injection intact;
- update the app packaging/developer docs so `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` match the current manifest/runtime contract and release-quality lifecycle expectations;
- complete the required refactor/remove-dead-code gate after correctness validation: remove obsolete transitional code, stale per-app hacks, duplicate lifecycle paths, and misleading compatibility fallbacks, then rerun release validation;
- require at least 3 consecutive clean post-fix reboots with broad non-destructive lifecycle green after each;
- prefer 5 consecutive clean reboots for production-release confidence;
- cut and smoke-test the `1.8-alpha` ISO.
## Bottom Line
We are working toward the intended goal: better than Umbrel/StartOS by making app behavior declarative and registry/manifest-owned. The migration is substantially advanced, Meshtastic manifest-owned config generation is verified live, catalog metadata is generated from manifests, disk cleanup/backup retention is in place without Podman prune risk, and full local backend/frontend workspace validation has been green. Remaining follow-up for `1.8-alpha` is post-reboot recovery validation, especially IndeeHub plus Nostr signer behavior, repeated reboot passes, ISO cut/smoke test, separate Podman socket/store-health diagnosis, and optional local cleanup of root-owned frontend dependencies before rerunning `npm ci`.