From 182f18ecf322b55a5828cedc2d9bd239fda550c8 Mon Sep 17 00:00:00 2001 From: archipelago Date: Thu, 11 Jun 2026 00:24:54 -0400 Subject: [PATCH] docs: capture 1.8 app migration release plan --- CHANGELOG.md | 5 + docs/1.8-alpha-improvements-tracker.md | 216 ++++++ docs/APP-PACKAGING-MIGRATION-PLAN.md | 441 +++++++++++++ docs/CONTAINER_LIFECYCLE_HANDOFF.md | 578 +++++++++++++++- docs/CURRENT_AGENT_HANDOFF.md | 216 ++++++ docs/MIGRATION_STATUS_REPORT.md | 105 +++ docs/NEXT_TERMINAL_HANDOFF.md | 572 ++++++++++++++++ docs/RESUME.md | 880 ++++++++++++++++++++++--- docs/app-developer-guide.md | 304 +++++---- docs/bitcoin-rpc-relay.md | 280 ++++++++ 10 files changed, 3392 insertions(+), 205 deletions(-) create mode 100644 docs/1.8-alpha-improvements-tracker.md create mode 100644 docs/APP-PACKAGING-MIGRATION-PLAN.md create mode 100644 docs/CURRENT_AGENT_HANDOFF.md create mode 100644 docs/MIGRATION_STATUS_REPORT.md create mode 100644 docs/NEXT_TERMINAL_HANDOFF.md create mode 100644 docs/bitcoin-rpc-relay.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 96b28205..e12be7f8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,10 @@ # Changelog +## v1.7.83-alpha (2026-05-26) + +- Portainer installs and repairs now provision `catatonit`, pre-create persistent compose storage, and expose host `/data` to Portainer's data directory so Portainer-launched stacks no longer fail with missing init binary or `/data/compose/...` bind-mount permission errors. +- Existing Portainer containers are recreated when missing the corrected compose bind mount, and first-boot/deploy paths now use the rootless Podman socket consistently. + ## v1.7.82-alpha (2026-05-22) - Saleor storefront proxying now forwards `X-Forwarded-Host`, fixing Next.js Server Actions requests that compared the browser origin with the internal `storefront-app:3000` upstream host. diff --git a/docs/1.8-alpha-improvements-tracker.md b/docs/1.8-alpha-improvements-tracker.md new file mode 100644 index 00000000..d98a8446 --- /dev/null +++ b/docs/1.8-alpha-improvements-tracker.md @@ -0,0 +1,216 @@ +# 1.8-alpha Improvements Tracker + +Last updated: 2026-06-11 00:17 EDT + +This tracks the user-facing improvement list that must land with the `1.8-alpha` +container migration release and the next ISO cut produced from that release. It +is intentionally separate from the container handoff docs, but should be treated +as release and ISO smoke-test scope. + +Status legend: + +- `todo`: not started. +- `in-progress`: active local work or validation. +- `blocked`: needs host access, hardware, credentials, a product decision, or an + external artifact. +- `done`: implemented and validated for this release. +- `defer?`: candidate to explicitly defer from `1.8-alpha` after product review. + +Resume protocol: + +1. Read this file after `docs/NEXT_TERMINAL_HANDOFF.md`. +2. Keep every user-requested improvement represented here until it is either + `done` or explicitly moved out of `1.8-alpha` by product decision. +3. When implementation starts, change status to `in-progress` and add the file, + test, host, or design decision being worked. +4. Mark `done` only after the change is implemented and validated locally or on + the release validation host, as appropriate. +5. Before cutting the next ISO, run this checklist as part of ISO smoke testing. + +Active-session note, 2026-06-10 05:48 EDT: resumed from +`docs/NEXT_TERMINAL_HANDOFF.md`; no `.198` host actions have been run yet. The +immediate tracker-affecting local gate is rerunning the focused Rust +`container::image_versions::tests` validation for the Nextcloud false-update +row, then continuing lifecycle/control-plane truthfulness work. + +Resume-save checkpoint, 2026-06-10 08:32 EDT: the current pass stayed on the +fixes backlog, not app migration. No `.198` host actions were run, no dev server +was intentionally left running, and no long-running validation command is +expected to still be active. Continue from the in-progress `Make tabs info load +quickly or show loading states` row or the next unresolved fixes-backlog row. + +Active-session progress: `git diff --check` passed. Focused image-version Rust +validation is still inconclusive because the tool PTY stayed open with no +active compiler process visible, a bounded 300s retry using the normal +workspace target exited `124` before test output, and a fresh 600s retry in +`/tmp/archy-cargo-image-versions-2` also exited `124` after compiling into the +`archipelago` crate without reaching test output. The Nextcloud false-update +row remains `in-progress`. A local lifecycle fix is in progress so migrated +single-orchestrator app stops return immediately with a transitional state +instead of blocking the UI while Podman cleanup runs; `cargo fmt --check` and +focused backend compile check passed, and `git diff --check` is clean. Latest +credentials backlog follow-up added backend PhotoPrism credentials, centered +the mobile credential pre-launch modal in My Apps and the icon grid, and passed +focused frontend tests, type-check, backend compile check, `cargo fmt --check`, +and `git diff --check`. Web5 Connected Nodes Messages/Requests, Web5 +Identities, and DWN message browsing now preserve visible content during +refresh/failure and show compact refresh labels instead of replacing populated +tabs with loading panels; focused tests and type-check passed. Server Network +overview, Network Interfaces, and Tor Services cards now keep visible values +during refresh or refresh failure and show compact refresh labels instead of +reverting to skeletons or false empty states; focused test and type-check +passed. The standalone Credentials view now keeps credential rows visible +during refresh/failure and shows `Refreshing credentials...`; focused test and +type-check passed. Lightning Channels now keeps existing channels visible +during refresh/failure and shows `Refreshing channels...`; focused test and +type-check passed. Peer Files now keeps existing peer catalog items visible +during Tor refresh/failure and shows `Refreshing peer files...`; focused test, +type-check, and `git diff --check` passed. Cloud peer cards now remain visible +during federation peer-list refresh/failure with `Refreshing peer nodes...`; +focused test, type-check, and `git diff --check` passed. The Web5 Verifiable +Credentials summary now keeps credential rows visible during refresh/failure +with `Refreshing credentials...`; focused test, type-check, and +`git diff --check` passed. Web5 Nostr Relays now keeps relay stats visible +during refresh/failure with `Refreshing relays...`; focused test, type-check, +and `git diff --check` passed. Web5 Domains now keeps registered-name counts +visible during refresh/failure with `Refreshing domains...`; focused test, +type-check, and `git diff --check` passed. Settings Backups now keeps existing +backup rows visible during refresh/failure with `Refreshing backups...`; +focused test, type-check, and `git diff --check` passed. Settings Transport +Preferences now keeps preference controls visible during refresh/failure with +`Refreshing transport preferences...`; focused test, type-check, and +`git diff --check` passed. Settings VPN status now keeps current connection +details visible during refresh/failure with `Refreshing VPN status...`; +focused test, type-check, and `git diff --check` passed. Web5 Federation now +shows `Refreshing federation...` during summary refresh and keeps existing node +counts/DID visible on refresh failure; focused test, type-check, and +`git diff --check` passed. Mesh map denied-location behavior now has component +coverage proving browser location denial reports that peer positions can still +appear without requiring local location; focused test, type-check, and +`git diff --check` passed. Companion/app-session mobile tab-app handling now +keeps apps that require a new tab inside the mobile session fallback instead of +auto-opening an external tab and closing; focused app-session, launcher, and +config tests passed with type-check and `git diff --check`. +Nostr Discoverable Nodes now keeps discovered rows visible during relay refresh +or relay failure and shows `Searching relays...`; focused test, type-check, and +`git diff --check` passed. App Store/App Details screenshot sections now render +only real screenshot metadata and no longer show fake placeholder tiles when no +assets exist; focused App Details content and marketplace handoff tests, +type-check, and `git diff --check` passed. Home now has an App Store +recommendations card driven by uninstalled core/recommended marketplace apps; +the recommendations respect installed aliases so apps drop out after install +and move into normal My Apps/Home behavior. Focused helper tests, type-check, +`git diff --check`, and the Playwright Home dashboard smoke passed. Easy Mode +goal configure steps now route to their owning app/screen, verify steps have an +explicit `Check & Continue` action, and configure/info/verify actions start +goal progress before completing the step; focused goal action/store tests, +type-check, and `git diff --check` passed. Setup path selection no longer shows +the disabled `Connect Existing (Coming Soon)` option; Fresh Start and Restore +from Seed are the only visible choices and route correctly. Focused onboarding +option/composable tests, type-check, and `git diff --check` passed. Header +responsiveness follow-up restored the primary My Apps/App Store/Websites +navigation to persistent desktop tabs at `md+` on My Apps, Discover, and +Marketplace; removed the desktop primary dropdowns; kept mobile dropdown +behavior; delayed App Store category collapse by lowering the search reserve and +header gap; and removed the My Apps desktop category dropdown. Focused +Marketplace/App config tests, type-check, and scoped `git diff --check` passed. +Browser smoke against the already-running local Vite/mock session is still next. + +Done criteria for this tracker: + +- Code/UI items: implemented, covered by targeted test or manual smoke check, + and no known regression against the container migration work. +- Runtime/container items: validated on the release host named in + `docs/NEXT_TERMINAL_HANDOFF.md`, then included in ISO smoke test scope. +- Product-decision items: documented decision plus implementation task if the + decision keeps it in `1.8-alpha`. +- External/hardware items: hardware/document/access obtained, or explicitly + deferred from the release by product decision. + +## Release-Critical Runtime Gates + +| Item | Status | Release question / blocker | +| --- | --- | --- | +| Check logs of every server for errors and fix | blocked | Needs explicit target server list. Current docs name `.198`; are there more production validation hosts? | +| Go through issues on gate | blocked | Need location of "gate" issue tracker/board and access details. | +| Sort out container tagging so databases, backend, etc are sorted properly | in-progress | Tie to manifest/catalog metadata and My Apps grouping. | +| Sort out supplementary container naming so it is better | in-progress | Needs naming convention for dependencies: app-prefixed service names vs role-first names. | +| Figure out how we offer updates to apps | todo | Product/runtime design needed: manual update, scheduled checks, or auto-update by app tier. | +| Figure out how we provide different versions for Bitcoin to download and keep updated automatically | todo | Requires release policy for Knots/Core versions and whether users may pin old versions. | +| Make sure all credentials are given for apps without registration | in-progress | File Browser now exposes credentials on App Details and in the pre-launch interstitial. Backend `package.credentials` returns the secured File Browser password from `/var/lib/archipelago/secrets/filebrowser/password` when present, with `admin/admin` fallback matching the install hook. PhotoPrism now exposes manifest-backed `admin` / `archipelago` credentials from both backend `package.credentials` and the frontend fallback. My Apps and mobile icon-grid credential pre-launch modals are vertically centered on mobile. Covered by `appCredentials.test.ts`, `AppIconGrid.test.ts`, local type-check, backend compile check, `cargo fmt --check`, and `git diff --check`. Grafana was not added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo default/secret. Remaining no-registration apps still need inventory. | +| Nextcloud always shows update, and how are apps actually updated? | in-progress | Nextcloud manifest/catalog metadata is aligned to the pinned `nextcloud:29` image, and update detection now ignores registry-host-only image changes while still reporting real same-repo tag drift. Catalog drift check passed. Backend focused test was added but local validation hit a Rust linker/incremental artifact failure, then bounded retries exited `124` before test output, including a 600s fresh-target retry on 2026-06-10. Broader app update UX/policy design still needed. | +| Make sure Tor is solid as having to rotate addresses to get it to work | todo | Needs `.198`/target-host Tor logs and reproducible failure case. | +| Fix fleet it does not seem to work | done | Fleet data now preserves existing nodes during refresh, exposes an explicit refreshing state, sorts online nodes first, avoids duplicate history fetches when selecting a node, accepts backend `entries` and legacy `history` response shapes for per-node charts, and uses readable loading/auto-refresh UI. Covered by `useFleetData.test.ts`, local type-check, targeted tests, and user visual review of the Fleet header/card treatment. | +| Check Beta Telemetry and how it works | done | Telemetry is opt-in via `analytics-config.json`; the background reporter runs every 15 minutes only when enabled, saves `telemetry-latest.json`, writes local Fleet reports/history under `telemetry-fleet/`, and optionally POSTs a `telemetry.ingest` JSON-RPC envelope to `TELEMETRY_COLLECTOR_URL`. The systemd unit now reads optional `/var/lib/archipelago/telemetry.env`, and deploys write that file when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`. Manual and periodic report schemas now both include metric percentages and container inventory, and the Fleet UI normalizes older reports with missing fields. Covered by local type-check, `useFleetData.test.ts`, `cargo check -p archipelago`, deploy-script syntax check, and `git diff --check`. Remaining ops step: choose the real collector URL, deploy it, restart the service, and confirm central Fleet ingest. | +| Get Netbird working | todo | Requires app/runtime validation and credentials/config expectations. | +| Sort out how we are going to manage lightning channel creation | todo | Product design needed for UX, safety limits, fees, and peer selection. | +| Make sure old health notifications do not return on refresh/new login when stale/out of date | done | Health toasts now require a current app-linked unhealthy package state and hide stale package health notifications after 30 minutes on reload/new login. Backend monitoring notifications now prune duplicate active alerts and old generic alerts before pushing new ones. Covered by `HealthNotifications.test.ts`, local type-check, targeted frontend tests, and backend notification unit test work. | +| Fix BTCPay issue from desktop file "BTCPay Issues" | blocked | Need file contents or path to that desktop artifact. | +| Check Nostr Discoverable Nodes and get it working correctly | in-progress | Discover modal now keeps discovered rows visible during relay refresh/failure and shows `Searching relays...` instead of dropping to an empty state. Covered by `DiscoverModal.test.ts`, local type-check, and `git diff --check`. Needs live relay/trust validation before marking done. | +| Make sure update password is working properly | done | Backend now returns separate SSH update status so a successful web password change is not reported as a full failure when optional SSH password update fails. Settings modal shows success plus SSH warning and stays open for review. Covered by local type-check, focused modal/RPC tests, auth unit test, `cargo check -p archipelago`, and `git diff --check`. | +| Do UI performance and general performance improvements | todo | Needs profiling target; start with obvious loading/render issues. | +| Make sure companion app is all working well, had issues with tab apps | in-progress | Mobile app-session now keeps apps that require a new tab inside the session fallback instead of auto-opening an external tab and closing immediately. Covered by `AppSessionMobileNewTab.test.ts`, existing app-session config tests, app launcher tests, local type-check, and `git diff --check`. Broader companion smoke test still needed before marking done. | +| Even though performance is better, on reboot/restart backend/update show checking-containers notification instead of no apps | done | My Apps now shows a dedicated `Checking containers` card when initial backend data has loaded but `server-info.status-info.containers-scanned` is still false and no apps are ready to render, instead of falling through to the no-apps empty state. A follow-up UI pass preserves the last known app list when a later scanner/backoff update reports an empty package map with `containers-scanned=false`, and shows a refresh status banner above the grid. Validated by local type-check, targeted tests, and `git diff --check`; follow-up validation passed `npm test -- --run src/views/apps/__tests__/appPackageCache.test.ts` and `npm run type-check`. | +| Check mesh core is picking up public channel/other devices, not just Archipelago ones | blocked | Needs Meshtastic hardware/radio environment. | +| Make tabs info load quickly or show loading states | in-progress | Fleet now has initial loading/background-refresh states, and node history keeps showing while the next sample is fetched instead of blanking out. Web5 Connected Nodes Trusted/Observers tabs now show loading instead of empty states while peer data is pending and keep existing lists visible during refresh; Messages and Requests now also keep populated lists visible during refresh/failure. Web5 Shared Content now keeps My Content visible during refresh/failure with `Refreshing shared content...`, and Browse Peers keeps current same-peer results visible during refresh with `Refreshing peer content...` instead of replacing lists with full loading panels. Web5 Identities now keeps the identity list visible during refresh/failure with `Refreshing identities...`; Web5 DWN message browsing keeps stored messages visible during refresh/failure with `Refreshing messages...`. The Web5 Verifiable Credentials summary keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Web5 Nostr Relays keeps relay stats visible during refresh/failure with `Refreshing relays...`. Web5 Domains keeps registered-name counts visible during refresh/failure with `Refreshing domains...`. Web5 Federation keeps summary node counts/DID visible during refresh/failure with `Refreshing federation...`. Server Network overview, Network Interfaces, and Tor Services cards now keep visible values during refresh/failure with `Refreshing network...`, `Refreshing interfaces...`, and `Refreshing Tor services...`. Credentials keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Settings Backups keeps backup rows visible during refresh/failure with `Refreshing backups...`. Settings Transport Preferences keeps preference controls visible during refresh/failure with `Refreshing transport preferences...`. Settings VPN status keeps current connection details visible during refresh/failure with `Refreshing VPN status...`. Lightning Channels keeps existing channels visible during refresh/failure with `Refreshing channels...`. Peer Files keeps existing peer catalog items visible during Tor refresh/failure with `Refreshing peer files...`. Cloud keeps existing peer cards visible during federation peer-list refresh/failure with `Refreshing peer nodes...`. Covered by focused Web5/Server/Credentials/Backups/Transport/VPN/Lightning/Peer Files/Cloud tests and local type-check. Broader tab-info audit still needed for other slow panels before marking done. | +| Add states about why Bitcoin address is not ready | in-progress | Receive Bitcoin on-chain flows now reject blank LND address responses and translate common LND/Bitcoin readiness failures into user-facing reasons: wallet locked, wallet uninitialized, Bitcoin/LND still syncing, LND unreachable, or LND REST/newaddress transport issues. The receive modals now show a live “checking wallet readiness” message while the request is in flight. Backend `lnd.newaddress` now errors if LND returns an error or no address. Needs live wallet-state smoke test before marking done. | +| Add new Bitcoin wallets easily and securely | todo | Product/security design needed. | +| Add the new gate instead of gate | blocked | Need definition of "new gate" and target integration. | +| Local Nostr signer app should ask which account after logout/re-login | todo | Needs signer/session state validation. | +| See what apps can migrate to local Nostr signer sign-in | todo | Needs app-by-app auth inventory. | +| Make server name change change the host name | in-progress | Settings label changed to `Hostname`. `server.set-name` now persists the display name, derives a Linux-safe hostname slug, attempts `sudo -n hostnamectl set-hostname`, and returns non-fatal hostname warning fields if OS update fails. Covered by hostname slug unit test, local type-check, `cargo check -p archipelago`, and `git diff --check`. Impact audit: mDNS/SSH/Tailscale labels may change; already-created app configs using old `HOST_MDNS` (notably Fedimint derived env) are not automatically rewritten by hostnamectl, so this needs release-host smoke validation before marking done. | +| Sort out HTTPS certificate, what is best way? | todo | Needs product decision: self-signed local CA, ACME DNS, Tailscale certs, or reverse proxy model. | + +## User Interface And App Experience + +| Item | Status | Release question / blocker | +| --- | --- | --- | +| LND Channels then back/back gets stuck between LND detail and channels | done | App Details back now routes explicitly to the parent surface, and Lightning Channels back replaces history so browser back no longer bounces between LND detail and Channels. Validated by local type-check and targeted tests. | +| Add a Meshtastic icon | done | Added `meshcore.svg` asset and manifest-owned icon metadata. Catalog generation is idempotent and strict catalog drift is clean. | +| Improve default app icon fallback | done | Missing/broken app icons now fall back to the centered Archipelago `A` mark using the same black fill and gradient-border treatment as the custom UI icon asset, instead of the old generic placeholder. Applied to My Apps cards, mobile icons, Marketplace cards, and App Details. Validated by local type-check, targeted tests, Rust check, and `git diff --check`. | +| Use favicon for Portainer apps? | todo | Need decision: use upstream favicons dynamically or ship curated icons. | +| Settings for apps | blocked | Needs definition: per-app config screen, runtime env vars, credentials, or install options? | +| Update SearXNG app icon | blocked | Needs user-provided/approved icon asset. User said to move past this until they can make icons. | +| Once an app is installed remove recommended/core pills | done | Marketplace cards hide tier badges when installed. Validated by `MarketplaceAppCard.test.ts`, targeted Vitest, type-check, and `git diff --check`. | +| Get Bitcoin / LND UI fully done with all options and controls | todo | Large feature area; needs scope for `1.8-alpha` vs post-release. | +| Fix intro always showing on new browser sessions | done | Splash gating now checks the backend onboarding-complete state before showing the intro when this browser has no local intro flag. Already-onboarded nodes skip the splash and seed `neode_intro_seen`; fresh installs still show it. Covered by `introSplash.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Fix App Store tabs/categories/search overflow | done | Discover/App Store and Marketplace render one shared App Store section list. Follow-up after user review restored the primary My Apps/App Store/Websites navigation to persistent desktop tabs at `md+` on My Apps, Discover, and Marketplace; mobile keeps dropdown behavior. App Store category collapse now happens later by starting uncollapsed and using a smaller header gap/search reserve, and the My Apps category dropdown no longer appears on desktop. Covered by local type-check, focused Marketplace/App config tests, and scoped `git diff --check`; browser smoke remains the next resume step. | +| Add a test harness for all of the application | in-progress | Lifecycle harness exists; need expand UI/e2e coverage definition. | +| Fix app details screen links | done | App Details sidebar no longer renders dead `href="#"` links. It now renders only real manifest website/marketing, upstream/wrapper repo, and support URLs, and hides the Links card when no usable URLs exist. Covered by `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Fix FIPS anchoring, update FIPS | todo | Needs expected FIPS UX/API behavior. | +| Fix generate receive address not working on nodes and identify wallet management | todo | Needs wallet API/backend validation. | +| Fix mesh page on larger screens so it scales nicely | done | Mesh keeps the tabbed tools layout on normal desktop/1920px widths and only splits Off-Grid Bitcoin, Dead Man, and Map into separate stacked containers on very large screens (`>=2560px` wide and `>=1200px` tall). The desktop tools column now fills its panel instead of using a wrapper scroll container. Validated by local type-check, targeted tests, and `git diff --check`. | +| Mesh map should handle denied location permission and still show other devices | in-progress | Mesh map now treats browser geolocation as optional in the UI: denied local location reports that peer locations can still appear, and the empty hint waits for mesh device positions instead of saying location sharing is required. Covered by `MeshMap.test.ts`. Needs browser smoke test with denied location plus a peer coordinate message before marking done. | +| Make tablet-size Meshtastic scrollable | done | Tablet/mobile Mesh tools panels now have bounded heights and internal scrolling so the selected Bitcoin/Dead Man/Map panel can scroll without blowing out the page. Validated by local type-check, targeted tests, and `git diff --check`. | +| Make mobile screens have gap below lowest container and tab bar | done | Dashboard route panels, including the separate Chat/Mesh branch, now use mobile tab-bar bottom clearance so the lowest content clears the bottom tab bar. | +| Add Trusted tab to Connected Nodes container and have Peers and Observers | done | Connected Nodes now labels trusted peers as Trusted and splits federation nodes with `trust_level: observer` into the Observers tab. Observer nodes are excluded from Trusted, shown with their own count/badge, and refresh from the same live federation list. Validated by local type-check and targeted tests. | +| Add more tree navigation to cloud files so they do not all go back to first screen | done | Cloud folder navigation now persists the current folder path in the route query so refresh/browser back keeps nested folders instead of resetting to the section root. The Cloud back button now walks up to the parent folder before returning to Cloud home. Covered by `cloudPath.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Fix visible UI refreshing on find nodes screens | done | Federation node auto-refresh no longer blanks/replaces the visible node lists after the initial load. Existing nodes stay visible during background refreshes, covered by `NodeList.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Remove dead UI components/ones that are coming soon | done | Removed the dead Web3/coming-soon Network card, disabled local-network placeholder button, and the non-interactive Spotlight AI Assistant coming-soon block. Verified active UI no longer contains explicit `Coming soon` copy outside historical release-note text. Covered by local type-check and `git diff --check`. | +| Hide Web3 container on network for now and move FIPS Mesh up | done | Network page now places the live FIPS Mesh card in the top overview grid where the dead Web3 card was, removes the duplicate lower FIPS card, and updates the Home Network description to remove Web3 language. Validated by local type-check, targeted tests, and `git diff --check`. | +| Make cool screens less hidden: Find Nodes, Fleet, Monitoring, etc. | done | Existing Web5 summary cards now expose Monitoring, Find Nodes/Federation, and Fleet directly. Federation card has separate `Find Nodes` and `Fleet` actions instead of hiding Find Nodes behind Fleet. Covered by `Web5Federation.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Fix dashboard container/card square rendering corruption | done | Generalized the App Store compositor workaround to dashboard scroll-panel glass cards/buttons/inputs and removed transform-based stagger movement so Chromium/Brave no longer paints random large black square/rectangle layers over containers. Kept the Web5 bottom-action placement change. Validated by local type-check, targeted tests, and `git diff --check`. | +| Move constrained card header actions to bottom buttons | done | Web5 summary actions and Network actions for Add Device, Scan WiFi, Restart Tor, and Add Service now stay in the card header only on very wide screens; otherwise they render at the card bottom as full-width or 50/50 buttons. Button icons were removed from those action buttons. Validated by local type-check, targeted tests, and `git diff --check`. | +| Work on setup screens function and flows | in-progress | Onboarding setup choice now shows only usable paths: Fresh Start and Restore from Seed. Removed the disabled `Connect Existing (Coming Soon)` option, and covered default Fresh routing plus Restore routing with `OnboardingOptions.test.ts`; `useOnboarding.test.ts`, local type-check, and `git diff --check` passed. Broader onboarding/setup audit still needed before marking done. | +| Work on Easy Mode experience | in-progress | Easy Mode goal configure steps now route to their owning app/screen instead of silently completing without navigation; verify steps now expose a `Check & Continue` action; configure/info/verify actions start goal progress before completing the active step. Covered by `goalStepActions.test.ts`, existing goal store tests, local type-check, and `git diff --check`. Broader Easy Mode product scope still needed before marking done. | +| Update My Apps homescreen to show most-used apps instead of hardcoded | done | App launches are recorded locally through the app launcher, and the Home My Apps card now shows the top three installed user apps by launch count/recency with a running-app/name fallback when there is no history. Covered by `appUsage.test.ts`, existing app launcher tests, local type-check, targeted tests, and `git diff --check`. | +| Improve Full Archive Node dependent apps UX | todo | Already partly represented by Bitcoin-pruned install block; needs broader dependency UX. | +| Fix incorrect modals that are wrong color and are not full-screen overlay | done | Custom Teleport modals that still used the old light `bg-black/10` overlay now use the same full-screen `bg-black/60` overlay treatment as BaseModal/newer modals. Verified no fixed modal overlays retain `bg-black/10`; validated by local type-check, targeted tests, and `git diff --check`. | +| Prevent modals from allowing background scroll | done | Added shared scroll-lock composable, root-level body lock, wheel/touch containment, and explicit dashboard route-panel locking. User validated the background no longer scrolls behind modal overlays. | +| Look over gamepad navigation | todo | Needs focused controller-nav pass. | +| App Store screenshots | in-progress | Placeholder policy fixed: Marketplace App Details and installed App Details now render screenshot sections only when real screenshot metadata exists, and otherwise hide the fake placeholder tiles. Metadata can be string URLs or `{ src, alt }` objects. Covered by `AppContentSection.test.ts`, `useMarketplaceApp.test.ts`, local type-check, and `git diff --check`. Needs actual screenshot assets/metadata before marking done. | +| Fix App Detail page issues; container controls are not good | done | App Details container controls now disable while start/stop/restart/update/uninstall RPCs are running and show action-specific progress labels. Header actions collapse into the bottom 50/50 grid below `1280px` to avoid tablet/smaller desktop overlap. Credentials now show a loading state while package credentials are being fetched. Covered by `AppHeroSection.test.ts`, `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Add setup instructions for apps that need them | done | App Details now renders a dedicated Setup Instructions card from `static-files.instructions` when present, so apps can show install/setup notes without a new schema. Covered by `AppSidebar.test.ts`, local type-check, and `git diff --check`. | +| Add press-and-hold option for apps on mobile app screen | done | Mobile My Apps icons now support long press/context menu to open the app detail/options screen while a normal tap still launches the app. Space key opens the same options path for keyboard users. Covered by `AppIconGrid.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Side-load: add port-not-available validation | done | Sideload modal now validates app ID collisions, malformed `host:container` mappings, reserved Archipelago/package host ports, and host ports already exposed by installed packages before queueing install. Backend install remains the final bind authority. Covered by `sideloadValidation.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Delete app data option and uninstall warning | done | Uninstall dialogs in My Apps and App Details now include a clear warning plus a `Delete app data and reset it` choice. Leaving it off preserves app data for later reinstall; checking it passes `preserve_data=false` through `package.uninstall` so the app is fully reset. Covered by `AppsUninstallModal.test.ts`, `rpc-client.test.ts`, local type-check, targeted tests, and `git diff --check`. | +| Add App Store container with recommended apps that change to Home Screen | done | Home now shows up to three uninstalled core/recommended App Store apps and routes clicks through the existing Marketplace App Details handoff. Installed aliases are honored, so recommendations disappear once the app is installed and the app moves into normal My Apps/Home behavior. Follow-up layout polish moved Cloud back into the second card slot, moved Recommended Apps into Cloud's previous slot, and placed Quick Start inside the grid next to Wallet to avoid an odd-width row. Covered by `homeRecommendations.test.ts`, local type-check, `git diff --check`, and Playwright Home dashboard smoke against local Vite/mock backend. | +| Add QR code to download mobile companion app in login-triggered modal and improve modal | done | Companion intro modal now renders a QR code on desktop and a direct download button on mobile. It reads `VITE_COMPANION_APK_URL` and falls back to `/packages/archipelago-companion.apk.zip`; the APK zip is now published at `neode-ui/public/packages/archipelago-companion.apk.zip` so the modal can serve it immediately. Covered by local type-check, `git diff --check`, and manual file placement verification. | +| Video calling Picture-in-Picture | blocked | Need referenced document or desired provider/library. | +| Card-based loading visuals on App Store pages | done | Discover and Marketplace now show app-card skeleton grids while community/Nostr catalog data is loading and no cards are available yet, instead of a centered spinner/empty state. Validated by local type-check, targeted tests, and `git diff --check`. | + +## External / Hardware Items + +| Item | Status | Release question / blocker | +| --- | --- | --- | +| Buy a HaLow device and start integration | blocked | Requires hardware purchase and driver/device target. Not a code-only `1.8-alpha` item unless hardware is available now. | diff --git a/docs/APP-PACKAGING-MIGRATION-PLAN.md b/docs/APP-PACKAGING-MIGRATION-PLAN.md new file mode 100644 index 00000000..5ef353bb --- /dev/null +++ b/docs/APP-PACKAGING-MIGRATION-PLAN.md @@ -0,0 +1,441 @@ +# App Packaging Migration Plan + +## Goal + +Turn Archipelago into a serious app platform while preserving the fundamentals that drove the original architecture: + +- Rootless Podman and security-first execution. +- Managed node-OS behavior: health, repair, backups, updates, secrets, and routing. +- Bitcoin/LND/Tor/Web5/mesh integration where the platform genuinely needs deep awareness. +- A developer-friendly app packaging model that avoids app-specific Rust installers as the normal path. + +## Current Contract + +The runtime contract is manifest-first. App packages live at `apps//manifest.yml` and are validated by the shared container manifest parser. + +The current canonical manifest fields are: + +- `app`: identity and app-level metadata. +- `container`: image or build source, pull policy, network, entrypoint, custom args, derived env, secret env, and data UID. +- `dependencies`: storage and app dependencies. +- `resources`: CPU, memory, disk. +- `security`: capabilities, read-only root, no-new-privileges, network policy, optional AppArmor profile. +- `ports`, `volumes`, `files`, `environment`, `health_check`, and `devices`. +- `metadata`: current catalog-facing presentation data such as category, tier, icon, repo/source, author, and features. +- extension keys may exist temporarily, but they are transitional and should not become a second contract. + +The historical `archy-app.yml` name should be treated as superseded. The active local package filename is `manifest.yml`. + +## Current Progress + +As of the current `1.8-alpha` workstream: + +- `apps/*/manifest.yml` is the source of truth for runtime app definitions. +- The Rust manifest parser validates app identity, image-vs-build source selection, safe environment/secrets, safe ports, safe bind/named/tmpfs volumes, generated files under declared bind mounts, devices, and security/network policy values. +- Manifest-owned generated files exist through `app.files` and have been used for app config material such as Meshtastic config regeneration. +- Local image builds are represented with `container.build`; pulled images are represented with `container.image`. +- Data ownership repair is represented with `container.data_uid`. +- Derived host facts and secret-file-backed environment variables are represented with `container.derived_env` and `container.secret_env`. +- Catalog metadata generation is implemented by `scripts/generate-app-catalog.py`. +- App-session launch ports/titles and new-tab launch behavior now have a generated TypeScript metadata path from manifests, with manual overrides preserved for companion UIs and aliases that do not have manifest-owned metadata yet. +- Release drift checking is implemented by `scripts/check-app-catalog-drift.py --release --strict`. +- The canonical catalog and the UI public catalog are expected to remain byte-for-byte synced after generation. +- Runtime validation has already moved many simple and moderate apps into the manifest/orchestrator path, including Filebrowser, Vaultwarden, Portainer, Uptime Kuma, Grafana, Gitea, Nextcloud, SearXNG, Nostr Relay, PhotoPrism, Jellyfin, Meshtastic, and several Bitcoin-adjacent apps. + +The remaining migration work is mostly orchestration quality: post-reboot adoption, progress reporting, stale scanner-state handling, update policy, multi-container stack ownership, proxy route generation, and cleanup of obsolete legacy installers/fallbacks. + +## Target Architecture + +Use a StartOS-inspired package model with Umbrel-like app folders. + +```text +apps/saleor/ + manifest.yml + Dockerfile + icon.svg + screenshots/ + instructions.md + hooks/ + post-install.sh + pre-start.sh + repair.sh + health.sh + backup.sh + restore.sh + proxy/ + routes.yml +``` + +Archipelago becomes the secure compiler/runtime for these packages. The manifest declares what it needs; Archipelago validates it, injects secrets, creates rootless Podman containers, generates nginx/Tor/public routes, registers health checks, displays credentials, and manages lifecycle. + +## Core Principles + +- App packages are declarative by default. +- Hooks are allowed only as controlled, reviewed escape hatches. +- Rootless Podman stays. +- Arbitrary privileged Compose execution is not allowed. +- Each app has one source of truth. +- Catalog, launch URLs, mobile behavior, credentials, backup paths, and public routes come from the app package or its generated catalog entry. +- Rust backend owns orchestration, not app-specific business logic. +- Core infrastructure can remain special-case where justified. + +## What Stays + +- Rootless Podman. +- Archipelago orchestrator. +- Health/reconcile/repair loops. +- Host nginx. +- Nginx Proxy Manager integration. +- Tor/public routing goals. +- Bitcoin/LND/mesh/Web5/FIPS/security direction. +- OTA update system. +- App-session/mobile shell. +- Managed secrets and credentials display. + +## What Changes + +- Complex app stacks stop living in Rust. +- `app-catalog/catalog.json` becomes generated. +- Frontend fallback marketplace data is removed or generated. +- App-session port maps and new-tab launch behavior become generated. +- Public proxy routes become app-declared. +- Install/start/restart/backup/restore become package-driven. +- App updates become app package changes where possible, not full backend code changes. + +## Package Schema Direction + +Example `manifest.yml`: + +```yaml +app: + id: saleor + name: Saleor + version: 3.23.0 + description: Composable commerce platform + container: + image: docker.io/myorg/saleor:3.23.0 + pull_policy: if-not-present + network: archy-net + entrypoint: ["sh", "-lc"] + custom_args: + - /app/start.sh + derived_env: + - key: PUBLIC_URL + template: https://{{HOST_MDNS}}:9010 + secret_env: + - key: SALEOR_SECRET_KEY + secret_file: saleor-secret-key + dependencies: + - storage: 20Gi + resources: + cpu_limit: 4 + memory_limit: 2Gi + security: + capabilities: [] + readonly_root: true + no_new_privileges: true + network_policy: isolated + ports: + - host: 9010 + container: 9000 + protocol: tcp + volumes: + - type: bind + source: /var/lib/archipelago/saleor + target: /data + options: [rw] + environment: + - NODE_ENV=production + health_check: + type: http + endpoint: http://localhost:9000 + path: /health + interval: 30s + timeout: 5s + retries: 3 +``` + +Optional generated files, hooks, icons, and screenshots can sit beside the manifest, but the manifest stays the source of truth. Compose-style definitions are not executed directly. + +## Security Model + +Do not run arbitrary Compose directly. Archipelago validates: + +- No privileged containers unless explicitly approved. +- No host filesystem mounts outside approved paths. +- No Docker socket mounts. +- No host network unless explicitly approved. +- No dangerous capabilities by default. +- No arbitrary device access without declaration. +- No rootful execution. +- Pinned images preferred. +- Resource limits required. +- Backup paths declared where the app stores durable data. +- Public routes explicit. +- Secrets referenced by name, not hardcoded. + +When the runtime needs app-specific facts that do not belong in the manifest, prefer adding a reusable platform primitive rather than introducing another ad hoc installer path. + +This preserves the reason for avoiding raw Umbrel-style Compose while still giving developers a sane package format. + +## Lifecycle Model + +Every app package should support: + +- install +- configure +- start +- stop +- restart +- update +- repair +- health +- backup +- restore +- uninstall +- migrate + +Archipelago owns the state machine. + +Optional hooks: + +- `post-install.sh` for migrations/admin creation. +- `pre-start.sh` for ownership repair. +- `repair.sh` for app-specific remediation. +- `health.sh` for custom health checks. +- `backup.sh` and `restore.sh` only when simple path backups are insufficient. + +Hooks run with a controlled environment and restricted permissions. + +## Hard Work + +The hard work is not writing YAML. The hard work is safely translating app packages into reliable rootless runtime behavior: + +- Build a robust package validator. +- Map a safe Compose subset to rootless Podman. +- Handle multi-container networks without hardcoded IPs. +- Handle rootless volume ownership correctly. +- Generate host nginx routes from app metadata. +- Handle public-domain apps without leaking private `192.168.x.x` or `100.x.x.x` URLs. +- Inject secrets without exposing values in logs or frontend bundles. +- Make backup/restore consistent across databases and files. +- Migrate existing hand-built containers to package-owned containers. +- Keep old alpha nodes working while introducing the new system. +- Avoid keeping two permanent systems that drift forever. + +## Alpha Node Impact + +Existing alpha nodes must not be broken. + +Phase 1 behavior: + +- Current Rust installers keep working. +- Current app manifests keep working. +- New app package loader exists beside the old system. +- No existing app is automatically migrated. +- Alpha nodes receive compatibility code only. + +Phase 2 behavior: + +- New installs of selected apps use package mode. +- Existing installs can be detected and adopted. +- App state is preserved. +- Migration is opt-in or happens only for low-risk apps. + +Phase 3 behavior: + +- Stable migrated apps switch to package mode by default. +- Existing containers are adopted if names/volumes match. +- Data directories are preserved. +- Old Rust installers remain as fallback for at least one release cycle. + +Phase 4 behavior: + +- Remove old installers only after live alpha validation. +- Keep migration repair code for already-deployed nodes. + +## Migration Rules + +For every migrated app: + +- Preserve `/var/lib/archipelago/` data. +- Preserve generated secrets. +- Preserve credentials shown to users. +- Preserve public ports where possible. +- Preserve container names where needed for adoption. +- Never delete volumes during migration. +- Stop/recreate containers only when necessary. +- Record migration version in app state. +- Provide rollback path to old installer for alpha builds. + +## Notes For The Release + +- Catalog entries should be generated from manifests so the UI and runtime agree on launch metadata. +- The developer docs should describe the manifest/runtime contract that exists today, not the older publish-model draft. +- If a new capability is needed, add one reusable manifest field or orchestrator primitive and document it here before wiring a one-off app branch. + +## First Apps To Migrate + +Start with low-risk apps: + +- Filebrowser +- Vaultwarden +- Uptime Kuma +- Grafana + +Then moderate apps: + +- Gitea +- Nextcloud +- SearXNG +- Nginx Proxy Manager metadata integration + +Then complex apps: + +- Saleor +- Mempool +- BTCPay Server +- NetBird only if safe + +Leave for later: + +- Bitcoin +- LND +- Electrs/ElectrumX +- Tor +- System update +- Mesh/Web5/FIPS core services + +## Saleor Reference Goal + +Saleor should become the showcase package. It should prove: + +- Multi-container stack support. +- Generated secrets. +- Post-install migration/admin user hooks. +- Dashboard/API/storefront routes. +- Same-origin public GraphQL routing. +- Credentials display. +- Backup paths. +- Health checks. +- Public domain support. +- Alpha-node adoption. + +Once Saleor is clean, the app system is credible. + +## Implementation Phases + +### Phase 1: Package Contract + +- Use `apps//manifest.yml` as the package contract. +- Keep the Rust parser/validator as the canonical schema implementation. +- Keep generated catalog output from manifest-owned metadata. +- Finish generated app-session launch metadata so launch behavior cannot drift from manifests. +- Add/keep tests for unsafe package rejection. + +### Phase 2: Single-Container Runtime + +- Continue hardening package install for one-container apps. +- Compile manifests to rootless Podman/Quadlet runtime behavior. +- Support ports, env, generated files, devices, volumes, resources, health checks, data UID repair, image pull/build availability checks, and launch metadata. +- Keep Filebrowser, Vaultwarden, Portainer, Uptime Kuma, Grafana, SearXNG, Jellyfin, PhotoPrism, Meshtastic, and similar apps as regression proofs. + +### Phase 3: Multi-Container Runtime + +- Decide whether multi-container stacks use a safe `compose.yml` subset or a manifest-native `services` section. +- Support app-local networks. +- Support service dependencies and readiness gates. +- Support internal service names. +- Support generated env/secrets across services. +- Support controlled hooks only where declarative primitives are insufficient. +- Adopt existing multi-container apps without deleting data. + +### Phase 4: Routing + +- Add `proxy/routes.yml`. +- Generate host nginx routes. +- Generate Tor/public routes. +- Fix same-origin API routing class of bugs permanently. +- Integrate with Nginx Proxy Manager sync. + +### Phase 5: Migration + +- Add adoption logic for existing containers. +- Add migration metadata. +- Migrate simple apps. +- Migrate Saleor. +- Keep rollback. +- Prove reboot recovery with repeated clean post-reboot lifecycle passes. +- Preserve Nostr signer bridges, Bitcoin dependency wait states, and public launch ports during adoption. + +### Phase 6: Cleanup + +- Remove duplicated catalog/frontend data. +- Remove migrated Rust stack installers. +- Document package format. +- Add developer tooling: validate, test, package, install locally. +- Remove stale fallback metadata, app-specific lifecycle branches, and compatibility shims only after live validation. + +## Developer Tooling + +Add commands like: + +```bash +archy app validate apps/saleor +archy app render apps/saleor +archy app install apps/saleor +archy app test apps/saleor +``` + +Developers should be able to package an app without understanding Archipelago internals. + +## Open Source Story + +Public explanation: + +> Archipelago uses rootless Podman and a validated app package format. App authors define services declaratively, while the OS enforces security, secrets, routing, backups, health, and lifecycle repair. This gives us Umbrel-like app packaging with StartOS-like managed service discipline. + +## Rework Estimate + +- Package schema and validator: 1-2 weeks. +- Single-container package runtime: 1-2 weeks. +- Generated catalog/frontend metadata: 1 week. +- Multi-container support: 2-4 weeks. +- Routing/public proxy integration: 1-2 weeks. +- Hooks/secrets/backups: 2-3 weeks. +- First migrations: 2-4 weeks. +- Saleor reference migration: 1-2 weeks. +- Cleanup/docs/tooling: 2-3 weeks. + +Total estimate: 8-14 weeks of serious work for an excellent system. + +Minimum viable version: 3-5 weeks. + +## Biggest Risks + +- Rootless Podman edge cases continue to bite. +- Compose compatibility scope creeps too wide. +- Hooks become an unsafe escape hatch. +- Migration accidentally disrupts alpha nodes. +- Generated metadata drifts from old manual data during transition. +- Old and new systems remain permanently duplicated. + +## Risk Controls + +- Support a strict Compose subset, not all Compose. +- Validate everything. +- Keep hooks minimal and logged. +- Migrate one app at a time. +- Add live alpha-node checks before each release. +- Generate catalog/app-session data early. +- Set a deadline for deleting migrated legacy installers. + +## Immediate Next Steps + +1. Expand generated app-session metadata beyond ports/titles/new-tab behavior to cover proxy paths and companion UI aliases where those can be declared safely in manifests. +2. Define the app update policy and wire it into manifest/catalog metadata. +3. Finish post-reboot adoption and stale scanner-state handling for migrated apps. +4. Convert remaining multi-container legacy stacks to a manifest-owned model without deleting data. +5. Add developer tooling around the current `manifest.yml` contract: validate, render, local install, lifecycle test. +6. Migrate Saleor or another serious multi-container app as the proof package once the stack model is stable. +7. Leave Bitcoin/LND/core services as managed infrastructure until the package system is proven for normal apps. diff --git a/docs/CONTAINER_LIFECYCLE_HANDOFF.md b/docs/CONTAINER_LIFECYCLE_HANDOFF.md index a538a8d6..00515e35 100644 --- a/docs/CONTAINER_LIFECYCLE_HANDOFF.md +++ b/docs/CONTAINER_LIFECYCLE_HANDOFF.md @@ -1,6 +1,582 @@ # Container Lifecycle Handoff -Last updated: 2026-05-11 +Last updated: 2026-06-08 + +## 2026-06-08 `1.8-alpha` Release Gate Update + +- Target release is now `1.8-alpha`, including a cut and smoke-tested ISO after validation is green. +- Current release readiness estimate is about `82%`. +- Host reboot validation is not clean yet. User reported that a reboot test left IndeeHub stopped afterward, with many containers killed by SIGKILL during reboot/shutdown, one crash, and a couple stopped. +- Treat post-reboot recovery as the active release blocker. +- IndeeHub is not considered recovered unless: + - the stack containers recover after boot; + - `http://192.168.1.198:7778/` is reachable; + - the HTML includes `/nostr-provider.js`; + - `http://192.168.1.198:7778/nostr-provider.js` is served and looks like the Nostr signer bridge. +- Local follow-up in progress: + - `core/archipelago/src/container/prod_orchestrator.rs` now hardens IndeeHub stack reconcile by starting existing backend containers through a user scope when possible, waiting for backend/API dependency readiness, restarting the frontend when it does not remain running/reachable, and checking host port `7778`; + - `tests/lifecycle/remote-lifecycle.sh` now validates the IndeeHub Nostr provider during launch probes; + - `core/container/src/manifest.rs` now has stricter package safety validation while preserving all current real manifests. +- Validation passed locally for this follow-up: + - `cargo fmt --manifest-path core/Cargo.toml --all`; + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`); + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`; + - filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran one matching existing test; + - `bash -n tests/lifecycle/remote-lifecycle.sh`; + - `git diff --check`. +- Passing criterion after deploy: + - minimum: 3 consecutive clean post-fix reboots, broad non-destructive lifecycle green after each; + - preferred before release: 5 consecutive clean post-fix reboots, broad lifecycle green after each; + - SIGKILL during shutdown is not automatically disqualifying if all managed apps recover and pass health/launch after boot, but any stopped/crashed/unreachable managed app after boot fails that iteration. +- Final release gate after reboot validation: cut the `1.8-alpha` ISO and smoke-test boot/install/backend/UI/catalog/focused app lifecycle. + +### 2026-06-08 Focused Blocker Validation After `06420c...` + +- Deployed backend `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba`, then backend `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`. +- Both deploys restarted only `archipelago.service`; `archipelago-doctor.timer` and `archipelago-reconcile.timer` stayed inactive. No reboot and no broad Podman store/image commands were run. +- Local fixes included: + - targeted Podman remove fallback for stuck `removing/stopping` records; + - rootless Podman socket liveness check by Unix connection, not path existence; + - IndeeHub readiness fallback to platform network aliases when `getent` inside the API image cannot prove DNS; + - Tailscale launch harness now requires login/auth UI content; + - stricter manifest validation while preserving all real manifests. +- Validation passed locally: + - `cargo fmt --manifest-path core/Cargo.toml --all`; + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`; + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`); + - `bash -n tests/lifecycle/remote-lifecycle.sh`; + - `git diff --check`. +- `.198` is still not release-ready after `06420c...`: + - `indeedhub`: stuck `stopping`, launch `7778` returns `000`; + - `immich`: `starting`, launch `2283` returns `000`; + - `tailscale`: `running`, launch `8240` returns `000`; logs show `NeedsLogin`/`WantRunning=false`, and launch must present the Tailscale login/auth UI; + - `vaultwarden`: absent/not listed after start attempt, launch `8082` returns `000`; + - `portainer`: `running`, launch `9000` returns `000`; user confirmed Portainer environment wizard cannot connect to `unix:///var/run/docker.sock`; + - `btcpay-server`: not a current blocker; direct launch `23000` returned HTTP 200 and user confirmed the earlier report was wrong-server/slowness. +- Do not continue to reboot validation or ISO cutting until rootless Podman control-plane/socket health, stuck container-state cleanup, and app-screen launch contracts are fixed. + +## 2026-06-08 `.198` Release Candidate State Check + +- Deployed backend hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412` to `.198` after the targeted image-probe mitigation. +- Previous live backend hash before deploy was `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. +- Deployment notes: + - local release build passed: `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`; + - initial direct `cp` over `/usr/local/bin/archipelago` failed with `Text file busy`, after creating a timestamped backup; + - recovered by installing to `/usr/local/bin/archipelago.new`, atomically renaming it over `/usr/local/bin/archipelago`, and restarting only `archipelago.service`; + - no host reboot and no broad Podman store/image commands were run. +- Latest mitigation now live on `.198`: + - `core/container/src/runtime.rs` uses bounded targeted `podman image inspect` for `ContainerRuntime::image_exists()`; + - `core/archipelago/src/api/rpc/package/install.rs` uses bounded targeted `podman image inspect` for local fallback and post-pull verification; + - `core/archipelago/src/container/companion.rs` uses `podman image inspect` for companion image checks. +- Validation passed on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`: + - focused non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism,fedimint,indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`; + - broad non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`; + - `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. + - `/`: `66%` used, about `9.6G` free. + - `/var/lib/archipelago`: `8%` used, about `375G` free. +- Startup logs still showed one known `podman ps -a --format json timed out after 30s` scan timeout followed by scan backoff; lifecycle validation passed anyway. Treat Podman socket/store health as a residual release risk, but release image probes are now quarantined from the known fragile image-existence/list commands. +- Remaining release gate: host reboot validation, only if explicitly approved. + +- Verified `.198` without running broad Podman store/image commands. +- Current local release binary and live `/usr/local/bin/archipelago` match hash `670a3e789540082437c7521cc5ad7a4c260f56ee8e0a9cf770160fa25b4e4644`. +- `archipelago.service` is active. +- `archipelago-doctor.timer` is inactive. +- `archipelago-reconcile.timer` is inactive. +- `/` is at `65%` used with about `9.9G` free. +- `/var/lib/archipelago` is at `10%` used with about `370G` free. +- Backend-restart validation was already recorded as passed in the release-candidate checkpoint. The remaining live validation gate is host reboot validation, only if explicitly approved. +- Continue avoiding `podman image list`, `podman system df`, broad `podman image exists`, `podman image prune`, and `podman volume prune` on `.198` while the store/socket health risk is unresolved. + +## 2026-06-08 Local Release Gate Completion + +- No `.198` host actions were performed in this pass: no reboot, no timer changes, no deploy, no Podman store-wide commands. +- Fixed scanner skip/backoff wakeups so skipped scans still advance the scan-completion watch counter for install/update waiters. +- Fixed local full-test blockers: + - crash-recovery unit tests now pass the `include_stack_members` flag and cover generic-vs-stack recovery behavior; + - runtime manifest-port lookup checks the workspace `apps/` directory via `CARGO_MANIFEST_DIR`, so new public manifests are visible from test/runtime working directories; + - journal disk usage parsing accepts compact `journalctl` output such as `463.9M`; + - boot-reconciler cadence tests bypass the global crash-recovery wait gate when using the existing test-only `without_companion_stage()` helper. +- Local validation passed: + - `cargo fmt --manifest-path core/Cargo.toml --all`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`). + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests). + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. + - `git diff --check`. + - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. +- Remaining live gate is unchanged: host reboot validation on `.198`, only if explicitly approved. + +## 2026-06-08 Frontend Release Gate Completion + +- No `.198` host actions were performed in this pass: no reboot, no timer changes, no deploy, no Podman store-wide commands. +- Fixed mobile app-launch behavior in `neode-ui/src/stores/appLauncher.ts`: + - desktop still opens X-Frame-Options/new-tab apps directly in a new tab; + - mobile now routes those same apps through `app-session` so app icons keep users inside Archipelago; + - router return-path handling is defensive when `currentRoute` is unavailable. +- Updated frontend tests for current launch behavior and fixed async/Pina fixture setup. +- Local validation passed: + - `npm run type-check`. + - `npm test` (`548 passed`). + - `npm run build`. + - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. + - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. + - `git diff --check`. +- Local caveat: `npm ci` failed before checks because existing `neode-ui/node_modules/@alloc` entries are `root:root`; do not mutate ownership or remove the tree without explicit approval. + +## 2026-06-08 Local Podman Store-Risk Cleanup + +- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`. +- Bounded stack installer image pulls in `core/archipelago/src/api/rpc/package/stacks.rs` with `kill_on_drop` and a 600s timeout. +- Bounded manual package update image pulls in `core/archipelago/src/api/rpc/package/update.rs` with `kill_on_drop` and a 600s timeout while preserving stderr progress parsing. +- Validation passed locally: + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `cargo fmt` from `core/`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. +- Local release binary hash after this cleanup is `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4`. +- This local build has not been deployed to `.198`; live `.198` remains on `670a3e789540082437c7521cc5ad7a4c260f56ee8e0a9cf770160fa25b4e4644` unless a later checkpoint says otherwise. + +## 2026-06-08 `.198` Podman Pull Hardening Deploy + +- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198`. +- Previous backend was backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*` before replacement. +- Restarted only `archipelago.service`; no host reboot was performed. +- No broad Podman store/image commands were run. +- Initial `systemctl restart` exceeded the local 120s wrapper while startup was still in progress, but the backend reached `Server listening`, then systemd settled to `active/running`. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4`. + - `/`: `65%` used, about `9.8G` free. + - `/var/lib/archipelago`: `10%` used, about `370G` free. +- Validation passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - `python3 scripts/check-app-catalog-drift.py --release --strict`. +- Remaining release gate: host reboot validation, only if explicitly approved. + +## 2026-06-08 `.198` App Health and Port Recovery + +- Deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`. +- Fedimint Guardian and File Browser were reachable but UI package-data reported `health=starting`; backend scanner now normalizes reachable running apps to healthy and restores the launch URL when the direct port is reachable. +- Nostr relay had been using host port `8081`, which conflicted with Nginx Proxy Manager admin launch. Updated `apps/nostr-rs-relay/manifest.yml` to use host port `18081`. +- Recovered live Nostr/NPM state: + - Nginx Proxy Manager admin UI responds on `http://127.0.0.1:8081/`. + - Nostr relay responds on `http://127.0.0.1:18081/` with the expected Nostr-client message. +- Hardened legacy install runtime for scoped web apps: use `podman create` followed by `systemd-run --user --scope podman start` so containers are not coupled to `archipelago.service`, while install RPCs do not hang on scoped `podman run -d`. +- Recovered IndeedHub after broad validation found it stopped: + - `indeedhub-minio` had stopped, causing the frontend nginx container to exit with `host not found in upstream "minio"`. + - Restarted existing `indeedhub-minio` with preserved volume data and restarted the frontend. + - `http://127.0.0.1:7778/` returned HTTP `200` afterward. +- Validation passed: + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`. + - Broad non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. + - `/`: `65%` used, about `9.6G` free. + - `/var/lib/archipelago`: `10%` used, about `370G` free. +- Remaining release gate: host reboot validation, only if explicitly approved. + +## 2026-06-04 `.198` IndeedHub and Immich Lifecycle Recovery + +- Deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`. +- Fixed IndeedHub frontend startup sequencing so network alias repair is only applied immediately before the frontend starts, after `indeedhub-minio`, `indeedhub-redis`, and `indeedhub-api` are running. +- Fixed Immich lifecycle recovery on `.198`: + - dependency readiness now accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower `podman exec` probes; + - `immich_server` startup now repairs `/var/lib/archipelago/immich` ownership through `podman unshare chown -R 0:0`, preserving existing upload data while matching the current rootless container user mapping; + - this resolved the observed `EACCES` failure writing `/usr/src/app/upload/encoded-video/.immich`. +- Diagnosis notes: + - Broad audit initially failed only on Immich (`state=exited`); focused Fedimint and NetBird audits passed. + - Patched dependency wait got lifecycle past dependencies to `Starting container: immich_server`. + - Upload ownership repair allowed Immich API and microservices to remain running; direct `http://127.0.0.1:2283/` returned HTTP `200`. +- Verification on this hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. + - Focused IndeedHub audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Focused Fedimint audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - Focused NetBird audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - Focused Immich audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state after validation: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9`. +- Residual risk: + - `.198` still shows intermittent `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load; keep avoiding store-wide Podman commands and treat Podman socket/store health as a separate release hardening item. + +## 2026-06-03 `.198` Generic Host-Port Health Checkpoint + +- Latest local Podman store-risk mitigation, pending deploy to `.198`: + - `core/container/src/runtime.rs` now implements `ContainerRuntime::image_exists()` with bounded targeted `podman image inspect` instead of `podman image exists`. + - `core/archipelago/src/api/rpc/package/install.rs` now verifies local fallback images and post-pull images with bounded targeted `podman image inspect` instead of `podman images -q`. + - `core/archipelago/src/container/companion.rs` now uses `podman image inspect` instead of `podman image exists`. + - A grep across `core/**/*.rs` finds no live Rust call sites for `podman image exists` or `podman images -q`; only an explanatory comment remains. + - Validation passed: `cargo fmt --all --check`, `cargo check -p archipelago-container`, `cargo check -p archipelago`, `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests`, `cargo test -p archipelago-container`, and whitespace check for the changed files. + - A filtered `cargo test -p archipelago install_fresh_build` did not reach execution due to local compile/link slowness/artifact failure; `--tests` compilation passed afterward. + +- Deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198` after release code-review/refactor cleanup of legacy runtime host-port repair. +- Reduced duplicated app-specific port repair logic in `core/archipelago/src/api/rpc/package/runtime.rs`: + - legacy package start/restart repair now derives host ports from `apps/*/manifest.yml` when available; + - hardcoded ports remain only as fallback for legacy/non-manifest apps and for extra legacy cleanup ports such as Gitea `3000` and Nginx Proxy Manager `8084`/`8444`; + - the old duplicate Gitea cleanup helper was removed; + - focused unit coverage was added for manifest-derived runtime ports and legacy extra ports. +- Verification on this hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. + - Focused `runtime_host_ports` test was added but local `cargo test ... runtime_host_ports` did not complete within 5 minutes during compilation, consistent with known local test/linker slowness. + - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state after validation: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b`. + +- Catalog metadata generation is now implemented: + - Added `scripts/generate-app-catalog.py` to sync manifest-owned fields into both `app-catalog/catalog.json` and `neode-ui/public/catalog.json` while preserving catalog-only presentation/runtime fields. + - Corrected stale manifest metadata for public catalog apps where the manifest was behind production catalog/image values: BotFights, IndeeHub, Gitea icon/repo, LND title/image, ElectrumX image, Fedimint image, and Mempool title/version/image. + - Ran generator; canonical and UI catalogs now match byte-for-byte. + - Release drift gate is green: `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. + - Validation passed: `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`, `cargo test --manifest-path core/Cargo.toml -p archipelago-container`, `cargo check --manifest-path core/Cargo.toml -p archipelago`, and `npm run build` from `neode-ui`. + +- Deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198` after a narrow Podman store-risk hardening pass. +- Hardened fresh local-build installs so `podman image exists ` failures/timeouts no longer fail the lifecycle operation outright: + - existing timeout remains bounded in the runtime; + - `install_fresh()` now logs the check failure and rebuilds the local image instead; + - this matches the existing drift-restart path and keeps local image store checks from becoming release-blocking. +- Verification on this hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. + - Focused unit test `install_fresh_builds_when_image_exists_check_fails` was added but local `cargo test ...` did not complete within 15 minutes during compilation, consistent with known local test/linker slowness. + - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state after validation: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2`. + +- Deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198` after hardening `container-health` fallback behavior. +- Fixed the broad lifecycle timeout path where `container-health` could return `Failed to get container health` even though the app endpoint was reachable: + - `cached_reachable_health()` now parses URL ports correctly when launch URLs include a trailing slash, such as `http://localhost:2342/`. + - The fallback port map now covers the lifecycle launch apps, including PhotoPrism `2342`, BTCPay `23000`, LND UI `18083`, Mempool `4080`, Electrum `50002`, Fedimint `8175`, Gitea `3001`, IndeedHub `7778`, Ollama `11434`, Vaultwarden `8082`, Tailscale `8240`, and others. + - Reachable cached-running apps can now return `healthy` without depending on flaky Podman health/inspect paths. +- Verification on this hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. + - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state after validation: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36`. + - `/`: `62%` used, about `11G` free. + - `/var/lib/archipelago`: `9%` used, about `370G` free. +- Remaining blockers: + - Podman socket/store health is still a release risk; continue avoiding broad store/image commands on `.198`. + - Backend-restart and host-reboot validation are still pending and should be run only when approved. + +## 2026-06-03 `.198` Generic Host-Port Health Checkpoint In Progress + +- Deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`. +- This pass is explicitly aligned with the migration objective: use generic platform primitives from manifest/container-declared ports instead of adding more OS-level or app-specific package edits. +- Broad lifecycle on previous hash `d21202cd...` failed only because Uptime Kuma briefly appeared as `stopping` during listener repair; it recovered immediately afterward with `3002` listening and HTTP `302`. +- Implemented generic health-monitor host-port awareness: + - Health monitor now parses Podman JSON `Ports` host TCP bindings for each container. + - A running container with declared host TCP ports is not considered healthy if those host listeners are missing. + - This avoids a hardcoded app-to-port list and makes missing pasta/rootless listeners a generic recovery concern. +- Also fixed scanner merge semantics: + - `Stopping -> Running` now recovers immediately when there is no user-stopped marker. + - User-initiated stops still preserve `Stopping` over live `Running` while the stop is in progress. +- Verification so far: + - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. + - Live service state after deploy: `archipelago.service` active; doctor/reconcile timers inactive. + - After backend restart, Uptime Kuma recovered its `3002` listener and returned HTTP `302`. +- Still in progress: + - Jellyfin is still running/healthy according to Podman but missing the `8096` host listener after backend restart. + - Next fix should keep the same generic direction: missing host listener repair should use the manifest/orchestrator-aware restart path for apps with declared ports, not another Jellyfin-specific OS edit. + - Broad lifecycle has not yet passed on `3912b900...`. + +## 2026-06-03 `.198` Stale State and Jellyfin Pasta Listener Repair + +- Deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`. +- Fixed a focused lifecycle false-negative where `container-list` could report stale cached `exited` state while Podman scan backoff was active and the container had already recovered: + - Cached `exited` entries now get a targeted live refresh before being returned by `container-list`. + - This avoids broad `podman ps` scans and preserves the UI/package-data consistency model. +- Added a bounded `container-health` fallback for cached running web apps: + - If the cached app state is `Running` and its known local launch port accepts TCP, the RPC can return `healthy` without waiting on Podman inspect/list paths. + - This quarantines health reads from intermittent Podman socket/store stalls. +- Added Jellyfin to the legacy runtime host-port repair path: + - `runtime_required_host_port("jellyfin")` now maps to `8096`. + - stale pasta cleanup now includes `8096` for Jellyfin start conflicts. +- Validation notes: + - `package.restart jellyfin` exposed a remaining Podman socket/runtime failure after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`. + - `package.start jellyfin` recovered the app afterward; `jellyfin` returned `Up ... (healthy)`, `ss` showed a `pasta.avx2` listener on `8096`, and `http://192.168.1.198:8096/` returned HTTP `302`. + - Focused lifecycle passed on the current hash: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Endpoint checks after focused lifecycle: Uptime Kuma `3002` returned `302`; Jellyfin `8096` returned `302`; Filebrowser `8083` returned `404` at `/`, which is expected for this probe. + - `scripts/check-app-catalog-drift.py --release` still reports zero missing entries and `35` metadata drift items. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e`. + - `/`: `62%` used, about `11G` free. + - `/var/lib/archipelago`: `9%` used, about `371G` free. +- Remaining blocker: + - Broad lifecycle has not yet been rerun on `d21202cd...`. + - Podman socket/store health is still a release risk; avoid broad image/store commands and treat socket permission/runtime failures separately from app health. + +## 2026-06-03 `.198` Expanded Rollback Cleanup and Store-Safe Uninstall + +- Deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`. +- Expanded `system.disk-cleanup` retention beyond `archipelago.backup-*` to cover alpha-era rollback artifacts: + - legacy `/usr/local/bin/archipelago.bak*` and `archipelago.before-*` files; + - old `/opt/archipelago/web-ui.bak*` and `web-ui.old` directories. +- Live cleanup reclaimed `10.3 GB` without touching Podman image/volume prune: + - `Removed old backend backups: 41.6 MB freed`. + - `Removed old legacy backend backups: 3.6 GB freed`. + - `Removed old web UI backups: 6.6 GB freed`. + - `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. +- Root filesystem pressure is no longer a release blocker on `.198`: + - Before expanded cleanup: `/` was `99%` used with about `478-545M` free. + - After expanded cleanup: `/` is `61%` used with about `11G` free. + - `/usr/local/bin` dropped to about `336M`; `/opt/archipelago` dropped to about `1.1G`. +- Uninstall no longer runs global `podman volume prune -f`; app data removal remains explicit when `preserve_data=false`. +- Verification: + - `cargo build -p archipelago --bin archipelago --release` passed. + - Local `cargo test -p archipelago system::tests` did not complete within 10 minutes in this environment; release build succeeded and live cleanup validation passed. + - Focused post-cleanup lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b`. + +## 2026-06-03 `.198` Startup Scan Backoff and Uptime Kuma Pasta Repair + +- Deployed backend hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28` to `.198`. +- Startup adoption is now bounded with a 35s timeout so a stuck `podman ps -a --format json` cannot stall backend startup indefinitely. +- The initial container scan now seeds the same 300s Podman scan backoff used by periodic scans, preventing an immediate second `podman ps` after a startup timeout. +- Legacy pasta restart paths now use scoped `podman restart` instead of stop+start. This repairs cases where a running pasta container loses its host listener but `podman start` would be a no-op. +- Uptime Kuma validation: + - Before repair, the container was running and internally healthy on `127.0.0.1:3001`, but host port `3002` had no `pasta` listener and LAN launch failed. + - `package.restart` for `uptime-kuma` now returns `{"status":"restarted"}` instead of hanging. + - Post-restart `http://192.168.1.198:3002/` returned HTTP `302` and the scanner restored launch metadata. +- Release validation passed: + - Focused audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Broad audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/`: still tight at `99%` used, about `395M` free. + - `/var/lib/archipelago`: about `10%` used. +- Residual risk: + - `.198` Podman store health remains fragile under broad store commands; avoid prune/image-list/system-df release operations until the store issue is handled separately. + - Logs during broad validation still showed unrelated IndeedHub/conmon cgroup permission noise, but focused and broad lifecycle audits passed. + +## 2026-06-02 `.198` Registry/Catalog and Lifecycle Checkpoint + +- Follow-up on Podman prune/catalog generation: + - Diagnosed the `podman image prune -f` failure and found it is broader than prune: `podman system df`, `podman image list`, `podman image exists`, and sometimes broad `podman ps`/`inspect` can hang on `.198` under current store/node load. + - Stopped only the diagnostic Podman commands started during this follow-up. + - Changed `system.disk-cleanup` to skip Podman image/volume prune entirely for the release path. Cleanup still handles logs, journal retention, temp files, and backend backup retention, and returns an explicit action: `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. + - Deployed backend hash `c9695dc3db10ff6e593cdbcfbbdc94b2e98b6008aa62655bba51b9879b549e8c` to `.198`. + - Live cleanup validation passed: endpoint returned quickly, pruned old backend backups, did not spawn new Podman prune/list work, and `/` stayed around `98%` with about `647-670M` free. + - During diagnosis, Uptime Kuma's port returned empty responses. Restarted only `uptime-kuma` through `package.restart`; data preserved; launch returned HTTP `302` afterward. + - Focused post-repair audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Broad post-repair audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Final raw Podman bad-state sweep was clean. + - Catalog metadata generation is not implemented yet. The release-safe step in this pass is the new `scripts/check-app-catalog-drift.py --release` mode, which reports zero missing catalog/manifest entries while still surfacing metadata-only drift. + +- Release-work continuation after cleanup/catalog/review gate: + - Deployed backend hash `e285d421cef497beb6b4b929f36fb4296d6db1f4a4c786157b6751eec51619ca` to `.198`. + - `system.disk-cleanup` is now bounded so a slow `podman image prune -f` cannot wedge the cleanup RPC indefinitely; the prune failure is reported as an action while cleanup continues. + - `system.disk-cleanup` now vacuums systemd journals to a bounded size and prunes timestamped `/usr/local/bin/archipelago.backup-*` files to the newest three using the existing `host_sudo` path. + - Live cleanup validation passed: endpoint returned, journals were reduced to about `200M`, old backend backups were pruned to three, and `/` improved from about `99%`/`490M` free to `98%`/about `730M` free. + - Added `nostr-rs-relay` to both catalog surfaces. Release-focused catalog drift now has zero missing catalog/manifest entries; remaining drift is metadata-only and belongs to the catalog-generation follow-up. + - Focused post-cleanup audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,nostr-rs-relay,portainer ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Broad post-cleanup audit passed with extended harness timeout: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. + - Final service state: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive. + +- Follow-up validation after the previous cutoff: + - `.198` is already running the current local release build hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`; no backend replacement was performed in this pass. + - Local release binary smoke-started successfully on an alternate bind/data dir before live checks. + - Meshtastic manifest-owned file rendering is now proven live: `/var/lib/archipelago/meshtastic/config.yaml` was backed up, removed, and recreated by `package.restart` from `apps/meshtastic/manifest.yml`. + - Focused Meshtastic audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - Focused regression audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. + - Service state remains deterministic-test safe: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive. + - `/` remains tight at `99%` used with about `490M` free. + +- Live `.198` state after this pass: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256 is now `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`; no backend replacement was performed in this follow-up pass. + - `/`: still tight at `99%` used, about `490M` free. +- Registry state: + - Live `/var/lib/archipelago/config/registries.json` is already correct: `146.59.87.168:3000/lfg2025` is primary with `tls_verify: false`; `git.tx1138.com/lfg2025` is enabled as secondary with `tls_verify: true`. + - Added `meshtastic` and `portainer` to both `app-catalog/catalog.json` and `neode-ui/public/catalog.json` so migrated manifest-owned apps are present in the registry/catalog surface. +- Live recovery performed: + - Raw Podman sweep found `nextcloud` stuck in `Removing`. + - Removed only the wedged container record with `podman rm -f nextcloud`; bind-mounted data was preserved. +- Local verification passed: + - `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`. + - `cargo test -p archipelago-container generated_files_must_live_under_bind_mounts`. + - `cargo test -p archipelago manifest_generated_files`. + - `cargo test -p archipelago reconcile_force_recreates_stopping_container`. + - `cargo test -p archipelago health_maps_states_to_strings`. + - `cargo test -p archipelago test_rewrite_image`. + - `cargo test -p archipelago test_load_default`. + - `cargo check -p archipelago --bin archipelago`. + - `cargo build -p archipelago --bin archipelago --release`, hash `13786fd7bc5afb36fb7873ad9aee1a54a696e75b0a92c2fcd90cc8100038a54c`. +- Live validation passed: + - Focused audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. +- Remaining before release: + - The prior release-binary segfault is no longer reproducing with the current artifact; `.198` is active on hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`. Continue watching logs after restarts, but do not treat `app.files` deployment as blocked. + - Add disk cleanup/backup retention policy; root filesystem pressure still makes deploys and image operations fragile. + - Resolve broader app catalog/manifest drift reported by `scripts/check-app-catalog-drift.py`; this pass only added the migrated Meshtastic and Portainer catalog entries. + +## 2026-05-28 `.198` Meshtastic File-Rendering Recovery Checkpoint + +- Current `.198` service state after recovery: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256 restored to `2ec1952dcc5f6101d236dd3ea7a85a40a6387a3f1afb8a5681345cad90306853` after a failed deploy attempt. + - `/`: still tight at `99%` used, about `546M` free. +- Local generated-file support status: + - Manifest schema supports `app.files`. + - Production orchestrator writes declared manifest files before create/start/restart and does not overwrite existing files unless `overwrite: true` is declared. + - Meshtastic manifest declares `/var/lib/archipelago/meshtastic/config.yaml` under its bind-mounted data directory. +- Local verification passed: + - `cargo test -p archipelago-container generated_files_must_live_under_bind_mounts`. + - `cargo test -p archipelago manifest_generated_files`. + - `cargo check -p archipelago --bin archipelago`. + - `cargo build -p archipelago --bin archipelago --release` produced local hash `13786fd7bc5afb36fb7873ad9aee1a54a696e75b0a92c2fcd90cc8100038a54c`. +- Live deploy caveat: + - Deploying the local release binary to `.198` caused immediate `SIGSEGV` on `archipelago.service` startup. + - The previous live binary was restored from `/usr/local/bin/archipelago.backup-20260528-container-files-2ec1952dcc5f6101d236dd3ea7a85a40a6387a3f1afb8a5681345cad90306853`; backend returned active. + - Do not redeploy that local release artifact blindly; diagnose the startup segfault/build mismatch first. +- Live Meshtastic recovery: + - Before recovery, `.198` had Meshtastic manifests with `files:` but no `/var/lib/archipelago/meshtastic/config.yaml`; container logs showed `No 'config.yaml' found` and `Blank MAC Address not allowed`. + - Wrote the same config currently declared by the manifest to `/var/lib/archipelago/meshtastic/config.yaml` as an operational recovery, then restarted `meshtastic.service`. + - Meshtastic returned `Up ... (healthy)`. +- Live validation passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Raw Podman sweep showed Meshtastic, Jellyfin, File Browser, BTCPay, Grafana, SearXNG, Gitea, Nostr relay, Botfights, Portainer, Nginx Proxy Manager, and other active managed containers without unhealthy/stopping/removing/exited states. +- Next required work: + - Diagnose why the local release backend segfaults immediately on `.198` before deploying the generic manifest file renderer as the durable fix. + - After a safe backend deploy, remove reliance on the manually recovered Meshtastic config by proving the manifest-owned renderer recreates it on start/restart. + - Keep deterministic-test timers inactive unless intentionally running non-deterministic recovery testing. + +## 2026-05-27 `.198` Manifest-Orchestrator Migration Checkpoint + +- Current `.198` live backend: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/usr/local/bin/archipelago` sha256: `31ae1b346fd36d715c9fe7f0686dcb31a70d2fea44996abf122743d048fb7b2f`. +- Migration goal confirmed and advanced: apps should not require hardcoded OS/Rust edits to work. App differences belong in manifests; Rust/OS should provide generic primitives for lifecycle, Quadlet rendering, readiness/health, port repair, bind-mount prep, data ownership, and image availability. +- New generic backend fixes deployed: + - Quadlet health drift detection now compares `HealthCmd`, `HealthInterval`, `HealthTimeout`, and `HealthRetries`. + - HTTP health command rendering now derives `wget -T` / `curl -m` from manifest `health_check.timeout`; `timeout: 30s` now produces helper-level `30s` probes instead of an outer Podman `30s` wrapped around an inner `5s` command. + - Existing Quadlet unit drift that requires restart now verifies the manifest image exists locally and pulls/builds if missing before restarting. + - Existing Quadlet service start for a missing container now also verifies/pulls/builds the manifest image before `systemctl --user start`. + - Reconcile now treats manifest-declared dependencies of active apps as required even if stale `user-stopped.json` entries exist, and parent app reconcile drift-syncs existing dependency Quadlet units from their own manifests. + - Portainer host prep moved out of a hardcoded Rust install hook; generic bind-mount socket prep now handles manifest sources ending in `/podman.sock`. +- Manifest updates deployed to both `/opt/archipelago/apps` and `/opt/archipelago/web-ui/archipelago-runtime/apps`: + - `portainer`: declarative manifest with data dirs, Podman socket mount, capabilities, `data_uid`, `9000:9000`, and no Podman healthcheck. + - `btcpay-server`, `grafana`, `nostr-rs-relay`, `searxng`: HTTP health timeouts/retries loosened to `timeout: 30s`, `retries: 5` to avoid false negatives under `.198` load. + - `archy-nbxplorer` manifest has `timeout: 30s`, `retries: 5`; live unit now matches with helper-level `wget -T 30` / `curl -m 30`. +- Local verification passed: + - `cargo fmt`. + - `cargo test -p archipelago translate_health_check -- --nocapture` passed. + - `cargo check -p archipelago --bin archipelago` passed after each backend fix. + - `cargo build -p archipelago --bin archipelago --release` passed; final deployed binary hash is `31ae1b346fd36d715c9fe7f0686dcb31a70d2fea44996abf122743d048fb7b2f`. +- Live `.198` validation: + - Portainer full lifecycle passed earlier: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - BTCPay focused lifecycle passed after the missing-image start guard: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Focused migration audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server,grafana,nostr-rs-relay,searxng,portainer,gitea ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. + - Targeted unit/container sweep showed `btcpay-server`, `grafana`, `nostr-rs-relay`, `searxng`, and `portainer` services active. + - Post-focused and post-broad raw Podman sweeps found no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. + - Raw states: `btcpay-server Up ... (healthy)`, `grafana Up ... (healthy)`, `nostr-rs-relay Up ... (healthy)`, `searxng Up ... (healthy)`, `portainer Up ...`. + - Generated units for `btcpay-server`, `grafana`, `nostr-rs-relay`, and `searxng` now show helper-level `wget -T 30` / `curl -m 30`, `HealthTimeout=30s`, and `HealthRetries=5`. + - Generated unit for `archy-nbxplorer` now also shows helper-level `wget -T 30` / `curl -m 30`, `HealthTimeout=30s`, and `HealthRetries=5`; BTCPay stack remained healthy. + - Filebrowser full lifecycle passed under the manifest/orchestrator path: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=filebrowser ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Filebrowser post-test live verification: `filebrowser.service` active; bind mounts `/srv` and `/data` rendered; `Exec=--config /data/.filebrowser.json`; generated `.filebrowser.json` points database to `/data/filebrowser.db` and root to `/srv`; container is `Up ... (healthy)`. +- Operational caveat found: + - `.198` root filesystem remains tight: about `556M` free on `/` (`99%` used). There are many old backend backup binaries under `/usr/local/bin`; deploys and Podman image operations are fragile until backup/image cleanup policy is added. +- Remaining before release: + - Meshtastic full lifecycle now passed on `.198` after routing it through the orchestrator path and fixing its manifest image, device, volume target, health check, launch metadata handling, and TCP port declaration. + - Replace the temporary/manual Meshtastic host `config.yaml` dependency with the generic manifest-owned file rendering path: + - Added local schema support for `app.files`. + - Added local production-orchestrator rendering for declared files before container start. + - Added Meshtastic `files:` declaration for `/var/lib/archipelago/meshtastic/config.yaml`. + - Local manifest parser tests passed; backend orchestrator tests are still running before deployment. + - Latest post-Meshtastic raw `.198` sweep: + - `archipelago.service`: active. + - `archipelago-doctor.timer`: inactive. + - `archipelago-reconcile.timer`: inactive. + - `/`: 99% used, about `532M` free. + - `jellyfin` and `filebrowser` reported `unhealthy`; investigate before final release qualification. + - Add the release code-review/refactor/performance gate: remove dead transitional code, reduce remaining app-specific Rust/OS paths, review scan/health/reconcile performance, then rerun lifecycle and launch tests after cleanup. + +## 2026-05-26 Migration Release Notes + +- Active doctrine: app-specific host mutations should move out of generic Rust/OS install paths wherever possible. Apps should be described by manifests and lifecycle hooks; the Rust backend should provide generic primitives for validation, container lifecycle, health/readiness, port repair, secrets, data ownership, and recovery. +- Current `.198` work remains focused on lifecycle migration hardening first. Do not call the migration finished until focused full lifecycle and broad audits pass on the manifest/orchestrator-owned path. +- `.198` Gitea migration checkpoint: + - Backend deployed: `/usr/local/bin/archipelago` sha256 `3780e54eec4821a61fbc024259bd854ec376228eb981fa169ec6f8aeafc5a9dd`. + - Gitea manifest deployed to both `/opt/archipelago/apps/gitea/manifest.yml` and `/opt/archipelago/web-ui/archipelago-runtime/apps/gitea/manifest.yml`, latest sha256 `8df263fcca9581a4e0a2872d21d26eed35b007c7bd7475071bedfd005f514e68`. + - The Gitea fix is manifest-owned: `security.no_new_privileges` is now honored by the generic Podman/Quadlet renderers, and Gitea declares its required capabilities (`CHOWN`, `FOWNER`, `SETUID`, `SETGID`, `DAC_OVERRIDE`, `NET_BIND_SERVICE`) plus `no_new_privileges: false`. + - Focused full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- `.198` generic host-listener repair checkpoint: + - Backend deployed: `/usr/local/bin/archipelago` sha256 `be06756763283535d2b3ee911cc91c7d401fb51b4dd88a3ebe86d79a05183e84`. + - Running-container reconcile now probes manifest-declared host ports and repairs missing listeners generically; observed repair restored Grafana port `3000` without a Grafana-specific OS edit. + - Uptime Kuma repair uses a longer readiness window so the generic repair path does not restart it before its slow HTTP startup completes. + - Gitea healthcheck timeout/retries were loosened in manifest metadata (`timeout: 30s`, `retries: 5`) after raw Podman health showed timeout-only false negatives while HTTP launch returned `200`. + - Focused audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea,grafana,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. +- Release follow-ups to keep in scope after the current Gitea/Uptime/Nextcloud migration pass: + - Portainer fixes discussed on 2026-05-26 must be carried into the new declarative approach, not left as a hardcoded OS prerequisite path. Completed for the current `.198` pass: + - Added `apps/portainer/manifest.yml` with manifest-declared data dirs, Podman socket mount, port `9000`, capabilities, `data_uid`, and no Podman healthcheck. + - Removed the hardcoded `ensure_portainer_host()` OS/Rust install hook. + - Added generic manifest-driven Podman socket preparation for any app that bind-mounts `podman.sock`. + - Backend deployed: `/usr/local/bin/archipelago` sha256 `d440e2cba52c6e1b60d8f0716386b0f4e3ce56b5370cedafabc6dbd30d230909`. + - Portainer manifest deployed to both `/opt/archipelago/apps/portainer/manifest.yml` and `/opt/archipelago/web-ui/archipelago-runtime/apps/portainer/manifest.yml`, latest sha256 `5e2ab96f2ba91ad2539a7dc6b73c92c6cece676109550d7d4c2f556aa578ba9c`. + - Focused full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Re-test the Filebrowser fixes under the manifest/orchestrator path. + - Re-test the Meshtastic fixes before final release qualification. + - Add an app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so third-party developers can package apps against the current manifest/runtime contract without relying on one-off OS-level changes. + - Add a required release code-review/refactor gate before cutting `1.8-alpha`: remove dead transitional code, replace remaining app-specific Rust/OS paths with manifest-owned metadata or generic lifecycle primitives, review scan/health/reconcile performance, then rerun lifecycle and launch tests after the cleanup. ## 2026-05-13 `.198` Stopping-State Repair Checkpoint diff --git a/docs/CURRENT_AGENT_HANDOFF.md b/docs/CURRENT_AGENT_HANDOFF.md new file mode 100644 index 00000000..1e67913f --- /dev/null +++ b/docs/CURRENT_AGENT_HANDOFF.md @@ -0,0 +1,216 @@ +# Current Agent Handoff - Bitcoin UI Recovery And `1.8-alpha` Resume + +Last updated: 2026-06-10 05:33 EDT + +## Read This First + +This is a separate handoff from `docs/NEXT_TERMINAL_HANDOFF.md`. That file tracks +an older/broader plan. For the next agent resuming this machine-switch pause, +read this file first, then read: + +- `docs/RESUME.md` +- `docs/1.8-alpha-improvements-tracker.md` +- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` +- `docs/MIGRATION_STATUS_REPORT.md` + +Do not assume `docs/NEXT_TERMINAL_HANDOFF.md` is the current short-term plan. + +## Current Goal + +Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image. + +The release goal is not just "apps launch once"; the app/container system needs +to be developer-ready and production-release ready: + +- manifests and docs must describe the real runtime contract; +- apps must install, start, stop, restart, uninstall, reinstall, survive reboot, + report truthful status, and show useful progress; +- My Apps must preserve last-known truth during Podman/scanner backoff instead + of showing false empty/no-app states; +- Bitcoin-dependent apps must explain sync/wallet readiness instead of looking + broken; +- final validation needs focused lifecycle, broad non-destructive lifecycle, + then repeated reboot checks before ISO cut/smoke test. + +## Current Estimate + +As of this pause: + +- Credible release candidate: roughly `87-91%`. +- Production-quality release developers will love: roughly `73-79%`. +- Calendar estimate if the remaining systemic lifecycle issues are bounded: + `1-2 focused engineering days` for a release candidate, then additional + reboot/ISO smoke time. +- The biggest remaining risk is not catalog wiring; it is rootless Podman + control-plane responsiveness, stale scanner state, lifecycle progress UX, and + reboot validation. + +## Validation Host + +- Host: `192.168.1.198` +- SSH user: `archipelago` +- Password used in this session: `password123` +- Active Bitcoin app on this host: `bitcoin-knots`, not `bitcoin-core` +- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive + for deterministic validation unless intentionally testing them. +- Preserve app data. +- Avoid broad Podman store/image cleanup commands on `.198`. + +## Bitcoin UI Incident Summary + +User reported the Bitcoin custom UI showing: + +`Bitcoin node is starting or busy syncing; retrying automatically. Detail: +getblockchaininfo: Bitcoin RPC request failed ... operation timed out` + +Then after listener repair, the message changed through: + +- `Connection refused` +- `Verifying blocks...` +- then the user reported it looked fine again. + +What happened: + +- The node is a `bitcoin-knots` node. +- During live debugging, the wrong alias, `bitcoin-core`, was started/stopped. +- `bitcoin-core` and `bitcoin-knots` compete for the same Bitcoin RPC/P2P ports. +- That action left the real `bitcoin-knots` service active but without the host + `8332` rootlessport listener for a while. +- Stopping the stray `bitcoin-core.service` and restarting only + `bitcoin-knots.service` recreated listeners on `8332` and `8333`. +- After restart, bitcoind entered the normal `-28 Verifying blocks...` phase. +- The user later reported the Bitcoin UI looked fine again. + +Known live state observed during recovery: + +- `bitcoin-knots.service`: active +- `bitcoin-core.service`: inactive +- `archy-bitcoin-ui.service`: active +- listeners present after repair: + - `8332` via `rootlessport` + - `8333` via `rootlessport` + - `8334` via nginx/Bitcoin UI +- `bitcoin-knots` logs showed active IBD around height `4137xx` and progress + about `0.09438`. + +Do not restart Bitcoin again unless there is a fresh confirmed service/listener +failure. If checking status, prefer read-only probes and avoid starting the +wrong variant. + +## Source Fixes Made Locally + +These local edits were made after live Bitcoin recovered. They are not deployed +yet and were not fully validated before the user paused. + +### `core/archipelago/src/bitcoin_status.rs` + +Changed Bitcoin status cache behavior and copy: + +- refresh interval changed from `5s` to `10s`; +- transient error backoff added at `15s`; +- RPC client timeout increased from `8s` to `20s`; +- error context now uses full anyhow chain with `{e:#}`; +- transient classifications now include common overloaded/backend states; +- user-facing copy now distinguishes: + - `verifying blocks after restart`; + - `waiting for the Bitcoin RPC listener`; + - `busy and not answering RPC before the timeout`; + - generic `starting or busy syncing`; +- added unit tests for the three user-visible states above. + +Intent: stop collapsing distinct backend states into the same stale +"starting or busy syncing" timeout message. + +### `core/archipelago/src/api/rpc/package/update.rs` + +Narrow Bitcoin alias fix added: + +- `orchestrator_update_app_id("bitcoin-knots")` now remains + `"bitcoin-knots"` instead of mapping to `"bitcoin-core"`; +- candidate app IDs for a Bitcoin container now prefer `bitcoin-knots` before + `bitcoin-core`; +- tests updated to lock this behavior. + +Intent: `bitcoin-core` and `bitcoin-knots` can be dependency/status aliases, +but must not be interchangeable lifecycle/update targets on a node that has a +specific installed variant. + +Important: this file also already contained other uncommitted update/pull +timeout changes from prior work. Do not assume every diff in this file came +from this interruption. + +## Validation Status At Pause + +Completed: + +- `cargo fmt --manifest-path core/Cargo.toml --all` passed after the local + Bitcoin edits. + +Attempted but not completed: + +- Targeted Cargo tests were first launched in three separate `/tmp` target dirs + and failed due `/tmp` filling with `No space left on device`. +- Those temporary dirs were removed: + - `/tmp/archy-cargo-bitcoin-status` + - `/tmp/archy-cargo-update-alias` + - `/tmp/archy-cargo-container-candidates` +- A second run using `CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix` was still + compiling when the user paused. It was terminated for handoff. +- No successful Rust test result exists yet for the new Bitcoin status/alias + tests. + +Recommended validation after resume: + +```bash +git diff --check -- core/archipelago/src/bitcoin_status.rs core/archipelago/src/api/rpc/package/update.rs docs/CURRENT_AGENT_HANDOFF.md +CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago bitcoin_status::tests +CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago update_aliases_map_to_manifest_app_ids +CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago container_name_candidates_cover_common_aliases +``` + +If Cargo target locking appears stale, check for real `cargo`/`rustc` workers +before deleting anything. Prefer workspace-local target dirs under `.codex-tmp` +over new cold `/tmp` targets. + +## Immediate Next Steps + +1. Confirm no lingering Cargo process: + + ```bash + pgrep -af "cargo|rustc|cargo-bitcoin-fix" + ``` + +2. Validate the local Bitcoin source fixes listed above. + +3. If validation passes, build/deploy the backend to `.198` only after + confirming the user still wants deployment. + +4. Recheck live Bitcoin non-destructively: + + - `bitcoin-knots.service` active; + - `bitcoin-core.service` inactive; + - listeners on `8332`, `8333`, `8334`; + - Bitcoin UI loads on `8334`; + - `/bitcoin-status` returns useful copy if backend is busy. + +5. Resume release backlog: + + - rootless Podman lifecycle/control-plane responsiveness; + - My Apps last-known-state truthfulness during scanner backoff; + - progress UX for install/uninstall/start/stop/restart; + - remaining tracker rows in `docs/1.8-alpha-improvements-tracker.md`; + - focused lifecycle matrix on `.198`; + - broad non-destructive lifecycle; + - 3 clean reboot validations minimum, 5 preferred; + - ISO cut and ISO smoke test. + +## Cautions For Next Agent + +- Do not start `bitcoin-core` on `.198` unless intentionally migrating variants. +- Treat `bitcoin-knots` as the installed Bitcoin variant. +- Do not run broad Podman prune/store cleanup. +- Do not revert unrelated dirty worktree changes. +- `docs/NEXT_TERMINAL_HANDOFF.md` exists but is not the short-term handoff for + this pause. +- Many repo files are dirty from broader release hardening. Read diffs before + attributing changes. diff --git a/docs/MIGRATION_STATUS_REPORT.md b/docs/MIGRATION_STATUS_REPORT.md new file mode 100644 index 00000000..342398f1 --- /dev/null +++ b/docs/MIGRATION_STATUS_REPORT.md @@ -0,0 +1,105 @@ +# Migration Status Report + +Last updated: 2026-06-11 + +## Goal + +Make Archipelago's app/container system developer-ready and release-ready: app installs, lifecycle, recovery, and integrations should be portable, manifest-driven, and not rely on one-off OS-level changes or hardcoded Rust branches for each new app. The OS/backend should provide generic primitives for manifests, Quadlet rendering, lifecycle, health/readiness, dependency ordering, data ownership, image availability, bind mounts, secrets, app files, networking, bridge/signer integrations, and recovery. + +The developer contract should be clear enough that a third-party developer can build and ship an Archipelago app from documentation plus manifest/schema examples. If an app needs a capability the platform does not yet expose, the release direction is to add a reusable manifest/orchestrator primitive rather than a special case tied to that app. This is the standard for the `1.8-alpha` app migration: professional app delivery, predictable behavior after restart/reboot, and a path for user-installed/community apps that does not require rebuilding the OS image for every app. + +Release quality bar: every supported app must install, stop, start, restart, uninstall, survive host reboot, report accurate status, and expose clear install/uninstall progress. Stale health notifications must not persist across login or refresh after the underlying condition has cleared. Final release validation should run on the intended release validation server, not drift between appliances without an explicit checkpoint. + +Target release: `1.8-alpha`, including a cut and smoke-tested ISO once validation is green. + +Current release readiness estimate: about `82%`. The remaining percentage is mostly post-reboot recovery confidence, repeated reboot validation, and ISO creation/smoke testing rather than the core manifest/catalog migration itself. + +## Current Result + +- The migration is not final-release complete yet, but the core direction is being met. +- Portainer, Filebrowser, BTCPay, Grafana, Nostr Relay, SearXNG, Gitea, and key dependency units have moved further into the manifest/orchestrator path. +- `.198` has passed focused and broad lifecycle audits for the already migrated set. +- Meshtastic is now routed through the orchestrator path, no longer falls back to legacy `localhost/meshtastic:latest`, and has passed full lifecycle validation on `.198`. +- On 2026-06-02, focused and broad `.198` non-destructive lifecycle audits passed after clearing a wedged `nextcloud` Podman record. The live registry config already has OVH primary plus tx1138 mirror, and Meshtastic/Portainer were added to the catalog surfaces. +- Later on 2026-06-02, the current release backend hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265` was found active and stable on `.198`. Meshtastic `app.files` rendering was proven live by removing `/var/lib/archipelago/meshtastic/config.yaml`, restarting through `package.restart`, and verifying the manifest recreated the file. Focused Meshtastic, focused `meshtastic,jellyfin,filebrowser`, and broad non-destructive audits all passed afterward; raw Podman sweep was clean. +- The remaining release gate was continued on 2026-06-02: bounded disk cleanup, journal retention, backend-backup retention, and release-focused catalog drift classification were added. `.198` is active on backend hash `e285d421cef497beb6b4b929f36fb4296d6db1f4a4c786157b6751eec51619ca`; focused and broad post-cleanup lifecycle audits passed, and final raw Podman sweep was clean. +- Follow-up found Podman store commands can hang on `.198` beyond image prune (`podman system df`, image list/exists, and sometimes broad ps/inspect). The release cleanup path now skips Podman image/volume prune rather than touching that unstable path. `.198` is active on backend hash `c9695dc3db10ff6e593cdbcfbbdc94b2e98b6008aa62655bba51b9879b549e8c`; Uptime Kuma was repaired with a normal `package.restart`; focused and broad post-repair lifecycle audits passed, and final raw bad-state sweep was clean. +- On 2026-06-03, startup/adoption scanner hardening and pasta restart repair were deployed. `.198` is active on backend hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`; `package.restart` for Uptime Kuma now returns successfully and restores the `3002` pasta listener; focused `meshtastic,jellyfin,filebrowser,uptime-kuma` and broad lifecycle audits passed. +- Later on 2026-06-03, expanded rollback cleanup and store-safe uninstall hardening were deployed. `.198` is active on backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b`; `system.disk-cleanup` reclaimed `10.3 GB` from old backend and web UI rollback artifacts while still skipping Podman prune, and focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed afterward. +- Latest 2026-06-03 follow-up deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e`. It mitigates stale cached `container-list` state during Podman scan backoff, adds a bounded TCP reachability fallback for `container-health`, and adds Jellyfin `8096` to legacy pasta host-listener repair. Focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed on this hash. Broad lifecycle still needs rerun on this latest hash. +- Current validation backend hash is `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b`. It keeps the generic host-listener health direction, preserves the `container-health` fallback fix from `be95ea...`, hardens fresh local-build installs so `podman image exists ` failures/timeouts rebuild instead of failing the lifecycle operation, and reduces duplicated legacy runtime port repair by deriving host ports from manifests. Targeted PhotoPrism and broad non-destructive `.198` lifecycle audits passed on this hash. +- Catalog metadata generation from manifests is now implemented via `scripts/generate-app-catalog.py`. The canonical catalog and UI public catalog are synced from manifest-owned fields, strict release drift is zero, and frontend build validation passed. +- Current live `.198` validation backend hash is `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. Broad non-destructive lifecycle is green on that deployed line after app health/port recovery, IndeedHub recovery, scoped legacy install hardening, and bounded Podman pull hardening. +- Local release validation now passes the full backend binary test target and every Rust workspace member after release cleanup fixes for scanner backoff wakeups, crash-recovery tests, manifest-port lookup, journal parsing, and boot-reconciler test determinism. +- Frontend release validation now passes `npm run type-check`, `npm test` (`548` tests), and `npm run build` after fixing mobile app-launch routing for new-tab apps and updating stale launch tests. Local `npm ci` is blocked by root-owned `neode-ui/node_modules` entries, so dependency reinstall remains a local environment cleanup item requiring explicit approval. +- Reboot validation is not yet green. User reported that a reboot test left IndeeHub stopped afterward, with multiple containers killed by SIGKILL during shutdown/reboot and at least one crash. Treat post-reboot recovery as the active release blocker. +- Local follow-up now hardens IndeeHub stack boot recovery and updates lifecycle validation so IndeeHub must still serve the Nostr signer bridge (`/nostr-provider.js`) before a launch probe passes. + +## Completed In This Pass + +- Pause checkpoint for resume: generated app-session metadata now covers manifest-owned launch ports, titles, and new-tab behavior. The next migration step should continue from proxy path/companion UI alias generation or return to the release blocker around post-reboot IndeeHub recovery. +- Updated `docs/APP-PACKAGING-MIGRATION-PLAN.md` to reflect the current `apps//manifest.yml` contract, replacing stale `archy-app.yml` next-step language with the actual parser/generator/orchestrator progress and the remaining migration blockers. +- Updated `docs/app-developer-guide.md` so developers see the current manifest fields, generated catalog flow, validation commands, and release lifecycle expectations instead of the older Nostr marketplace publish/trust-score draft. +- Verified the developer-guide manifest example parses as YAML, `scripts/generate-app-catalog.py` is idempotent, strict release catalog drift remains zero, and `git diff --check` is clean for the migration docs. +- Extended `scripts/generate-app-catalog.py` to also emit `neode-ui/src/views/appSession/generatedAppSessionConfig.ts` from manifests, and wired `appSessionConfig.ts` to merge generated launch ports/titles/new-tab launch behavior with the existing manual overrides for companion UIs and aliases. +- Added a Fedimint `interfaces.main` launch declaration for the Guardian wait/proxy UI on port `8175`, so that public launch surface is now represented in the manifest. +- Focused validation passed for the generated app-session path: Python helper compile, generator idempotence, strict catalog drift, `appSessionConfig.test.ts`, and frontend type-check. +- Aligned `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` with the current manifest/runtime contract so the release docs no longer describe the stale marketplace-style schema. +- Removed the hardcoded Portainer host-prep path and replaced it with a manifest plus generic Podman socket bind-mount preparation. +- Added generic Quadlet health drift detection for command, interval, timeout, and retry changes. +- Made rendered HTTP health helpers honor manifest timeouts. +- Added image availability guards before Quadlet starts/restarts so pruned images are pulled or built before systemd tries to start them. +- Fixed stale dependency handling so active manifest dependencies are not suppressed by old `user-stopped.json` entries. +- Added parent-app reconcile syncing for dependency Quadlet units. +- Validated Portainer, Filebrowser, BTCPay, and broad non-destructive audits on `.198`. +- Updated Meshtastic manifest to use a real available image, the real `/dev/ttyUSB0` device, the actual daemon data path, and a non-HTTP health check. +- Updated the lifecycle harness so non-HTTP apps do not require launch metadata. +- Added a generic manifest-owned file rendering primitive under `app.files` so apps can declare required bind-mounted config files without adding app-specific Rust/OS branches. + +## Current `.198` State + +- `archipelago.service`: active. +- `archipelago-doctor.timer`: inactive. +- `archipelago-reconcile.timer`: inactive. +- Current validation backend hash: `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. +- `.198` root filesystem pressure is currently resolved for release validation: latest sweep showed `/` at 65% used with about 9.6G free after expanded rollback cleanup. +- Latest focused Fedimint, Immich, IndeedHub, and PhotoPrism audits passed on the current hash. +- Broad non-destructive lifecycle passed on the current hash before and after backend restart validation. + +## Meshtastic Status + +- Orchestrator routing is fixed and verified by the generated Quadlet unit. +- Current generated unit uses: + - `Image=docker.io/meshtastic/meshtasticd:daily-alpine` + - `Volume=/var/lib/archipelago/meshtastic:/var/lib/meshtasticd:Z` + - `AddDevice=/dev/ttyUSB0` + - `HealthCmd=test -f /var/lib/meshtasticd/config.yaml` +- The daemon starts and accepts TCP API connections on port `4403`. +- Full lifecycle passed on `.198`: install, stop, start, restart, uninstall with preserved data, and reinstall. +- A persisted `config.yaml` is required. The release path is now the generic `app.files` manifest primitive rather than a Meshtastic-specific backend hook, and this has been verified live on `.198` by deleting the file and proving `package.restart` recreates it from the manifest. + +## Release Blockers + +- Continue monitoring the current optimized release backend on `.198`; the previously observed release-binary segfault is not reproducing with hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. +- `system.disk-cleanup` now handles journal, backend-backup, legacy backend rollback, and web UI rollback retention while intentionally skipping Podman image/volume prune because Podman store commands can hang on `.198` under current load. Diagnose Podman store health separately from the release cleanup path. +- Release image probes have been further quarantined from the fragile Podman store commands and deployed to `.198` on backend hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`: runtime, legacy install, and companion image checks now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`. Focused and broad non-destructive lifecycle validation passed on the deployed hash. +- Podman socket/runtime health remains a release blocker: `package.restart jellyfin` stopped the container but failed to complete because Podman reported `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`; `package.start jellyfin` recovered the app and the focused lifecycle passed afterward. +- Release-focused catalog drift now has zero missing catalog/manifest entries and zero metadata drift after generating catalog metadata from manifests. +- Backend-restart validation passed. Host-reboot validation is currently failed/pending due to post-reboot IndeeHub recovery. Reboot retests should run only after an explicit release checkpoint/approval. +- Local code-review/refactor cleanup gate has full local validation coverage now: + - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` passed (`688` tests); + - all other workspace packages check/test clean; + - frontend type-check/tests/build passed; + - release build, catalog drift, catalog idempotence, Python helper compile, and whitespace checks passed. +- Before `1.8-alpha` release: + - deploy the post-reboot recovery fixes; + - prove focused IndeeHub lifecycle with Nostr signer injection intact; + - update the app packaging/developer docs so `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` match the current manifest/runtime contract and release-quality lifecycle expectations; + - complete the required refactor/remove-dead-code gate after correctness validation: remove obsolete transitional code, stale per-app hacks, duplicate lifecycle paths, and misleading compatibility fallbacks, then rerun release validation; + - require at least 3 consecutive clean post-fix reboots with broad non-destructive lifecycle green after each; + - prefer 5 consecutive clean reboots for production-release confidence; + - cut and smoke-test the `1.8-alpha` ISO. + +## Bottom Line + +We are working toward the intended goal: better than Umbrel/StartOS by making app behavior declarative and registry/manifest-owned. The migration is substantially advanced, Meshtastic manifest-owned config generation is verified live, catalog metadata is generated from manifests, disk cleanup/backup retention is in place without Podman prune risk, and full local backend/frontend workspace validation has been green. Remaining follow-up for `1.8-alpha` is post-reboot recovery validation, especially IndeeHub plus Nostr signer behavior, repeated reboot passes, ISO cut/smoke test, separate Podman socket/store-health diagnosis, and optional local cleanup of root-owned frontend dependencies before rerunning `npm ci`. diff --git a/docs/NEXT_TERMINAL_HANDOFF.md b/docs/NEXT_TERMINAL_HANDOFF.md new file mode 100644 index 00000000..ecc188a2 --- /dev/null +++ b/docs/NEXT_TERMINAL_HANDOFF.md @@ -0,0 +1,572 @@ +# Next Terminal Handoff - Archipelago `1.8-alpha` + +Last updated: 2026-06-11 00:17 America/New_York + +## Resume Prompt + +Paste this into the next terminal/session: + +> Continue Archipelago `1.8-alpha` release hardening from `/home/archipelago/Projects/archy`. First read `docs/NEXT_TERMINAL_HANDOFF.md`, then `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, `docs/MIGRATION_STATUS_REPORT.md`, and `docs/1.8-alpha-improvements-tracker.md`. Active validation node is `.198` at `192.168.1.198` with user `archipelago` and password `password123`. Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic validation. Do not run broad Podman store/image cleanup commands on `.198` (`podman prune`, `podman image list`, `podman system df`, broad image-exists/list/store-wide cleanup); the store/control path is known to hang under load. Preserve app data. Latest deployed backend hash on `.198` is `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`. Fedimint Guardian public launch is fixed: `8175` serves the styled wait/proxy UI with real background/icon assets and proxies to backend Guardian on `8177`; `package.restart fedimint` now returns immediately and settled with both services active. Latest local-only tracker pass added uninstall preserve/delete-data UI, companion APK QR/download, setup instructions rendering, Fleet/Bitcoin receive-state loading improvements, Nextcloud false-update work, PhotoPrism credential fallback, and removed the Spotlight AI coming-soon block. Continue with the broader rootless Podman lifecycle/control-plane blocker, My Apps state truthfulness, progress UX, remaining in-progress tracker items, full lifecycle, clean reboot iterations, ISO cut, and ISO smoke test. + +## Current Goal + +Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image. + +Release status is still not green. The remaining work is mostly systemic hardening and final gates, not basic app catalog wiring. + +The user improvement list in `docs/1.8-alpha-improvements-tracker.md` is part of +the same release and next ISO cut. Keep that tracker updated as items move from +`todo` to `in-progress`, `blocked`, `done`, or explicit release deferral. + +## Active Session Checkpoint - 2026-06-10 05:48 EDT + +New terminal resumed from this handoff. No `.198` host actions have been run in +this resumed pass yet. + +Resume-save checkpoint, 2026-06-10 08:32 EDT: progress is saved in this handoff +and `docs/1.8-alpha-improvements-tracker.md`. No `.198` host actions were run +after the 05:48 checkpoint, no dev server was intentionally left running, and no +long-running validation command is expected to still be active from this pass. +The user explicitly wants the fixes backlog continued, not app migration work, +unless they redirect. Start a resumed session by re-reading the tracker row +`Make tabs info load quickly or show loading states`, then continue the slow +panel audit or move to the next unresolved fixes-backlog row. + +Resume-save checkpoint, 2026-06-10 23:15 EDT: continued only frontend fixes +backlog work and avoided Bitcoin/Tor RPC/backend paths because another agent is +working there. No `.198` host actions were run, no dev server was intentionally +left running, and no long-running validation command is expected to still be +active from this pass. + +Resume-save checkpoint, 2026-06-11 00:17 EDT: continued the fixes backlog only, +not app migration. Avoid Bitcoin/Tor RPC/backend work because a separate agent +is working there. The latest local change fixes the header responsiveness +regression the user flagged: primary My Apps/App Store/Websites navigation is +restored to persistent desktop tabs at `md+` on My Apps, Discover, and +Marketplace; desktop primary dropdowns were removed; mobile dropdown behavior +remains; App Store category collapse is delayed by starting uncollapsed and +using a smaller header gap/search reserve; My Apps desktop category dropdown was +removed. Validation passed `npm run type-check`, +`npm test -- --run src/views/marketplace/__tests__/MarketplaceAppCard.test.ts src/views/apps/__tests__/appsConfig.test.ts`, +and scoped `git diff --check`. Browser smoke against the already-running local +Vite/mock session (`http://127.0.0.1:8102` and mock backend `5959`) is still +pending. Leave that existing session alone unless it has already exited. + +Exact first step for this pass: + +1. Update the handoff docs with this fresh checkpoint. +2. Rerun local resume gates that were pending after the 05:30 checkpoint: + `git diff --check` and the focused Rust image-version test for the + Nextcloud false-update work. +3. If local gates are clean, continue the rootless Podman lifecycle/control-plane + blocker by inspecting the backend scanner/backoff and package stop/start/ + restart paths before touching `.198`. + +Progress in this resumed pass: + +- `git diff --check` passed. +- `/tmp` has sufficient build headroom for focused Rust validation + (`/tmp` was 14% used at the start of the pass). +- Focused Rust validation for Nextcloud/image-version work is still + inconclusive, not green: + `env CARGO_INCREMENTAL=0 CARGO_TARGET_DIR=/tmp/archy-cargo-image-versions cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests` + compiled through the `archipelago` crate, then the tool PTY stayed open with + no active `cargo`, `rustc`, or linker process visible in `ps`. +- A bounded retry using the normal workspace target also did not finish: + `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests` + exited `124` after compiling the `archipelago` test target without reaching + test output. Keep the Nextcloud false-update row `in-progress`. +- Found and fixed a lifecycle asymmetry in + `core/archipelago/src/api/rpc/package/runtime.rs`: `package.stop` claimed to + return immediately but single-orchestrator apps still stopped synchronously + before responding. The local change now lets migrated single-orchestrator apps + return `{"status":"stopping"}` immediately and finish stop in the background, + matching start/restart behavior. This is not deployed yet and still needs + local validation. +- Separate UI-only pass on port-review track: + - My Apps now preserves the last known backend package list when a later + scanner/backoff update reports `containers-scanned=false` with an empty + package map; + - the page shows `Refreshing container state. Showing the last known app list + until the scan finishes.` above the app grid while cached app state is being + rendered; + - this touched only `neode-ui` UI files and this handoff/tracker note, so it + should not conflict with the backend app migration/control-plane pass; + - focused validation passed: + `npm test -- --run src/views/apps/__tests__/appPackageCache.test.ts` and + `npm run type-check`. + - Web5 Shared Content My Content tab now keeps the current content list + visible during refresh/failure and shows `Refreshing shared content...`; + - Web5 Shared Content Browse Peers tab now keeps the current peer content list + visible while refreshing the same peer, and shows `Refreshing peer content...` + instead of replacing the tab with a full loading panel; + - switching to a different peer still clears stale content and shows the full + connecting state; + - focused validation passed: + `npm test -- --run src/views/web5/__tests__/Web5SharedContent.test.ts` and + `npm run type-check`. + - Local review services are running for user review: + Vite `http://localhost:8102/` / `http://192.168.1.116:8102/` and mock + backend `http://localhost:5959`; `curl` probes returned HTTP `200` for both + the Vite root and proxied `server.get-state`. +- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed after the + stop-path fix. +- Backend compile validation for the stop-path fix passed: + `env CARGO_TARGET_DIR=/tmp/archy-cargo-runtime-check cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`. + The first check session also eventually returned success after the bounded + rerun waited on its build-directory lock. +- `git diff --check` passed again after the stop-path edit and doc updates. +- Follow-up inspection confirmed the lower-level Quadlet/orchestrator stop path + is already bounded: `quadlet::stop_service` uses timed `systemctl --user stop` + with app-scoped kill/reset recovery, and the runtime fallback treats missing + containers as success. No additional lower-level stop change was made in this + pass. +- Latest backlog-fix pass stayed on the fixes tracker, not new app migration: + - backend `package.credentials` now returns manifest-backed PhotoPrism + credentials (`admin` / `archipelago`) directly, matching the existing UI + fallback; + - My Apps and mobile icon-grid credential pre-launch modals are centered + vertically on mobile instead of behaving like bottom sheets; + - validation passed: + `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts src/views/apps/__tests__/AppIconGrid.test.ts`, + `npm run type-check`, + `env CARGO_TARGET_DIR=/tmp/archy-cargo-runtime-check timeout 300s cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`, + `cargo fmt --manifest-path core/Cargo.toml --all --check`, and + `git diff --check`. +- Focused Nextcloud/image-version Rust test is still not green: + `env CARGO_INCREMENTAL=0 CARGO_TARGET_DIR=/tmp/archy-cargo-image-versions-2 timeout 600s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests -- --nocapture` + again exited `124` after compiling into the `archipelago` crate without + reaching test output. Keep that tracker row `in-progress`. +- Continued the tab loading-state backlog: + - Web5 Connected Nodes Messages and Requests tabs keep populated lists + visible during refresh or refresh failure; + - Web5 Identities keeps the current identity list visible during refresh or + refresh failure and shows `Refreshing identities...`; + - Web5 DWN message browsing keeps stored messages visible during refresh or + refresh failure and shows `Refreshing messages...`; + - validation passed: + `npm test -- --run src/views/web5/__tests__/Web5ConnectedNodes.test.ts src/views/web5/__tests__/Web5Identities.test.ts src/views/web5/__tests__/Web5DWN.test.ts` + and `npm run type-check`. +- Continued the same tab/loading-state backlog on Server networking: + - Server Network overview keeps current values visible during refresh/failure + and shows `Refreshing network...`; + - Server Network Interfaces keeps current detected interfaces visible during + refresh/failure and shows `Refreshing interfaces...`; + - Server Tor Services keeps existing hidden-service rows visible during + refresh/failure and shows `Refreshing Tor services...`; + - validation passed: + `npm test -- --run src/views/__tests__/ServerNetworkRefresh.test.ts` and + `npm run type-check`. +- Continued the same loading-state backlog on Credentials: + - the Credentials list keeps existing credential rows visible during + refresh/failure and shows `Refreshing credentials...`; + - validation passed: + `npm test -- --run src/views/__tests__/CredentialsRefresh.test.ts src/views/__tests__/ServerNetworkRefresh.test.ts` + and `npm run type-check`. +- Continued the same loading-state backlog on Lightning Channels: + - the channels list keeps existing channels visible during refresh/failure + and shows `Refreshing channels...`; + - validation passed: + `npm test -- --run src/views/apps/__tests__/LightningChannels.test.ts src/views/__tests__/CredentialsRefresh.test.ts src/views/__tests__/ServerNetworkRefresh.test.ts` + and `npm run type-check`. +- Continued the same loading-state backlog on Peer Files: + - the peer catalog keeps existing file cards visible during Tor + refresh/failure and shows `Refreshing peer files...`; + - validation passed: + `npm test -- --run src/views/__tests__/PeerFilesRefresh.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Cloud peer cards: + - Cloud keeps existing peer cards visible during federation peer-list + refresh/failure and shows `Refreshing peer nodes...`; + - validation passed: + `npm test -- --run src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on the Web5 Verifiable Credentials + summary: + - the summary keeps existing credential rows visible during refresh/failure + and shows `Refreshing credentials...`; + - validation passed: + `npm test -- --run src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Web5 Nostr Relays: + - relay stats stay visible during refresh/failure and show + `Refreshing relays...`; + - validation passed: + `npm test -- --run src/views/web5/__tests__/Web5NostrRelays.test.ts src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Web5 Domains: + - registered-name counts stay visible during refresh/failure and show + `Refreshing domains...`; + - validation passed: + `npm test -- --run src/views/web5/__tests__/Web5Domains.test.ts src/views/web5/__tests__/Web5NostrRelays.test.ts src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Settings Backups: + - existing backup rows stay visible during refresh/failure and show + `Refreshing backups...`; + - validation passed: + `npm test -- --run src/views/settings/__tests__/BackupSection.test.ts src/views/web5/__tests__/Web5Domains.test.ts src/views/web5/__tests__/Web5NostrRelays.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Settings Transport Preferences: + - existing preference controls stay visible during refresh/failure and show + `Refreshing transport preferences...`; + - validation passed: + `npm test -- --run src/views/settings/__tests__/TransportPrefsCard.test.ts src/views/settings/__tests__/BackupSection.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Settings VPN status: + - current VPN connection details stay visible during refresh/failure and show + `Refreshing VPN status...`; + - validation passed: + `npm test -- --run src/views/settings/__tests__/VpnStatusSection.test.ts src/views/settings/__tests__/TransportPrefsCard.test.ts src/views/settings/__tests__/BackupSection.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the same loading-state backlog on Web5 Federation: + - summary node counts and node DID stay visible during refresh/failure and + show `Refreshing federation...`; + - validation passed: + `npm test -- --run src/views/web5/__tests__/Web5Federation.test.ts`, + `npm run type-check`, and `git diff --check`. +- Continued the Mesh map denied-location backlog: + - added component coverage that browser geolocation denial remains optional + and tells the user peer positions can still appear; + - validation passed: + `npm test -- --run src/components/__tests__/MeshMap.test.ts`, + `npm run type-check`, and `git diff --check`. + - row remains `in-progress` until browser smoke validates denied location + with a real peer coordinate message. +- Continued the companion/tab-app backlog: + - mobile app-session keeps apps that require a new tab inside the mobile + session fallback instead of auto-opening an external tab and closing; + - validation passed: + `npm test -- --run src/views/__tests__/AppSessionMobileNewTab.test.ts src/views/appSession/__tests__/appSessionConfig.test.ts src/stores/__tests__/appLauncher.test.ts`, + `npm run type-check`, and `git diff --check`. + - row remains `in-progress` until broader companion smoke testing is done. +- Continued the Nostr Discoverable Nodes UI backlog: + - Discover modal keeps existing discovered rows visible during relay + refresh/failure and shows `Searching relays...`; + - validation passed: + `npm test -- --run src/views/federation/__tests__/DiscoverModal.test.ts`, + `npm run type-check`, and `git diff --check`. + - row remains `in-progress` until live relay/trust validation is done. +- Continued the App Store screenshots backlog: + - Marketplace App Details and installed App Details no longer show fake + screenshot placeholder tiles when no screenshot metadata exists; + - both views now render real screenshot URLs when metadata is provided as + strings or `{ src, alt }` objects; + - validation passed: + `npm test -- --run src/views/appDetails/__tests__/AppContentSection.test.ts src/composables/__tests__/useMarketplaceApp.test.ts`, + `npm run type-check`, and `git diff --check`; + - row remains `in-progress` until real screenshot assets/metadata are added. +- Continued the Home/App Store recommendations backlog: + - Home now shows an App Store recommendations card with up to three + uninstalled core/recommended marketplace apps; + - the selector respects installed aliases, so recommended apps drop out once + installed and then rely on normal My Apps/Home behavior; + - card clicks reuse the existing Marketplace App Details handoff; + - card animation ordering was tightened so Home cards have a stable stagger + sequence as the recommendations card appears/disappears; + - validation passed: + `npm test -- --run src/views/home/__tests__/homeRecommendations.test.ts`, + `npm run type-check`, + `git diff --check`, and + `ARCHY_BASE_URL=http://127.0.0.1:8103 npx playwright test e2e/visual-regression.spec.ts -g 'home / dashboard' --project=chromium`; + - temporary Vite on `8103` was stopped after the smoke. An older local + dev/mock session on `8102`/`5959` was already present and was left alone. + - tracker row is `done`. +- Home layout follow-up: + - Cloud was moved back into the second card slot; + - Recommended Apps moved into Cloud's previous position; + - Quick Start now lives inside the dashboard grid next to Wallet, with + stacked goal buttons, instead of rendering as a separate odd-width row; + - validation passed: + `npm test -- --run src/views/home/__tests__/homeRecommendations.test.ts`, + `npm run type-check`, + `git diff --check`, and + `ARCHY_BASE_URL=http://127.0.0.1:8102 npx playwright test e2e/visual-regression.spec.ts -g 'home / dashboard' --project=chromium`. +- Continued the Easy Mode experience backlog: + - goal configure steps now route to their owning app/screen instead of + silently completing without navigation; + - verify steps now show `Check & Continue`, so goals that start with a verify + step are no longer stuck without an active action; + - configure/info/verify actions start goal progress before completing the + current step; + - validation passed: + `npm test -- --run src/views/goals/__tests__/goalStepActions.test.ts src/stores/__tests__/goals.test.ts`, + `npm run type-check`, and `git diff --check`; + - tracker row is `in-progress` because broader Easy Mode product scope still + needs review. +- Continued the setup screens/function/flow backlog: + - onboarding setup choice now shows only usable paths, Fresh Start and + Restore from Seed; + - removed the disabled `Connect Existing (Coming Soon)` option; + - validation passed: + `npm test -- --run src/views/__tests__/OnboardingOptions.test.ts src/composables/__tests__/useOnboarding.test.ts`, + `npm run type-check`, and `git diff --check`; + - tracker row is `in-progress` because broader onboarding/setup audit still + needs review. + +## Latest Local Checkpoint - 2026-06-10 05:30 EDT + +User paused work to switch machines. No dev server or validation command should +be intentionally left running from this checkpoint. + +Latest local-only release-tracker work since the older `.198` handoff: + +- Uninstall/data reset: + - My Apps and App Details uninstall dialogs now include `Delete app data and reset it`; + - unchecked preserves app data and sends `preserve_data=true`; + - checked sends `preserve_data=false`; + - covered by `AppsUninstallModal.test.ts`, `rpc-client.test.ts`, type-check, and `git diff --check`; + - tracker row is `done`. +- Companion APK: + - companion intro modal uses `VITE_COMPANION_APK_URL` or `/packages/archipelago-companion.apk.zip`; + - desktop shows a centered QR image generated with the same `qrcode` library used by wallet flows; + - mobile shows a direct download button; + - visible close button restored; + - APK exists at `neode-ui/public/packages/archipelago-companion.apk.zip`; + - tracker row is `done`. +- Setup instructions: + - App Details sidebar renders `static-files.instructions` when non-empty; + - covered by `AppSidebar.test.ts`, type-check, and `git diff --check`; + - tracker row is `done`. +- Fleet / tab loading: + - Fleet auto-refresh header/sort controls were tightened; + - node history no longer blanks during refresh and now shows `Refreshing history...`; + - covered by `useFleetData.test.ts`, type-check, and `git diff --check`; + - tracker row remains `in-progress` pending broader slow-tab audit. +- Bitcoin receive readiness: + - receive modals show a live `Checking Lightning wallet readiness...` message while on-chain address generation is in flight; + - shared helper now distinguishes LND REST/newaddress transport failures; + - covered by `bitcoinReceive.test.ts`, type-check, and `git diff --check`; + - tracker row remains `in-progress` pending live wallet-state smoke test. +- Nextcloud false update: + - Nextcloud manifest/catalog/static UI metadata moved from `28` to pinned `29`; + - update comparison now ignores registry-host-only image changes while reporting same-repo tag drift; + - `python3 scripts/check-app-catalog-drift.py --release --strict` passed; + - `cargo test -p archipelago container::image_versions::tests` from `core/` failed first with a Rust linker/incremental artifact issue after `/tmp` was full, then the non-incremental retry was killed because it ran too long; + - old `/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered to about 14% used; + - tracker row is `in-progress`; rerun the focused Rust test before marking done. +- Dead/coming-soon UI: + - removed the non-interactive Spotlight AI Assistant coming-soon block; + - verified no active UI `Coming soon` strings remain outside historical release-note text; + - type-check passed and `git diff --check` passed; + - tracker row is `done`. +- No-registration credentials: + - added PhotoPrism fallback credentials from its manifest (`admin` / `archipelago`); + - did not add Grafana because its `GRAFANA_ADMIN_PASSWORD` is not resolved to a known local secret/default in the repo; + - `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed; + - `npm run type-check` passed; + - tracker row still `in-progress` because other no-registration apps still need inventory. + +Most recent validations before pause: + +- `npm run type-check` passed after the PhotoPrism credential fallback. +- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed. +- `git diff --check` passed after the Spotlight cleanup and before the PhotoPrism fallback; rerun it after resuming. +- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during the Nextcloud pass. +- Backend Rust focused validation for image versions is still not clean because of the local linker/incremental artifact failure and the killed retry; rerun from `core/` when convenient. + +## Latest Known `.198` State + +- Host: `192.168.1.198`. +- Backend deployed: `/usr/local/bin/archipelago` sha256 `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`. +- `archipelago.service`: active after deploy. +- `archipelago-doctor.timer`: inactive. +- `archipelago-reconcile.timer`: inactive. +- No reboot validation should be started yet. + +## What Was Just Done + +- Investigated current Fedimint Guardian UI report: + - live `.198` RPC reports `fedimint` as `starting` and `container-health {"fedimint":"starting"}`; + - direct `http://192.168.1.198:8175/` returns HTTP `000` because the manifest wrapper has not exec'd `fedimintd` yet; + - `bitcoin-knots` is `running` and `http://192.168.1.198:8334/` returns HTTP `200`; + - `bitcoin.status` RPC returned an operation-failed error during the check, consistent with the current Bitcoin-dependent-app wait-state problem. +- Added frontend Fedimint-specific wait-state copy: + - My Apps/App card now says `Waiting for Bitcoin to finish initial sync before Guardian starts.` when Fedimint is starting or running with `health=starting`; + - App session fallback title now says `Waiting for Bitcoin sync` instead of generic `App not reachable` for that state. +- Validated frontend changes: + - `npm test -- --run src/views/apps/__tests__/appsConfig.test.ts` passed (`7` tests); + - `npm run type-check` passed; + - `npm run build` passed. +- Deployed rebuilt static frontend to `.198` only: + - preserved `aiui/` and `claude-login.html`; + - backed up previous web root at `/opt/archipelago/rollback/web-ui-fedimint-ui-20260610-042927.tar`; + - reloaded nginx; + - confirmed deployed assets contain the new Fedimint copy. +- Fixed Fedimint Guardian launch on `.198` while Bitcoin is still syncing: + - added `docker/fedimint-ui`, an nginx wait/proxy companion; + - changed Fedimint backend manifest so real Guardian UI maps to host `8177` instead of the public launch port; + - public launch port `8175` is now owned by `archy-fedimint-ui`, which serves `Waiting for Bitcoin sync` until `fedimintd` binds behind it; + - fixed the Fedimint wait command to avoid `printf '%s'` in Quadlet `Exec=` because systemd expands `%s` to the user shell (`/bin/bash`); + - live `.198` `fedimint.service` unit has `TimeoutStartSec=infinity` so systemd does not kill the intentional Bitcoin-sync wait loop; + - rebuilt and deployed frontend static files so Fedimint remains launchable while `health=starting`; + - confirmed `http://192.168.1.198:8175/` returns HTTP `200` with `Waiting for Bitcoin sync`. +- Restyled the Fedimint wait/proxy page: + - `docker/fedimint-ui/index.html` now uses Archipelago-style `glass-card`, app icon block, Montserrat-like heading stack, orange focus/glow accents, and yellow starting badge styling; + - rebuilt `localhost/fedimint-ui:latest` on `.198`; + - restarting `archy-fedimint-ui.service` hit the known rootless Podman cleanup slowness and left the unit temporarily `deactivating`; + - recovered with app-scoped `systemctl --user kill --kill-whom=all -s SIGKILL archy-fedimint-ui.service`, `reset-failed`, and `start`; + - final LAN validation: `http://192.168.1.198:8175/` returns HTTP `200`, size `6419`, and contains `glass-card`, `app-icon`, `Archipelago App`, and `Waiting for Bitcoin sync`. +- Updated the Fedimint wait/proxy page again per design feedback: + - uses the Bitcoin custom UI's `/assets/img/bg-network.jpg` full-screen background + dark overlay pattern; + - uses the real Fedimint icon inside the Bitcoin custom UI `logo-gradient-border` treatment instead of text initials; + - copied those assets into `docker/fedimint-ui/assets/`; + - rebuilt `localhost/fedimint-ui:latest` on `.198`; + - fixed nginx routing so `/assets/...` is served statically instead of being proxied to the not-yet-running Guardian backend; + - corrected the companion page to reference `fedimint.jpg` because the catalog icon bytes are JPEG despite the old `.png` extension; + - final LAN validation: `http://192.168.1.198:8175/` returns HTTP `200`, size `11328`; `/assets/img/app-icons/fedimint.jpg` returns `200 image/jpeg`; `/assets/img/bg-network.jpg` returns `200 image/jpeg`; + - Playwright render validation confirmed title `Fedimint Guardian`, status `Waiting for Bitcoin sync`, background URL `/assets/img/bg-network.jpg`, and icon natural width `860`. +- Hardened Fedimint/backend lifecycle enough for this path: + - generated Quadlet services now include `TimeoutStartSec=0` so systemd does not kill dependency-gated container entrypoints while they wait for Bitcoin IBD; + - `package.restart` now returns `{"status":"restarting"}` immediately instead of blocking the RPC call for minutes in the single-orchestrator path; + - `quadlet::restart_service` now uses bounded stop/start, app-scoped kill/reset recovery, and settle waits instead of opaque `systemctl restart`; + - deployed backend hash `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228` to `.198`; + - backup made at `/opt/archipelago/rollback/archipelago-before-quadlet-timeout0-20260610-082535`; + - `package.restart fedimint` returned `{"status":"restarting"}` in `0s`; + - restart observation: `8175` stayed HTTP `200` throughout; generated `fedimint.container` gained `TimeoutStartSec=0`; `fedimint.service` and `archy-fedimint-ui.service` settled `active`; ports `8175` and `8177` listened. +- Final Fedimint live validation after restart: + - `container-health` returned `{"fedimint":"healthy"}`; + - `container-list` returned `fedimint` `state:"running"` and `lan_address:"http://localhost:8175"`; + - services: `fedimint.service` active, `archy-fedimint-ui.service` active; + - unit contains `TimeoutStartSec=0` at line `42`; + - public wait/proxy UI and both image assets returned `200`. +- Fedimint live rollback references: + - previous frontend backup: `/opt/archipelago/rollback/web-ui-fedimint-guardian-launch-20260610-045949.tar`; + - previous Fedimint Quadlet backup: `/home/archipelago/.config/containers/systemd/fedimint.container.guardian-fix-rewrite-20260610-050607.bak`. +- Earlier backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` was superseded by `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`. +- Added explicit release gates: + - app packaging docs must match current manifest/runtime contract before `1.8-alpha`; + - refactor/remove-dead-code is mandatory before `1.8-alpha`, after correctness validation and before final ISO/release gates. +- Validated IndeeHub: + - `container-list` reported `indeedhub` running; + - `container-health` returned `{"indeedhub":"healthy"}`; + - `http://192.168.1.198:7778/` returned HTTP `200`; + - `http://192.168.1.198:7778/nostr-provider.js` returned HTTP `200` and contains the Archipelago NIP-07/NIP-98 provider shim. +- Validated Immich launch: + - `http://192.168.1.198:2283/` returned HTTP `200`; + - one `container-health` check returned `{"immich":"unknown"}`, so health truthfulness still needs follow-up. +- Fixed Tailscale launch UI: + - patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh`; + - command now waits for `/var/run/tailscale/tailscaled.sock` before starting `tailscale web`; + - copied updated catalog to `/opt/archipelago/web-ui/catalog.json` on `.198`; + - patched the live generated Tailscale `.container` unit and restarted only `tailscale.service`; + - confirmed `container-list` reports Tailscale running; + - confirmed `container-health` returns `{"tailscale":"healthy"}`; + - confirmed `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content. + +## Important Caveat + +Tailscale launch is fixed, but Tailscale lifecycle is not fully passing: + +- `package.restart tailscale` failed through RPC with `podman ps timed out while listing containers`. +- Manual app-scoped restart showed old container stop needed SIGKILL and Podman cleanup took roughly 2 minutes. +- Logs still showed `podman ps timed out`, `podman stats timed out`, scan backoff, and slow cleanup. + +This confirms the active blocker is the rootless Podman control-plane/lifecycle path, not just individual app launch URLs. + +## Active Blockers + +- Rootless Podman/control-plane responsiveness: + - `podman ps` and cleanup paths time out; + - backend scan/backoff causes stale or slow UI state; + - app stop/start/restart can look frozen or fail through RPC. +- My Apps state truthfulness: + - do not show false empty/no-apps while scanner/Podman is in backoff; + - preserve last-known apps and show explicit stale/checking state. +- Progress UX: + - install/uninstall/start/stop/restart must show meaningful phase progress and not appear frozen. +- Immich health truthfulness: + - HTTP launch works, but health may still report `unknown`. +- Portainer: + - HTTP `9000` returned `200`; + - user still needs to retry environment wizard and confirm `/var/run/docker.sock` works. +- Fedimint: + - public Guardian launch URL now loads on `8175` even while Bitcoin is in IBD; + - `archy-fedimint-ui` owns `8175` and proxies to the real Guardian backend on `8177` when `fedimintd` eventually starts; + - durable manifest/companion/frontend/backend changes are now deployed on `.198`; + - `package.restart fedimint` fast-returned and settled active with `TimeoutStartSec=0`, but keep Fedimint in the broader lifecycle matrix because rootless Podman cleanup slowness remains a systemic blocker. +- Reboot validation: + - require at least 3 clean consecutive post-fix reboots with broad lifecycle green after each; + - prefer 5 clean reboots; + - do not start until lifecycle/control-plane is stable. +- App packaging docs: + - aligned `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` with the current manifest/runtime contract. +- Refactor/remove-dead-code: + - required before `1.8-alpha`; + - remove stale per-app hacks, duplicate lifecycle paths, stale fallback metadata, misleading compatibility shims; + - rerun release gates afterward. + +## Local Validation Already Run + +- `bash -n tests/lifecycle/remote-lifecycle.sh` passed. +- `bash -n scripts/first-boot-containers.sh tests/lifecycle/remote-lifecycle.sh` passed. +- `cargo fmt --manifest-path core/Cargo.toml --all` was run. +- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests). +- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed. +- `python3 scripts/check-app-catalog-drift.py --release --strict` passed. +- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json` passed. +- `git diff --check` passed. +- `npm test -- --run src/views/apps/__tests__/appsConfig.test.ts` passed. +- `npm run type-check` passed. +- `npm run build` passed. +- `python3 scripts/check-app-catalog-drift.py --release --strict` passed after Fedimint manifest changes. +- `git diff --check` passed for Fedimint manifest, companion, frontend, and new `docker/fedimint-ui` files. +- `cargo fmt --manifest-path core/Cargo.toml --all` passed. +- `CARGO_TARGET_DIR=/tmp/archy-cargo-check-quadlet cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed after Quadlet/restart changes. +- `CARGO_TARGET_DIR=/tmp/archy-cargo-final-quadlet cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` produced the deployed backend binary (tool PTY heartbeat wrapper became stale after link; artifact hash was validated separately before deploy). +- Live Fedimint restart validation passed on `.198`: + - `package.restart fedimint` returned `{"status":"restarting"}` immediately; + - `8175` remained HTTP `200`; + - `fedimint.service` and `archy-fedimint-ui.service` settled `active`; + - `container-health fedimint` returned `healthy`. +- `cargo test --manifest-path core/Cargo.toml -p archipelago companion::tests` compiled then the tool PTY stuck with no active `cargo`/`rustc` process visible; treat as inconclusive, not failed. +- Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat as inconclusive, not failed. + +## Immediate Next Step + +Do not reboot yet. + +Start with the rootless Podman lifecycle/control-plane blocker: + +1. Inspect the backend stop/start/restart path around `package.restart`, scanner backoff, and `podman ps` dependency. +2. Make stop/restart tolerate slow cleanup without wedging RPC/UI state. +3. Keep last-known app state during scanner backoff. +4. Revalidate focused apps on `.198`: `tailscale`, `indeedhub`, `immich`, `portainer`, `vaultwarden`, `botfights`; keep `fedimint` in the matrix but its focused Guardian launch/restart path is currently green. +5. Only after focused lifecycle is clean, run broad non-destructive lifecycle. +6. Only after that, begin 3/5 reboot validation. + +## Files Touched In Last Mini-Pass + +- `docs/NEXT_TERMINAL_HANDOFF.md` - this file. +- `neode-ui/src/views/apps/appsConfig.ts` - Fedimint launch-blocked reason helper. +- `neode-ui/src/views/apps/AppCard.vue` - show Fedimint Bitcoin-sync wait copy on app cards. +- `neode-ui/src/views/AppSession.vue` - pass app-specific blocked reason into app session. +- `neode-ui/src/views/appSession/AppSessionFrame.vue` - show app-specific blocked title/reason instead of generic unreachable fallback. +- `neode-ui/src/views/apps/__tests__/appsConfig.test.ts` - regression coverage for Fedimint wait-state copy. +- `apps/fedimint/manifest.yml` - backend real Guardian UI now maps host `8177` and wait command avoids systemd `%` expansion. +- `core/archipelago/src/container/companion.rs` - added `archy-fedimint-ui` companion mapping. +- `core/archipelago/src/container/quadlet.rs` - generated unit `TimeoutStartSec=0` plus bounded stop/restart recovery helpers. +- `core/archipelago/src/api/rpc/package/runtime.rs` - restart RPC returns immediately and runs restart async. +- `docker/fedimint-ui/` - new nginx wait/proxy companion image for Fedimint Guardian launch. +- `docs/RESUME.md` - checkpoint and gates. +- `docs/MIGRATION_STATUS_REPORT.md` - packaging/refactor release gates. +- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` - packaging/refactor release gates. +- `docs/APP-PACKAGING-MIGRATION-PLAN.md` - updated manifest/runtime contract documentation. +- `docs/app-developer-guide.md` - updated manifest/runtime contract documentation. +- `docs/MIGRATION_STATUS_REPORT.md` - noted that the docs gate is being closed in this pass. +- `app-catalog/catalog.json` - Tailscale socket-wait startup command. +- `neode-ui/public/catalog.json` - same Tailscale catalog update. +- `scripts/first-boot-containers.sh` - same Tailscale first-boot startup update. +- `neode-ui/src/views/apps/appPackageCache.ts` - UI-only last-known package + cache for scanner backoff. +- `neode-ui/src/views/apps/__tests__/appPackageCache.test.ts` - cache behavior + coverage. +- `neode-ui/src/views/Apps.vue` - uses cached packages during scanner backoff + and shows a refresh status banner. +- `docs/1.8-alpha-improvements-tracker.md` - noted My Apps backoff cache + improvement. +- `neode-ui/src/views/web5/Web5SharedContent.vue` - preserves shared/peer + content during refresh and shows compact refresh states. +- `neode-ui/src/views/web5/__tests__/Web5SharedContent.test.ts` - shared and + peer content refresh regression coverage. + +The worktree has many other pre-existing release-hardening changes. Do not revert unrelated dirty files. diff --git a/docs/RESUME.md b/docs/RESUME.md index 7caa6bb4..2e778d9b 100644 --- a/docs/RESUME.md +++ b/docs/RESUME.md @@ -1,126 +1,840 @@ -# RESUME — Rust orchestrator migration, Step 8b +# RESUME - Archipelago Release Hardening on `.198` -Last updated: 2026-04-23 (evening, post-architecture-audit) +Last updated: 2026-06-10 -Read this first if you're a fresh OpenCode session resuming work. Paste the "Resume prompt" below verbatim. +## 2026-06-10 05:48 EDT Active Session Checkpoint + +Work resumed from `docs/NEXT_TERMINAL_HANDOFF.md`. No `.198` host actions have +been run yet in this resumed pass. + +Current first steps: + +1. Rerun `git diff --check`. +2. Rerun the focused Rust image-version test for the Nextcloud false-update + helper. +3. If those are clean, inspect and continue the rootless Podman lifecycle/ + scanner-backoff work before any `.198` validation. + +Progress: + +- `git diff --check` passed. +- Focused Rust image-version test in `/tmp/archy-cargo-image-versions` remains + inconclusive: the tool PTY stayed open after compile output stopped, with no + active `cargo`, `rustc`, or linker process visible. +- Bounded retry of the focused image-version test using the normal workspace + target also timed out: `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests` + exited `124` after compiling the `archipelago` test target without reaching + test output. Nextcloud false-update validation is still not closed. +- Local code change in progress: single-orchestrator `package.stop` now returns + immediately with `stopping` and runs the orchestrator stop in the background, + instead of blocking the RPC/UI while Podman cleanup happens. +- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed. +- Compile check passed in `/tmp/archy-cargo-runtime-check`: + `cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`. +- `git diff --check` passed after the stop-path edit and doc updates. +- Lower-level stop path inspection: Quadlet service stop is already bounded + with kill/reset recovery, and the runtime fallback treats already-absent + containers as success. No extra lower-level stop change was made. + +## 2026-06-10 05:30 EDT Pause Checkpoint + +User paused to switch machines. Continue from `/home/archipelago/Projects/archy` +and read `docs/NEXT_TERMINAL_HANDOFF.md` plus +`docs/1.8-alpha-improvements-tracker.md` first. No dev server or validation +command should be intentionally left running from this checkpoint. + +Latest local-only tracker progress: + +- Done: uninstall preserve/delete-data choice, companion APK QR/download modal, + App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight + AI placeholder removal. +- In progress: Fleet/tab loading polish, Bitcoin receive-address readiness + states, no-registration credentials inventory, Nextcloud false-update fix. +- New credential fallback: PhotoPrism now shows manifest-backed credentials + (`admin` / `archipelago`) when backend credentials are empty. Grafana was not + added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo + default/secret. +- Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29` + and image update detection ignores registry-host-only changes. Catalog drift + passed, but backend focused Rust validation did not complete cleanly. First + `cargo test -p archipelago container::image_versions::tests` from `core/` + hit a Rust linker/incremental artifact failure while `/tmp` was full; a + non-incremental retry was killed after running too long. Old + `/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered. + +Latest local validations: + +- `npm run type-check` passed after the PhotoPrism credential fallback. +- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed. +- `git diff --check` passed after the Spotlight cleanup and should be rerun + after resuming. +- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during + the Nextcloud pass. + +Immediate next steps: + +1. Rerun `git diff --check`. +2. Rerun `cargo test -p archipelago container::image_versions::tests` from + `core/` when ready to validate the Nextcloud update-detection helper. +3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain + `todo` or `in-progress`, avoiding host-gated items until `.198` access is + intentionally resumed. + +## 2026-06-09 Resume Handoff - Read First + +Last user prompt to preserve: + +> please can we save all our progress, backlog, and goal to memory so I can resume on another device please +> +> including the last prompt + +Ultimate release goal: + +Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs. + +Important target node: + +- Validation node: `archipelago@192.168.1.198`, password `password123`. +- Current release deadline pressure from user: production release target was Thursday, 2026-06-11. +- Tests have been run mostly on `.198`; user noted we may also need to validate on the current intended release server, not only `.198`. +- Avoid broad/destructive Podman store cleanup. Do not use `git reset --hard` or revert unrelated user changes. + +Current deployed backend on `.198`: + +- Latest deployed `/usr/local/bin/archipelago` sha256: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`. +- A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff. + +Major progress achieved in the latest session: + +- Beta Telemetry / Fleet collector: + - Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it. + - Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body. + - Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`. + - Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`. + - Documented the expected value shape in `scripts/deploy-config.example`: `https:///rpc/v1`. + - Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`. + - `systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`. + - Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https:///rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet. +- IndeeHub: + - Recovered stale/corrupt metadata/container state enough for fresh lifecycle. + - Full lifecycle passed earlier on `.198`. + - Verified launch on `7778`. + - Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved. +- Saleor: + - Removed from app catalog/server as requested. +- Bitcoin Knots / Bitcoin UI: + - Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`. + - Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure. + - Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks. +- Fedimint: + - Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway. + - Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup. + - Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts. + - Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`. + - Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`. + - Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`. + - Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe. +- BotFights: + - User reported stopped/unhealthy. + - Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery. + - Deployed backend hash `9a00e543...`. + - BotFights started and is active. + - Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`. + - Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later. +- Status/health correctness: + - Reduced container health/status Podman timeouts to avoid UI hanging forever. + - `container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states. + - Fedimint stale `stopping` fixed to `starting`. + - Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights. +- Filebrowser/Home Assistant/Immich/Bitcoin: + - Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy. + - Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation. + +Current critical blockers: + +- Runtime control plane / Podman scanning: + - Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`. + - Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`. + - This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions. + - Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff. +- My Apps UI false negatives: + - User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed". + - Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout. +- Fedimint Guardian: + - Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD. + - Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied. +- Progress UX: + - User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen. + - Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications. +- Stale health notifications: + - Must not persistently trigger on new logins/refreshes after no longer valid. + - Some UI filtering was patched earlier, but keep this in regression backlog. +- Reboot survival: + - Must pass repeated reboot validation after runtime/status fixes. + - Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5. + +Backlog captured from user reports: + +- Portainer: + - Environment wizard error: `Dial unix /var/run/docker.sock: connect: connection refused`. + - User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful. +- Fedimint: + - Setup after guardian confirmation caused app not to launch. + - Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct. + - Gateway app disappeared from catalog before; it has been restored but keep in regression tests. +- Bitcoin Knots: + - User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression. +- Home Assistant: + - Setup has issues on this node and restart hung for a long time. +- Immich: + - After setup user saw HTTP 500 stacktrace from `loadServerConfig`. Needs focused post-setup validation, not just "healthy". +- Filebrowser: + - User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression. +- Tailscale: + - Launch must show local login/auth UI, not merely container running. +- BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps: + - Need clearer dependency wait states when Bitcoin RPC is slow/IBD. +- App catalog/developer readiness: + - Apps should not require OS-level changes per app. + - App migration document and developer guide must include this principle and current app packaging contract. +- Saleor: + - Removed from catalog/server and should stay removed unless intentionally reintroduced. + +Release readiness estimate: + +- Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%. +- Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation. + +Suggested immediate next steps after resuming: + +1. Read this file and verify no background build/process is running. +2. Build/deploy the local-only HTTP-health tightening patch if not already deployed. +3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`. +4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking. +5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer. +6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts. + +Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim. --- -## Resume prompt (paste this into a new opencode session) +## Resume Prompt -> We are mid-migration: `docs/rust-orchestrator-migration.md` + `docs/bulletproof-containers.md` are the plan, Steps 1–7 + 8a are shipped on `main`, Step 8b is next. Read `docs/RESUME.md` + `docs/STEP-8B-PORT-AUDIT.md` in full. Do NOT run any container mutations or edit `scripts/container-specs.sh`, `scripts/first-boot-containers.sh`, or `scripts/reconcile-containers.sh` — those are dead code scheduled for deletion in Step 8c. Work happens in `core/container/src/manifest.rs`, `core/archipelago/src/container/prod_orchestrator.rs`, and `apps//manifest.yml`. Summarize back to me what you understand the current state to be, wait for approval before touching anything. +> Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step. --- -## Standing directive from the user +## Current Goal -> Please get back to a well architected, minimal as possible, perfect working container architecture. If we've gone off track and the system is getting complex rather than elegant and perfect best containers ever then we need to review all the current state of the system and get back to making the best container system ever and according to our projects goals. We will be working on this until it's perfect. +Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image. -**Interpretation (validated with the user):** resume the Rust orchestrator migration. Stop patching bash scripts. The bash scripts were supposed to be deleted three months of commits ago and we drifted into maintaining them by accident. +Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO. -## Latest user comment (must be followed) +## Release Readiness Estimate -> please continue, please state my last comment in the resume doc and first before making this plan to adhere to - -Adherence rule for this session: -- Before proposing or executing a plan, first record the user's latest directive in `docs/RESUME.md`. -- Keep work aligned to Step 8 migration goals and avoid off-scope drift. - -Most recent directive: - -> And we need to get every container working on .116 and tested before we release - -Release gate update: -- `.116` must have all required containers healthy and tested before release is allowed. -- Treat runtime stabilization on `.116` as immediate priority while continuing Step 8 migration work. +- Estimated completion: `68%`. +- What is already achieved: + - manifest-driven app migration is substantially advanced; + - catalog metadata generation and strict drift checks are green; + - local backend/frontend release gates have been green in prior passes; + - broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding; + - Podman store-risk paths have been quarantined from known fragile broad image/store commands; + - IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness; + - targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness; + - mobile and desktop app progress UX now has clearer install/remove phase labels in local changes; + - Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix; + - Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`. +- What must still pass before release: + - deploy the current Immich readiness-gating backend and frontend progress UX changes; + - focused Immich validation: install must stay in progress until `http://:2283/` returns HTTP success and app launch opens the frontend; + - focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served; + - keep Vaultwarden in regression coverage even though the latest full lifecycle passed; + - focused Tailscale validation: launch must present the local login/auth link/UI on `8240`; + - focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`; + - full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`; + - progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough; + - app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks; + - required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward; + - broad non-destructive lifecycle after the deploy; + - at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each; + - preferably 5 consecutive clean reboot iterations before calling `1.8-alpha` production-release ready; + - final local release gates after any additional fixes; + - cut the `1.8-alpha` ISO; + - boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle. --- -## Where we actually are +## Latest User Directive -### Shipped (Steps 1–7 + 8a) +> A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria +> +> please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks +> +> also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't +> +> Also BTCPay is not running either +> +> no my bad, wrong server, BTCPay is fine just slow, please continue +> +> Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" +> +> please confirm there is a refactor/remove dead code release gate too -Commits on `main` (unpushed to `origin`/tx1138 until release gate; user-visible history): +Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability. -| Step | Commit | What | -|------|--------|------| -| 1 | (schema in place from earlier commits) | `ContainerConfig.image` ⊕ `ContainerConfig.build` — mutually exclusive pull-or-build source | -| 2 | `34af4d9d` | `ContainerRuntime` trait gains `image_exists` + `build_image`; `PodmanRuntime` impl | -| 3 | `b6a04d31` | `ProdContainerOrchestrator` with build-or-pull + adoption + reconcile | -| 4 | `e8a59c93` | `ContainerOrchestrator` trait; `RpcHandler` uses it in prod | -| 5 | `fc39b04b` | `BootReconciler` — periodic reconcile loop | -| 6 | `48f08aa3` | Wire both into `main.rs` | -| 7 | `069bc4a5` | `bitcoin-ui` pre-start hook renders `nginx.conf` from embedded template (the pattern for "derived config" at apply time) | -| 8a | `a0707f4d`, `1c81a739` | Retire `archipelago-reconcile` systemd timer; split Step 8 into 8a/8b/8c | +Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm. -Three `apps/*/manifest.yml` are genuinely ported and running under the Rust orchestrator on `.116` + `.228`: `bitcoin-ui`, `electrs-ui`, `lnd-ui` (Step 7). - -### Where we drifted (the session that produced the previous RESUME.md) - -On 2026-04-23 a fedimint outage on `.116` pulled a session into patching `scripts/reconcile-containers.sh`, `scripts/container-specs.sh`, `scripts/first-boot-containers.sh` — files that Step 8c is scheduled to delete. Five bugs deep, the user halted the session. That cluster of bugs is a symptom of running two incompatible codepaths in parallel (bash first-boot/reconcile + Rust `BootReconciler`), which is exactly the condition Step 8c fixes by deleting the bash half. - -**Discard-of-scope decision:** the uncommitted bash edits on `.116` (listed in the previous RESUME.md's "Uncommitted script changes" section) are not going to be committed. The fedimint mDNS-URLs fix, the filebrowser custom-args fix, the bcrypt-escape fix — these all land as changes to `apps//manifest.yml` + the Rust orchestrator in Steps 8b.0 – 8b.3. See `docs/STEP-8B-PORT-AUDIT.md` for the exact mapping. - -### Current container state on `.116` - -Running but drifted. See the "Current container state" section in the previous RESUME.md. Decision (approved by user): accept `.116` is limping until 8b.3 lands. Do not run `scripts/reconcile-containers.sh` or any mutations; all rescues go through the Rust orchestrator or wait for the manifest port. - -`.228` is happier — it's already adopted by the Rust orchestrator for the three UI apps. +There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates. --- -## Next step — Step 8b.0 +## Live `.198` State -**Concretely:** schema extensions to `core/container/src/manifest.rs` + unit tests. No orchestrator changes, no manifest changes, no container mutations. +- Host: `192.168.1.198`. +- Password for lifecycle harness/RPC login: `password123`. +- Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. +- `archipelago.service`: active. +- `archipelago-doctor.timer`: inactive. +- `archipelago-reconcile.timer`: inactive. +- `/`: `65%` used, about `9.6G` free. +- `/var/lib/archipelago`: about `9-10%` used, about `370G` free. -Fields to add (justified in `docs/STEP-8B-PORT-AUDIT.md§Schema gaps`): +Current active app blockers: -- `container.network: Option` — podman `--network` value (`"archy-net"`, `"host"`, or `None` = isolated default). -- `container.custom_args: Vec` — appended to the container command. -- `container.entrypoint: Option>` — override. -- `container.derived_env: Vec<{key, template}>` — template strings resolved against `HostFacts { host_ip, host_mdns, disk_gb }` at apply time. -- `container.secret_env: Vec<{key, secret_file}>` — read from `/var/lib/archipelago/secrets/` at apply time. -- `container.data_uid: Option` — `"NNNNN:NNNNN"` applied via `chown -R` before container create. -- `Volume.volume_type: "tmpfs"` + `Volume.tmpfs_options: String` — OR a new `container.tmpfs: Vec<{target, options}>`. Pick one at implementation time. +- Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes. +- IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://:7778/` plus `/nostr-provider.js` for the Nostr signer. +- Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker. +- Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`. +- Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`. +- Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping. +- Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`. +- Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`. +- Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks. +- Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery. +- Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer. +- Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`. +- Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen. -**Tests** (block the commit until green): +Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free. -- Every existing `apps/*/manifest.yml` still parses (`parse_every_real_manifest` test). -- Each new field parses correctly with sensible defaults. -- `validate()` rejects: empty custom_args elements, empty entrypoint elements, duplicate derived_env keys, derived_env templates referencing unknown host facts, secret_env with `..` or `/` in secret_file (path-traversal guard). -- `resolve_env(HostFacts)` returns expected strings for each supported placeholder. -- `resolve_secret_env(SecretsProvider)` returns expected strings; missing secret file is a hard error. +### 2026-06-10 Resume Continuation Checkpoint -This is the smallest useful commit and unblocks every port in 8b.1+. +- Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`. + - Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`. + - `archipelago.service` is active. + - `archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive. +- Added explicit release gates to this handoff: + - app packaging docs must be updated before `1.8-alpha`; + - refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO. +- Local validation before deploy: + - `bash -n tests/lifecycle/remote-lifecycle.sh` passed; + - `cargo fmt --manifest-path core/Cargo.toml --all`; + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests); + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed; + - `python3 scripts/check-app-catalog-drift.py --release --strict` passed; + - `git diff --check` passed. + - Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed. +- IndeeHub live validation after deploy: + - `container-list` reports `indeedhub` running; + - `container-health` reports `{"indeedhub":"healthy"}`; + - `http://192.168.1.198:7778/` returns HTTP `200`; + - `http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim. +- Immich live validation after deploy: + - `container-list` reports `immich` running; + - direct `http://192.168.1.198:2283/` returns HTTP `200`; + - `container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable. +- Tailscale live validation after deploy: + - Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`. + - App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`. + - Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes. + - After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content. + - Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker. +- Other live probes after deploy: + - `portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard. + - `vaultwarden` HTTP `8082` returns `200` from localhost on `.198`. + - `botfights` HTTP `9100` returns `200` from localhost on `.198`. + - `btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails. + - `fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope. +- Podman/control-plane remains the active systemic blocker: + - logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup; + - do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts. --- -## Project ground rules (standing) +## Latest Completed Work -- `archy` SSH alias = `.116`. `archy228` = `.228`. **Do not swap.** -- SSHFS at `/Users/dorian/mnt/archy-thinkpad/` = `archy:Projects/archy/`. -- `.116` sudo password: `ThisIsWeb54321@` — works passwordless in-session via `sudo -nS` after first use. -- `.228` has NOPASSWD. -- Git commits on `.116` MUST use `git commit -F /tmp/tmp-msg.txt` over `ssh archy` — SSHFS `git commit` hangs. -- Never push except current release (granted: `gitea-local` + `gitea-vps2`). -- No em-dashes. Conventional Commits. -- No altcoin mentions, Bitcoin-only. +### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix + +- Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive. +- Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`: + - `/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep; + - socket bind mounts call explicit socket repair before other bind prep; + - `ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed. +- Validated locally before deploy: + - `cargo fmt --manifest-path core/Cargo.toml --all`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests). + - `git diff --check`. + - `timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release`. +- Vaultwarden full preserve-data lifecycle passed on `.198`: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Portainer full preserve-data lifecycle passed on `.198`: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Portainer stale socket mount was confirmed and repaired: + - Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`. + - After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`. + - User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect. +- Direct state check after deploy: + - `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. + - `indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed. + - `vaultwarden running true`. + - `portainer running true`. + +### 2026-06-08 Reboot Blocker Follow-up In Progress + +- User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot. +- Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean. +- Local changes made in this pass: + - hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`; + - hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests; + - updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement. +- Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed: + - `indeedhub` stuck `stopping` and unhealthy; + - `immich` stopped/unhealthy; + - `tailscale` running/healthy but direct launch `8240` returned `000`; + - `vaultwarden` health RPC errored and launch `8082` returned `000`; + - `btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm. +- Targeted diagnostics on `.198` found: + - IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener; + - Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener; + - Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint; + - Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes; + - Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready. +- Local follow-up fixes after those diagnostics: + - `core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails; + - `ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists; + - IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS; + - lifecycle harness now requires Tailscale launch content to look like login/auth UI. +- Local validation passed after those fixes: + - `cargo fmt --manifest-path core/Cargo.toml --all`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`). + - `bash -n tests/lifecycle/remote-lifecycle.sh`. + - `git diff --check`. +- Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive. +- Public RPC recovery attempts on hash `06420c...`: + - `package.restart indeedhub` still failed; + - `package.start immich` accepted async start but app remained `starting` with no `2283` launch; + - `package.start vaultwarden` accepted async start but no `8082` launch appeared; + - `package.restart portainer` failed; + - `package.restart tailscale` accepted async restart but no `8240` launch UI appeared. +- Latest focused probe after hash `06420c...`: + - `tailscale` `running`, `http://192.168.1.198:8240/` returns `000`; + - `immich` `starting`, `http://192.168.1.198:2283/` returns `000`; + - `indeedhub` `stopping`, `http://192.168.1.198:7778/` returns `000`; + - `portainer` `running`, `http://192.168.1.198:9000/` returns `000`; + - `vaultwarden` absent/not listed, `http://192.168.1.198:8082/` returns `000`. +- Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful. +- Local validation passed so far: + - `cargo fmt --manifest-path core/Cargo.toml --all`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`). + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `bash -n tests/lifecycle/remote-lifecycle.sh`. + - `git diff --check`. +- A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet. +- Next steps: + - deploy the new backend only after approval; + - verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability; + - run reboot validation iterations on `.198` only after explicit approval; + - pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence. + - cut and smoke-test the `1.8-alpha` ISO after reboot validation is green. + +### Local Release Gate Completion After `.198` App Recovery + +- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands. +- Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active. +- Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them. +- Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism. +- Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`. +- Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion. +- Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs. +- Validation passed locally: + - `cargo fmt --manifest-path core/Cargo.toml --all`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`). + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`. + - `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests). + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. + - `git diff --check`. + - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. +- Remaining gated item remains host reboot validation on `.198`, only if explicitly approved. + +### Frontend Release Gate Completion + +- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands. +- Found and fixed a mobile app-launch regression in `neode-ui/src/stores/appLauncher.ts`: + - desktop-only new-tab apps still open directly on desktop; + - mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab; + - `dashboardReturnPath()` now tolerates tests/minimal router mocks with no `currentRoute`. +- Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior. +- Fixed `AppIconGrid` test setup so it shares the mounted Pinia instance and mocks credential lookup before launch. +- Fixed onboarding retry test timing to cover the actual exponential retry budget. +- Validation passed locally: + - `npm run type-check` from `neode-ui`. + - `npm test` from `neode-ui` (`548 passed`). + - `npm run build` from `neode-ui`. + - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. + - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. + - `git diff --check`. +- Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval. + +### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery + +- Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`. +- Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable. +- Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”. +- Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`. +- Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port. +- Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC. +- Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app. +- Validation passed: + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`. + - Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`. + - Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. +- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free. + +### Deployed Podman Store-Risk Cleanup + +- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`. +- Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts. +- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`. +- Validation passed: + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `cargo fmt` from `core/`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - Focused post-deploy lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Broad post-deploy non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free. + +### Release Candidate Backend Restart Validation + +- Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`. +- Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load. +- Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback. +- Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`. +- Recovered live Immich without data loss: + - `immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written. + - Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership. + - A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root. +- Validation passed on latest hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `python3 scripts/check-app-catalog-drift.py --release --strict`. + - `npm run build` from `neode-ui`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - Backend restart validation followed by focused `fedimint,immich,indeedhub,photoprism` lifecycle passed. + - Post-restart broad non-destructive lifecycle passed. +- Remaining gate before calling this a release: host reboot validation, if approved. + +### IndeedHub and Immich Lifecycle Recovery + +- Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`. +- IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running. +- Fedimint and NetBird focused audits are green; they were not current blockers after rerun. +- Immich was the broad-audit blocker and is now green: + - dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes; + - `immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping; + - this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`. +- Validation passed on latest hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands. + +### Release Refactor Cleanup + +- Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`. +- Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available. +- Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility. +- Removed the duplicate Gitea-specific stale port cleanup helper. +- Validation passed on latest hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation. + +### Catalog Metadata Generation + +- Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`. +- The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`. +- The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes. +- Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation. +- Release catalog drift is now zero: + - `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. +- Validation passed: + - `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`. + - canonical and UI public catalogs match byte-for-byte. + - `cargo test --manifest-path core/Cargo.toml -p archipelago-container`. + - `cargo check --manifest-path core/Cargo.toml -p archipelago`. + - `npm run build` from `neode-ui`. + +### Podman Store-Risk Hardening + +- Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`. +- Fresh local-build installs now treat `podman image exists ` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation. +- This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior. +- Validation passed on the latest hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. +- Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation. + +### Container Health Fallback and Broad Lifecycle Green + +- Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`. +- Fixed `container-health` broad lifecycle timeout behavior: + - `cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`. + - The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others. + - Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls. +- Validation passed on the latest hash: + - `cargo check --manifest-path core/Cargo.toml -p archipelago`. + - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. + +### Generic Host-Port Health Checkpoint + +- Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`. +- Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned. +- Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward. +- Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`. +- Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing. +- This is generic host-port health, not an app-specific mapping. +- After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart. +- Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails. +- Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior. + +### Stale State and Jellyfin Pasta Listener Hardening + +- Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`. +- `container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery. +- `container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads. +- Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`. +- `package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`. +- `package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`. +- Focused lifecycle passed on the latest hash: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Release catalog drift check remains: `missing_catalog=0`, `missing_manifests=0`, `metadata_drift=35`. + +### Expanded Cleanup and Store-Safe Uninstall + +- Built and deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`. +- Expanded `system.disk-cleanup` to remove old rollback artifacts while keeping newest rollback points: + - `/usr/local/bin/archipelago.backup-*` newest 3. + - legacy `/usr/local/bin/archipelago.bak*` newest 3. + - `/usr/local/bin/archipelago.before-*` newest 3 as part of legacy backend cleanup. + - `/opt/archipelago/web-ui.bak*` newest 3. + - `/opt/archipelago/web-ui.old` included as web UI rollback cleanup. +- Live `system.disk-cleanup` reclaimed `10.3 GB`: + - `Removed old backend backups: 41.6 MB freed`. + - `Removed old legacy backend backups: 3.6 GB freed`. + - `Removed old web UI backups: 6.6 GB freed`. + - `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. +- `/usr/local/bin` dropped to about `336M`. +- `/opt/archipelago` dropped to about `1.1G`. +- Removed global `podman volume prune -f` from uninstall. Uninstall now logs a skip and still removes explicit app data when `preserve_data=false`. + +### Startup Scan and Uptime Kuma Fixes + +- Startup `adopt_existing()` is bounded with a 35s timeout. +- Initial container scan seeds the same 300s Podman scan backoff used by periodic scans. +- Legacy pasta restart paths use scoped `podman restart` instead of stop+start. +- Uptime Kuma was repaired: + - Before: container internally healthy on `127.0.0.1:3001`, but host `3002` had no pasta listener. + - After: `package.restart uptime-kuma` returns `{"status":"restarted"}` and `http://192.168.1.198:3002/` returns HTTP `302`. + +### Cleanup and Catalog Work Already Done + +- `system.disk-cleanup` intentionally skips Podman image/volume prune. +- `nostr-rs-relay` was added to both catalog surfaces. +- `scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation. +- Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest. --- -## Recommended next action for the fresh session +## Verification Already Run -1. Read this file + `docs/STEP-8B-PORT-AUDIT.md` + the "Open decisions" section of the audit. -2. Answer the four open decisions (or confirm the recommended defaults). -3. Implement 8b.0 commit 1: add `network`, `custom_args`, `entrypoint`, `derived_env`, `secret_env`, `data_uid` fields to `ContainerConfig` + validation + unit tests. Backwards-compat: every existing `apps/*/manifest.yml` must still parse. -4. Commit + `cargo test -p archipelago-container` + stop. +- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed for the currently deployed release-candidate line. +- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed for the currently deployed release-candidate line. +- Broad lifecycle on current hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Targeted PhotoPrism audit on current hash passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh` +- Focused lifecycle on current hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Live cleanup RPC passed and reclaimed `10.3 GB`. +- Focused lifecycle after expanded cleanup passed: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Before the expanded cleanup pass, broad lifecycle also passed on hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`: + - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh` +- Direct app checks after latest cleanup passed: + - `http://192.168.1.198:3002/` -> HTTP `302`. +- `http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start. + - `http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here. -Do not touch `scripts/*.sh`. Do not run `reconcile-containers.sh`. Do not live-test on `.116` or `.228` until the schema + orchestrator pieces in 8b.0 + 8b.1 are both in. +### Test Caveat + +- Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). +- Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`. --- -## Recent release (out of scope, for grep context) +## Critical Constraints -v1.7.43-alpha shipped yesterday: tarball-only OTA, async install/uninstall/update lifecycle, install UX polish, `.23` VPS retirement. Manifest at `gitea-local` + `gitea-vps2`. `.228` on the new binary. See `docs/STATUS.md` for the full rundown. +- Preserve app data. +- `.198` is the active validation node. +- Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. +- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them. +- Do not run destructive git commands. +- Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan: + - Avoid `podman system df`. + - Avoid `podman image list` / `podman image ls`. + - Avoid broad `podman image exists` loops. + - Avoid `podman image prune` and `podman volume prune`. +- Podman store commands can hang and block app health under current `.198` load. +- Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`. -Earlier session notes (container rescue on `.116`, "never fails" directive, env-drift detector experiment) are obsolete — superseded by this file. The directive ("never fails") is honored by the Step 8 migration itself: a declarative manifest regenerated on every reconcile tick can't bake stale IPs into consensus data because the env comes from derived/secret sources that are re-resolved every apply. +--- + +## Current Remaining Blockers + +1. Podman socket/store health remains unresolved. + - Need quarantine/mitigation strategy rather than store-wide commands in release paths. + - Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`. + - Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward. + - Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed. + +2. Release code-review/refactor gate is still open. + - Reduce remaining app-specific Rust/OS branches where possible. + - Review scanner, health, reconcile, and install/update paths for performance and store-risk. + - Clean up dead transitional paths. + +3. Clean release branch hygiene is not done. + - Worktree is very dirty with many modified and untracked files. + - Do not commit unless explicitly asked. + +4. Full production validation still needed. + - Broad non-destructive lifecycle is green on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. + - Backend restart validation has passed. + - Run host reboot validation if approved. + - Run selected full lifecycle tests for critical apps if time allows. + +--- + +## Files Changed In Latest Pass + +- `core/container/src/runtime.rs` + - Changed Podman runtime `image_exists()` from `podman image exists` to a bounded targeted `podman image inspect` local-storage probe. + +- `core/archipelago/src/api/rpc/package/install.rs` + - Replaced legacy `podman images -q` local fallback and post-pull verification checks with bounded targeted `podman image inspect`. + +- `core/archipelago/src/container/companion.rs` + - Changed companion image existence checks from `podman image exists` to `podman image inspect`. + +- `core/archipelago/src/container/prod_orchestrator.rs` + - Updated image-existence failure test fixture wording for the new `image inspect` probe. + +- Validation for latest local mitigation: + - `cargo fmt --all --check` passed. + - `cargo check -p archipelago-container` passed. + - `cargo check -p archipelago` passed. + - `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests` passed. + - `cargo test -p archipelago-container` passed (`43` tests). + - `git diff --check -- ` passed. + - Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward. + +- `core/archipelago/src/api/rpc/system/handlers.rs` + - Calls expanded rollback cleanup helpers and reports reclaimed bytes. + +- `core/archipelago/src/api/rpc/system/mod.rs` + - Added cleanup helpers for legacy backend backups and web UI rollback backups. + - Uses size accounting for directories before removal. + - Keeps newest rollback artifacts instead of deleting all. + +- `core/archipelago/src/api/rpc/package/runtime.rs` + - Skips global `podman volume prune -f` during uninstall. + - Adds Jellyfin `8096` to runtime host-port/pasta cleanup repair. + - Derives legacy runtime host-port cleanup/repair ports from manifests. + - Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code. + +- `core/archipelago/src/api/rpc/container.rs` + - Adds stale cached `exited` refresh for `container-list`. + - Adds cached-running plus local TCP reachability fallback for `container-health`. + - Fixes fallback URL port parsing and expands lifecycle web app port coverage. + +- `core/archipelago/src/container/prod_orchestrator.rs` + - Rebuilds local-build images when `image_exists` fails/times out instead of failing fresh install. + - Adds focused unit test coverage for that behavior. + +- `scripts/generate-app-catalog.py` + - Generates/syncs public catalog metadata from manifest-owned fields. + +- `app-catalog/catalog.json` and `neode-ui/public/catalog.json` + - Generated from current manifests; files match byte-for-byte. + +- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` + - Added latest deployment, cleanup, validation, and residual-risk checkpoint. + +- `docs/MIGRATION_STATUS_REPORT.md` + - Updated current hash, root disk state, and remaining blockers. + +- `docs/RESUME.md` + - This file, replacing stale April migration resume content. + +--- + +## Suggested Next Steps + +1. Re-read the three docs: + - `docs/RESUME.md` + - `docs/CONTAINER_LIFECYCLE_HANDOFF.md` + - `docs/MIGRATION_STATUS_REPORT.md` + +2. Verify latest `.198` state: + - `ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'` + +3. Start Podman-store-risk review: + - Search for image/store operations: `image_exists`, `podman image`, `podman system`, `podman prune`, `volume prune`. + - Prefer targeted container status/API calls with timeouts. + - Avoid new broad store commands. + +4. Continue release code-review/refactor cleanup. + +5. If approved, run backend-restart validation and then host-reboot validation. + +--- + +## Current Release Readiness Estimate + +- Credible release candidate: closer now, roughly `87-91%`. +- Production-quality release developers will love: still closer to `73-79%`. + +The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health. diff --git a/docs/app-developer-guide.md b/docs/app-developer-guide.md index 4104251a..6dcdf7f8 100644 --- a/docs/app-developer-guide.md +++ b/docs/app-developer-guide.md @@ -1,14 +1,18 @@ # Archipelago App Developer Guide -Build and publish containerized apps for the Archipelago ecosystem. +Build and package containerized apps for Archipelago. ## Overview -Apps run as Podman containers on user nodes. You publish app manifests to Nostr relays, where nodes discover and install them through the community marketplace. +Apps run as rootless Podman containers on user nodes. You describe an app in `apps//manifest.yml`; the backend validates that manifest, compiles it into rootless container/runtime behavior, and the release pipeline generates catalog surfaces from the same manifest-owned metadata. + +Archipelago's app contract is deliberately manifest-first. A developer should be able to describe images or local builds, ports, volumes, generated files, dependencies, health/readiness, data ownership, networking, secrets, and supported bridge integrations in the app manifest without asking for a custom OS image or app-specific backend patch. When a real app needs a capability that is not represented yet, the preferred path is to add a reusable manifest/orchestrator primitive that other apps can use too. + +The historical marketplace-publish design is not the active local developer contract for `1.8-alpha`. For this release, local manifests are the source of truth and catalog JSON is generated from them. ## App Manifest -Every app needs a manifest (YAML for local apps, JSON for marketplace publishing). +Every app needs a manifest at `apps//manifest.yml`. The root key is `app`; runtime, catalog, and integration fields live below that key. ### Template Manifest @@ -18,36 +22,70 @@ app: id: my-app # Unique, lowercase kebab-case name: My App version: 1.0.0 # Semantic versioning + description: My App does one thing well. + + container: + image: docker.io/myorg/my-app:1.0.0 + pull_policy: if-not-present + network: archy-net + entrypoint: ["sh", "-lc"] + custom_args: + - /app/start.sh + derived_env: + - key: PUBLIC_URL + template: https://{{HOST_MDNS}}:8180 + secret_env: + - key: APP_PASSWORD + secret_file: my-app-password + + dependencies: + - storage: 1Gi + + resources: + cpu_limit: 2 + memory_limit: 512Mi + + security: + capabilities: [] + readonly_root: true + no_new_privileges: true + network_policy: isolated -container: - image: docker.io/myorg/my-app:1.0.0 # Never use :latest ports: - - container: 8080 - host: 8180 + - host: 8180 + container: 8080 protocol: tcp - volumes: - - name: data - path: /data - env: - APP_MODE: production - capabilities: [] # Only add if absolutely necessary - readonly_root: true # Required - no_new_privileges: true # Required - run_as_user: 1000 # Must be >= 1000 -metadata: - description: - short: "One-line description (max 120 chars)" - long: "Detailed description of what this app does and why." - author: - name: "Your Name" - did: "did:key:z6Mk..." # Your Archipelago node DID - category: money # money | commerce | data | networking | home | community | other - icon_url: "https://example.com/icon.png" - repo_url: "https://github.com/myorg/my-app" - license: MIT - min_archipelago_version: "0.1.0" - dependencies: [] # e.g., ["bitcoin-knots"] if this app needs Bitcoin + volumes: + - type: bind + source: /var/lib/archipelago/my-app + target: /data + options: [rw] + + environment: + - APP_MODE=production + + health_check: + type: http + endpoint: http://localhost:8080 + path: /health + interval: 30s + timeout: 5s + retries: 3 + + files: + - path: /var/lib/archipelago/my-app/config.yml + content: | + bind: 0.0.0.0:8080 + overwrite: false + + metadata: + icon: /assets/img/app-icons/my-app.svg + category: tools + tier: optional + repo: https://github.com/myorg/my-app + launch: + open_in_new_tab: false ``` ### Required Fields @@ -56,39 +94,71 @@ metadata: |-------|-------------| | `app.id` | Unique identifier, lowercase, kebab-case only | | `app.name` | Human-readable name | -| `app.version` | Semantic version (major.minor.patch) | -| `container.image` | Full image reference with pinned version tag | -| `metadata.description.short` | One-line description, max 120 characters | -| `metadata.author.did` | Your node's DID (get via `node.did` RPC) | +| `app.version` | Version string containing at least one digit; semantic versions are preferred | +| `container.image` or `container.build` | Exactly one image source must be present | +| `security.readonly_root` | Should remain `true` for normal apps | +| `security.no_new_privileges` | Should remain `true` for normal apps | + +### Current Manifest Fields + +| Field | Purpose | +|-------|---------| +| `app.id`, `app.name`, `app.version`, `app.description` | App identity and release metadata | +| `app.container.image` | Registry image to pull | +| `app.container.build` | Local build definition with `context`, `dockerfile`, `tag`, and optional `build_args` | +| `app.container.pull_policy` | Pull behavior, usually `if-not-present` | +| `app.container.network` | Podman network setting such as `archy-net` or `pasta`; dangerous namespace-sharing modes are rejected | +| `app.container.entrypoint` / `custom_args` | Entrypoint and command override | +| `app.container.derived_env` | Environment values rendered from allowed host facts such as `HOST_IP`, `HOST_MDNS`, and `DISK_GB` | +| `app.container.secret_env` | Environment values read from `/var/lib/archipelago/secrets/` | +| `app.container.data_uid` | UID:GID ownership repair for app data directories | +| `app.dependencies` | Storage requirements and app dependencies | +| `app.resources` | CPU, memory, and disk limits | +| `app.security` | Capabilities, read-only root, no-new-privileges, network policy, optional AppArmor profile | +| `app.ports` | Host-to-container port mappings | +| `app.volumes` | `bind`, `volume`, or `tmpfs` mounts | +| `app.files` | Generated files under declared bind-mounted host paths | +| `app.environment` | Static `KEY=value` environment entries | +| `app.health_check` | HTTP or TCP health check settings | +| `app.devices` | Explicit device paths | +| `app.metadata` | Catalog-facing presentation metadata such as icon, category, tier, repo/source, author, feature bullets, and launch hints | + +Additional extension keys may exist for current integrations, for example Bitcoin, Lightning, or app-specific launch/interface metadata. Treat extension keys as transitional unless they are documented as reusable platform primitives. + +Use `metadata.launch.open_in_new_tab: true` when the app UI is known to reject iframe embedding with headers such as `X-Frame-Options` or restrictive CSP. The frontend app-session metadata is generated from this flag during release work. ## Security Requirements -These are enforced by the marketplace and the node. Non-compliant apps are flagged. +These are enforced by the marketplace/catalog pipeline and the node. Non-compliant apps are flagged. ### Mandatory 1. **No `:latest` tag** — Pin a specific version: `myapp:1.0.0` -2. **Read-only root filesystem** — `readonly_root: true` (use volumes for writable data) -3. **Non-root user** — `run_as_user: 1000` or higher -4. **No privilege escalation** — `no_new_privileges: true` -5. **Minimal capabilities** — Drop all caps, only add required ones +2. **Read-only root filesystem** — `security.readonly_root: true` (use volumes for writable data) +3. **No privilege escalation** — `security.no_new_privileges: true` +4. **Minimal capabilities** — Drop all caps, only add required ones +5. **No host network unless explicitly approved** — keep `security.network_policy` isolated or bridge ### Allowed Capabilities -Only these Linux capabilities may be requested: +The parser currently accepts this allow-list. Keep capability requests minimal; some accepted capabilities still require release review before a public package should depend on them. | Capability | When Needed | |-----------|-------------| | `CHOWN` | App needs to change file ownership | -| `NET_BIND_SERVICE` | App binds to ports below 1024 | | `DAC_OVERRIDE` | App needs to bypass file permissions | -| `SETUID`, `SETGID` | App manages user switching (e.g., nginx) | +| `FOWNER` | App needs ownership-related file operations | +| `NET_ADMIN` | Network administration; requires extra scrutiny | +| `NET_BIND_SERVICE` | App binds to ports below 1024 | +| `NET_RAW` | Raw network sockets; requires extra scrutiny | +| `SETUID`, `SETGID` | App manages user switching | +| `SYS_ADMIN` | Broad administrative capability; avoid for normal apps | ### Forbidden -- `--network host` — Apps cannot share the host network +- Namespace-sharing network modes such as `container:` or `ns:` - Mounting system paths: `/`, `/etc`, `/var`, `/usr`, `/proc`, `/sys` -- `SYS_ADMIN`, `SYS_PTRACE`, or any privileged capability +- `SYS_PTRACE`, privileged containers, Docker socket mounts, or rootful execution - Hardcoded secrets in environment variables or images ## Container Best Practices @@ -97,14 +167,26 @@ Only these Linux capabilities may be requested: ```yaml volumes: - - name: data # App data persists across restarts - path: /data - - name: config # Configuration files - path: /config + - type: bind + source: /var/lib/archipelago/my-app + target: /data + options: [rw] ``` Data is stored at `/var/lib/archipelago/{app-id}/` on the host. +Generated files must live under a declared bind-mounted host path: + +```yaml +files: + - path: /var/lib/archipelago/my-app/config.yml + content: | + bind: 0.0.0.0:8080 + overwrite: false +``` + +Use `overwrite: false` for first-run defaults that users or the app may later modify. Use `overwrite: true` only for generated files the platform must own. + ### Health Checks Define a health check endpoint in your container: @@ -130,14 +212,30 @@ dependencies: - bitcoin-knots container: - env: - BITCOIN_RPC_HOST: bitcoin-knots # Container DNS name on archy-net - BITCOIN_RPC_PORT: "8332" + network: archy-net + derived_env: + - key: BITCOIN_RPC_HOST + template: bitcoin-knots + - key: BITCOIN_RPC_PORT + template: "8332" ``` -The `archy-net` Podman network provides DNS resolution between containers. +The `archy-net` Podman network provides DNS resolution between containers. Use `derived_env` for host facts like `HOST_MDNS` instead of hardcoding node-specific URLs. -## Publishing to the Marketplace +## Catalog Generation + +Catalog JSON is generated from manifests during release work. Do not manually edit generated fields in `app-catalog/catalog.json` or `neode-ui/public/catalog.json` when the same value belongs in the manifest. + +Manifest-owned catalog fields currently include: + +- app title from `app.name`; +- version from `app.version`; +- description from `app.description`; +- Docker image from `app.container.image`; +- category from `app.category` or `app.metadata.category`; +- tier from `app.metadata.tier`; +- icon from `app.metadata.icon`; +- repo URL from `app.metadata.repo`, `repoUrl`, or `source`. ### 1. Build and Push Your Image @@ -146,79 +244,24 @@ podman build -t docker.io/myorg/my-app:1.0.0 . podman push docker.io/myorg/my-app:1.0.0 ``` -### 2. Get Your Node's DID +### 2. Generate Catalogs ```bash -curl -b cookies.txt -X POST http://localhost/rpc/v1 \ - -d '{"method":"node.did"}' -# Returns: {"result":{"did":"did:key:z6Mk..."}} +python3 scripts/generate-app-catalog.py ``` -### 3. Publish via RPC +### 3. Verify Drift ```bash -curl -b cookies.txt -X POST http://localhost/rpc/v1 \ - -H "Content-Type: application/json" \ - -d '{ - "method": "marketplace.publish", - "params": { - "app_id": "my-app", - "name": "My App", - "version": "1.0.0", - "description": {"short": "A useful tool", "long": "Detailed description..."}, - "author": {"name": "Dev Name", "did": "did:key:z6Mk...", "nostr_pubkey": ""}, - "container": { - "image": "docker.io/myorg/my-app:1.0.0", - "ports": [{"container": 8080, "host": 8180, "protocol": "tcp"}], - "volumes": [], - "env": {}, - "capabilities": [], - "readonly_root": true, - "no_new_privileges": true, - "run_as_user": 1000 - }, - "category": "other", - "icon_url": "", - "repo_url": "https://github.com/myorg/my-app", - "license": "MIT", - "min_archipelago_version": "0.1.0", - "dependencies": [] - } - }' +python3 scripts/check-app-catalog-drift.py --release --strict ``` -The manifest is published to all configured Nostr relays as a NIP-78 event (kind 30078). - -### 4. Verify Discovery +Before release, the canonical catalog and UI public catalog should match: ```bash -curl -b cookies.txt -X POST http://localhost/rpc/v1 \ - -d '{"method":"marketplace.discover"}' -# Your app should appear in the results +cmp -s app-catalog/catalog.json neode-ui/public/catalog.json ``` -## Trust Model - -Published apps receive trust scores (0-100) based on: - -| Factor | Points | How to Maximize | -|--------|--------|-----------------| -| Valid DID in author | 30 | Always include your node's DID | -| Found on multiple relays | 5-20 | Configure many relays in your node | -| Developer in federation | 20 | Have federated peers who trust you | -| Proper semver version | 10 | Use `major.minor.patch` format | -| Repository URL present | 5 | Include your repo URL | -| Security compliance | 15 | Meet all security requirements | - -### Trust Tiers - -| Score | Tier | User Experience | -|-------|------|----------------| -| 80-100 | Verified | One-click install | -| 50-79 | Community | Install with confirmation | -| 20-49 | Unverified | Install with warning | -| 0-19 | Untrusted | Requires explicit override | - ## Testing Your App ### Local Testing @@ -256,18 +299,18 @@ podman logs my-app ### Validate Manifest ```bash -curl -b cookies.txt -X POST http://localhost/rpc/v1 \ - -H "Content-Type: application/json" \ - -d '{"method":"marketplace.verify","params":{...your manifest...}}' -# Returns: {"result":{"valid":true,"issues":[],"trust_score":65,"trust_tier":"community"}} +cargo test --manifest-path core/Cargo.toml -p archipelago-container +python3 scripts/check-app-catalog-drift.py --release --strict ``` ## Updating Your App -1. Build and push the new version: `docker.io/myorg/my-app:1.1.0` -2. Publish an updated manifest with the new version -3. NIP-33 replaceable events: the latest publish overwrites the previous one on relays -4. Nodes running your app can see the update in their marketplace +1. Build and push the new version: `docker.io/myorg/my-app:1.1.0`. +2. Update `app.version` and `app.container.image` or `app.container.build.tag`. +3. Run catalog generation and drift checks. +4. Validate install/start/stop/restart/uninstall/reinstall behavior before shipping. + +The broader app update policy for `1.8-alpha` is still being finalized. Until that policy is locked, app manifests should be explicit and pinned so update detection compares concrete image/tag metadata rather than mutable tags. ## App Icon @@ -275,3 +318,22 @@ curl -b cookies.txt -X POST http://localhost/rpc/v1 \ - Recommended size: 256x256 pixels - Square aspect ratio - If no icon URL, a generic placeholder is shown in the marketplace + +## Release Validation Expectations + +Every supported app must satisfy the lifecycle contract: + +- install +- launch +- stop +- start +- restart +- uninstall while preserving data +- reinstall with preserved data +- report truthful health/status +- survive backend restart +- survive host reboot + +For apps with special dependencies, launch must explain dependency wait states instead of showing a dead iframe. Examples include Bitcoin sync/IBD, Lightning wallet readiness, Nostr signer bridge injection, Tailscale login/auth, and app-specific setup screens. + +Runtime changes should be validated with focused tests first, then the release lifecycle harness on the validation host when host access is intentionally resumed. diff --git a/docs/bitcoin-rpc-relay.md b/docs/bitcoin-rpc-relay.md new file mode 100644 index 00000000..a71fc371 --- /dev/null +++ b/docs/bitcoin-rpc-relay.md @@ -0,0 +1,280 @@ +# Bitcoin RPC Relay for External Wallets + +This note captures the pattern used to let an external wallet, such as Wasabi, +use an Archipelago Bitcoin node for transaction relay without exposing the +node's admin RPC credentials. + +## Goal + +Expose a public HTTPS JSON-RPC endpoint that can broadcast transactions and read +basic chain/mempool state, while preventing wallet and admin RPC access. + +The endpoint should be fronted by nginx or another TLS reverse proxy: + +```text +wallet client -> https:/// -> reverse proxy -> Archipelago node nginx -> bitcoind RPC +``` + +Do not expose Bitcoin RPC credentials with wallet/admin access to external +users. + +## Restricted RPC User + +Create a separate RPC user, currently named `txrelay`, with an `rpcauth` secret +and a Bitcoin RPC whitelist. + +Allowed RPC methods: + +```text +sendrawtransaction +testmempoolaccept +getmempoolinfo +getrawmempool +getmempoolentry +getnetworkinfo +getblockchaininfo +getblockcount +getblockhash +getblockheader +getrawtransaction +decoderawtransaction +decodescript +estimatesmartfee +``` + +Wallet/admin access is denied by setting `-rpcwhitelistdefault=0` and giving the +`txrelay` user only the method whitelist above. + +Secrets live under: + +```text +/var/lib/archipelago/secrets/bitcoin-rpc-txrelay-password +/var/lib/archipelago/secrets/bitcoin-rpc-txrelay-rpcauth +/var/lib/archipelago/secrets/bitcoin-rpc-txrelay-client.env +``` + +Do not commit these files or paste them into docs. + +## Archipelago UI/API Flow + +The productized flow is managed from the Bitcoin Core/Knots custom UI in the +`Transaction Relay Sharing` panel. + +Implemented RPC methods: + +```text +bitcoin.relay-status +bitcoin.relay-update-settings +bitcoin.relay-request-peer +bitcoin.relay-approve-request +bitcoin.relay-reject-request +bitcoin.relay-create-tor-service +``` + +When peer sharing is enabled, `bitcoin.relay-update-settings` automatically +provisions the restricted `txrelay` password, `rpcauth`, and client env file if +they do not already exist. If those files were just generated, restart Bitcoin +Core/Knots so `bitcoind` reloads the `txrelay` `rpcauth` and whitelist flags. + +The UI shows: + +```text +HTTP / HTTPS / Tor relay endpoint settings +local sync status +restricted credential readiness, without printing the password +trusted peer dropdown, disabled until the local node is synchronized +incoming relay requests with approve/reject actions +outbound relay requests and approval status +``` + +Approving an incoming peer request sends the selected endpoint plus restricted +`txrelay` credentials through the existing encrypted peer-message path. On the +requesting node, approved peer credentials are stored in a per-peer secret env +file: + +```text +/var/lib/archipelago/secrets/bitcoin-relay-peer-.env +``` + +The UI returns the credential secret path and approved endpoint metadata, but it +does not display the raw password. + +For dev review, the mock server exposes the Bitcoin UI at: + +```text +http://localhost:8102/app/bitcoin-ui/ +``` + +## Bitcoin Startup Flags + +The Bitcoin Knots app should add the restricted user only when the secret exists: + +```sh +RPC_TXRELAY_AUTH="$(printenv BITCOIN_RPC_TXRELAY_RPCAUTH || true)" +RPC_TXRELAY_FLAGS="-rpcwhitelistdefault=0" +if [ -n "$RPC_TXRELAY_AUTH" ]; then + RPC_TXRELAY_FLAGS="$RPC_TXRELAY_FLAGS -rpcauth=$RPC_TXRELAY_AUTH -rpcwhitelist=txrelay:sendrawtransaction,testmempoolaccept,getmempoolinfo,getrawmempool,getmempoolentry,getnetworkinfo,getblockchaininfo,getblockcount,getblockhash,getblockheader,getrawtransaction,decoderawtransaction,decodescript,estimatesmartfee" +fi +``` + +Then include `$RPC_TXRELAY_FLAGS` in the `bitcoind` command. Keep the local +`archipelago` RPC user unrestricted for internal services by using +`-rpcwhitelistdefault=0` and only setting a whitelist for `txrelay`. + +The current implementation touches: + +```text +apps/bitcoin-knots/manifest.yml +scripts/container-specs.sh +``` + +## Node nginx + +The Archipelago node can expose a host-based nginx vhost that proxies to local +Bitcoin RPC: + +```nginx +limit_req_zone $binary_remote_addr zone=bitcoin_rpc_ext:10m rate=5r/s; + +server { + listen 80; + server_name rpc.example.com; + + client_max_body_size 2m; + + location / { + limit_req zone=bitcoin_rpc_ext burst=20 nodelay; + limit_req_status 429; + + proxy_pass http://127.0.0.1:8332; + proxy_http_version 1.1; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_connect_timeout 5s; + proxy_send_timeout 120s; + proxy_read_timeout 120s; + proxy_buffering off; + } +} +``` + +If another public reverse proxy terminates TLS, point it at: + +```text +http://:80 +``` + +For the tested node the LAN upstream was: + +```text +http://192.168.1.116:80 +``` + +The public proxy should serve a valid TLS certificate for the chosen subdomain. + +## DNS and Routing + +Use a subdomain that resolves to the public reverse proxy: + +```text +Type: A +Host/Name: +Value: +``` + +For example, if the desired hostname is `rpc.example.com`, the DNS host/name +field is usually only `rpc`, not the full `rpc.example.com`. Entering the full +hostname in some DNS panels can accidentally create: + +```text +rpc.example.com.example.com +``` + +The public proxy should forward: + +```text +TCP 443 -> TLS reverse proxy for the subdomain +TCP 80 -> optional, needed for HTTP-01 certificate issuance or redirects +``` + +If the public proxy is separate from the Archipelago node, configure it with: + +```text +server_name: +scheme: http +upstream host: +upstream port: 80 +``` + +## Verification + +Check authoritative DNS: + +```sh +dig @ A +noall +answer +authority +dig @1.1.1.1 +short A +``` + +Check TLS: + +```sh +openssl s_client -connect :443 -servername " +``` + +Check that transaction broadcast reaches Bitcoin RPC, without needing a real +transaction: + +```sh +curl -sS --user "$BITCOIN_RPC_TXRELAY_USER:$BITCOIN_RPC_TXRELAY_PASSWORD" \ + --data-binary '{"jsonrpc":"1.0","id":"badtx","method":"sendrawtransaction","params":["00"]}' \ + "" +``` + +Expected result is a Bitcoin RPC validation error such as `TX decode failed`, +which confirms the request reached `sendrawtransaction`. + +Check that wallet/admin RPC is blocked: + +```sh +curl -sS -o /tmp/txrelay-deny.json -w '%{http_code}\n' \ + --user "$BITCOIN_RPC_TXRELAY_USER:$BITCOIN_RPC_TXRELAY_PASSWORD" \ + --data-binary '{"jsonrpc":"1.0","id":"deny","method":"listwallets","params":[]}' \ + "" +``` + +Expected result: + +```text +403 +``` + +## Tested Outcome + +The working endpoint used in this setup was: + +```text +https://shard.tx1138.com/ +``` + +It was verified with: + +```text +DNS resolves +TLS certificate is valid +txrelay credentials authenticate +getblockchaininfo returns chain=main +sendrawtransaction reaches Bitcoin RPC +listwallets is blocked for txrelay +```