diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..7fc60f8f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,46 @@ +# Archipelago β€” agent guide + +## 🚩 TOP PRIORITY (until production testing passes) + +**Read `docs/PRODUCTION-MASTER-PLAN.md` first.** It is the authoritative plan and +overrides ad-hoc direction until the production test gate is green. Goal: a +world-class, **developer-ready app platform** where every app is manifest-driven, +manifests ship via the **signed registry** (not OTA disk files), and **third-party +developers publish apps via an external/decentralized registry** β€” all rootless, +secure, robust, and 100%-uptime-capable. + +Detailed sub-plans (all linked from the master): +- App platform / packaging phases + security model β†’ `docs/APP-PACKAGING-MIGRATION-PLAN.md` +- Registry-distributed manifests (in progress) β†’ `docs/registry-manifest-design.md` +- External/decentralized marketplace for devs β†’ `docs/marketplace-protocol.md` +- Current per-app state β†’ `docs/app-registry-status-2026-06-21.md` +- Production test gate (exit criterion) β†’ `tests/lifecycle/TESTING.md` + +## Invariants (never violate) + +- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged + containers unless explicitly approved. +- **No per-app Rust installers / no OS-level reliance.** Apps are declarative; + the orchestrator owns the lifecycle. `install_immich_stack` (hardcoded + `podman run` + `sudo chown`) is the anti-pattern being deleted, not a template. +- **Secrets are manifest-declared** (`generated_secrets`, materialised by + `container::secrets`, 0600/rootless) β€” never hardcoded, per-app, or logged. +- **Migrations never destroy data** β€” preserve `/var/lib/archipelago/`, + secrets, credentials, ports, and adoption container names; keep a rollback path. +- **Verify on a real node (.228, then .198) before any tag.** + +## Build / verify + +- Rust workspace root is `core/` (no Cargo.toml at repo root). `cargo` from `core/`. +- If a `cargo test`/build hits `rust-lld: undefined hidden symbol`, it's + incremental-cache corruption β€” rebuild with `CARGO_INCREMENTAL=0`. +- Frontend: `neode-ui/` β†’ `npm run build` outputs to `web/dist/neode-ui/`. + Grep the built bundle for new strings before shipping (build can silently no-op). +- App manifests load from disk on nodes at `/opt/archipelago/apps/*/manifest.yml` + (today); the goal is to distribute them via the signed catalog instead. + +## Production test gate (definition of done) + +`tests/lifecycle/run-20x.sh` green across install / UI / stop / start / restart / +reinstall / reboot-survive / archipelago-restart-survive / uninstall β€” **20Γ— on +.228 AND .198**. Until green, the master plan is the priority. diff --git a/docs/1.8-alpha-improvements-tracker.md b/docs/1.8-alpha-improvements-tracker.md deleted file mode 100644 index dcfc6302..00000000 --- a/docs/1.8-alpha-improvements-tracker.md +++ /dev/null @@ -1,231 +0,0 @@ -# 1.8-alpha Improvements Tracker - -Last updated: 2026-06-12 01:15 EDT - -This tracks the user-facing improvement list that must land with the `1.8-alpha` -container migration release and the next ISO cut produced from that release. It -is intentionally separate from the container handoff docs, but should be treated -as release and ISO smoke-test scope. - -Status legend: - -- `todo`: not started. -- `in-progress`: active local work or validation. -- `blocked`: needs host access, hardware, credentials, a product decision, or an - external artifact. -- `done`: implemented and validated for this release. -- `defer?`: candidate to explicitly defer from `1.8-alpha` after product review. - -Resume protocol: - -1. Read this file after `docs/NEXT_TERMINAL_HANDOFF.md`. -2. Keep every user-requested improvement represented here until it is either - `done` or explicitly moved out of `1.8-alpha` by product decision. -3. When implementation starts, change status to `in-progress` and add the file, - test, host, or design decision being worked. -4. Mark `done` only after the change is implemented and validated locally or on - the release validation host, as appropriate. -5. Before cutting the next ISO, run this checklist as part of ISO smoke testing. - -Active-session note, 2026-06-10 05:48 EDT: resumed from -`docs/NEXT_TERMINAL_HANDOFF.md`; no `.198` host actions have been run yet. The -immediate tracker-affecting local gate is rerunning the focused Rust -`container::image_versions::tests` validation for the Nextcloud false-update -row, then continuing lifecycle/control-plane truthfulness work. - -Resume-save checkpoint, 2026-06-10 08:32 EDT: the current pass stayed on the -fixes backlog, not app migration. No `.198` host actions were run, no dev server -was intentionally left running, and no long-running validation command is -expected to still be active. Continue from the in-progress `Make tabs info load -quickly or show loading states` row or the next unresolved fixes-backlog row. - -Active-session progress: `git diff --check` passed. Focused image-version Rust -validation is still inconclusive because the tool PTY stayed open with no -active compiler process visible, a bounded 300s retry using the normal -workspace target exited `124` before test output, and a fresh 600s retry in -`/tmp/archy-cargo-image-versions-2` also exited `124` after compiling into the -`archipelago` crate without reaching test output. The Nextcloud false-update -row remains `in-progress`. A local lifecycle fix is in progress so migrated -single-orchestrator app stops return immediately with a transitional state -instead of blocking the UI while Podman cleanup runs; `cargo fmt --check` and -focused backend compile check passed, and `git diff --check` is clean. Latest -credentials backlog follow-up added backend PhotoPrism credentials, centered -the mobile credential pre-launch modal in My Apps and the icon grid, and passed -focused frontend tests, type-check, backend compile check, `cargo fmt --check`, -and `git diff --check`. Web5 Connected Nodes Messages/Requests, Web5 -Identities, and DWN message browsing now preserve visible content during -refresh/failure and show compact refresh labels instead of replacing populated -tabs with loading panels; focused tests and type-check passed. Server Network -overview, Network Interfaces, and Tor Services cards now keep visible values -during refresh or refresh failure and show compact refresh labels instead of -reverting to skeletons or false empty states; focused test and type-check -passed. The standalone Credentials view now keeps credential rows visible -during refresh/failure and shows `Refreshing credentials...`; focused test and -type-check passed. Lightning Channels now keeps existing channels visible -during refresh/failure and shows `Refreshing channels...`; focused test and -type-check passed. Peer Files now keeps existing peer catalog items visible -during Tor refresh/failure and shows `Refreshing peer files...`; focused test, -type-check, and `git diff --check` passed. Cloud peer cards now remain visible -during federation peer-list refresh/failure with `Refreshing peer nodes...`; -focused test, type-check, and `git diff --check` passed. The Web5 Verifiable -Credentials summary now keeps credential rows visible during refresh/failure -with `Refreshing credentials...`; focused test, type-check, and -`git diff --check` passed. Web5 Nostr Relays now keeps relay stats visible -during refresh/failure with `Refreshing relays...`; focused test, type-check, -and `git diff --check` passed. Web5 Domains now keeps registered-name counts -visible during refresh/failure with `Refreshing domains...`; focused test, -type-check, and `git diff --check` passed. Settings Backups now keeps existing -backup rows visible during refresh/failure with `Refreshing backups...`; -focused test, type-check, and `git diff --check` passed. Settings Transport -Preferences now keeps preference controls visible during refresh/failure with -`Refreshing transport preferences...`; focused test, type-check, and -`git diff --check` passed. Settings VPN status now keeps current connection -details visible during refresh/failure with `Refreshing VPN status...`; -focused test, type-check, and `git diff --check` passed. Web5 Federation now -shows `Refreshing federation...` during summary refresh and keeps existing node -counts/DID visible on refresh failure; focused test, type-check, and -`git diff --check` passed. Mesh map denied-location behavior now has component -coverage proving browser location denial reports that peer positions can still -appear without requiring local location; focused test, type-check, and -`git diff --check` passed. Companion/app-session mobile tab-app handling now -keeps apps that require a new tab inside the mobile session fallback instead of -auto-opening an external tab and closing; focused app-session, launcher, and -config tests passed with type-check and `git diff --check`. -Nostr Discoverable Nodes now keeps discovered rows visible during relay refresh -or relay failure and shows `Searching relays...`; focused test, type-check, and -`git diff --check` passed. App Store/App Details screenshot sections now render -only real screenshot metadata and no longer show fake placeholder tiles when no -assets exist; focused App Details content and marketplace handoff tests, -type-check, and `git diff --check` passed. Home now has an App Store -recommendations card driven by uninstalled core/recommended marketplace apps; -the recommendations respect installed aliases so apps drop out after install -and move into normal My Apps/Home behavior. Focused helper tests, type-check, -`git diff --check`, and the Playwright Home dashboard smoke passed. Easy Mode -goal configure steps now route to their owning app/screen, verify steps have an -explicit `Check & Continue` action, and configure/info/verify actions start -goal progress before completing the step; focused goal action/store tests, -type-check, and `git diff --check` passed. Setup path selection no longer shows -the disabled `Connect Existing (Coming Soon)` option; Fresh Start and Restore -from Seed are the only visible choices and route correctly. Focused onboarding -option/composable tests, type-check, and `git diff --check` passed. Header -responsiveness follow-up restored the primary My Apps/App Store/Websites -navigation to persistent desktop tabs at `md+` on My Apps, Discover, and -Marketplace; removed the desktop primary dropdowns; kept mobile dropdown -behavior; delayed App Store category collapse by lowering the search reserve and -header gap; and removed the My Apps desktop category dropdown. Focused -Marketplace/App config tests, type-check, and scoped `git diff --check` passed. -Browser smoke against the already-running local Vite/mock session is still next. - -Active-session update, 2026-06-12 01:15 EDT: system update UX hardening landed -locally. `load_state()` now clears stale `update_in_progress` when no staged OTA -files exist, so failed legacy update attempts cannot leave the update screen -permanently stuck. Direct `update.git-apply` is gated behind -`ARCHIPELAGO_GIT_UPDATES`, preventing production nodes from accidentally entering -the local git/self-build path that requires `cargo`. `.116` was recovered from a -failed self-build attempt by applying its already-staged manifest OTA; it is now -on `1.7.84-alpha`, backend health is OK, nginx is active/config-valid, HTTP UI -returns `200`, `update_in_progress=false`, and staging was removed. Validation: -`cargo fmt --check`, `cargo check -p archipelago`, and scoped `git diff --check` -passed; focused `cargo test` was blocked by a local `rust-lld` undefined hidden -symbol linker failure unrelated to the updater patch. - -Done criteria for this tracker: - -- Code/UI items: implemented, covered by targeted test or manual smoke check, - and no known regression against the container migration work. -- Runtime/container items: validated on the release host named in - `docs/NEXT_TERMINAL_HANDOFF.md`, then included in ISO smoke test scope. -- Product-decision items: documented decision plus implementation task if the - decision keeps it in `1.8-alpha`. -- External/hardware items: hardware/document/access obtained, or explicitly - deferred from the release by product decision. - -## Release-Critical Runtime Gates - -| Item | Status | Release question / blocker | -| --- | --- | --- | -| Check logs of every server for errors and fix | blocked | Needs explicit target server list. Current docs name `.198`; are there more production validation hosts? | -| Go through issues on gate | blocked | Need location of "gate" issue tracker/board and access details. | -| Sort out container tagging so databases, backend, etc are sorted properly | in-progress | Tie to manifest/catalog metadata and My Apps grouping. | -| Sort out supplementary container naming so it is better | in-progress | Needs naming convention for dependencies: app-prefixed service names vs role-first names. | -| Figure out how we offer updates to apps | todo | Product/runtime design needed: manual update, scheduled checks, or auto-update by app tier. | -| Figure out how we provide different versions for Bitcoin to download and keep updated automatically | todo | Requires release policy for Knots/Core versions and whether users may pin old versions. | -| Make sure all credentials are given for apps without registration | in-progress | File Browser now exposes credentials on App Details and in the pre-launch interstitial. Backend `package.credentials` returns the secured File Browser password from `/var/lib/archipelago/secrets/filebrowser/password` when present, with `admin/admin` fallback matching the install hook. PhotoPrism now exposes manifest-backed `admin` / `archipelago` credentials from both backend `package.credentials` and the frontend fallback. My Apps and mobile icon-grid credential pre-launch modals are vertically centered on mobile. Covered by `appCredentials.test.ts`, `AppIconGrid.test.ts`, local type-check, backend compile check, `cargo fmt --check`, and `git diff --check`. Grafana was not added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo default/secret. Remaining no-registration apps still need inventory. | -| Nextcloud always shows update, and how are apps actually updated? | in-progress | Nextcloud manifest/catalog metadata is aligned to the pinned `nextcloud:29` image, and update detection now ignores registry-host-only image changes while still reporting real same-repo tag drift. Catalog drift check passed. Backend focused test was added but local validation hit a Rust linker/incremental artifact failure, then bounded retries exited `124` before test output, including a 600s fresh-target retry on 2026-06-10. Broader app update UX/policy design still needed. | -| Make sure Tor is solid as having to rotate addresses to get it to work | todo | Needs `.198`/target-host Tor logs and reproducible failure case. | -| Fix fleet it does not seem to work | done | Fleet data now preserves existing nodes during refresh, exposes an explicit refreshing state, sorts online nodes first, avoids duplicate history fetches when selecting a node, accepts backend `entries` and legacy `history` response shapes for per-node charts, and uses readable loading/auto-refresh UI. Covered by `useFleetData.test.ts`, local type-check, targeted tests, and user visual review of the Fleet header/card treatment. | -| Check Beta Telemetry and how it works | done | Telemetry is opt-in via `analytics-config.json`; the background reporter runs every 15 minutes only when enabled, saves `telemetry-latest.json`, writes local Fleet reports/history under `telemetry-fleet/`, and optionally POSTs a `telemetry.ingest` JSON-RPC envelope to `TELEMETRY_COLLECTOR_URL`. The systemd unit now reads optional `/var/lib/archipelago/telemetry.env`, and deploys write that file when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`. Manual and periodic report schemas now both include metric percentages and container inventory, and the Fleet UI normalizes older reports with missing fields. Covered by local type-check, `useFleetData.test.ts`, `cargo check -p archipelago`, deploy-script syntax check, and `git diff --check`. Remaining ops step: choose the real collector URL, deploy it, restart the service, and confirm central Fleet ingest. | -| Get Netbird working | todo | Requires app/runtime validation and credentials/config expectations. | -| Sort out how we are going to manage lightning channel creation | todo | Product design needed for UX, safety limits, fees, and peer selection. | -| Make sure old health notifications do not return on refresh/new login when stale/out of date | done | Health toasts now require a current app-linked unhealthy package state and hide stale package health notifications after 30 minutes on reload/new login. Backend monitoring notifications now prune duplicate active alerts and old generic alerts before pushing new ones. Covered by `HealthNotifications.test.ts`, local type-check, targeted frontend tests, and backend notification unit test work. | -| Fix BTCPay issue from desktop file "BTCPay Issues" | blocked | Need file contents or path to that desktop artifact. | -| Check Nostr Discoverable Nodes and get it working correctly | in-progress | Discover modal now keeps discovered rows visible during relay refresh/failure and shows `Searching relays...` instead of dropping to an empty state. Covered by `DiscoverModal.test.ts`, local type-check, and `git diff --check`. Needs live relay/trust validation before marking done. | -| Make sure update password is working properly | done | Backend now returns separate SSH update status so a successful web password change is not reported as a full failure when optional SSH password update fails. Settings modal shows success plus SSH warning and stays open for review. Covered by local type-check, focused modal/RPC tests, auth unit test, `cargo check -p archipelago`, and `git diff --check`. | -| Prevent System Update screen from getting permanently stuck | done | Update state loading now reconciles `update_in_progress` with the actual manifest OTA staging directory and clears stale stuck state when no staged files exist. Direct git/self-build apply is disabled unless `ARCHIPELAGO_GIT_UPDATES` is explicitly set, so production nodes cannot fall into the old `self-update.sh` path that requires local `cargo`. `.116` was recovered by applying its valid staged manifest OTA and verified on `1.7.84-alpha` with backend health OK, nginx active/config-valid, HTTP UI `200`, `update_in_progress=false`, and staging removed. Validated locally with `cargo fmt --check`, `cargo check -p archipelago`, and scoped `git diff --check`; focused `cargo test` was blocked by a local `rust-lld` linker artifact failure unrelated to the updater patch. | -| Do UI performance and general performance improvements | todo | Needs profiling target; start with obvious loading/render issues. | -| Make sure companion app is all working well, had issues with tab apps | in-progress | Mobile app-session now keeps apps that require a new tab inside the session fallback instead of auto-opening an external tab and closing immediately. Covered by `AppSessionMobileNewTab.test.ts`, existing app-session config tests, app launcher tests, local type-check, and `git diff --check`. Broader companion smoke test still needed before marking done. | -| Even though performance is better, on reboot/restart backend/update show checking-containers notification instead of no apps | done | My Apps now shows a dedicated `Checking containers` card when initial backend data has loaded but `server-info.status-info.containers-scanned` is still false and no apps are ready to render, instead of falling through to the no-apps empty state. A follow-up UI pass preserves the last known app list when a later scanner/backoff update reports an empty package map with `containers-scanned=false`, and shows a refresh status banner above the grid. Validated by local type-check, targeted tests, and `git diff --check`; follow-up validation passed `npm test -- --run src/views/apps/__tests__/appPackageCache.test.ts` and `npm run type-check`. | -| Check mesh core is picking up public channel/other devices, not just Archipelago ones | blocked | Needs Meshtastic hardware/radio environment. | -| Make tabs info load quickly or show loading states | in-progress | Fleet now has initial loading/background-refresh states, and node history keeps showing while the next sample is fetched instead of blanking out. Web5 Connected Nodes Trusted/Observers tabs now show loading instead of empty states while peer data is pending and keep existing lists visible during refresh; Messages and Requests now also keep populated lists visible during refresh/failure. Web5 Shared Content now keeps My Content visible during refresh/failure with `Refreshing shared content...`, and Browse Peers keeps current same-peer results visible during refresh with `Refreshing peer content...` instead of replacing lists with full loading panels. Web5 Identities now keeps the identity list visible during refresh/failure with `Refreshing identities...`; Web5 DWN message browsing keeps stored messages visible during refresh/failure with `Refreshing messages...`. The Web5 Verifiable Credentials summary keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Web5 Nostr Relays keeps relay stats visible during refresh/failure with `Refreshing relays...`. Web5 Domains keeps registered-name counts visible during refresh/failure with `Refreshing domains...`. Web5 Federation keeps summary node counts/DID visible during refresh/failure with `Refreshing federation...`. Server Network overview, Network Interfaces, and Tor Services cards now keep visible values during refresh/failure with `Refreshing network...`, `Refreshing interfaces...`, and `Refreshing Tor services...`. Credentials keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Settings Backups keeps backup rows visible during refresh/failure with `Refreshing backups...`. Settings Transport Preferences keeps preference controls visible during refresh/failure with `Refreshing transport preferences...`. Settings VPN status keeps current connection details visible during refresh/failure with `Refreshing VPN status...`. Lightning Channels keeps existing channels visible during refresh/failure with `Refreshing channels...`. Peer Files keeps existing peer catalog items visible during Tor refresh/failure with `Refreshing peer files...`. Cloud keeps existing peer cards visible during federation peer-list refresh/failure with `Refreshing peer nodes...`. Covered by focused Web5/Server/Credentials/Backups/Transport/VPN/Lightning/Peer Files/Cloud tests and local type-check. Broader tab-info audit still needed for other slow panels before marking done. | -| Add states about why Bitcoin address is not ready | in-progress | Receive Bitcoin on-chain flows now reject blank LND address responses and translate common LND/Bitcoin readiness failures into user-facing reasons: wallet locked, wallet uninitialized, Bitcoin/LND still syncing, LND unreachable, or LND REST/newaddress transport issues. The receive modals now show a live β€œchecking wallet readiness” message while the request is in flight. Backend `lnd.newaddress` now errors if LND returns an error or no address. Needs live wallet-state smoke test before marking done. | -| Add new Bitcoin wallets easily and securely | todo | Product/security design needed. | -| Add the new gate instead of gate | blocked | Need definition of "new gate" and target integration. | -| Local Nostr signer app should ask which account after logout/re-login | todo | Needs signer/session state validation. | -| See what apps can migrate to local Nostr signer sign-in | todo | Needs app-by-app auth inventory. | -| Make server name change change the host name | in-progress | Settings label changed to `Hostname`. `server.set-name` now persists the display name, derives a Linux-safe hostname slug, attempts `sudo -n hostnamectl set-hostname`, and returns non-fatal hostname warning fields if OS update fails. Covered by hostname slug unit test, local type-check, `cargo check -p archipelago`, and `git diff --check`. Impact audit: mDNS/SSH/Tailscale labels may change; already-created app configs using old `HOST_MDNS` (notably Fedimint derived env) are not automatically rewritten by hostnamectl, so this needs release-host smoke validation before marking done. | -| Sort out HTTPS certificate, what is best way? | todo | Needs product decision: self-signed local CA, ACME DNS, Tailscale certs, or reverse proxy model. | - -## User Interface And App Experience - -| Item | Status | Release question / blocker | -| --- | --- | --- | -| LND Channels then back/back gets stuck between LND detail and channels | done | App Details back now routes explicitly to the parent surface, and Lightning Channels back replaces history so browser back no longer bounces between LND detail and Channels. Validated by local type-check and targeted tests. | -| Add a Meshtastic icon | done | Added `meshcore.svg` asset and manifest-owned icon metadata. Catalog generation is idempotent and strict catalog drift is clean. | -| Improve default app icon fallback | done | Missing/broken app icons now fall back to the centered Archipelago `A` mark using the same black fill and gradient-border treatment as the custom UI icon asset, instead of the old generic placeholder. Applied to My Apps cards, mobile icons, Marketplace cards, and App Details. Validated by local type-check, targeted tests, Rust check, and `git diff --check`. | -| Use favicon for Portainer apps? | todo | Need decision: use upstream favicons dynamically or ship curated icons. | -| Settings for apps | blocked | Needs definition: per-app config screen, runtime env vars, credentials, or install options? | -| Update SearXNG app icon | blocked | Needs user-provided/approved icon asset. User said to move past this until they can make icons. | -| Once an app is installed remove recommended/core pills | done | Marketplace cards hide tier badges when installed. Validated by `MarketplaceAppCard.test.ts`, targeted Vitest, type-check, and `git diff --check`. | -| Get Bitcoin / LND UI fully done with all options and controls | todo | Large feature area; needs scope for `1.8-alpha` vs post-release. | -| Fix intro always showing on new browser sessions | done | Splash gating now checks the backend onboarding-complete state before showing the intro when this browser has no local intro flag. Already-onboarded nodes skip the splash and seed `neode_intro_seen`; fresh installs still show it. Covered by `introSplash.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Fix App Store tabs/categories/search overflow | done | Discover/App Store and Marketplace render one shared App Store section list. Follow-up after user review restored the primary My Apps/App Store/Websites navigation to persistent desktop tabs at `md+` on My Apps, Discover, and Marketplace; mobile keeps dropdown behavior. App Store category collapse now happens later by starting uncollapsed and using a smaller header gap/search reserve, and the My Apps category dropdown no longer appears on desktop. Covered by local type-check, focused Marketplace/App config tests, and scoped `git diff --check`; browser smoke remains the next resume step. | -| Add a test harness for all of the application | in-progress | Lifecycle harness exists; need expand UI/e2e coverage definition. | -| Fix app details screen links | done | App Details sidebar no longer renders dead `href="#"` links. It now renders only real manifest website/marketing, upstream/wrapper repo, and support URLs, and hides the Links card when no usable URLs exist. Covered by `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Fix FIPS anchoring, update FIPS | todo | Needs expected FIPS UX/API behavior. | -| Fix generate receive address not working on nodes and identify wallet management | todo | Needs wallet API/backend validation. | -| Fix mesh page on larger screens so it scales nicely | done | Mesh keeps the tabbed tools layout on normal desktop/1920px widths and only splits Off-Grid Bitcoin, Dead Man, and Map into separate stacked containers on very large screens (`>=2560px` wide and `>=1200px` tall). The desktop tools column now fills its panel instead of using a wrapper scroll container. Validated by local type-check, targeted tests, and `git diff --check`. | -| Mesh map should handle denied location permission and still show other devices | in-progress | Mesh map now treats browser geolocation as optional in the UI: denied local location reports that peer locations can still appear, and the empty hint waits for mesh device positions instead of saying location sharing is required. Covered by `MeshMap.test.ts`. Needs browser smoke test with denied location plus a peer coordinate message before marking done. | -| Make tablet-size Meshtastic scrollable | done | Tablet/mobile Mesh tools panels now have bounded heights and internal scrolling so the selected Bitcoin/Dead Man/Map panel can scroll without blowing out the page. Validated by local type-check, targeted tests, and `git diff --check`. | -| Make mobile screens have gap below lowest container and tab bar | done | Dashboard route panels, including the separate Chat/Mesh branch, now use mobile tab-bar bottom clearance so the lowest content clears the bottom tab bar. | -| Add Trusted tab to Connected Nodes container and have Peers and Observers | done | Connected Nodes now labels trusted peers as Trusted and splits federation nodes with `trust_level: observer` into the Observers tab. Observer nodes are excluded from Trusted, shown with their own count/badge, and refresh from the same live federation list. Validated by local type-check and targeted tests. | -| Add more tree navigation to cloud files so they do not all go back to first screen | done | Cloud folder navigation now persists the current folder path in the route query so refresh/browser back keeps nested folders instead of resetting to the section root. The Cloud back button now walks up to the parent folder before returning to Cloud home. Covered by `cloudPath.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Fix visible UI refreshing on find nodes screens | done | Federation node auto-refresh no longer blanks/replaces the visible node lists after the initial load. Existing nodes stay visible during background refreshes, covered by `NodeList.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Remove dead UI components/ones that are coming soon | done | Removed the dead Web3/coming-soon Network card, disabled local-network placeholder button, and the non-interactive Spotlight AI Assistant coming-soon block. Verified active UI no longer contains explicit `Coming soon` copy outside historical release-note text. Covered by local type-check and `git diff --check`. | -| Hide Web3 container on network for now and move FIPS Mesh up | done | Network page now places the live FIPS Mesh card in the top overview grid where the dead Web3 card was, removes the duplicate lower FIPS card, and updates the Home Network description to remove Web3 language. Validated by local type-check, targeted tests, and `git diff --check`. | -| Make cool screens less hidden: Find Nodes, Fleet, Monitoring, etc. | done | Existing Web5 summary cards now expose Monitoring, Find Nodes/Federation, and Fleet directly. Federation card has separate `Find Nodes` and `Fleet` actions instead of hiding Find Nodes behind Fleet. Covered by `Web5Federation.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Fix dashboard container/card square rendering corruption | done | Generalized the App Store compositor workaround to dashboard scroll-panel glass cards/buttons/inputs and removed transform-based stagger movement so Chromium/Brave no longer paints random large black square/rectangle layers over containers. Kept the Web5 bottom-action placement change. Validated by local type-check, targeted tests, and `git diff --check`. | -| Move constrained card header actions to bottom buttons | done | Web5 summary actions and Network actions for Add Device, Scan WiFi, Restart Tor, and Add Service now stay in the card header only on very wide screens; otherwise they render at the card bottom as full-width or 50/50 buttons. Button icons were removed from those action buttons. Validated by local type-check, targeted tests, and `git diff --check`. | -| Work on setup screens function and flows | in-progress | Onboarding setup choice now shows only usable paths: Fresh Start and Restore from Seed. Removed the disabled `Connect Existing (Coming Soon)` option, and covered default Fresh routing plus Restore routing with `OnboardingOptions.test.ts`; `useOnboarding.test.ts`, local type-check, and `git diff --check` passed. Broader onboarding/setup audit still needed before marking done. | -| Work on Easy Mode experience | in-progress | Easy Mode goal configure steps now route to their owning app/screen instead of silently completing without navigation; verify steps now expose a `Check & Continue` action; configure/info/verify actions start goal progress before completing the active step. Covered by `goalStepActions.test.ts`, existing goal store tests, local type-check, and `git diff --check`. Broader Easy Mode product scope still needed before marking done. | -| Update My Apps homescreen to show most-used apps instead of hardcoded | done | App launches are recorded locally through the app launcher, and the Home My Apps card now shows the top three installed user apps by launch count/recency with a running-app/name fallback when there is no history. Covered by `appUsage.test.ts`, existing app launcher tests, local type-check, targeted tests, and `git diff --check`. | -| Improve Full Archive Node dependent apps UX | in-progress | Electrum-style apps already block install on pruned Bitcoin nodes; Marketplace/App Store cards now surface an inline warning that a full archive Bitcoin node is required instead of only showing a terse `Bitcoin Pruned` button. Covered by `MarketplaceAppCard.test.ts` and local type-check. Broader dependency UX remains. | -| Fix incorrect modals that are wrong color and are not full-screen overlay | done | Custom Teleport modals that still used the old light `bg-black/10` overlay now use the same full-screen `bg-black/60` overlay treatment as BaseModal/newer modals. Verified no fixed modal overlays retain `bg-black/10`; validated by local type-check, targeted tests, and `git diff --check`. | -| Prevent modals from allowing background scroll | done | Added shared scroll-lock composable, root-level body lock, wheel/touch containment, and explicit dashboard route-panel locking. User validated the background no longer scrolls behind modal overlays. | -| Look over gamepad navigation | todo | Needs focused controller-nav pass. | -| App Store screenshots | in-progress | Placeholder policy fixed: Marketplace App Details and installed App Details now render screenshot sections only when real screenshot metadata exists, and otherwise hide the fake placeholder tiles. Metadata can be string URLs or `{ src, alt }` objects. Covered by `AppContentSection.test.ts`, `useMarketplaceApp.test.ts`, local type-check, and `git diff --check`. Needs actual screenshot assets/metadata before marking done. | -| Fix App Detail page issues; container controls are not good | done | App Details container controls now disable while start/stop/restart/update/uninstall RPCs are running and show action-specific progress labels. Header actions collapse into the bottom 50/50 grid below `1280px` to avoid tablet/smaller desktop overlap. Credentials now show a loading state while package credentials are being fetched. Covered by `AppHeroSection.test.ts`, `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Add setup instructions for apps that need them | done | App Details now renders a dedicated Setup Instructions card from `static-files.instructions` when present, so apps can show install/setup notes without a new schema. Covered by `AppSidebar.test.ts`, local type-check, and `git diff --check`. | -| Add press-and-hold option for apps on mobile app screen | done | Mobile My Apps icons now support long press/context menu to open the app detail/options screen while a normal tap still launches the app. Space key opens the same options path for keyboard users. Covered by `AppIconGrid.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Side-load: add port-not-available validation | done | Sideload modal now validates app ID collisions, malformed `host:container` mappings, reserved Archipelago/package host ports, and host ports already exposed by installed packages before queueing install. Backend install remains the final bind authority. Covered by `sideloadValidation.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Delete app data option and uninstall warning | done | Uninstall dialogs in My Apps and App Details now include a clear warning plus a `Delete app data and reset it` choice. Leaving it off preserves app data for later reinstall; checking it passes `preserve_data=false` through `package.uninstall` so the app is fully reset. Covered by `AppsUninstallModal.test.ts`, `rpc-client.test.ts`, local type-check, targeted tests, and `git diff --check`. | -| Add App Store container with recommended apps that change to Home Screen | done | Home now shows up to three uninstalled core/recommended App Store apps and routes clicks through the existing Marketplace App Details handoff. Installed aliases are honored, so recommendations disappear once the app is installed and the app moves into normal My Apps/Home behavior. Follow-up layout polish moved Cloud back into the second card slot, moved Recommended Apps into Cloud's previous slot, and placed Quick Start inside the grid next to Wallet to avoid an odd-width row. Covered by `homeRecommendations.test.ts`, local type-check, `git diff --check`, and Playwright Home dashboard smoke against local Vite/mock backend. | -| Add QR code to download mobile companion app in login-triggered modal and improve modal | done | Companion intro modal now renders a QR code on desktop and a direct download button on mobile. It reads `VITE_COMPANION_APK_URL` and falls back to `/packages/archipelago-companion.apk.zip`; the APK zip is now published at `neode-ui/public/packages/archipelago-companion.apk.zip` so the modal can serve it immediately. Covered by local type-check, `git diff --check`, and manual file placement verification. | -| Fix TV HDMI overscan clipping in kiosk mode | in-progress | Kiosk launcher now passes a browser safe-area fallback through `/kiosk?safe_area=...`; `/kiosk` now persists the safe-area value during redirect; self-update and deploy paths refresh kiosk launcher/services. The X11 safe-area attempt is opt-in because it stretched the live TV output on `100.66.157.120`. Wi-Fi UI fixes are included in the same OTA patch: scan errors are visible, scans can be retried, escaped SSIDs parse correctly, and open networks do not require a password. Needs live validation on HDMI node `100.66.157.120` after applying the visible OTA update. | -| Video calling Picture-in-Picture | blocked | Need referenced document or desired provider/library. | -| Card-based loading visuals on App Store pages | done | Discover and Marketplace now show app-card skeleton grids while community/Nostr catalog data is loading and no cards are available yet, instead of a centered spinner/empty state. Validated by local type-check, targeted tests, and `git diff --check`. | - -## External / Hardware Items - -| Item | Status | Release question / blocker | -| --- | --- | --- | -| Buy a HaLow device and start integration | blocked | Requires hardware purchase and driver/device target. Not a code-only `1.8-alpha` item unless hardware is available now. | diff --git a/docs/BETA-ISSUES-20260328.md b/docs/BETA-ISSUES-20260328.md deleted file mode 100644 index 10708242..00000000 --- a/docs/BETA-ISSUES-20260328.md +++ /dev/null @@ -1,96 +0,0 @@ -# Beta Test Issues β€” 2026-03-28 (ISO build 2137) - -Hardware: Dell OptiPlex 3020M, i5, 8GB RAM, 465G HDD, UEFI+Legacy - -## ISO / Boot (image-recipe) - -### 1. UEFI autodetect broken -- **Severity**: High -- **Detail**: Only autodetects/boots in Legacy BIOS mode. UEFI boot does not autodetect the install disk. -- **Where**: `build-auto-installer-iso.sh` GRUB config, EFI boot chain -- **Status**: TODO - -### 2. Installation TUI screens need redesign -- **Severity**: Medium -- **Detail**: Current installer output is plain/ugly. Needs polished design. -- **Action**: User will provide .md mockup for each screen, then we implement. -- **Where**: `build-auto-installer-iso.sh` auto-install.sh embedded script -- **Status**: AWAITING DESIGN - -### 3. No TUI animations -- **Severity**: Low -- **Detail**: Would like Claude-style spinner/progress animations during install. May not be possible with bash. -- **Where**: auto-install.sh -- **Status**: TODO (investigate) - -### 4. USB read errors on boot -- **Severity**: Medium (cosmetic but bad first impression) -- **Detail**: Read errors scroll on screen during USB boot before installer loads. Scares new users. -- **Where**: Kernel/initramfs boot, possibly `quiet` not suppressing early messages -- **Status**: TODO - -### 5. GRUB background tiling + text cutoff -- **Severity**: Medium -- **Detail**: Boot menu background image tiles instead of scaling. Menu text ("Install Archipelago", "Failsafe mode") is cut off. -- **Where**: `branding/grub-theme/`, `boot/grub/grub.cfg`, theme.txt resolution settings -- **Status**: TODO - -### 6. USB removal drops to command line -- **Severity**: Medium -- **Detail**: After install completes, removing USB drops to shell before user presses Enter to reboot. Confuses non-technical users. -- **Where**: auto-install.sh β€” end of install, before `read -s` / `reboot` -- **Status**: TODO - -## Frontend / UI (neode-ui) - -### 7. Broken splash screen flashes before onboarding -- **Severity**: High -- **Detail**: Black screen with "online/offline" top-right, broken archipelago image top-left, "use arrow keys" text. Flashes briefly before onboarding loads. -- **Where**: Likely `RootRedirect.vue` or `SplashScreen.vue` β€” routing/transition timing -- **Status**: TODO (reported before, persists) - -### 8. Skip buttons still visible in onboarding -- **Severity**: Medium -- **Detail**: Onboarding flow still shows skip buttons. Should be removed for clean UX. -- **Where**: `src/views/onboarding/` components -- **Status**: TODO - -### 9. App install UX outdated -- **Severity**: High -- **Detail**: Missing the yellow "Installing..." button that persists across navigation. Apps don't show as "installing" in My Apps view during install. -- **Where**: `src/views/marketplace/`, `src/views/myapps/`, app install store -- **Status**: TODO - -### 10. Login requires double Enter -- **Severity**: Medium -- **Detail**: Password field on login page requires pressing Enter twice to submit. -- **Where**: `src/views/LoginView.vue` β€” form submission handler -- **Status**: TODO (reported before, persists) - -### 11. No password setting UI -- **Severity**: High -- **Detail**: No way for user to set/change their password from the web UI. Currently hardcoded `password123`. -- **Where**: Settings view, backend auth API -- **Status**: TODO - -### 12. Browser login loops (non-kiosk) -- **Severity**: High -- **Detail**: Logging in from a browser (not kiosk) on the same network redirects back to login in a loop. Kiosk mode works fine. -- **Where**: Auth/session handling β€” possibly cookie `SameSite` or redirect logic in `RootRedirect.vue` -- **Status**: TODO - -### 13. Can't exit input fields with arrow keys -- **Severity**: Medium -- **Detail**: When focused on a text input, up/down arrow keys don't move focus to adjacent UI elements. Stuck in the field. -- **Where**: `useControllerNav.ts` β€” input field focus trap logic -- **Status**: TODO (reported before, persists) - ---- - -## Summary - -| Category | Critical | High | Medium | Low | -|----------|----------|------|--------|-----| -| ISO/Boot | 0 | 1 | 4 | 1 | -| Frontend | 0 | 4 | 3 | 0 | -| **Total** | **0** | **5** | **7** | **1** | diff --git a/docs/BETA-PROGRESS.md b/docs/BETA-PROGRESS.md deleted file mode 100644 index ad841c4d..00000000 --- a/docs/BETA-PROGRESS.md +++ /dev/null @@ -1,335 +0,0 @@ -# Beta Progress Tracker - -> **Goal**: Flawless beta that works perfectly on every machine we install it on. -> **Freeze started**: 2026-03-18 -> **Last updated**: 2026-03-25 - ---- - -## Pipeline - -``` -PHASE 1: Feature Testing (internal) ← WE ARE HERE - ↓ -PHASE 2: User Testing (real users, controlled) - ↓ -PHASE 3: Beta Live (public release) -``` - -**Current phase**: PHASE 1 β€” Feature Testing -**Gate to Phase 2**: Every feature works, all bugs fixed, security hardened, ISO verified -**Gate to Phase 3**: User testing feedback resolved, no P0/P1 issues remaining - ---- - -## Phase 1: Feature Testing (Internal) - -Everything in this phase must pass before we hand it to real users. - -### Overall Status: IN PROGRESS (~65%) - -| Workstream | Status | Completion | Gate-blocking? | -|------------|--------|------------|----------------| -| 1A. Critical Bugs (BUG-1 CSRF) | DONE | 100% | ~~YES~~ | -| 1B. Boot Screen (FEATURE-4) | IN PROGRESS | ~80% (needs hardware test) | YES | -| 1C. Security Hardening (TASK-8) | DONE (12/12 + code audit) | 100% | ~~YES~~ | -| 1D. Rootless Podman (TASK-11) | DONE (.228), IN PROGRESS (.198) | ~80% | YES | -| 1E. Beta Telemetry (TASK-12) | NOT STARTED | 0% | YES | -| 1F. App Testing β€” every feature | NOT STARTED | 0% | YES | -| 1G. ISO Build & Fresh Install | NOT STARTED | 0% | YES | -| 1H. UI Polish & Layout | DONE (batch + What's New) | ~90% | No | -| 1I. WebSocket Reliability | NOT STARTED | 0% | No | -| 1J. Quality Baseline Check | NOT STARTED | 0% | No | -| 1K. Architecture Review Fixes | DONE (4/4 items) | 100% | ~~YES~~ | -| 1L. Update System (git.tx1138.com) | DONE | 100% | No | - -### 1A. Critical Bugs - -#### BUG-1: Random logout / CSRF mismatch β€” P0 -**Status**: PLANNED -**Impact**: Users get randomly logged out. Blocks user testing β€” unacceptable UX. - -**What's known**: -- Sessions now persist to disk (fixed) -- CSRF token mismatch between cookie and header still causes 403s -- Likely caused by cookie rotation in multi-tab or deploy scenarios - -**Remaining work**: -- [ ] Add debug logging to capture actual cookie vs header values -- [ ] Reproduce reliably (multi-tab, deploy, long idle) -- [ ] Fix the root cause -- [ ] Verify fix survives deploys and multi-tab use - -#### BUG-3: IndeedHub WebSocket spam β€” P2 -**Status**: PLANNED -**Impact**: Console noise, minor. Should fix before user testing. - -- [ ] Rebuild IndeedHub with relative WebSocket URL -- [ ] Verify fix - ---- - -### 1B. Boot Screen (FEATURE-4) - -**Status**: IN PROGRESS (~80% complete) -**Impact**: Users hit errors on first boot before backend is ready. Blocks user testing. - -- [x] Audit current `/health` endpoint β€” returns trivial "OK" -- [x] Add granular service readiness to health endpoint (JSON with version + services) -- [x] Design boot screen component β€” BootScreen.vue (379 lines, starfield + terminal log + orb) -- [x] Create pixel art icon animations (6 SVG icons cycling) -- [x] Implement health polling with smooth transition (server.echo RPC, 2s interval) -- [x] Handle edge cases (timeout, 502/503 detection, boot-reset) -- [ ] Test on fresh ISO install (first-boot path) -- [ ] Test on normal reboot (existing user path) - ---- - -### 1C. Security Hardening (TASK-8) - -**Status**: DONE β€” 12/12 pentest findings fixed + additional hardening from code audit - -#### Pentest (12/12 fixed) -- [x] C1: /lnd-connect-info requires session auth -- [x] C3: DEV_MODE removed from production service -- [x] H1: node-message verifies ed25519 signatures -- [x] H2: federation.peer-joined verifies ed25519 signature -- [x] H3: federation.peer-address-changed requires signed proof -- [x] H4: Backend binds to 127.0.0.1 -- [x] M1: content.add rejects `..` path traversal -- [x] M2: NIP-07 postMessage uses specific origin -- [x] M3: AIUI nginx checks session_id cookie -- [x] L2: Strict v3 onion validation -- [x] MED-03: Shell injection in bitcoin.conf generation -- [x] MED-07: No body size limit on /rpc/ - -#### Code audit (additional) -- [x] CSRF: HMAC-derived from session token (BUG-1 fix) -- [x] Argon2id password hashing (bcrypt auto-upgrade) -- [x] Random Bitcoin RPC password on first boot -- [x] RBAC Viewer role: explicit allowlist -- [x] Error sanitization tightened -- [x] Identity label max length enforced -- [ ] Cosign image verification (large scope β€” post-beta candidate) - ---- - -### 1D. Rootless Podman (TASK-11) - -**Status**: DONE on .228 (30 containers rootless), IN PROGRESS on .198 -**Impact**: Security posture β€” containers no longer require root. - -- [x] Migrate existing root Podman containers to rootless (archipelago user) -- [x] Update PodmanClient to run `podman` directly (no sudo) β€” 9 Rust files -- [x] Deploy script auto-fixes ownership + sysctl + linger on every deploy -- [x] All 30 containers running rootless on .228 -- [ ] .198: only 2 containers running β€” needs full container recreation (TASK-39) -- [x] Tailscale deploy script: full deploy-tailscale.sh with split-mode SSH, rootfulβ†’rootless migration, container creation, all infrastructure -- [ ] Test full deploy on .198 (validation before Tailscale) -- [ ] Deploy to Tailscale nodes (Arch 1/2/3) - ---- - -### 1E. Beta Telemetry β€” Node Reporting (TASK-12) - -**Status**: NOT STARTED -**Impact**: Without this we're blind during user testing β€” can't see what's broken on their machines. - -All beta nodes report health/errors to a central log. We build a panel to monitor and triage issues. - -**Design**: -- Opt-in telemetry (user consents during onboarding or settings) -- Each node periodically reports: health status, error log digest, container states, uptime -- Central endpoint collects reports (could be a simple API on one of our servers) -- Dashboard panel shows all reporting nodes, their status, recent errors -- Privacy: no wallet data, no keys, no personal data β€” only system health and error logs -- Nodes identified by anonymous ID (hash of DID), not IP or name - -**Tasks**: -- [ ] Design report payload (health, errors, container states, versions, uptime) -- [ ] Design privacy model β€” what's collected, what's NOT, user consent flow -- [ ] Build reporting endpoint (backend RPC β†’ central collector) -- [ ] Build central collector service (receives + stores reports) -- [ ] Build monitoring dashboard/panel (view all nodes, filter by error type) -- [ ] Add opt-in toggle to Settings UI -- [ ] Add reporting interval config (default: every 15 min?) -- [ ] Test with multi-node fleet (.228, .198, Tailscale nodes) - ---- - -### 1F. App Testing β€” Every Feature - -**Status**: NOT STARTED -**Reference**: `docs/BETA-RELEASE-CHECKLIST.md` β€” full matrix - -Systematic test of **every feature** on the dev server, then on fresh install. - -#### Core Flows -- [ ] Onboarding: welcome β†’ password β†’ path β†’ DID β†’ backup β†’ dashboard -- [ ] Login / logout / re-login -- [ ] Password change (invalidates other sessions) -- [ ] 2FA enrollment and verification -- [ ] Settings: view server name, version, DID, Tor address -- [ ] Dashboard: all overview cards render with data - -#### App Lifecycle (every app) -- [ ] Bitcoin Knots: install, sync starts, UI loads, uninstall -- [ ] Electrs: install, auto-connects to Bitcoin, UI loads, uninstall -- [ ] LND: install, auto-connects to Bitcoin, UI loads, uninstall -- [ ] BTCPay Server: install, connects, Lightning available, uninstall -- [ ] Mempool: install with Bitcoin+Electrs, shows data, uninstall -- [ ] Fedimint + Gateway: install, UI loads, uninstall -- [ ] File Browser: install, UI loads, uninstall -- [ ] Immich: install, UI loads, uninstall -- [ ] PhotoPrism: install, UI loads, uninstall -- [ ] Penpot: install, UI loads, uninstall -- [ ] SearXNG: install, UI loads, uninstall -- [ ] Ollama: install, UI loads, uninstall -- [ ] Nostr Relay: install, UI loads, uninstall -- [ ] Nginx Proxy Manager: install, UI loads, uninstall -- [ ] Tailscale: install, UI loads, uninstall -- [ ] Home Assistant: install, UI loads (new tab), uninstall -- [ ] IndeedHub: opens external URL in iframe - -#### Dependency Chain Errors -- [ ] Electrs without Bitcoin β†’ clear error message -- [ ] LND without Bitcoin β†’ clear error message -- [ ] Mempool without Bitcoin+Electrs β†’ clear error message - -#### Federation & Identity -- [ ] Federation invite + join between nodes -- [ ] DWN sync between federated nodes -- [ ] Backup create + download -- [ ] Backup restore on fresh install - -#### WebSocket -- [ ] Connects on login, receives initial data -- [ ] Reconnects after network drop -- [ ] Ping/pong heartbeat both directions -- [ ] Connection state visible in UI -- [ ] Install progress delivered real-time - -#### Nginx Proxies -- [ ] Every `/app/*` proxy resolves correctly -- [ ] BTCPay and Home Assistant open in new tab -- [ ] Tor hidden services resolve - ---- - -### 1G. ISO Build & Fresh Install - -**Status**: NOT STARTED - -- [ ] ISO builds successfully on dev server -- [ ] ISO size < 10 GB -- [ ] All container images captured -- [ ] Boot from USB on x86_64 hardware -- [ ] Auto-installer partitions correctly -- [ ] Services start on first boot -- [ ] Web UI accessible within 3 minutes -- [ ] Full onboarding flow completes -- [ ] Second machine test (different hardware) -- [ ] ARM64 test (if targeting) - ---- - -### 1H. UI Polish & Layout - -**Status**: MOSTLY DONE β€” batch of fixes shipped 2026-03-18 -**Note**: Layout rearrangements and UX improvements allowed during freeze. - -- [x] Rename fedimintd β†’ "Fedimint Guardian" + icon (TASK-26) -- [x] Tab-launch icons for apps opening in new tabs (TASK-27) -- [x] Installed apps sorted to end of marketplace (TASK-28) -- [x] Mesh mobile: header hidden, overflow fixed (TASK-29) -- [x] On-Chain first in receive modals (TASK-30) -- [x] Federation node names β€” show name not DID, hover for key (TASK-35) -- [x] Cleaner iframe error screen with remediation (TASK-36) -- [x] CPU alert threshold fixed (BUG-33) -- [x] ElectrumX shows index size during indexing -- [x] Container startup "Checking..." shimmer -- [ ] Sticky nav header (TASK-31) -- [ ] Review all views for consistent glass design -- [ ] Verify all loading/empty/error states work -- [ ] Check responsive layout on tablet/mobile - ---- - -### 1I. WebSocket Reliability - -Covered under 1F testing β€” no separate workstream needed. - ---- - -### 1J. Quality Baseline Check - -**Last known** (2026-03-11): -- Silent catches: 0 -- Console statements: 0 -- `any` types: 0 -- TypeScript errors: 0 -- Tests: 515 passed -- npm audit (runtime): 0 - -- [ ] Re-run full quality sweep β€” verify no regressions -- [ ] Fix any new violations - ---- - -## Phase 2: User Testing (Controlled) - -**Gate**: All Phase 1 items pass. No P0/P1 bugs open. - -Starts when we hand ISOs to real users on real hardware we don't control. - -| Item | Status | -|------|--------| -| Recruit test users (3-5 people, varied hardware) | NOT STARTED | -| Provide ISOs + install instructions | NOT STARTED | -| Beta telemetry collecting reports from user nodes | NOT STARTED | -| Monitor dashboard for errors across fleet | NOT STARTED | -| Triage + fix reported issues | NOT STARTED | -| User feedback collection (structured form or channel) | NOT STARTED | -| Fix all P0/P1 issues from user reports | NOT STARTED | -| Rebuild ISO with fixes, re-test | NOT STARTED | - ---- - -## Phase 3: Beta Live (Public) - -**Gate**: User testing complete. No P0/P1 issues. Telemetry shows stable fleet. - -| Item | Status | -|------|--------| -| Final ISO build with all fixes | NOT STARTED | -| Release notes / changelog | NOT STARTED | -| Download page / distribution | NOT STARTED | -| Public announcement | NOT STARTED | -| Telemetry monitoring active for early adopters | NOT STARTED | - ---- - -## Session Log - -| Date | Session | Work Done | Items Closed | -|------|---------|-----------|--------------| -| 2026-03-18 | #1 | Created beta freeze plan, progress tracker | β€” | -| 2026-03-18 | #2 | Restructured into 3-phase pipeline, added telemetry workstream | β€” | -| 2026-03-18 | #3 | Updated tracking to reflect completed work β€” TASK-11 done, TASK-8 9/12, UI batch done | TASK-11, TASK-26-30, TASK-32, TASK-34-36, BUG-33 | -| 2026-03-18 | #4 | Rewrote deploy-tailscale.sh (full deploy with split-mode SSH, rootful migration, containers, infra). Fixed first-boot-containers.sh rootless bugs (subnet, UID mapping, prereqs). Dynamic HTTPS certs. | β€” | -| 2026-03-18 | #5 | BUG-1 CSRF fix, TASK-8 12/12 done, 7 bugs fixed, Argon2id migration, random BTC RPC, RBAC hardened, What's New history, Bitcoin sync gauge. Tagged v1.2.0-alpha.9. | BUG-1, TASK-8, BUG-20/37/40/41, TASK-31/38 | -| 2026-03-25 | #6 | Architecture review audit: all P0s+P1s verified fixed. Fixed remaining items: Nostr timeouts (6 calls), crypto dep pinning (12 deps), container image pinning (15 images), CI pipeline. Update system wired to git.tx1138.com. Cleaned stale branches. Docs updated. | Architecture review 4/4, CI pipeline | - ---- - -## Post-Beta Parking Lot - -These are explicitly deferred until after beta ships: -- FEATURE-6: Watch-only wallet architecture -- TASK-7: Mesh Bitcoin security hardening -- INQUIRY-5: Offline balance check via mesh relay -- TASK-2: Roll incoming-tx into deploy & ISO (P2, not blocking) -- did:dht integration -- Multi-user support -- Cluster mode -- Mobile companion PWA diff --git a/docs/BETA-RELEASE-CHECKLIST.md b/docs/BETA-RELEASE-CHECKLIST.md deleted file mode 100644 index bacf2fbd..00000000 --- a/docs/BETA-RELEASE-CHECKLIST.md +++ /dev/null @@ -1,269 +0,0 @@ -# Beta Release Checklist (v0.5.0-beta) - -## Pre-Build Verification - -### Source Code - -- [ ] All changes committed and pushed to `main` -- [ ] `cargo clippy --all-targets --all-features` passes (zero warnings) -- [ ] `cargo fmt --all` applied -- [ ] `cd neode-ui && npm run type-check` passes (zero errors) -- [ ] `cd neode-ui && npm test` passes (all tests green) -- [ ] `cargo test --all-features` passes on dev server - -### Critical Files - -- [ ] `core/container/src/podman_client.rs` β€” rootless Podman REST API socket -- [ ] `core/archipelago/src/container/docker_packages.rs` β€” app metadata + UI mapping -- [ ] `core/archipelago/src/api/rpc/package.rs` β€” app configs, capabilities, dependencies -- [ ] `core/archipelago/src/session.rs` β€” session security hardening -- [ ] `core/security/src/secrets_manager.rs` β€” encryption + rotation -- [ ] `neode-ui/src/views/Marketplace.vue` β€” all app entries with pinned image versions -- [ ] `neode-ui/src/api/websocket.ts` β€” heartbeat + reconnection -- [ ] `image-recipe/configs/nginx-archipelago.conf` β€” all app proxies + path traversal blocks -- [ ] All app icons present in `neode-ui/public/assets/img/app-icons/` - ---- - -## App Integration Matrix - -Every app must be tested for install, launch, and uninstall on a fresh system. - -### Core Bitcoin Stack - -| App | Image | Version | Install | Launch | UI Loads | Uninstall | -|-----|-------|---------|---------|--------|----------|-----------| -| Bitcoin Knots | `bitcoinknots/bitcoin` | `v28.1` | [ ] | [ ] | [ ] | [ ] | -| Electrs | `mempool/electrs` | `v0.4.1` | [ ] | [ ] | [ ] | [ ] | -| LND | `lightninglabs/lnd` | `v0.18.4` | [ ] | [ ] | [ ] | [ ] | -| BTCPay Server | `btcpayserver/btcpayserver` | `2.0.6` | [ ] | [ ] | [ ] | [ ] | -| Mempool | `mempool/frontend` | `v3.0.0` | [ ] | [ ] | [ ] | [ ] | -| Fedimint | `fedimintui/fedimint` | `0.5.0` | [ ] | [ ] | [ ] | [ ] | -| Fedimint Gateway | `fedimintui/gateway-ui` | `0.5.0` | [ ] | [ ] | [ ] | [ ] | - -### Storage & Media - -| App | Image | Version | Install | Launch | UI Loads | Uninstall | -|-----|-------|---------|---------|--------|----------|-----------| -| File Browser | `filebrowser/filebrowser` | `v2` | [ ] | [ ] | [ ] | [ ] | -| Immich | `ghcr.io/immich-app/immich-server` | `v1.121.0` | [ ] | [ ] | [ ] | [ ] | -| PhotoPrism | `photoprism/photoprism` | `240915` | [ ] | [ ] | [ ] | [ ] | - -### Productivity & Privacy - -| App | Image | Version | Install | Launch | UI Loads | Uninstall | -|-----|-------|---------|---------|--------|----------|-----------| -| Penpot | `penpotapp/frontend` | `2.4` | [ ] | [ ] | [ ] | [ ] | -| SearXNG | `searxng/searxng` | `2024.11.17-e2554de75` | [ ] | [ ] | [ ] | [ ] | -| Ollama | `ollama/ollama` | `0.5.4` | [ ] | [ ] | [ ] | [ ] | - -### Network & Infrastructure - -| App | Image | Version | Install | Launch | UI Loads | Uninstall | -|-----|-------|---------|---------|--------|----------|-----------| -| Nostr Relay | `scsiblade/nostr-rs-relay` | `0.9.0` | [ ] | [ ] | [ ] | [ ] | -| Nginx Proxy Manager | `jc21/nginx-proxy-manager` | `2.12.1` | [ ] | [ ] | [ ] | [ ] | -| Tailscale | `tailscale/tailscale` | pinned | [ ] | [ ] | [ ] | [ ] | -| Home Assistant | `homeassistant/home-assistant` | pinned | [ ] | [ ] | [ ] | [ ] | - -### Virtual Apps (No Container) - -| App | Behavior | Works | -|-----|----------|-------| -| IndeedHub | Opens external URL | [ ] | - ---- - -## Dependency Chain Tests - -These must be tested in order on a fresh install: - -- [ ] Install Bitcoin Knots β†’ starts and begins syncing -- [ ] Install Electrs while Bitcoin running β†’ connects to Bitcoin automatically -- [ ] Install LND while Bitcoin running β†’ connects to Bitcoin automatically -- [ ] Install BTCPay while Bitcoin running β†’ connects; Lightning available if LND present -- [ ] Install Mempool while Bitcoin + Electrs running β†’ shows blockchain data -- [ ] Try installing Electrs without Bitcoin β†’ shows clear error message -- [ ] Try installing LND without Bitcoin β†’ shows clear error message -- [ ] Try installing Mempool without Bitcoin + Electrs β†’ shows missing deps error -- [ ] Fedimint Gateway auto-detects LND credentials when available - ---- - -## Security Hardening Verification - -### Session Security - -- [ ] Sessions expire after 24 hours of inactivity -- [ ] Password change invalidates all other sessions -- [ ] Maximum 5 concurrent sessions (oldest evicted when exceeded) -- [ ] Session tokens are SHA-256 hashed in memory (never stored as plaintext) -- [ ] Login rate limiting: 5 failures per 60 seconds per IP - -### Container Security - -- [ ] All container images use pinned versions (no `:latest`) -- [ ] Read-only root filesystem enabled for compatible apps -- [ ] `--cap-drop=ALL` applied to all containers -- [ ] `--security-opt=no-new-privileges:true` applied to all containers -- [ ] Required capabilities added explicitly per app (e.g., CHOWN for File Browser) - -### Secrets Management - -- [ ] Secrets encrypted with AES-256-GCM on disk -- [ ] Secret metadata tracked (creation date, rotation count) -- [ ] Secret rotation generates new random values and re-encrypts -- [ ] `security.list-expiring` RPC returns secrets older than threshold - -### Path Traversal Prevention - -- [ ] Nginx blocks `..` in filebrowser API paths (403 response) -- [ ] Frontend `sanitizePath()` strips `..` and resolves paths -- [ ] File Browser token not exposed in URLs - -### Authentication - -- [ ] TOTP 2FA enrollment and verification works -- [ ] TOTP backup codes work for recovery -- [ ] Maximum 5 TOTP attempts before session invalidation -- [ ] Pending TOTP sessions expire after 5 minutes -- [ ] Cookie-based auth (no tokens in query strings) - ---- - -## WebSocket & Connectivity - -- [ ] WebSocket connects on login and receives initial data dump -- [ ] WebSocket reconnects after network interruption (exponential backoff, max 30s) -- [ ] Server sends ping every 30s; client responds with pong -- [ ] Client sends JSON ping every 30s; server responds with JSON pong -- [ ] Server closes inactive connections after 5 minutes -- [ ] Connection state shown in UI (connected/reconnecting/disconnected) -- [ ] Install progress updates delivered in real-time via WebSocket - ---- - -## Fresh Install Testing Matrix - -### ISO Build - -- [ ] ISO builds successfully on dev server -- [ ] ISO size is reasonable (< 10 GB) -- [ ] All container images captured in ISO - -### Installation - -- [ ] Boot from USB on x86_64 hardware -- [ ] Auto-installer partitions disk correctly -- [ ] Debian 13 installs without errors -- [ ] Archipelago services start on first boot -- [ ] Web UI accessible at server IP within 3 minutes of first boot - -### Onboarding Flow - -- [ ] Welcome screen displays with intro video -- [ ] Password creation enforces minimum requirements -- [ ] Path selection shows all 6 options -- [ ] DID generation completes within 60 seconds -- [ ] Identity naming is optional and skippable -- [ ] Backup download produces valid JSON file -- [ ] Onboarding completes and reaches Dashboard - -### Post-Onboarding - -- [ ] Dashboard shows all overview cards -- [ ] App Store loads with all curated apps -- [ ] Settings shows server name, version, DID, Tor address -- [ ] Logout and re-login works -- [ ] Password change works and invalidates other sessions - ---- - -## Performance Targets - -- [ ] Backend startup: < 3 seconds -- [ ] Frontend initial load: < 500 KB gzipped -- [ ] WebSocket initial data: < 1 second after connection -- [ ] App install progress visible in UI within 5 seconds of starting - ---- - -## Nginx Proxy Verification - -All app proxies must work in both HTTP and HTTPS blocks: - -- [ ] `/rpc/` β†’ backend:5678 -- [ ] `/ws/` β†’ backend:5678 (WebSocket upgrade) -- [ ] `/health` β†’ backend:5678 -- [ ] `/app/filebrowser/` β†’ filebrowser:80 -- [ ] `/app/searxng/` β†’ searxng:8080 -- [ ] `/app/immich/` β†’ immich:2283 -- [ ] `/app/penpot/` β†’ penpot-frontend:80 -- [ ] `/app/ollama/` β†’ ollama:11434 -- [ ] `/app/photoprism/` β†’ photoprism:2342 -- [ ] `/app/nginx-proxy-manager/` β†’ npm:81 -- [ ] `/app/tailscale/` β†’ tailscale:8240 -- [ ] BTCPay (port 23000) opens in new tab -- [ ] Home Assistant (port 8123) opens in new tab -- [ ] Tor hidden services resolve for all configured apps - ---- - -## Rollback Procedures - -### If Backend Fails to Start - -```bash -# Check logs -sudo journalctl -u archipelago -n 50 --no-pager - -# Restore previous binary -sudo cp /usr/local/bin/archipelago.bak /usr/local/bin/archipelago -sudo systemctl restart archipelago -``` - -### If Frontend is Broken - -```bash -# Restore previous frontend build -sudo cp -r /opt/archipelago/web-ui.bak/* /opt/archipelago/web-ui/ -sudo systemctl reload nginx -``` - -### If Container Won't Start - -```bash -# Check container logs -podman logs - -# Remove and recreate -podman rm -f -# Reinstall from App Store -``` - -### If ISO Install Fails - -1. Boot into rescue mode from USB -2. Check `/var/log/installer.log` on target disk -3. Verify disk partitioning with `lsblk` -4. Re-run installer with `INSTALLER_STARTED= /opt/installer.sh` - -### Full System Rollback - -If the beta is unusable: -1. Re-flash the ISO from the last known good build -2. Restore user data from `/var/lib/archipelago/` backup -3. Re-import DID from backup JSON file - ---- - -## Sign-Off - -| Reviewer | Area | Date | Pass/Fail | -|----------|------|------|-----------| -| | Backend | | | -| | Frontend | | | -| | Security | | | -| | ISO Build | | | -| | Fresh Install | | | -| | App Integrations | | | diff --git a/docs/CHAT_TRANSCRIPT_2026-05-02.md b/docs/CHAT_TRANSCRIPT_2026-05-02.md deleted file mode 100644 index 71a926a2..00000000 --- a/docs/CHAT_TRANSCRIPT_2026-05-02.md +++ /dev/null @@ -1,317 +0,0 @@ -# Chat Transcript And Working Notes - -Date: 2026-05-02 - -This file captures the current chat context, decisions, progress, and next steps so work can continue from another device/session. - -## User Request - -The user asked to continue hardening Archipelago app/container lifecycle, then asked multiple times to save the plan/progress/next steps and finally to save the entire chat to Markdown. - -Key user constraints and corrections: - -- Continue if next steps are clear; ask only if blocked. -- Exhaustively harden app/container lifecycle before release. -- Preserve data during destructive lifecycle testing unless explicitly instructed otherwise. -- Do not rely on `/app/...` proxy paths for app launch/testing. The user corrected: β€œwe never use paths only ports.” -- LND/Electrum wallet-connect tests must validate real connection details and QR, including Tor. - -## Earlier Progress Summary - -Before the latest work, the project already had substantial lifecycle hardening in progress: - -- Remote lifecycle harness exists at `tests/lifecycle/remote-lifecycle.sh`. -- `.198` SSH works with `/home/archipelago/.ssh/id_ed25519`. -- `.228` RPC works, but SSH is blocked with `Permission denied (publickey,password)`. -- Multiple backend release binaries were built and deployed to `.198` with backups in `/usr/local/bin/archipelago.bak-*`. -- Fixed stale package scanner state recovery from `Removing -> Running` when a container is actually live. -- Fixed startup ordering so crash recovery runs before BootReconciler. -- Removed dangerous automatic Podman runtime directory deletion on `podman info` failure. -- Narrowed generic crash recovery to safe legacy containers. -- Fixed companion reconciliation on install/start/restart. -- Fixed uninstall/reinstall behavior so uninstall disables manifest apps instead of deleting manifest availability, and reinstall re-enables them. -- Fixed LND config generation/repair: - - `bitcoin.active=true` - - `bitcoin.mainnet=true` - - `bitcoin.node=bitcoind` - - `bitcoind.rpchost=bitcoin-knots:8332` - - sudo fallback for writing container-owned config paths. -- `.198` had previously passed focused lifecycle for `filebrowser`, `bitcoin-knots`, and a looser LND launch test. - -## Major Files Touched In This Session - -- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` -- `docs/CHAT_TRANSCRIPT_2026-05-02.md` -- `tests/lifecycle/remote-lifecycle.sh` -- `core/archipelago/src/container/lnd.rs` -- `core/archipelago/src/container/companion.rs` -- `core/archipelago/src/container/prod_orchestrator.rs` -- `core/archipelago/src/container/docker_packages.rs` -- `core/container/src/podman_client.rs` -- `core/archipelago/src/port_allocator.rs` -- `apps/lnd-ui/manifest.yml` -- `neode-ui/src/views/appSession/appSessionConfig.ts` -- `neode-ui/src/stores/container.ts` -- `neode-ui/src/stores/appLauncher.ts` -- `neode-ui/src/views/appDetails/appDetailsData.ts` -- nginx config/snippet files under `scripts/` and `image-recipe/` - -## LND Wallet Bootstrap Investigation - -Initial strict LND probe failed because `/lnd-connect-info` could not read `admin.macaroon`: - -```text -Failed to read LND admin macaroon β€” is LND installed? -direct: Permission denied (os error 13) -sudo: cat: /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon: No such file or directory -``` - -LND logs showed the wallet was uninitialized/locked: - -```text -Waiting for wallet encryption password. Use lncli create... -``` - -Tests showed `lncli create` is interactive and does not support `--stdin`: - -```text -[lncli] flag provided but not defined: -stdin -``` - -`lncli unlock --stdin` is supported, so the final approach was: - -- Use LND REST unlocker endpoints for new wallet creation. -- Use `lncli unlock --stdin` only for an existing wallet. -- Treat β€œwallet already exists” from REST as a signal to unlock. -- Use sudo-aware checks/reads for wallet artifacts because LND data directories are container-owned and `0700`. - -Implemented in `core/archipelago/src/container/lnd.rs`: - -- `ensure_wallet_initialized()` -- `file_exists_as_root()` -- `read_file_as_root()` -- `init_wallet_via_rest()` -- `get_lnd_unlocker_json()` -- `post_lnd_unlocker_json()` -- `unlock_existing_wallet()` -- `wait_for_admin_macaroon()` -- `lnd_getinfo_ready()` - -Focused Rust test passes: - -```bash -cd /home/archipelago/Projects/archy/core -cargo test -p archipelago --bin archipelago lnd -``` - -Result: - -```text -7 passed; 0 failed -``` - -## LND UI Port Collision - -The strict LND UI test then failed with `502`. - -Investigation found a real port collision: - -- `nostr-rs-relay` uses host `8081`. -- Old `archy-lnd-ui` also used host `8081`. -- nginx `/app/lnd/` proxy also pointed at `8081`. - -Fix implemented: - -- Move LND UI companion to host port `18083`, container port `80`. -- Keep `nostr-rs-relay` on `8081`. -- Update app metadata/routing to `18083`. -- Update tests to expect direct port launch. - -Important correction from user: - -```text -we never use paths only ports, how many times do you need to be told -``` - -Action taken after correction: - -- Stop validating through `/app/lnd/` and `/app/electrumx/` in the lifecycle harness. -- Switch `launch_url_for()` to direct app ports. -- Switch app session resolver to direct `http://host:port` launch, even from HTTPS parent pages. -- Remove use of `HTTPS_PROXY_PATHS[id]` in `resolveAppUrl()`. - -Direct-port LND audit command: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh -``` - -Result: - -```text -### 192.168.1.198 iteration 1 / 1 ### -lnd state=running -all checks passed -``` - -The audit now validates `http://192.168.1.198:18083/`, not `/app/lnd/`. - -## Lifecycle Harness Changes - -`tests/lifecycle/remote-lifecycle.sh` changes made: - -- Normalize package states with `ascii_downcase` because API returned `Running`. -- Direct port launch URLs: - - LND: `http://${ARCHY_HOST}:18083/` - - Electrum/Electrs: `http://${ARCHY_HOST}:50002/` - - Bitcoin UI: `http://${ARCHY_HOST}:8334/` - - Other apps mapped to direct ports where known. -- LND probe checks: - - `Connect Your Wallet` - - `id="lndQrBox"` - - `id="connHost"` - - `value="rest-tor"` - - `value="grpc-tor"` - - `value="rest-local"` - - `value="grpc-local"` - - `Copy lndconnect URI` - - `/lnd-connect-info` cert, macaroon, ports, and Tor onion. -- Electrum probe checks: - - local QR container and address field - - Tor QR container and onion field - - port `50001` - - QR renderer - - direct `http://${ARCHY_HOST}:50002/qrcode.js` - - `/electrs-status` Tor onion. -- Full lifecycle now fails immediately on any failed phase with `|| return 1` so a later reinstall cannot mask a failed restart/probe. - -## Deployments To `.198` - -Several release builds were made and deployed: - -```bash -cd /home/archipelago/Projects/archy/core -cargo build -p archipelago --bin archipelago --release -``` - -Deploy pattern: - -```bash -scp -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \ - /home/archipelago/Projects/archy/core/target/release/archipelago \ - archipelago@192.168.1.198:/tmp/archipelago.new - -ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \ - archipelago@192.168.1.198 \ - "sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak- && \ - sudo install -m 0755 /tmp/archipelago.new /usr/local/bin/archipelago && \ - sudo systemctl restart archipelago.service && \ - systemctl is-active archipelago.service" -``` - -Latest deploy returned: - -```text -active -``` - -## `.198` Current Observations - -After forcing LND package restart, companion reconciliation succeeded: - -```text -nostr-rs-relay Up ... 0.0.0.0:8081->8080/tcp -lnd Up ... 0.0.0.0:8080->8080/tcp, 0.0.0.0:9735->9735/tcp, 0.0.0.0:10009->10009/tcp -archy-lnd-ui Up ... 0.0.0.0:18083->80/tcp -``` - -Direct UI test from `.198` returned `200`: - -```bash -curl -i http://127.0.0.1:18083/ -``` - -Strict direct-port LND audit is green: - -```text -lnd state=running -all checks passed -``` - -## Full LND Lifecycle Status - -Full direct-port lifecycle was started: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -It reached: - -```text -### 192.168.1.198 iteration 1 / 1 ### -== lnd: install == -== lnd: stop == -``` - -Then the user aborted the command while asking to save memory/transcript. - -The next continuation point is to rerun full LND direct-port lifecycle from scratch and inspect the stop phase if it hangs/fails. - -## Handoff File - -A durable handoff file was also created: - -```text -docs/CONTAINER_LIFECYCLE_HANDOFF.md -``` - -It contains the plan, progress, current blockers, and next steps. - -## Immediate Next Steps - -1. Rerun full strict LND direct-port lifecycle: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -2. If it hangs/fails at `stop`, inspect package runtime stop path and logs: - -```bash -ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 \ - 'journalctl -u archipelago.service -n 260 --no-pager | egrep -i "package\.(stop|start|restart|install|uninstall)|lnd|companion|error|failed" | sed -n "1,220p"; podman ps -a --format "{{.Names}} {{.Status}} {{.Ports}}" | egrep "lnd|nostr" || true' -``` - -3. If stop is unreliable, inspect/fix: - -- `core/archipelago/src/api/rpc/package/runtime.rs` -- `core/archipelago/src/container/prod_orchestrator.rs` - -Likely causes to check: - -- Reconciler restarting LND while stop is expected. -- State scanner reporting stale `running`. -- Companion handling interfering with parent app state. -- Async lifecycle returning before actual stop completes. - -4. Once LND full lifecycle is green, run Electrum strict lifecycle with direct port `50002`: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -5. Continue with app groups after LND/Electrum: - -- `filebrowser` -- `bitcoin-knots` -- `lnd` -- `electrumx` -- `mempool` -- `btcpay-server` -- `fedimint` -- remaining catalog apps. - -## Important Instruction To Preserve - -Use ports only for app launch/testing. Do not add or rely on `/app/...` path proxy launch behavior unless the user explicitly changes this requirement. diff --git a/docs/CONTAINER-ISSUES-REPORT.md b/docs/CONTAINER-ISSUES-REPORT.md deleted file mode 100644 index 89fc7d25..00000000 --- a/docs/CONTAINER-ISSUES-REPORT.md +++ /dev/null @@ -1,508 +0,0 @@ -# Archipelago Container Infrastructure β€” Critical Issues Report - -**Date:** 2026-03-31 -**Status:** Server .228 rebooted β€” some apps recovered, many did not. UI showed everything as "crashed" during recovery window. -**Purpose:** Fix guide for getting container lifecycle to production quality. - ---- - -## Executive Summary - -The container system has **7 systemic failures** that compound each other: - -1. **Silent failures everywhere** β€” errors are swallowed with `|| true`, `.unwrap_or_default()`, and warn-level logs. Nothing actually tells the user (or the system) that something broke. -2. **Health checks are fake** β€” manifests define real health checks (HTTP probes, exec checks) but they are **never executed**. "Healthy" just means `podman ps` shows "running". -3. **Duplicate polling burns CPU** β€” health monitor + metrics collector both call `podman stats` every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning. -4. **Uninstall doesn't clean up** β€” no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure. -5. **Two divergent install paths** β€” `first-boot-containers.sh` and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync. -6. **UI misrepresents state** β€” `Exited` (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels. -7. **Dependency-blind restarts** β€” health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit. - ---- - -## LIVE EVIDENCE: .228 Reboot on 2026-03-31 - -After rebooting .228, here's the actual container state 30 minutes later: - -### Permanently Dead (exceeded 3 restart attempts, abandoned) -| Container | Exit Code | Cause | -|-----------|-----------|-------| -| `indeedhub-postgres` | 0 (clean) | Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too. | -| `indeedhub-redis` | 0 | Same β€” clean exit, 3 failed restart attempts, abandoned | -| `indeedhub-minio` | 0 | Same | -| `indeedhub-relay` | 0 | Same | -| `indeedhub` | 0 | Same | -| `indeedhub-api` | 1 | Can't resolve hostname `indeedhub-postgres` (postgres is dead, DNS entry gone from network) | -| `jellyfin` | 137 (OOM) | "Failed to create CoreCLR" β€” memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted. | - -### Crash-Looping (still failing on every restart) -| Container | Cause | -|-----------|-------| -| `mempool-api` | `ECONNREFUSED 10.89.0.42:3306` β€” DB (`archy-mempool-db`) just restarted, not ready yet | -| `portainer` | "database schema version does not align with server version" β€” image upgraded, DB not migrated. Will NEVER recover. | -| `photoprism` | "Failed creating test file in storage folder" β€” volume permission issue (rootless UID mapping) | - -### Never Started (stuck in "Created" state) -| Container | Cause | -|-----------|-------| -| `archy-mempool-web` | "cannot assign requested address" β€” network binding failure | -| `fedimint` | Same network error | - -### Running but Unhealthy -| Container | Notes | -|-----------|-------| -| `homeassistant` | Up 14 min, health check failing | -| `searxng` | Up 13 min, health check failing | -| `onlyoffice` | Up 10 min, health check failing | - -### Actually Recovered (healthy) -`filebrowser`, `bitcoin-knots`, `vaultwarden`, `nginx-proxy-manager`, `archy-btcpay-db`, `lnd`, `electrumx`, `grafana` - -### Key Observations -1. **All containers have `unless-stopped` restart policy** β€” but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts. -2. **The entire IndeedHub stack died** because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. **No dependency awareness.** -3. **Containers in "Created" state** were never even started β€” some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers. -4. **The UI showed ALL apps as "crashed"** during the first few minutes, even the ones that eventually recovered. This is because `Exited` state (even exit code 0) maps to the label "crashed" in `appsConfig.ts`. - ---- - -## Problem 1: Containers Don't Start or Recover After Reboot - -**Confirmed:** All apps crashed after .228 reboot on 2026-03-31. - -### Root Causes - -#### A. Crash recovery has a 30-second timeout that's too short -**File:** `core/archipelago/src/crash_recovery.rs:265-271` -```rust -let result = tokio::time::timeout( - std::time::Duration::from_secs(30), - tokio::process::Command::new("podman").args(["start", &record.name]).output(), -).await; -``` -On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is **skipped** β€” no retry. - -#### B. If `podman ps` itself times out, recovery finds zero containers -**File:** `core/archipelago/src/crash_recovery.rs:318` -The `podman ps -a` call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: `all_names` is empty, recovery silently exits having started nothing. - -#### C. Boot tier ordering uses a catch-all that misses dependencies -**File:** `core/archipelago/src/crash_recovery.rs:374-385` -```rust -fn container_boot_tier(name: &str) -> u8 { - match id { - "btcpay-db" | "mempool-db" | ... => 0, // databases - "bitcoin-knots" | ... => 1, // bitcoin - "lnd" | "electrumx" | ... => 2, // depends on bitcoin - "mempool-web" | ... => 4, // frontend - _ => 3, // EVERYTHING ELSE - may start before its dependencies - } -} -``` -Any app not explicitly listed gets tier 3, which may be before its dependencies are ready. - -#### D. First-boot script swallows ALL errors -**File:** `scripts/first-boot-containers.sh:8` β€” no `set -e` -48+ commands have `|| true` appended. Every `podman run` failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed. - -#### E. Install RPC returns success before container is actually running -**File:** `core/archipelago/src/api/rpc/package/install.rs:260-294` -After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds: -```rust -if i == 5 { - debug!("Container {} health check timeout (30s) -- continuing anyway"); -} -``` -It logs at debug level and **returns success**. The user sees "installed" but the container never actually started. - -### Fixes Required - -1. **Increase crash recovery timeout to 120s** and add retry with backoff (3 attempts per container) -2. **Increase `podman ps` timeout to 60s** during boot recovery -3. **Replace tier catch-all** β€” every container must be explicitly listed or derived from manifest dependencies -4. **Remove `|| true`** from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end -5. **Install RPC must return failure** if container isn't running after timeout, not silently succeed -6. **Add `--restart unless-stopped`** to container creation in the Podman client (`core/container/src/podman_client.rs:303-335`) β€” currently missing, so Podman itself never auto-restarts crashed containers - ---- - -## Problem 2: Health Checks Are Fake - -### Root Causes - -#### A. "Healthy" just means "running" β€” application health is never checked -**File:** `core/archipelago/src/container/dev_orchestrator.rs:239-249` -```rust -pub async fn get_health_status(&self, app_id: &str) -> Result { - match status.state { - ContainerState::Running => Ok("healthy".to_string()), // <-- THIS IS THE ENTIRE CHECK - ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()), - ... - } -} -``` -A container can be "running" but the application inside is completely broken. This is reported as "healthy". - -#### B. Manifest health checks exist but are never executed -All 30+ app manifests in `image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml` define health checks like: -```yaml -health_check: - type: http - endpoint: http://localhost:4080 - path: /api/health - interval: 30s - timeout: 5s - retries: 3 -``` -The `HealthMonitor` struct at `core/container/src/health_monitor.rs` can execute these checks. **But it is never instantiated.** No code path creates a `HealthMonitor` from the manifest health check definitions. - -#### C. Health status is never pushed to the frontend via WebSocket -**File:** `core/archipelago/src/data_model.rs:120-127` -```rust -pub struct PackageDataEntry { - pub health: Option, // Field exists but is NEVER POPULATED -} -``` -The health field in the data model is always `None`. Frontend can only get health via explicit RPC call, which it almost never makes. - -#### D. Frontend never polls health status -**File:** `neode-ui/src/stores/container.ts:169-175` -`fetchHealthStatus()` is only called after `startContainer()` and `startBundledApp()`. There is **no setInterval, no periodic polling, no watch**. After the initial call, health status is never refreshed. - -### Fixes Required - -1. **Wire up manifest health checks** β€” instantiate `HealthMonitor` from manifest definitions, run actual HTTP/exec probes instead of just checking `podman ps` -2. **Populate the `health` field in `PackageDataEntry`** so WebSocket pushes real health status to frontend -3. **Add 30-second health polling** in the frontend container store (with backoff to 60s when all healthy) -4. **Fix `get_health_status()`** in dev_orchestrator to call actual health checks, not just check container state - ---- - -## Problem 3: CPU Exhaustion from Duplicate Polling - -### Root Causes - -#### A. Two independent monitors both call `podman stats` every 60 seconds -- **Health monitor:** `core/archipelago/src/health_monitor.rs:17` β€” `CHECK_INTERVAL_SECS = 60` - - Runs `podman ps -a --format json` (line 305-323) - - Runs `podman stats --no-stream` every 5 cycles (line 442-450) -- **Metrics collector:** `core/archipelago/src/monitoring/mod.rs:28` β€” 60-second interval - - Runs `podman stats --no-stream --format json` independently (collector.rs:220-224) - -These are **not coordinated**. Both spawn separate subprocesses. On a system with 15+ containers, each `podman stats` call is expensive. - -#### B. Total subprocess spawning frequency -| Component | Interval | What it runs | -|-----------|----------|-------------| -| Health monitor | 60s | `podman ps`, `podman stats` (every 5th), restart attempts | -| Metrics collector | 60s | `podman stats` (duplicate!) | -| Crash recovery snapshot | 120s | `podman ps` | -| Disk monitor | 300s | `df`, `sudo dmesg`, potentially `podman image prune` | -| Telemetry | 900s | `podman stats` (another duplicate) | -| Systemd watchdog | 120s | sd_notify ping | -| Frontend fleet polling | 60s | RPC calls that trigger more podman commands | - -That's roughly **one `podman` subprocess every 10-15 seconds** on average, plus all the triggered operations. - -#### C. No restart policy means polling-driven restarts -**File:** `core/container/src/podman_client.rs:303-335` -Container creation spec does NOT include `RestartPolicy`. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism. - -#### D. Health monitor restart attempts with exponential backoff still spawn processes -When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns `podman start`, `podman inspect`, etc. If multiple containers are unhealthy, this multiplies. - -### Fixes Required - -1. **Deduplicate `podman stats`** β€” create a shared cache layer. One component fetches, others read from cache (TTL: 30s) -2. **Add `RestartPolicy: unless-stopped` with MaxRetryCount: 5** to all container creation β€” let Podman handle restarts natively instead of polling -3. **Increase health monitor interval to 120s** (60s is too aggressive when health checks are just `podman ps`) -4. **Remove duplicate `podman stats`** call from metrics collector β€” share data with health monitor -5. **Make frontend fleet polling viewport-aware** β€” only poll when user is actually viewing the fleet page -6. **Batch all container queries** β€” use a single `podman ps -a --format json` per check cycle, shared across all consumers - ---- - -## Problem 4: Uninstall Doesn't Work - -### Root Causes - -#### A. No volume removal -**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289` -The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It **never removes Podman volumes**. Orphaned volumes accumulate forever. - -#### B. No network cleanup -**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289` -Multi-container stacks create networks (`archy-net`, `immich-net`, `penpot-net`) during install (`stacks.rs:89, 211`). These are **never cleaned up** during uninstall. Leftover networks can prevent reinstallation. - -#### C. Force-kills stateful containers without graceful shutdown -**File:** `core/archipelago/src/api/rpc/package/runtime.rs:226` -```rust -let rm_out = tokio::process::Command::new("podman") - .args(["rm", "-f", name]) // -f = force kill - .output().await; -``` -The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for `stop`. The `rm -f` that follows **ignores these timeouts** and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL. - -#### D. Returns 200 OK even on partial failure -**File:** `core/archipelago/src/api/rpc/package/runtime.rs:268-289` -```rust -Ok(serde_json::json!({ - "status": if errors.is_empty() { "uninstalled" } else { "partial" }, - ... -})) -``` -Returns HTTP 200 with `"partial"` status. Frontend at `neode-ui/src/views/apps/useAppsActions.ts:74` doesn't check for "partial" β€” it deletes the app from the UI regardless. - -#### E. Data directory cleanup requires sudo and fails silently -**File:** `core/archipelago/src/api/rpc/package/runtime.rs:256-265` -```rust -let rm_out = tokio::process::Command::new("sudo") - .args(["rm", "-rf", dir]).output().await; -if let Ok(o) = rm_out { - if !o.status.success() { - tracing::warn!(...); // Warning only, continues - } -} -``` -If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled". - -#### F. Container name detection has gaps -**File:** `core/archipelago/src/api/rpc/package/config.rs:287-340` -Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed. - -### Fixes Required - -1. **Add `podman volume rm`** for all volumes associated with the app after container removal -2. **Add network cleanup** β€” remove app-specific networks after all containers on that network are gone -3. **Use `podman stop -t {timeout}` then `podman rm`** (without -f) β€” respect graceful shutdown timeouts, especially for Bitcoin/LND/databases -4. **Return an error (not 200)** when uninstall has failures. Frontend must check and display errors -5. **Surface "partial" failures to the user** with specific error messages -6. **Unify container naming** β€” derive names from a single source (manifest), not hardcoded patterns in multiple files - ---- - -## Problem 5: Two Divergent Install Paths - -The first-boot bash script and the Rust RPC installer create containers with **different configurations**. This is a major source of bugs. - -### Specific Divergences - -#### A. Database passwords -- **First-boot** (`scripts/first-boot-containers.sh:118-127`): Generates random passwords with `openssl rand -base64 24`, stores in `/var/lib/archipelago/secrets/` -- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610`): Uses hardcoded `"btcpaypass"`, `"mempoolpass"`, `"rootpass"`, `"immichpass"` - -**Result:** Apps installed via RPC after first-boot can't connect to databases because passwords don't match. - -#### B. Bitcoin configuration -- **First-boot** (`scripts/first-boot-containers.sh:295-313`): Dynamically sets `-prune=550` on small disks, `-txindex=1` on large disks -- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:415-420`): No custom args at all - -**Result:** Bitcoin installed via RPC has no pruning or txindex regardless of disk size. - -#### C. ZMQ configuration for LND -- **First-boot** (`scripts/first-boot-containers.sh:100-114`): Bitcoin.conf generated without ZMQ publisher settings -- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:438-439`): LND configured to connect to `tcp://bitcoin-knots:28332` and `tcp://bitcoin-knots:28333` - -**Result:** LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path. - -#### D. Port conflicts -- **First-boot** (`scripts/first-boot-containers.sh:813,835`): Both strfry and indeedhub bind to host port 7777 -- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:734`): IndeedHub uses `8190:3000` - -**Result:** On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely. - -#### E. Memory limits -- **First-boot** (`scripts/first-boot-containers.sh:253-283`): Ollama gets 1g on low-mem systems -- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:245-280`): Ollama gets 4g always - -**Result:** Same app gets different resource limits depending on how it was installed. - -#### F. Version mismatches in marketplace UI -- `scripts/image-versions.sh:17`: LND image is `v0.18.4-beta` -- `neode-ui/src/views/marketplace/marketplaceData.ts:155`: Shows `0.17.4` -- `scripts/image-versions.sh:21-22`: Mempool images are `v3.0.0` -- `neode-ui/src/views/marketplace/marketplaceData.ts:177`: Shows `2.5.0` - -### Fixes Required - -1. **Single source of truth for container config** β€” Rust config must read passwords from `/var/lib/archipelago/secrets/`, not hardcode them -2. **Add ZMQ config** to Bitcoin startup in both paths: `zmqpubrawblock=tcp://0.0.0.0:28332` and `zmqpubrawtx=tcp://0.0.0.0:28333` -3. **Fix port 7777 conflict** β€” assign unique ports to strfry and indeedhub -4. **Add disk-aware Bitcoin config** to Rust installer (prune/txindex based on disk size) -5. **Sync memory limits** between first-boot and Rust config -6. **Update marketplace version strings** to match actual image versions in `image-versions.sh` -7. **Long-term: eliminate first-boot-containers.sh** β€” have the backend handle all container creation using the same Rust code path - ---- - -## Problem 6: Post-Install Hooks Run Async and Fail Silently - -**File:** `core/archipelago/src/api/rpc/package/install.rs:541-625` - -Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks: -```rust -tokio::spawn(async move { - let _ = tokio::fs::create_dir_all(secret_dir).await; - let _ = tokio::fs::write(...).await; -}); -``` - -The install RPC returns success **before hooks complete**. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied. - -### Fix Required - -Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion. - ---- - -## Problem 7: Podman Client Swallows Errors - -**File:** `core/container/src/podman_client.rs` - -#### A. JSON serialization failures return empty strings (line 182-183) -```rust -let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default(); -``` - -#### B. Container ID parsing failures return empty string (line 344-348) -```rust -let id = result["Id"].as_str().unwrap_or("").to_string(); -Ok(id) // Empty string = success? -``` - -#### C. Socket timeout is only 5 seconds (line 154-160) -On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic. - -### Fixes Required - -1. Replace `.unwrap_or_default()` with proper error propagation using `?` -2. Return `Err` when container ID is empty -3. Increase socket timeout to 15-30s -4. Add retry with backoff (3 attempts) on socket connection - ---- - -## Problem 8: UI Misrepresents Container State - -### Root Causes - -#### A. "Exited" always displays as "Crashed" β€” even for clean shutdowns -**File:** `neode-ui/src/views/apps/appsConfig.ts:119-146` -```typescript -getStatusLabel(state, health): - - "exited" β†’ "crashed" // <-- THIS IS THE PROBLEM -``` -Every container that exited β€” whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) β€” shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up. - -#### B. No "recovering" or "boot in progress" state exists -**File:** `core/archipelago/src/data_model.rs:103-119` -PackageState enum has `Starting`, but it's only set during **explicit user start actions**, not during automatic crash recovery. During boot recovery, containers transition from `Exited β†’ Running` without ever passing through `Starting`, so the UI never shows a spinner or "starting up" message. - -#### C. Backend skips sub-containers from package listing, so their state is invisible -**File:** `core/archipelago/src/container/docker_packages.rs:39-117` -The excluded_services list filters out backend services like `mempool-db`, `btcpay-db`, `nbxplorer`, `penpot-postgres`, etc. UI containers ending in `-ui` are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., `indeedhub-postgres` being dead kills the entire IndeedHub stack, but only `indeedhub-api` errors are visible). - -#### D. No distinction between "needs manual intervention" and "will recover soon" -The UI shows the same visual treatment for: -- Portainer (DB migration error β€” will NEVER recover without manual intervention) -- mempool-api (DB not ready yet β€” will recover in 30 seconds) -- IndeedHub (dependencies abandoned β€” won't recover until deps are manually restarted) - -### Fixes Required - -1. **Differentiate exit codes**: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning) -2. **Add a "recovering" state**: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers -3. **Show sub-container health**: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down") -4. **Distinguish recoverable from permanent failures**: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed" -5. **Add recovery progress indicator**: During boot, show "Recovering containers: 15/22 started" on the dashboard - ---- - -## Problem 9: Dependency-Blind Restarts - -### Root Cause (Confirmed by .228 reboot) - -The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure: - -1. `indeedhub-postgres` exits cleanly (code 0) on reboot -2. Health monitor restarts postgres β€” it starts, but exits again (likely needs volume mount or network ready) -3. After 3 attempts, postgres is **abandoned** -4. Meanwhile, `indeedhub-api` tries to connect to postgres β†’ `ENOTFOUND indeedhub-postgres` β†’ exits -5. Health monitor restarts api β†’ same DNS failure β†’ exits -6. After 3 attempts, api is **abandoned** -7. Same cascade for redis, minio, relay, main container β€” all abandoned within minutes - -**File:** `core/archipelago/src/health_monitor.rs:500-530` -The restart loop treats each container independently. There's no logic to: -- Check if a container's dependencies are running before restarting it -- Restart dependencies first when a dependent container fails -- Reset attempt counters when a dependency comes back online - -**3 attempts is too few**, especially when dependencies need time: -- Attempt 1: 10s backoff β†’ dependency still starting -- Attempt 2: 30s backoff β†’ dependency crashed and is being restarted -- Attempt 3: 90s backoff β†’ dependency hit its own 3-attempt limit and was abandoned -- Game over. Entire stack is dead. - -### Fixes Required - -1. **Dependency-aware restart ordering**: Before restarting a container, check if its dependencies are running. If not, restart dependencies first. -2. **Increase max restart attempts to 5-10** for containers with dependencies -3. **Reset attempt counters** when a dependency comes back online (the dependent container failed because of the dependency, not itself) -4. **Add a "stack restart" concept**: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order -5. **Handle "Created" state containers**: `archy-mempool-web` and `fedimint` are in "Created" state (never started). The health monitor should detect these and attempt to start them. - ---- - -## Priority Order for Fixes - -### P0 β€” System is broken without these (reboot = broken system) -1. **Dependency-aware restarts** in health_monitor.rs β€” restart dependencies before dependents, reset attempt counters when deps recover -2. **Increase max restart attempts to 10** (currently 3) β€” dependency chains need more time on boot -3. **Handle "Created" state** β€” containers stuck in Created are never started by health monitor -4. **Fix UI state labels** β€” "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window. -5. Fix Rust config to read secrets from `/var/lib/archipelago/secrets/` instead of hardcoded passwords -6. Fix port 7777 conflict (strfry vs indeedhub) -7. Add ZMQ config to Bitcoin for LND block notifications - -### P1 β€” Core functionality broken -8. Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes) -9. Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts -10. Return actual errors from install/uninstall instead of silent success on partial failure -11. Remove `|| true` from critical first-boot commands -12. Show sub-container health in UI (which dependency is actually broken) - -### P2 β€” Performance and CPU -13. Deduplicate `podman stats` calls (health monitor + metrics collector both call every 60s independently) -14. Increase health monitor interval to 120s -15. Add frontend health polling via WebSocket push (populate `health` field in data model) -16. Make fleet polling viewport-aware (don't poll when user isn't viewing) - -### P3 β€” Consistency and correctness -17. Sync memory limits between first-boot and Rust config -18. Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0) -19. Unify container naming conventions between first-boot script and Rust config -20. Add disk-aware Bitcoin config (prune/txindex) to Rust installer -21. Distinguish "needs manual intervention" from "will recover soon" in UI - ---- - -## Key Files to Modify - -| File | What to fix | -|------|-------------| -| `core/archipelago/src/health_monitor.rs` | Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector | -| `core/container/src/podman_client.rs` | Add RestartPolicy to container creation spec, fix `.unwrap_or_default()` error swallowing, increase socket timeout to 15-30s | -| `core/archipelago/src/crash_recovery.rs` | Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all | -| `core/archipelago/src/api/rpc/package/install.rs` | Return failure on timeout (not silent success), await post-install hooks | -| `core/archipelago/src/api/rpc/package/runtime.rs` | Add volume/network cleanup on uninstall, use `podman stop -t` then `podman rm` (not `-f`), return errors on partial failure | -| `core/archipelago/src/api/rpc/package/config.rs` | Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits | -| `core/archipelago/src/container/dev_orchestrator.rs` | Wire up manifest-defined health checks instead of just checking podman state | -| `core/archipelago/src/container/docker_packages.rs` | Stop filtering sub-containers from state β€” or expose their health as part of parent app status | -| `core/archipelago/src/data_model.rs` | Populate `health` field for WebSocket push, add exit code to state | -| `core/archipelago/src/monitoring/mod.rs` | Share podman stats data with health monitor instead of duplicate subprocess calls | -| `neode-ui/src/views/apps/appsConfig.ts` | Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window | -| `neode-ui/src/stores/container.ts` | Add periodic health polling (30s) | -| `neode-ui/src/views/apps/useAppsActions.ts` | Check for "partial" uninstall status, show errors to user | -| `neode-ui/src/views/marketplace/marketplaceData.ts` | Fix version strings to match image-versions.sh | -| `scripts/first-boot-containers.sh` | Remove `\|\| true` from critical commands, fix port 7777 conflict, add proper error reporting | diff --git a/docs/CONTAINER_LIFECYCLE_HANDOFF.md b/docs/CONTAINER_LIFECYCLE_HANDOFF.md deleted file mode 100644 index 00515e35..00000000 --- a/docs/CONTAINER_LIFECYCLE_HANDOFF.md +++ /dev/null @@ -1,1739 +0,0 @@ -# Container Lifecycle Handoff - -Last updated: 2026-06-08 - -## 2026-06-08 `1.8-alpha` Release Gate Update - -- Target release is now `1.8-alpha`, including a cut and smoke-tested ISO after validation is green. -- Current release readiness estimate is about `82%`. -- Host reboot validation is not clean yet. User reported that a reboot test left IndeeHub stopped afterward, with many containers killed by SIGKILL during reboot/shutdown, one crash, and a couple stopped. -- Treat post-reboot recovery as the active release blocker. -- IndeeHub is not considered recovered unless: - - the stack containers recover after boot; - - `http://192.168.1.198:7778/` is reachable; - - the HTML includes `/nostr-provider.js`; - - `http://192.168.1.198:7778/nostr-provider.js` is served and looks like the Nostr signer bridge. -- Local follow-up in progress: - - `core/archipelago/src/container/prod_orchestrator.rs` now hardens IndeeHub stack reconcile by starting existing backend containers through a user scope when possible, waiting for backend/API dependency readiness, restarting the frontend when it does not remain running/reachable, and checking host port `7778`; - - `tests/lifecycle/remote-lifecycle.sh` now validates the IndeeHub Nostr provider during launch probes; - - `core/container/src/manifest.rs` now has stricter package safety validation while preserving all current real manifests. -- Validation passed locally for this follow-up: - - `cargo fmt --manifest-path core/Cargo.toml --all`; - - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`); - - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`; - - filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran one matching existing test; - - `bash -n tests/lifecycle/remote-lifecycle.sh`; - - `git diff --check`. -- Passing criterion after deploy: - - minimum: 3 consecutive clean post-fix reboots, broad non-destructive lifecycle green after each; - - preferred before release: 5 consecutive clean post-fix reboots, broad lifecycle green after each; - - SIGKILL during shutdown is not automatically disqualifying if all managed apps recover and pass health/launch after boot, but any stopped/crashed/unreachable managed app after boot fails that iteration. -- Final release gate after reboot validation: cut the `1.8-alpha` ISO and smoke-test boot/install/backend/UI/catalog/focused app lifecycle. - -### 2026-06-08 Focused Blocker Validation After `06420c...` - -- Deployed backend `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba`, then backend `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`. -- Both deploys restarted only `archipelago.service`; `archipelago-doctor.timer` and `archipelago-reconcile.timer` stayed inactive. No reboot and no broad Podman store/image commands were run. -- Local fixes included: - - targeted Podman remove fallback for stuck `removing/stopping` records; - - rootless Podman socket liveness check by Unix connection, not path existence; - - IndeeHub readiness fallback to platform network aliases when `getent` inside the API image cannot prove DNS; - - Tailscale launch harness now requires login/auth UI content; - - stricter manifest validation while preserving all real manifests. -- Validation passed locally: - - `cargo fmt --manifest-path core/Cargo.toml --all`; - - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`; - - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`); - - `bash -n tests/lifecycle/remote-lifecycle.sh`; - - `git diff --check`. -- `.198` is still not release-ready after `06420c...`: - - `indeedhub`: stuck `stopping`, launch `7778` returns `000`; - - `immich`: `starting`, launch `2283` returns `000`; - - `tailscale`: `running`, launch `8240` returns `000`; logs show `NeedsLogin`/`WantRunning=false`, and launch must present the Tailscale login/auth UI; - - `vaultwarden`: absent/not listed after start attempt, launch `8082` returns `000`; - - `portainer`: `running`, launch `9000` returns `000`; user confirmed Portainer environment wizard cannot connect to `unix:///var/run/docker.sock`; - - `btcpay-server`: not a current blocker; direct launch `23000` returned HTTP 200 and user confirmed the earlier report was wrong-server/slowness. -- Do not continue to reboot validation or ISO cutting until rootless Podman control-plane/socket health, stuck container-state cleanup, and app-screen launch contracts are fixed. - -## 2026-06-08 `.198` Release Candidate State Check - -- Deployed backend hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412` to `.198` after the targeted image-probe mitigation. -- Previous live backend hash before deploy was `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. -- Deployment notes: - - local release build passed: `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`; - - initial direct `cp` over `/usr/local/bin/archipelago` failed with `Text file busy`, after creating a timestamped backup; - - recovered by installing to `/usr/local/bin/archipelago.new`, atomically renaming it over `/usr/local/bin/archipelago`, and restarting only `archipelago.service`; - - no host reboot and no broad Podman store/image commands were run. -- Latest mitigation now live on `.198`: - - `core/container/src/runtime.rs` uses bounded targeted `podman image inspect` for `ContainerRuntime::image_exists()`; - - `core/archipelago/src/api/rpc/package/install.rs` uses bounded targeted `podman image inspect` for local fallback and post-pull verification; - - `core/archipelago/src/container/companion.rs` uses `podman image inspect` for companion image checks. -- Validation passed on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`: - - focused non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism,fedimint,indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`; - - broad non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`; - - `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`. - - `/`: `66%` used, about `9.6G` free. - - `/var/lib/archipelago`: `8%` used, about `375G` free. -- Startup logs still showed one known `podman ps -a --format json timed out after 30s` scan timeout followed by scan backoff; lifecycle validation passed anyway. Treat Podman socket/store health as a residual release risk, but release image probes are now quarantined from the known fragile image-existence/list commands. -- Remaining release gate: host reboot validation, only if explicitly approved. - -- Verified `.198` without running broad Podman store/image commands. -- Current local release binary and live `/usr/local/bin/archipelago` match hash `670a3e789540082437c7521cc5ad7a4c260f56ee8e0a9cf770160fa25b4e4644`. -- `archipelago.service` is active. -- `archipelago-doctor.timer` is inactive. -- `archipelago-reconcile.timer` is inactive. -- `/` is at `65%` used with about `9.9G` free. -- `/var/lib/archipelago` is at `10%` used with about `370G` free. -- Backend-restart validation was already recorded as passed in the release-candidate checkpoint. The remaining live validation gate is host reboot validation, only if explicitly approved. -- Continue avoiding `podman image list`, `podman system df`, broad `podman image exists`, `podman image prune`, and `podman volume prune` on `.198` while the store/socket health risk is unresolved. - -## 2026-06-08 Local Release Gate Completion - -- No `.198` host actions were performed in this pass: no reboot, no timer changes, no deploy, no Podman store-wide commands. -- Fixed scanner skip/backoff wakeups so skipped scans still advance the scan-completion watch counter for install/update waiters. -- Fixed local full-test blockers: - - crash-recovery unit tests now pass the `include_stack_members` flag and cover generic-vs-stack recovery behavior; - - runtime manifest-port lookup checks the workspace `apps/` directory via `CARGO_MANIFEST_DIR`, so new public manifests are visible from test/runtime working directories; - - journal disk usage parsing accepts compact `journalctl` output such as `463.9M`; - - boot-reconciler cadence tests bypass the global crash-recovery wait gate when using the existing test-only `without_companion_stage()` helper. -- Local validation passed: - - `cargo fmt --manifest-path core/Cargo.toml --all`. - - `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`). - - `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`). - - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - - `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`. - - `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests). - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). - - `python3 scripts/check-app-catalog-drift.py --release --strict`. - - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. - - `git diff --check`. - - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. -- Remaining live gate is unchanged: host reboot validation on `.198`, only if explicitly approved. - -## 2026-06-08 Frontend Release Gate Completion - -- No `.198` host actions were performed in this pass: no reboot, no timer changes, no deploy, no Podman store-wide commands. -- Fixed mobile app-launch behavior in `neode-ui/src/stores/appLauncher.ts`: - - desktop still opens X-Frame-Options/new-tab apps directly in a new tab; - - mobile now routes those same apps through `app-session` so app icons keep users inside Archipelago; - - router return-path handling is defensive when `currentRoute` is unavailable. -- Updated frontend tests for current launch behavior and fixed async/Pina fixture setup. -- Local validation passed: - - `npm run type-check`. - - `npm test` (`548 passed`). - - `npm run build`. - - `python3 scripts/generate-app-catalog.py` (`updated 0 fields`). - - `python3 scripts/check-app-catalog-drift.py --release --strict`. - - `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`. - - `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`. - - `git diff --check`. -- Local caveat: `npm ci` failed before checks because existing `neode-ui/node_modules/@alloc` entries are `root:root`; do not mutate ownership or remove the tree without explicit approval. - -## 2026-06-08 Local Podman Store-Risk Cleanup - -- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`. -- Bounded stack installer image pulls in `core/archipelago/src/api/rpc/package/stacks.rs` with `kill_on_drop` and a 600s timeout. -- Bounded manual package update image pulls in `core/archipelago/src/api/rpc/package/update.rs` with `kill_on_drop` and a 600s timeout while preserving stderr progress parsing. -- Validation passed locally: - - `python3 scripts/check-app-catalog-drift.py --release --strict`. - - `cargo fmt` from `core/`. - - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. -- Local release binary hash after this cleanup is `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4`. -- This local build has not been deployed to `.198`; live `.198` remains on `670a3e789540082437c7521cc5ad7a4c260f56ee8e0a9cf770160fa25b4e4644` unless a later checkpoint says otherwise. - -## 2026-06-08 `.198` Podman Pull Hardening Deploy - -- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198`. -- Previous backend was backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*` before replacement. -- Restarted only `archipelago.service`; no host reboot was performed. -- No broad Podman store/image commands were run. -- Initial `systemctl restart` exceeded the local 120s wrapper while startup was still in progress, but the backend reached `Server listening`, then systemd settled to `active/running`. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4`. - - `/`: `65%` used, about `9.8G` free. - - `/var/lib/archipelago`: `10%` used, about `370G` free. -- Validation passed: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - `python3 scripts/check-app-catalog-drift.py --release --strict`. -- Remaining release gate: host reboot validation, only if explicitly approved. - -## 2026-06-08 `.198` App Health and Port Recovery - -- Deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`. -- Fedimint Guardian and File Browser were reachable but UI package-data reported `health=starting`; backend scanner now normalizes reachable running apps to healthy and restores the launch URL when the direct port is reachable. -- Nostr relay had been using host port `8081`, which conflicted with Nginx Proxy Manager admin launch. Updated `apps/nostr-rs-relay/manifest.yml` to use host port `18081`. -- Recovered live Nostr/NPM state: - - Nginx Proxy Manager admin UI responds on `http://127.0.0.1:8081/`. - - Nostr relay responds on `http://127.0.0.1:18081/` with the expected Nostr-client message. -- Hardened legacy install runtime for scoped web apps: use `podman create` followed by `systemd-run --user --scope podman start` so containers are not coupled to `archipelago.service`, while install RPCs do not hang on scoped `podman run -d`. -- Recovered IndeedHub after broad validation found it stopped: - - `indeedhub-minio` had stopped, causing the frontend nginx container to exit with `host not found in upstream "minio"`. - - Restarted existing `indeedhub-minio` with preserved volume data and restarted the frontend. - - `http://127.0.0.1:7778/` returned HTTP `200` afterward. -- Validation passed: - - `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`. - - `python3 scripts/check-app-catalog-drift.py --release --strict`. - - Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`. - - Broad non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. - - `/`: `65%` used, about `9.6G` free. - - `/var/lib/archipelago`: `10%` used, about `370G` free. -- Remaining release gate: host reboot validation, only if explicitly approved. - -## 2026-06-04 `.198` IndeedHub and Immich Lifecycle Recovery - -- Deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`. -- Fixed IndeedHub frontend startup sequencing so network alias repair is only applied immediately before the frontend starts, after `indeedhub-minio`, `indeedhub-redis`, and `indeedhub-api` are running. -- Fixed Immich lifecycle recovery on `.198`: - - dependency readiness now accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower `podman exec` probes; - - `immich_server` startup now repairs `/var/lib/archipelago/immich` ownership through `podman unshare chown -R 0:0`, preserving existing upload data while matching the current rootless container user mapping; - - this resolved the observed `EACCES` failure writing `/usr/src/app/upload/encoded-video/.immich`. -- Diagnosis notes: - - Broad audit initially failed only on Immich (`state=exited`); focused Fedimint and NetBird audits passed. - - Patched dependency wait got lifecycle past dependencies to `Starting container: immich_server`. - - Upload ownership repair allowed Immich API and microservices to remain running; direct `http://127.0.0.1:2283/` returned HTTP `200`. -- Verification on this hash: - - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. - - Focused IndeedHub audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Focused Fedimint audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - - Focused NetBird audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - - Focused Immich audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state after validation: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9`. -- Residual risk: - - `.198` still shows intermittent `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load; keep avoiding store-wide Podman commands and treat Podman socket/store health as a separate release hardening item. - -## 2026-06-03 `.198` Generic Host-Port Health Checkpoint - -- Latest local Podman store-risk mitigation, pending deploy to `.198`: - - `core/container/src/runtime.rs` now implements `ContainerRuntime::image_exists()` with bounded targeted `podman image inspect` instead of `podman image exists`. - - `core/archipelago/src/api/rpc/package/install.rs` now verifies local fallback images and post-pull images with bounded targeted `podman image inspect` instead of `podman images -q`. - - `core/archipelago/src/container/companion.rs` now uses `podman image inspect` instead of `podman image exists`. - - A grep across `core/**/*.rs` finds no live Rust call sites for `podman image exists` or `podman images -q`; only an explanatory comment remains. - - Validation passed: `cargo fmt --all --check`, `cargo check -p archipelago-container`, `cargo check -p archipelago`, `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests`, `cargo test -p archipelago-container`, and whitespace check for the changed files. - - A filtered `cargo test -p archipelago install_fresh_build` did not reach execution due to local compile/link slowness/artifact failure; `--tests` compilation passed afterward. - -- Deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198` after release code-review/refactor cleanup of legacy runtime host-port repair. -- Reduced duplicated app-specific port repair logic in `core/archipelago/src/api/rpc/package/runtime.rs`: - - legacy package start/restart repair now derives host ports from `apps/*/manifest.yml` when available; - - hardcoded ports remain only as fallback for legacy/non-manifest apps and for extra legacy cleanup ports such as Gitea `3000` and Nginx Proxy Manager `8084`/`8444`; - - the old duplicate Gitea cleanup helper was removed; - - focused unit coverage was added for manifest-derived runtime ports and legacy extra ports. -- Verification on this hash: - - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. - - Focused `runtime_host_ports` test was added but local `cargo test ... runtime_host_ports` did not complete within 5 minutes during compilation, consistent with known local test/linker slowness. - - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state after validation: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b`. - -- Catalog metadata generation is now implemented: - - Added `scripts/generate-app-catalog.py` to sync manifest-owned fields into both `app-catalog/catalog.json` and `neode-ui/public/catalog.json` while preserving catalog-only presentation/runtime fields. - - Corrected stale manifest metadata for public catalog apps where the manifest was behind production catalog/image values: BotFights, IndeeHub, Gitea icon/repo, LND title/image, ElectrumX image, Fedimint image, and Mempool title/version/image. - - Ran generator; canonical and UI catalogs now match byte-for-byte. - - Release drift gate is green: `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`. - - Validation passed: `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`, `cargo test --manifest-path core/Cargo.toml -p archipelago-container`, `cargo check --manifest-path core/Cargo.toml -p archipelago`, and `npm run build` from `neode-ui`. - -- Deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198` after a narrow Podman store-risk hardening pass. -- Hardened fresh local-build installs so `podman image exists ` failures/timeouts no longer fail the lifecycle operation outright: - - existing timeout remains bounded in the runtime; - - `install_fresh()` now logs the check failure and rebuilds the local image instead; - - this matches the existing drift-restart path and keeps local image store checks from becoming release-blocking. -- Verification on this hash: - - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. - - Focused unit test `install_fresh_builds_when_image_exists_check_fails` was added but local `cargo test ...` did not complete within 15 minutes during compilation, consistent with known local test/linker slowness. - - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state after validation: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2`. - -- Deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198` after hardening `container-health` fallback behavior. -- Fixed the broad lifecycle timeout path where `container-health` could return `Failed to get container health` even though the app endpoint was reachable: - - `cached_reachable_health()` now parses URL ports correctly when launch URLs include a trailing slash, such as `http://localhost:2342/`. - - The fallback port map now covers the lifecycle launch apps, including PhotoPrism `2342`, BTCPay `23000`, LND UI `18083`, Mempool `4080`, Electrum `50002`, Fedimint `8175`, Gitea `3001`, IndeedHub `7778`, Ollama `11434`, Vaultwarden `8082`, Tailscale `8240`, and others. - - Reachable cached-running apps can now return `healthy` without depending on flaky Podman health/inspect paths. -- Verification on this hash: - - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. - - Targeted PhotoPrism audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state after validation: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36`. - - `/`: `62%` used, about `11G` free. - - `/var/lib/archipelago`: `9%` used, about `370G` free. -- Remaining blockers: - - Podman socket/store health is still a release risk; continue avoiding broad store/image commands on `.198`. - - Backend-restart and host-reboot validation are still pending and should be run only when approved. - -## 2026-06-03 `.198` Generic Host-Port Health Checkpoint In Progress - -- Deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`. -- This pass is explicitly aligned with the migration objective: use generic platform primitives from manifest/container-declared ports instead of adding more OS-level or app-specific package edits. -- Broad lifecycle on previous hash `d21202cd...` failed only because Uptime Kuma briefly appeared as `stopping` during listener repair; it recovered immediately afterward with `3002` listening and HTTP `302`. -- Implemented generic health-monitor host-port awareness: - - Health monitor now parses Podman JSON `Ports` host TCP bindings for each container. - - A running container with declared host TCP ports is not considered healthy if those host listeners are missing. - - This avoids a hardcoded app-to-port list and makes missing pasta/rootless listeners a generic recovery concern. -- Also fixed scanner merge semantics: - - `Stopping -> Running` now recovers immediately when there is no user-stopped marker. - - User-initiated stops still preserve `Stopping` over live `Running` while the stop is in progress. -- Verification so far: - - `cargo check --manifest-path core/Cargo.toml -p archipelago` passed. - - `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed. - - Live service state after deploy: `archipelago.service` active; doctor/reconcile timers inactive. - - After backend restart, Uptime Kuma recovered its `3002` listener and returned HTTP `302`. -- Still in progress: - - Jellyfin is still running/healthy according to Podman but missing the `8096` host listener after backend restart. - - Next fix should keep the same generic direction: missing host listener repair should use the manifest/orchestrator-aware restart path for apps with declared ports, not another Jellyfin-specific OS edit. - - Broad lifecycle has not yet passed on `3912b900...`. - -## 2026-06-03 `.198` Stale State and Jellyfin Pasta Listener Repair - -- Deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`. -- Fixed a focused lifecycle false-negative where `container-list` could report stale cached `exited` state while Podman scan backoff was active and the container had already recovered: - - Cached `exited` entries now get a targeted live refresh before being returned by `container-list`. - - This avoids broad `podman ps` scans and preserves the UI/package-data consistency model. -- Added a bounded `container-health` fallback for cached running web apps: - - If the cached app state is `Running` and its known local launch port accepts TCP, the RPC can return `healthy` without waiting on Podman inspect/list paths. - - This quarantines health reads from intermittent Podman socket/store stalls. -- Added Jellyfin to the legacy runtime host-port repair path: - - `runtime_required_host_port("jellyfin")` now maps to `8096`. - - stale pasta cleanup now includes `8096` for Jellyfin start conflicts. -- Validation notes: - - `package.restart jellyfin` exposed a remaining Podman socket/runtime failure after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`. - - `package.start jellyfin` recovered the app afterward; `jellyfin` returned `Up ... (healthy)`, `ss` showed a `pasta.avx2` listener on `8096`, and `http://192.168.1.198:8096/` returned HTTP `302`. - - Focused lifecycle passed on the current hash: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Endpoint checks after focused lifecycle: Uptime Kuma `3002` returned `302`; Jellyfin `8096` returned `302`; Filebrowser `8083` returned `404` at `/`, which is expected for this probe. - - `scripts/check-app-catalog-drift.py --release` still reports zero missing entries and `35` metadata drift items. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e`. - - `/`: `62%` used, about `11G` free. - - `/var/lib/archipelago`: `9%` used, about `371G` free. -- Remaining blocker: - - Broad lifecycle has not yet been rerun on `d21202cd...`. - - Podman socket/store health is still a release risk; avoid broad image/store commands and treat socket permission/runtime failures separately from app health. - -## 2026-06-03 `.198` Expanded Rollback Cleanup and Store-Safe Uninstall - -- Deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`. -- Expanded `system.disk-cleanup` retention beyond `archipelago.backup-*` to cover alpha-era rollback artifacts: - - legacy `/usr/local/bin/archipelago.bak*` and `archipelago.before-*` files; - - old `/opt/archipelago/web-ui.bak*` and `web-ui.old` directories. -- Live cleanup reclaimed `10.3 GB` without touching Podman image/volume prune: - - `Removed old backend backups: 41.6 MB freed`. - - `Removed old legacy backend backups: 3.6 GB freed`. - - `Removed old web UI backups: 6.6 GB freed`. - - `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. -- Root filesystem pressure is no longer a release blocker on `.198`: - - Before expanded cleanup: `/` was `99%` used with about `478-545M` free. - - After expanded cleanup: `/` is `61%` used with about `11G` free. - - `/usr/local/bin` dropped to about `336M`; `/opt/archipelago` dropped to about `1.1G`. -- Uninstall no longer runs global `podman volume prune -f`; app data removal remains explicit when `preserve_data=false`. -- Verification: - - `cargo build -p archipelago --bin archipelago --release` passed. - - Local `cargo test -p archipelago system::tests` did not complete within 10 minutes in this environment; release build succeeded and live cleanup validation passed. - - Focused post-cleanup lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b`. - -## 2026-06-03 `.198` Startup Scan Backoff and Uptime Kuma Pasta Repair - -- Deployed backend hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28` to `.198`. -- Startup adoption is now bounded with a 35s timeout so a stuck `podman ps -a --format json` cannot stall backend startup indefinitely. -- The initial container scan now seeds the same 300s Podman scan backoff used by periodic scans, preventing an immediate second `podman ps` after a startup timeout. -- Legacy pasta restart paths now use scoped `podman restart` instead of stop+start. This repairs cases where a running pasta container loses its host listener but `podman start` would be a no-op. -- Uptime Kuma validation: - - Before repair, the container was running and internally healthy on `127.0.0.1:3001`, but host port `3002` had no `pasta` listener and LAN launch failed. - - `package.restart` for `uptime-kuma` now returns `{"status":"restarted"}` instead of hanging. - - Post-restart `http://192.168.1.198:3002/` returned HTTP `302` and the scanner restored launch metadata. -- Release validation passed: - - Focused audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Broad audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- Final `.198` state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/`: still tight at `99%` used, about `395M` free. - - `/var/lib/archipelago`: about `10%` used. -- Residual risk: - - `.198` Podman store health remains fragile under broad store commands; avoid prune/image-list/system-df release operations until the store issue is handled separately. - - Logs during broad validation still showed unrelated IndeedHub/conmon cgroup permission noise, but focused and broad lifecycle audits passed. - -## 2026-06-02 `.198` Registry/Catalog and Lifecycle Checkpoint - -- Follow-up on Podman prune/catalog generation: - - Diagnosed the `podman image prune -f` failure and found it is broader than prune: `podman system df`, `podman image list`, `podman image exists`, and sometimes broad `podman ps`/`inspect` can hang on `.198` under current store/node load. - - Stopped only the diagnostic Podman commands started during this follow-up. - - Changed `system.disk-cleanup` to skip Podman image/volume prune entirely for the release path. Cleanup still handles logs, journal retention, temp files, and backend backup retention, and returns an explicit action: `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`. - - Deployed backend hash `c9695dc3db10ff6e593cdbcfbbdc94b2e98b6008aa62655bba51b9879b549e8c` to `.198`. - - Live cleanup validation passed: endpoint returned quickly, pruned old backend backups, did not spawn new Podman prune/list work, and `/` stayed around `98%` with about `647-670M` free. - - During diagnosis, Uptime Kuma's port returned empty responses. Restarted only `uptime-kuma` through `package.restart`; data preserved; launch returned HTTP `302` afterward. - - Focused post-repair audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Broad post-repair audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Final raw Podman bad-state sweep was clean. - - Catalog metadata generation is not implemented yet. The release-safe step in this pass is the new `scripts/check-app-catalog-drift.py --release` mode, which reports zero missing catalog/manifest entries while still surfacing metadata-only drift. - -- Release-work continuation after cleanup/catalog/review gate: - - Deployed backend hash `e285d421cef497beb6b4b929f36fb4296d6db1f4a4c786157b6751eec51619ca` to `.198`. - - `system.disk-cleanup` is now bounded so a slow `podman image prune -f` cannot wedge the cleanup RPC indefinitely; the prune failure is reported as an action while cleanup continues. - - `system.disk-cleanup` now vacuums systemd journals to a bounded size and prunes timestamped `/usr/local/bin/archipelago.backup-*` files to the newest three using the existing `host_sudo` path. - - Live cleanup validation passed: endpoint returned, journals were reduced to about `200M`, old backend backups were pruned to three, and `/` improved from about `99%`/`490M` free to `98%`/about `730M` free. - - Added `nostr-rs-relay` to both catalog surfaces. Release-focused catalog drift now has zero missing catalog/manifest entries; remaining drift is metadata-only and belongs to the catalog-generation follow-up. - - Focused post-cleanup audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,nostr-rs-relay,portainer ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Broad post-cleanup audit passed with extended harness timeout: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. - - Final service state: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive. - -- Follow-up validation after the previous cutoff: - - `.198` is already running the current local release build hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`; no backend replacement was performed in this pass. - - Local release binary smoke-started successfully on an alternate bind/data dir before live checks. - - Meshtastic manifest-owned file rendering is now proven live: `/var/lib/archipelago/meshtastic/config.yaml` was backed up, removed, and recreated by `package.restart` from `apps/meshtastic/manifest.yml`. - - Focused Meshtastic audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - - Focused regression audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. - - Service state remains deterministic-test safe: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive. - - `/` remains tight at `99%` used with about `490M` free. - -- Live `.198` state after this pass: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256 is now `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`; no backend replacement was performed in this follow-up pass. - - `/`: still tight at `99%` used, about `490M` free. -- Registry state: - - Live `/var/lib/archipelago/config/registries.json` is already correct: `146.59.87.168:3000/lfg2025` is primary with `tls_verify: false`; `git.tx1138.com/lfg2025` is enabled as secondary with `tls_verify: true`. - - Added `meshtastic` and `portainer` to both `app-catalog/catalog.json` and `neode-ui/public/catalog.json` so migrated manifest-owned apps are present in the registry/catalog surface. -- Live recovery performed: - - Raw Podman sweep found `nextcloud` stuck in `Removing`. - - Removed only the wedged container record with `podman rm -f nextcloud`; bind-mounted data was preserved. -- Local verification passed: - - `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`. - - `cargo test -p archipelago-container generated_files_must_live_under_bind_mounts`. - - `cargo test -p archipelago manifest_generated_files`. - - `cargo test -p archipelago reconcile_force_recreates_stopping_container`. - - `cargo test -p archipelago health_maps_states_to_strings`. - - `cargo test -p archipelago test_rewrite_image`. - - `cargo test -p archipelago test_load_default`. - - `cargo check -p archipelago --bin archipelago`. - - `cargo build -p archipelago --bin archipelago --release`, hash `13786fd7bc5afb36fb7873ad9aee1a54a696e75b0a92c2fcd90cc8100038a54c`. -- Live validation passed: - - Focused audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Final raw Podman sweep showed no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. -- Remaining before release: - - The prior release-binary segfault is no longer reproducing with the current artifact; `.198` is active on hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265`. Continue watching logs after restarts, but do not treat `app.files` deployment as blocked. - - Add disk cleanup/backup retention policy; root filesystem pressure still makes deploys and image operations fragile. - - Resolve broader app catalog/manifest drift reported by `scripts/check-app-catalog-drift.py`; this pass only added the migrated Meshtastic and Portainer catalog entries. - -## 2026-05-28 `.198` Meshtastic File-Rendering Recovery Checkpoint - -- Current `.198` service state after recovery: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256 restored to `2ec1952dcc5f6101d236dd3ea7a85a40a6387a3f1afb8a5681345cad90306853` after a failed deploy attempt. - - `/`: still tight at `99%` used, about `546M` free. -- Local generated-file support status: - - Manifest schema supports `app.files`. - - Production orchestrator writes declared manifest files before create/start/restart and does not overwrite existing files unless `overwrite: true` is declared. - - Meshtastic manifest declares `/var/lib/archipelago/meshtastic/config.yaml` under its bind-mounted data directory. -- Local verification passed: - - `cargo test -p archipelago-container generated_files_must_live_under_bind_mounts`. - - `cargo test -p archipelago manifest_generated_files`. - - `cargo check -p archipelago --bin archipelago`. - - `cargo build -p archipelago --bin archipelago --release` produced local hash `13786fd7bc5afb36fb7873ad9aee1a54a696e75b0a92c2fcd90cc8100038a54c`. -- Live deploy caveat: - - Deploying the local release binary to `.198` caused immediate `SIGSEGV` on `archipelago.service` startup. - - The previous live binary was restored from `/usr/local/bin/archipelago.backup-20260528-container-files-2ec1952dcc5f6101d236dd3ea7a85a40a6387a3f1afb8a5681345cad90306853`; backend returned active. - - Do not redeploy that local release artifact blindly; diagnose the startup segfault/build mismatch first. -- Live Meshtastic recovery: - - Before recovery, `.198` had Meshtastic manifests with `files:` but no `/var/lib/archipelago/meshtastic/config.yaml`; container logs showed `No 'config.yaml' found` and `Blank MAC Address not allowed`. - - Wrote the same config currently declared by the manifest to `/var/lib/archipelago/meshtastic/config.yaml` as an operational recovery, then restarted `meshtastic.service`. - - Meshtastic returned `Up ... (healthy)`. -- Live validation passed: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Raw Podman sweep showed Meshtastic, Jellyfin, File Browser, BTCPay, Grafana, SearXNG, Gitea, Nostr relay, Botfights, Portainer, Nginx Proxy Manager, and other active managed containers without unhealthy/stopping/removing/exited states. -- Next required work: - - Diagnose why the local release backend segfaults immediately on `.198` before deploying the generic manifest file renderer as the durable fix. - - After a safe backend deploy, remove reliance on the manually recovered Meshtastic config by proving the manifest-owned renderer recreates it on start/restart. - - Keep deterministic-test timers inactive unless intentionally running non-deterministic recovery testing. - -## 2026-05-27 `.198` Manifest-Orchestrator Migration Checkpoint - -- Current `.198` live backend: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `31ae1b346fd36d715c9fe7f0686dcb31a70d2fea44996abf122743d048fb7b2f`. -- Migration goal confirmed and advanced: apps should not require hardcoded OS/Rust edits to work. App differences belong in manifests; Rust/OS should provide generic primitives for lifecycle, Quadlet rendering, readiness/health, port repair, bind-mount prep, data ownership, and image availability. -- New generic backend fixes deployed: - - Quadlet health drift detection now compares `HealthCmd`, `HealthInterval`, `HealthTimeout`, and `HealthRetries`. - - HTTP health command rendering now derives `wget -T` / `curl -m` from manifest `health_check.timeout`; `timeout: 30s` now produces helper-level `30s` probes instead of an outer Podman `30s` wrapped around an inner `5s` command. - - Existing Quadlet unit drift that requires restart now verifies the manifest image exists locally and pulls/builds if missing before restarting. - - Existing Quadlet service start for a missing container now also verifies/pulls/builds the manifest image before `systemctl --user start`. - - Reconcile now treats manifest-declared dependencies of active apps as required even if stale `user-stopped.json` entries exist, and parent app reconcile drift-syncs existing dependency Quadlet units from their own manifests. - - Portainer host prep moved out of a hardcoded Rust install hook; generic bind-mount socket prep now handles manifest sources ending in `/podman.sock`. -- Manifest updates deployed to both `/opt/archipelago/apps` and `/opt/archipelago/web-ui/archipelago-runtime/apps`: - - `portainer`: declarative manifest with data dirs, Podman socket mount, capabilities, `data_uid`, `9000:9000`, and no Podman healthcheck. - - `btcpay-server`, `grafana`, `nostr-rs-relay`, `searxng`: HTTP health timeouts/retries loosened to `timeout: 30s`, `retries: 5` to avoid false negatives under `.198` load. - - `archy-nbxplorer` manifest has `timeout: 30s`, `retries: 5`; live unit now matches with helper-level `wget -T 30` / `curl -m 30`. -- Local verification passed: - - `cargo fmt`. - - `cargo test -p archipelago translate_health_check -- --nocapture` passed. - - `cargo check -p archipelago --bin archipelago` passed after each backend fix. - - `cargo build -p archipelago --bin archipelago --release` passed; final deployed binary hash is `31ae1b346fd36d715c9fe7f0686dcb31a70d2fea44996abf122743d048fb7b2f`. -- Live `.198` validation: - - Portainer full lifecycle passed earlier: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - BTCPay focused lifecycle passed after the missing-image start guard: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Focused migration audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server,grafana,nostr-rs-relay,searxng,portainer,gitea ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Broad non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=600 tests/lifecycle/remote-lifecycle.sh`. - - Targeted unit/container sweep showed `btcpay-server`, `grafana`, `nostr-rs-relay`, `searxng`, and `portainer` services active. - - Post-focused and post-broad raw Podman sweeps found no `unhealthy`, `stopping`, `removing`, `exited`, `created`, or `initialized` containers. - - Raw states: `btcpay-server Up ... (healthy)`, `grafana Up ... (healthy)`, `nostr-rs-relay Up ... (healthy)`, `searxng Up ... (healthy)`, `portainer Up ...`. - - Generated units for `btcpay-server`, `grafana`, `nostr-rs-relay`, and `searxng` now show helper-level `wget -T 30` / `curl -m 30`, `HealthTimeout=30s`, and `HealthRetries=5`. - - Generated unit for `archy-nbxplorer` now also shows helper-level `wget -T 30` / `curl -m 30`, `HealthTimeout=30s`, and `HealthRetries=5`; BTCPay stack remained healthy. - - Filebrowser full lifecycle passed under the manifest/orchestrator path: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=filebrowser ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Filebrowser post-test live verification: `filebrowser.service` active; bind mounts `/srv` and `/data` rendered; `Exec=--config /data/.filebrowser.json`; generated `.filebrowser.json` points database to `/data/filebrowser.db` and root to `/srv`; container is `Up ... (healthy)`. -- Operational caveat found: - - `.198` root filesystem remains tight: about `556M` free on `/` (`99%` used). There are many old backend backup binaries under `/usr/local/bin`; deploys and Podman image operations are fragile until backup/image cleanup policy is added. -- Remaining before release: - - Meshtastic full lifecycle now passed on `.198` after routing it through the orchestrator path and fixing its manifest image, device, volume target, health check, launch metadata handling, and TCP port declaration. - - Replace the temporary/manual Meshtastic host `config.yaml` dependency with the generic manifest-owned file rendering path: - - Added local schema support for `app.files`. - - Added local production-orchestrator rendering for declared files before container start. - - Added Meshtastic `files:` declaration for `/var/lib/archipelago/meshtastic/config.yaml`. - - Local manifest parser tests passed; backend orchestrator tests are still running before deployment. - - Latest post-Meshtastic raw `.198` sweep: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/`: 99% used, about `532M` free. - - `jellyfin` and `filebrowser` reported `unhealthy`; investigate before final release qualification. - - Add the release code-review/refactor/performance gate: remove dead transitional code, reduce remaining app-specific Rust/OS paths, review scan/health/reconcile performance, then rerun lifecycle and launch tests after cleanup. - -## 2026-05-26 Migration Release Notes - -- Active doctrine: app-specific host mutations should move out of generic Rust/OS install paths wherever possible. Apps should be described by manifests and lifecycle hooks; the Rust backend should provide generic primitives for validation, container lifecycle, health/readiness, port repair, secrets, data ownership, and recovery. -- Current `.198` work remains focused on lifecycle migration hardening first. Do not call the migration finished until focused full lifecycle and broad audits pass on the manifest/orchestrator-owned path. -- `.198` Gitea migration checkpoint: - - Backend deployed: `/usr/local/bin/archipelago` sha256 `3780e54eec4821a61fbc024259bd854ec376228eb981fa169ec6f8aeafc5a9dd`. - - Gitea manifest deployed to both `/opt/archipelago/apps/gitea/manifest.yml` and `/opt/archipelago/web-ui/archipelago-runtime/apps/gitea/manifest.yml`, latest sha256 `8df263fcca9581a4e0a2872d21d26eed35b007c7bd7475071bedfd005f514e68`. - - The Gitea fix is manifest-owned: `security.no_new_privileges` is now honored by the generic Podman/Quadlet renderers, and Gitea declares its required capabilities (`CHOWN`, `FOWNER`, `SETUID`, `SETGID`, `DAC_OVERRIDE`, `NET_BIND_SERVICE`) plus `no_new_privileges: false`. - - Focused full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. -- `.198` generic host-listener repair checkpoint: - - Backend deployed: `/usr/local/bin/archipelago` sha256 `be06756763283535d2b3ee911cc91c7d401fb51b4dd88a3ebe86d79a05183e84`. - - Running-container reconcile now probes manifest-declared host ports and repairs missing listeners generically; observed repair restored Grafana port `3000` without a Grafana-specific OS edit. - - Uptime Kuma repair uses a longer readiness window so the generic repair path does not restart it before its slow HTTP startup completes. - - Gitea healthcheck timeout/retries were loosened in manifest metadata (`timeout: 30s`, `retries: 5`) after raw Podman health showed timeout-only false negatives while HTTP launch returned `200`. - - Focused audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea,grafana,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. -- Release follow-ups to keep in scope after the current Gitea/Uptime/Nextcloud migration pass: - - Portainer fixes discussed on 2026-05-26 must be carried into the new declarative approach, not left as a hardcoded OS prerequisite path. Completed for the current `.198` pass: - - Added `apps/portainer/manifest.yml` with manifest-declared data dirs, Podman socket mount, port `9000`, capabilities, `data_uid`, and no Podman healthcheck. - - Removed the hardcoded `ensure_portainer_host()` OS/Rust install hook. - - Added generic manifest-driven Podman socket preparation for any app that bind-mounts `podman.sock`. - - Backend deployed: `/usr/local/bin/archipelago` sha256 `d440e2cba52c6e1b60d8f0716386b0f4e3ce56b5370cedafabc6dbd30d230909`. - - Portainer manifest deployed to both `/opt/archipelago/apps/portainer/manifest.yml` and `/opt/archipelago/web-ui/archipelago-runtime/apps/portainer/manifest.yml`, latest sha256 `5e2ab96f2ba91ad2539a7dc6b73c92c6cece676109550d7d4c2f556aa578ba9c`. - - Focused full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`. - - Re-test the Filebrowser fixes under the manifest/orchestrator path. - - Re-test the Meshtastic fixes before final release qualification. - - Add an app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so third-party developers can package apps against the current manifest/runtime contract without relying on one-off OS-level changes. - - Add a required release code-review/refactor gate before cutting `1.8-alpha`: remove dead transitional code, replace remaining app-specific Rust/OS paths with manifest-owned metadata or generic lifecycle primitives, review scan/health/reconcile performance, then rerun lifecycle and launch tests after the cleanup. - -## 2026-05-13 `.198` Stopping-State Repair Checkpoint - -- User directive confirmed: testing target is `.198` until all containers work and the container layer is bulletproof/perfected. -- `.198` service state after this pass: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `5d3777d928ae6ee7627e9401faf932442806020ab7ad7a439eb7384d8eb7b8e6`. -- Live blocker found and repaired: - - `nostr-rs-relay` was stuck in raw Podman state `Stopping (healthy)`; focused lifecycle audit failed with `bad state: nostr-rs-relay is stopping`. - - Removed only the wedged container record with `podman rm -f nostr-rs-relay`; bind-mounted relay data under `/var/lib/archipelago/nostr-relay` was preserved. - - Archipelago/runtime recreated the relay and it returned `Up ... (healthy)`. -- Durable local fix added and deployed: - - `core/archipelago/src/container/prod_orchestrator.rs` now treats `ContainerState::Stopping` as a wedged container record during reconcile and force-recreates it from the manifest instead of trying a normal start. - - Added unit coverage intent: `reconcile_force_recreates_stopping_container`. - - `cargo check -p archipelago --bin archipelago` passed locally. - - `cargo build -p archipelago --bin archipelago --release` passed locally and was deployed to `.198`. - - Rust test binary build for the targeted unit test timed out during compilation in this environment before emitting compiler errors; use `cargo check` plus live `.198` audit as the validated gate for this pass. -- Post-deploy validation on `.198`: - - Focused audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool,searxng,nginx-proxy-manager,nostr-rs-relay,grafana,btcpay-server ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=180 tests/lifecycle/remote-lifecycle.sh`. - - Broad audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=240 tests/lifecycle/remote-lifecycle.sh`. - - Raw Podman sweep found no `unhealthy`, `Stopping`, `Removing`, `Exited`, or `Created` containers after post-restart startup settled. - - Direct HTTP probes returned healthy responses (`200` or expected `302`) for dashboard, bitcoin-ui, lnd-ui, btcpay, indeedhub, botfights, gitea, filebrowser, vaultwarden, searxng, fedimint, jellyfin, immich, homeassistant, grafana, tailscale, uptime-kuma, nextcloud, nginx-proxy-manager, and nostr-rs-relay. -- Current `.198` broad audit state: - - Running: `bitcoin-knots`, `lnd`, `btcpay-server`, `indeedhub`, `botfights`, `gitea`, `filebrowser`, `vaultwarden`, `searxng`, `fedimint`, `jellyfin`, `immich`, `homeassistant`, `grafana`, `tailscale`, `uptime-kuma`, `nextcloud`. - - Absent/expected in this audit: `bitcoin-core`, `mempool`, `electrumx`, `photoprism`. -- Important observation: - - Immediately after backend restart, `bitcoin-knots` briefly appeared `Exited (137)` during startup/recovery, then self-recovered and was running by inspection. Final broad audit and raw sweep were clean. -- Next recommended gate: - - Run destructive/full lifecycle on `.198` only when ready to intentionally cycle app containers; non-destructive broad audit and raw health are green after the stopping-state fix. - -## 2026-05-13 Resume Correction - -- User directive: "we're testing on .198 right, until all containers are working and we achieve our goal of bulletproof containers". -- Active target remains `.198`; do not drift back to older `.116`/`.228` release threads except for cross-node context. -- Continue lifecycle hardening until every intended `.198` container/app is working, recoverable, and aligned with the bulletproof-container goal. - -## Resume Prompt - -```text -Resume Archipelago lifecycle hardening from /home/archipelago/Projects/archy. Read docs/CONTAINER_LIFECYCLE_HANDOFF.md first. Active mission is node `192.168.1.198`, not the older `.116/.228/.67` release thread. SSH key is `/home/archipelago/.ssh/id_ed25519`; lifecycle password is `password123`. Preserve data unless explicitly told otherwise. Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive during deterministic testing. Do not revert unrelated dirty worktree changes because another agent/user may be working too. - -Mission: make every Archipelago app/container on `.198` lifecycle-safe and power-loss/reboot resilient. Containers should not randomly go down; app state must recover through daemon restarts, reboots, stale Podman/Quadlet state, missing host listeners, stuck installs, stopped/exited state drift, and stale stack/container records. Release is blocked until strict lifecycle plus app-specific reachability/launch probes agree with raw Podman health and actual app behavior. - -Latest live `.198` status from 2026-05-11: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; deployed `/usr/local/bin/archipelago` sha256 `ed4df8e4c3c0a12a481ea41f8246da4b5f9e9ad931d0f3f58084b0057c330af0`. Broad audit passed with `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=180 tests/lifecycle/remote-lifecycle.sh`, but this is not enough for release because raw Podman still reports health/state mismatches. - -Current suspected release blocker: reconcile the broad-audit pass with raw Podman health. On `.198`, `mempool-api` is `Up ... (unhealthy)`, `searxng` is `Up ... (unhealthy)`, `botfights` is `Up ... (unhealthy)`, and `nostr-rs-relay` is `Stopping (unhealthy)`. RPC/package state reports the installed audit set as running, so next work is to diagnose these health/state mismatches, decide whether each is a false-negative healthcheck or real app failure, fix the manifest/runtime/reconcile behavior, then rerun focused full lifecycle and browser/direct launch probes for affected apps. - -Known `.198` package state from latest broad audit: running `lnd`, `mempool`, `indeedhub`, `botfights`, `gitea`, `filebrowser`, `vaultwarden`, `searxng`, `fedimint`, `jellyfin`, `immich`, `homeassistant`, `tailscale`, `uptime-kuma`, `nextcloud`; absent `bitcoin-knots`, `bitcoin-core`, `btcpay-server`, `electrumx`, `grafana`, `photoprism`. Some absences are expected/blockers from earlier qualification, but `btcpay-server` and `grafana` had previously passed focused checks, so verify whether their absence is intentional before release. - -Regenerated release artifacts: -- `releases/v1.7.54-alpha/archipelago`: `77e3a236a6196a5ab9ec2411b150490e78ffc95ea6ab8eb34ab29b3df53cd632` -- `releases/v1.7.54-alpha/archipelago-frontend-1.7.54-alpha.tar.gz`: `a010ac43a2dd02f528202cb2f7b99b61ceab80adc6827877594e41df4ea951fb` -- `releases/manifest.json` and `release-manifest.json`: `0fb73c808ef87c1535c5e5f560ea331bacaded86c8c81abd5cdd2893a0415b6f` -- Unbundled ISO: `image-recipe/results/archipelago-installer-1.7.54-alpha-unbundled-x86_64.iso`, sha256 `9828b244e6ffdd5f1b1d5184c1b22bef7474b32078b1ceb4ec3584d9bdb6775b`, size `2.3G`. -``` - -## 2026-05-11 `.198` Active Mission Checkpoint - -## 2026-05-11 Resume Session Update - -- Latest user directive: "please resume our work". -- Reconfirmed active mission is `.198` lifecycle hardening, not the older `.116/.228/.67` thread. -- Live `.198` state at resume: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `494cd64f77cbecb95c08552237cb8fd3c11c2b2b76d5d39854e6cf92b5900b68`. -- Raw Podman still showed release blockers: - - `mempool-api`: `Up ... (unhealthy)`. - - `nginx-proxy-manager`: `Up ... (unhealthy)`. - - `nostr-rs-relay`: `Stopping (healthy)`. - - `searxng` was healthy by the time of recheck and served `http://127.0.0.1:8888/` with HTTP 200. -- Diagnosed `mempool-api` as a real app failure, not a false-negative healthcheck: logs repeatedly show `getaddrinfo ENOTFOUND electrumx`, and `.198` has no `electrumx` container present. `mempool-api` is configured with `ELECTRUM_HOST=electrumx`, so the broad audit was masking a broken stack member. -- Found and fixed a local backend masking bug: `ProdContainerOrchestrator::health` returned `healthy` for every running container and ignored Podman's actual health status. It now returns Podman's health value for running containers, maps `Stopping` to unhealthy, and `ContainerState` now parses Podman's `stopping` state explicitly. -- Local verification: - - `cargo fmt` passed. - - `cargo test -p archipelago-container parse_podman_ps_json_handles_cli_output` passed. - - `cargo check -p archipelago --bin archipelago` passed. - - `cargo test -p archipelago health_maps_states_to_strings` did not finish within 3 minutes during crate compilation; no compiler error was emitted before timeout. -- Focused live audit command attempted: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool,searxng,nginx-proxy-manager ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=60 tests/lifecycle/remote-lifecycle.sh`. It timed out because the deployed `.198` backend still has the old health behavior and Podman operations on the node are intermittently hanging. -- Next continuation point: - - Decide whether to deploy a freshly built backend to `.198`. Do not deploy the current dirty worktree blindly unless the existing unrelated changes are intended for this release, because the workspace contains many modified files from prior work. - - After deploy, rerun focused audit for `mempool,searxng,nginx-proxy-manager` and verify `container-health` reports `mempool` or stack health as unhealthy while `mempool-api` cannot resolve `electrumx`. - - Fix the mempool stack qualification: on a pruned/under-disk node, `mempool` must not install/start into a half-running state that leaves `mempool-api` unhealthy because `electrumx` is absent. - -## 2026-05-12 Lifecycle Hardening Completion Checkpoint - -- User directive: continue until the work is done. -- Deployed fixed backend to `.198`; final `/usr/local/bin/archipelago` sha256: `616e50ba8a83654e4a7656f931e5c9d1340a92cfa0ba22906edc0d374560df02`. -- `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive. -- Local durable fixes made: - - `ProdContainerOrchestrator::health` now respects Podman's health status instead of mapping all running containers to healthy. - - Podman `stopping` state is parsed explicitly and maps to unhealthy/stopping instead of unknown/running. - - `container-health` aggregates stack health for multi-container apps, so stack apps cannot hide unhealthy members like `mempool-api`. - - Health fallback now uses bounded exact-container Podman checks to avoid broad `podman ps` hangs poisoning unrelated app health. - - `mempool` install now runs dependency and archival-Bitcoin checks before dispatching to the stack installer, preventing half-running mempool stacks on pruned/under-disk nodes. - - Nginx Proxy Manager healthcheck now probes `http://localhost:81/`; `/api/` returns 502 on the deployed image while the UI is healthy. - - Runtime start repair now covers Vaultwarden and Nextcloud missing host listeners. - - Nextcloud runtime repair fixes bind-mounted data ownership before start/restart. - - Stale transitional state timeout lowered from 20 minutes to 2 minutes so dead lifecycle tasks clear promptly. -- Live `.198` repairs performed with data preserved: - - Removed broken `mempool` stack via `package.uninstall preserve_data=true`; `mempool` is now absent and full lifecycle correctly reports archival-blocked install. - - Recreated Nginx Proxy Manager container after stale Podman `Removing` state; data under `/var/lib/archipelago/nginx-proxy-manager` preserved. - - Recreated Vaultwarden container after stale conmon/host-listener failure; `/var/lib/archipelago/vaultwarden` preserved. - - Recreated Home Assistant and Nextcloud container records after stale conmon/host-listener failures; data directories preserved. - - Repaired Nextcloud ownership (`/var/lib/archipelago/nextcloud`) so Apache/PHP can write `config.php` and `data/nextcloud.log`. -- Verification passed: - - `cargo fmt`. - - `cargo check -p archipelago --bin archipelago`. - - `cargo build -p archipelago --bin archipelago --release`. - - `cargo test -p archipelago-container parse_podman_ps_json_handles_cli_output` passed earlier in this session. - - `cargo test -p archipelago health_maps_states_to_strings` still fails during local test binary linking with rust-lld undefined hidden symbols; `cargo check` and release build pass. - - Focused audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool,searxng,nginx-proxy-manager ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`. - - Broad audit passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=180 tests/lifecycle/remote-lifecycle.sh`. - - Final raw Podman sweep found no `unhealthy`, `Stopping`, or `Removing` containers. - - Final direct probes returned HTTP 200 for LND UI, IndeedHub, Botfights, Gitea, File Browser, Vaultwarden, SearXNG, Fedimint, Jellyfin, Immich, Home Assistant, Tailscale, Uptime Kuma, Nextcloud, Nginx Proxy Manager, and Nostr Relay. -- Final broad audit state: - - Running: `lnd`, `indeedhub`, `botfights`, `gitea`, `filebrowser`, `vaultwarden`, `searxng`, `fedimint`, `jellyfin`, `immich`, `homeassistant`, `tailscale`, `uptime-kuma`, `nextcloud`. - - Absent/expected for this node or archival-gated: `bitcoin-knots`, `bitcoin-core`, `btcpay-server`, `mempool`, `electrumx`, `grafana`, `photoprism`. -- Remaining release consideration: `.198` is green for the non-destructive broad audit and raw Podman health. Destructive/full lifecycle should still be run only when you are ready to intentionally cycle app containers. - -- User corrected the active mission after disconnect: continue `.198` container lifecycle hardening, not the older `.116/.228/.67` thread. -- Mission: build "perfect containers" that do not go down unexpectedly and recover through daemon restarts, server reboots, power loss, stale Podman/Quadlet state, missing rootless host listeners, stuck installs, stopped/exited state drift, and stale stack/container records. -- Preserve app data unless explicitly told otherwise. -- Keep deterministic-test timers paused: `archipelago-doctor.timer` and `archipelago-reconcile.timer` should remain inactive. -- Latest verified `.198` service state: - - `archipelago.service`: active. - - `archipelago-doctor.timer`: inactive. - - `archipelago-reconcile.timer`: inactive. - - `/usr/local/bin/archipelago` sha256: `ed4df8e4c3c0a12a481ea41f8246da4b5f9e9ad931d0f3f58084b0057c330af0`. -- Latest broad audit command passed: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=180 tests/lifecycle/remote-lifecycle.sh -``` - -- Latest broad audit states: - - Running: `lnd`, `mempool`, `indeedhub`, `botfights`, `gitea`, `filebrowser`, `vaultwarden`, `searxng`, `fedimint`, `jellyfin`, `immich`, `homeassistant`, `tailscale`, `uptime-kuma`, `nextcloud`. - - Absent: `bitcoin-knots`, `bitcoin-core`, `btcpay-server`, `electrumx`, `grafana`, `photoprism`. -- Do not treat the broad audit pass as release-ready yet. Raw Podman still showed these concerning health/state mismatches: - - `mempool-api`: `Up ... (unhealthy)`. - - `searxng`: `Up ... (unhealthy)`. - - `botfights`: `Up ... (unhealthy)`. - - `nostr-rs-relay`: `Stopping (unhealthy)`. -- Current suspected release blocker: Archipelago package state and broad audit say apps are running, but raw Podman health/state still reports unhealthy/stopping containers. Next agent should diagnose whether each mismatch is a false-negative healthcheck, stale Podman state, or a real app failure; then fix manifest/runtime/reconcile behavior and rerun focused full lifecycle plus browser/direct launch probes for affected apps. -- Also verify whether `btcpay-server` and `grafana` being absent is intentional, because both had previously passed focused lifecycle checks on `.198`. - -## 2026-05-06 Resume Checkpoint - -- Goal: make container lifecycle and health recovery durable for every install and existing Archipelago server, while preserving app data. -- `.228` state: - - SSH key auth still fails, but password SSH works with password `archipelago`. - - Quarantined stale Quadlet blocker `~/.config/containers/systemd/bitcoin-core.container.disabled-20260506`. - - Started companion Bitcoin/LND UI services; external ports `8334` and `18083` return HTTP 200. - - Recreated stale `bitcoin-knots` container record only, preserving `/var/lib/archipelago/bitcoin` and `BITCOIN_RPC_PASS`; authenticated local RPC works. - - Diagnosed Immich reset loop as `immich_postgres` memory cap `512MiB`; raised live cap to `2g`/`4g` swap and made it persistent in code. - - Final external checks passed: dashboard 200, Bitcoin UI 200, LND UI 200, Immich 200, Bitcoin RPC unauthenticated 405 expected. -- `.116` state: - - Removed stale update override `/etc/systemd/system/archipelago.service.d/update-url.conf`. - - Valid RPC/password auth is `archipelago`; `password123` failed. - - Recreated stale `bitcoin-knots` preserving data and RPC password; direct authenticated RPC works. - - Fixed Grafana with `podman unshare chown -R 472:472 /var/lib/archipelago/grafana`; Grafana health returns 200. - - Deployed locally built fixed backend to `/usr/local/bin/archipelago`; previous binary was backed up and service restarted. - - Backend deploy checksum now `c6c7830f14dc80b0e22d803997ad3df31c9ab3d4b08829b3bddc1b03ce77bd0a`. - - Repaired active nginx config and canonical config so `curl http://127.0.0.1/bitcoin-status` returns JSON instead of SPA HTML. - - Repaired LND UI companion drift: generated quadlet was using stale `localhost/lnd-ui:latest`, whose nginx listened on container port 8081 while the unit mapped `18083:80`. Updated the live unit to use `localhost/lnd-ui:local`; `http://127.0.0.1:18083/` returns HTTP 200 and survives `systemctl --user restart archy-lnd-ui.service`. - - Focused non-destructive lifecycle audit passed: `ARCHY_HOST=192.168.1.116 ARCHY_SCHEME=http ARCHY_PASSWORD=archipelago ARCHY_APPS=bitcoin-knots,lnd,btcpay-server,mempool,grafana ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - - Deployed newest local backend and script fixes live to `.116`, restarted Archipelago twice, and re-ran the focused non-destructive audit successfully. Important release/OTA note: startup promoted stale `/opt/archipelago/web-ui/archipelago-runtime/scripts` over `/opt/archipelago/scripts` once; after refreshing the runtime payload scripts too, restart preserved `18083` everywhere. - - Recent Bitcoin/ElectrumX status warnings appear transient during Bitcoin IBD/UTXO flushes. Live `/bitcoin-status` is `ok=true`, `stale=false`; ElectrumX reports `waiting` because it is indexed beyond the local Bitcoin node and is waiting for Bitcoin catch-up. -- `.67` state: - - User confirmed credentials `archipelago`/`archipelago`. - - This workspace cannot reach it: SSH `No route to host`, HTTP `000`, ping 100% loss, neighbor incomplete/failed. - - IndeedHub reboot/Nostr signing fix still needs live verification from a host that can reach `.67`. -- Local durable fixes in progress/done: - - Bitcoin/Grafana/Immich/IndeedHub backend fixes are implemented locally. - - UI loading/launch readiness fixes are implemented locally. - - Nginx canonical config now includes `/bitcoin-status` proxy next to `/electrs-status`. - - Startup bootstrap now patches older nginx configs that are missing `/bitcoin-status` and still patches `/api/app-catalog` when needed. It handles both `sites-available/archipelago` and copied `sites-enabled/archipelago` layouts. - - LND UI companion/spec drift is fixed locally: first-boot/container specs now use host `18083`, and companion reconcile now rewrites stale quadlet units/images instead of only checking active state. - - Release packaging now includes `image-recipe/configs/nginx-archipelago.conf` in the OTA runtime payload and strips `__pycache__`, `.pyc`, `.bak`, `.bak-*`, and logs from runtime assets. - - Regenerated `v1.7.54-alpha` frontend tarball was explicitly verified to contain LND UI `18083`, LND UI container nginx `listen 80`, and `/bitcoin-status` nginx blocks; no pycache/pyc/bak junk remains. - - ISO builder now configures both `146.59.87.168:3000` and `git.tx1138.com` as insecure for Podman and passes `--tls-verify=false` for primary HTTP registry pulls. The unbundled ISO now successfully pulls and saves `filebrowser.tar` instead of warning that Cloud/File Browser will be missing. - - ISO output filenames now include the release version and alpha suffix, e.g. `archipelago-installer-1.7.54-alpha-unbundled-x86_64.iso`. -- Verification already passed before latest nginx change: - - `cargo fmt` - - `cargo check -p archipelago --bin archipelago` - - `cargo build -p archipelago --bin archipelago --release` - - `bash -n scripts/first-boot-containers.sh` - - `bash -n image-recipe/build-debian-iso.sh image-recipe/archipelago-scripts/install-to-disk.sh image-recipe/write-usb-dd.sh image-recipe/create-fat32-usb.sh image-recipe/_archived/build-auto-installer-iso.sh scripts/create-release-manifest.sh scripts/container-specs.sh scripts/first-boot-containers.sh scripts/self-update.sh` - - `cd neode-ui && npm run build` - - `cd neode-ui && npm run type-check` - - `cd neode-ui && npm test -- appsConfig.test.ts appLauncher.test.ts --run` - - `scripts/check-release-manifest.sh` - - `sudo -n env UNBUNDLED=1 BUILD_FROM_SOURCE=1 bash build-debian-iso.sh` from `image-recipe/` passed and produced the v1.7.54-alpha unbundled ISO. -- Next steps: - - Re-check `.116` Archipelago logs for `Bitcoin status: RPC failure: getblockchaininfo` after Bitcoin IBD/UTXO flushing calms down. - - Deploy the fixed backend to `.228` if desired so durable repairs run there too. - - Optional next gate: run a full bundled/core-image ISO build if you need offline app images. The prior File Browser HTTP registry blocker is fixed for the builder path. - - Verify IndeedHub on `.67` only from a reachable network path. - -## 2026-05-05 Botfights, Gitea, Icons - -## 2026-05-06 Multi-Node Non-Destructive Audit - -### 2026-05-06 `.228` Live Repair - -- Access notes: - - SSH key auth to `.228` still fails, but password SSH works with password `archipelago`. - - Dashboard/RPC health reports `version=1.7.53-alpha`. -- Companion UI repair: - - Root cause: a stale rootless Quadlet unit at `~/.config/containers/systemd/bitcoin-core.container` blocked user Quadlet generation, so `archy-bitcoin-ui.service` and `archy-lnd-ui.service` were missing even though their `.container` files existed. - - Quarantined only the stale blocker: `~/.config/containers/systemd/bitcoin-core.container.disabled-20260506`. - - Ran user daemon reload and started generated companion services. - - Final verification: `archy-bitcoin-ui.service` and `archy-lnd-ui.service` are active; external `http://192.168.1.228:8334/` and `http://192.168.1.228:18083/` both return HTTP 200. -- Bitcoin Knots repair: - - Root cause: existing `bitcoin-knots` container record was stale and still launched `exec bitcoind`; current image only provides `/opt/bitcoin-29.3.knots20260210/bin/bitcoind` on PATH/fallback. - - Removed and recreated only the `bitcoin-knots` container record, preserving `/var/lib/archipelago/bitcoin` and the existing `BITCOIN_RPC_PASS`. - - New command matches the deployed manifest fallback: resolve `command -v bitcoind`, then search `/opt -path '*/bin/bitcoind'`. - - Final verification: container is running, ports `8332`/`8333` are listening, authenticated local RPC `getblockchaininfo` works, and the node is in initial block/header sync. -- Immich repair: - - Root cause: `immich_postgres` was capped at `512MiB`; during Immich v2.7.4 reverse-geocoding geodata import, Postgres child processes were SIGKILLed while bulk inserting into `geodata_places`, forcing DB recovery and causing `immich_server` to reset connections on `2283`. - - Raised only the Postgres container memory limit with `podman update --memory=2g --memory-swap=4g immich_postgres`, then restarted `immich_postgres` and `immich_server`; preserved `/var/lib/archipelago/immich-db` and `/var/lib/archipelago/immich`. - - Final logs showed `Successfully imported 224210 geodata records`, `Initialized local reverse geocoder`, and both Immich API/microservices successfully started. - - Final external verification: `http://192.168.1.228:2283/` returns HTTP 200. -- Final `.228` external status after repair: - - Dashboard `http://192.168.1.228/`: HTTP 200. - - Bitcoin UI `http://192.168.1.228:8334/`: HTTP 200. - - LND UI `http://192.168.1.228:18083/`: HTTP 200. - - Immich `http://192.168.1.228:2283/`: HTTP 200. - - Bitcoin RPC no-auth probe `http://192.168.1.228:8332/`: HTTP 405, expected for reachable RPC without credentials. -- Still outstanding from this audit: - - `.116` has the same stale Bitcoin Knots container-command symptom but RPC password `password123` fails; do not repair until valid auth/SSH access is confirmed. - - `.67` remains unreachable from this machine even with confirmed credentials `archipelago`/`archipelago`: SSH reports `No route to host`, HTTP probes return `000`, local route is via `wlp3s0` from `192.168.1.116`, and ping has 100% packet loss. IndeedHub reboot behavior still needs diagnosis from a host that can reach `.67`. - - The `.228` ad-hoc Immich Postgres memory repair was made persistent locally after the live fix: `install_immich_stack` now creates `immich_postgres` with `--memory=2g`, and `get_memory_limit("immich_postgres")` returns `2g`. Verification passed with `cargo fmt` and `cargo check -p archipelago --bin archipelago`. -- IndeedHub reboot/Nostr signing root cause and local fix: - - User confirmed IndeedHub works after a manual restart, but after server boot it fails to come back correctly and forgets the Nostr signing/provider behavior. - - Root cause in code: `ProdContainerOrchestrator::ensure_running_with_mode` returned `stack-managed` immediately for `indeedhub`, so the boot reconciler never started/repaired the installed stack and never reapplied the imperative frontend nginx/Nostr-provider mutation. - - Additional gap: package start/restart repaired IndeedHub network aliases but did not reapply `nostr-provider.js` / nginx patch after the frontend container was started. - - Local fix: boot reconcile now handles an existing IndeedHub stack without fresh-installing the single manifest: starts backend containers, starts frontend if stopped/exited/created, repairs network aliases, reapplies the Nostr provider/nginx patch, and restarts the frontend if host port `7778` is not listening. - - Local fix: package start/restart now reapplies the IndeedHub Nostr provider patch whenever `indeedhub` is in the started/restarted set. - - Verification passed locally with `cargo fmt` and `cargo check -p archipelago --bin archipelago`. - - Not live-verified on `.67` because this workspace still cannot reach `.67`; deploy the backend build to a reachable test node or run from a host that can reach `.67`, then reboot and confirm `http://:7778/` plus Nostr signing in the iframe. -- Bitcoin/Grafana permanent repair notes: - - `.116` showed `Unable to connect to Bitcoin node` because `bitcoin-knots` had the same stale container command as `.228`: existing container record still executed bare `bitcoind`, but the current image only has `/opt/bitcoin-29.3.knots20260210/bin/bitcoind` discoverable via PATH/fallback. - - Local permanent fix: `ProdContainerOrchestrator::container_env_drifted` now also checks entrypoint/cmd drift against the current manifest. Existing stale containers whose command no longer matches the deployed manifest are removed/recreated by boot reconcile/start/install flows, preserving bind-mounted data. - - `.116` Grafana served `/api/health` but logs showed `GF_PATHS_DATA='/var/lib/grafana' is not writable` and repeated `attempt to write a readonly database`; live data ownership had mixed rootless mapped owners. - - Local permanent fix: `apps/grafana/manifest.yml` now declares `data_uid: "472:472"`, and Grafana start/reconcile paths repair `/var/lib/archipelago/grafana` ownership before start/restart. This makes fresh installs and already-installed nodes self-heal instead of relying on manual `chown`. - - Verification passed with `cargo fmt` and `cargo check -p archipelago --bin archipelago`. - -- Current local branch state during audit: - - `main` is 31 commits ahead of `tx1138/main`. - - Tracked worktree is clean. - - Untracked docs: `docs/CONTAINER_LIFECYCLE_HANDOFF.md` and `docs/CHAT_TRANSCRIPT_2026-05-02.md`. -- Connectivity and service health: - - `.198`: SSH reachable with `/home/archipelago/.ssh/id_ed25519`; `archipelago.service` active; local health returns `status=ok`, `version=1.7.53-alpha`. - - `.116`: SSH reachable with `/home/archipelago/.ssh/id_ed25519`; `archipelago.service` active; local health returns `status=ok`, `version=1.7.51-alpha`. - - `.228`: SSH still blocked with `Permission denied (publickey,password)`; dashboard/RPC is reachable over HTTP/HTTPS. -- Broad non-destructive lifecycle audit results: - - `.198` passed cleanly: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=180 tests/lifecycle/remote-lifecycle.sh`. - - `.228` failed two checks with RPC-only audit: Bitcoin Knots UI direct port `http://192.168.1.228:8334/` returned `status=000`, and LND UI direct port `http://192.168.1.228:18083/` returned `status=000`. Dashboard itself returns HTTP 200. SSH-level diagnosis is blocked until credentials/key access are fixed. - - `.116` audit did not complete within 15 minutes and showed degraded state: `container-health` returned `unknown` for `bitcoin-knots`, `btcpay-server`, and `lnd`; LND direct port `http://192.168.1.116:18083/` returned `status=000`. Direct probes showed dashboard HTTP 200, Bitcoin UI `http://192.168.1.116:8334/` HTTP 200, old LND UI `http://192.168.1.116:8081/` HTTP 200, BTCPay `http://192.168.1.116:23000/` HTTP 302, and Mempool `http://192.168.1.116:4080/` HTTP 200. -- `.116` live diagnostics: - - Deployed backend checksum: `f761e659d661f0a83cd3a67a086bb2279398bc05e50ee3c52e769e52d11e476c`. - - Service has `ARCHIPELAGO_DEV_MODE=true` override and `ARCHIPELAGO_UPDATE_URL=http://192.168.1.116:3000/lfg2025/archy/raw/branch/main/releases/manifest.json`. - - `archy-lnd-ui` is still mapped to `0.0.0.0:8081->80/tcp`, while the current lifecycle harness expects LND UI on `18083`; treat `.116` as stale relative to the current LND port migration. - - `lnd` is `Up ... (unhealthy)` on `8080`, `9735`, and `10009`. - - `btcpay-server` is `Up ... (unhealthy)` on `23000`. - - `bitcoin-knots` is `Up ... (reset)` and backend logs show repeated Bitcoin RPC failures for `getblockchaininfo`. - - Backend logs show ElectrumX status also failing Bitcoin RPC. -- `.198` live diagnostics: - - Deployed backend checksum observed during this audit: `86cf408ed84c7a7a72d1b5529aa97561dd02db38aab57c523999d1f5e7bf48b7`. -- Local smoke verification passed: - - `cargo check -p archipelago --bin archipelago` from `core/`. - - `npm run type-check` from `neode-ui/`. - - `npm test -- appsConfig.test.ts appLauncher.test.ts --run` from `neode-ui/` (`27 passed`). -- Next focused actions: - - Fix `.228` SSH access first if deeper runtime diagnosis is required; RPC-only audit already identifies closed/unreachable direct app ports `8334` and `18083`. - - Bring `.116` forward to the current deployed release/runtime expectations before treating lifecycle failures as fresh regressions. It is on `1.7.51-alpha`, has dev-mode/update-url overrides, and still launches LND UI on legacy port `8081`. - - After `.116` is updated, rerun focused non-destructive checks for `bitcoin-knots`, `lnd`, `btcpay-server`, `mempool`, and ElectrumX/Bitcoin RPC status before a full broad audit. - -## 2026-05-05 Tailscale And Grafana Recheck - -## 2026-05-05 Release v1.7.52-alpha Staging - -- Release target corrected to `1.7.52-alpha`. -- Version bumped locally in: - - `core/archipelago/Cargo.toml` - - `core/Cargo.lock` - - `neode-ui/package.json` - - `neode-ui/package-lock.json` -- `.52` release notes added to `CHANGELOG.md`. -- Debian 13/Trixie security mitigation added for rebuilt media: - - `_archived/build-auto-installer-iso.sh` now runs `apt-get -y full-upgrade` after enabling Debian/Trixie security repositories during rootfs, Tailscale, FIPS, and installer environment creation. - - `image-recipe/archipelago-scripts/install-to-disk.sh` now runs `apt-get -y full-upgrade` after writing `trixie-security` sources and before installing kernel/bootloader/packages. - - This does not retroactively patch already-built ISOs; `.52` media must be rebuilt. -- Active ISO command restored: - - Added `image-recipe/build-debian-iso.sh` wrapper around the archived builder so documented ISO commands no longer point at a missing script. - - USB helper scripts now default to `results/archipelago-installer-x86_64.iso` / unbundled fallback and allow `ARCHIPELAGO_ISO=/path/to.iso`. -- `.52` release artifacts staged: - - `releases/v1.7.52-alpha/archipelago` - - `releases/v1.7.52-alpha/archipelago-frontend-1.7.52-alpha.tar.gz` - - `releases/manifest.json` - - `release-manifest.json` -- Manifest validation passed: `scripts/check-release-manifest.sh`. -- Frontend dependency audit: - - Ran `npm audit fix`, removing the critical `protobufjs` advisory and high advisories. - - Remaining audit finding is moderate `uuid <14` via `dockerode`; `npm audit fix --force` would upgrade to breaking `dockerode@5.0.0`, so this was not forced during release staging. -- Final verification passed: - - `cargo build -p archipelago --bin archipelago --release` with existing `reconcile_all` dead-code warning. - - `cargo check -p archipelago --bin archipelago` with same warning. - - `cd neode-ui && npm run build`. - - `cd neode-ui && npm run type-check && npm test -- appsConfig.test.ts appLauncher.test.ts --run`. - - `bash -n image-recipe/build-debian-iso.sh image-recipe/archipelago-scripts/install-to-disk.sh image-recipe/write-usb-dd.sh image-recipe/create-fat32-usb.sh image-recipe/_archived/build-auto-installer-iso.sh`. - - `npm audit --audit-level=high` reports only moderate findings and exits with the remaining moderate `dockerode`/`uuid` issue. -- Not yet done in this pass: - - Full bundled ISO build was not run; unbundled ISO build passed. - - `.52` release artifacts were staged locally but not committed, tagged, or pushed. - - No git commit was created. - -### 2026-05-05 Warning Fix And ISO Build - -- Removed the `reconcile_all` dead-code warning by making the install-missing reconcile helper test-only with `#[cfg(test)]`; production uses `reconcile_existing`. -- Verification now passes without Rust warnings: - - `cargo check -p archipelago --bin archipelago` - - `cargo build -p archipelago --bin archipelago --release` -- Refreshed `.52` backend artifact and manifests after the warning fix: - - `scripts/check-release-manifest.sh` passes. - - Backend sha256: `fc47c3bc42f67472252cb854bb03e200a92929ab38aeac519422704486af18d4`. - - Frontend tarball sha256: `329e57a0491e91966afcd5a82f5c00920657695b01ecc6c9e99c6814b44abf29`. -- Built unbundled `.52` Debian ISO: - - Command: `sudo -n env UNBUNDLED=1 BUILD_FROM_SOURCE=1 bash image-recipe/build-debian-iso.sh` from `image-recipe/`. - - Output: `image-recipe/results/archipelago-installer-unbundled-x86_64.iso`. - - Size: `2.3G`. - - sha256: `547ba5dcd0ad61aeaa52ce0beaff4f447e2ab2c59bf6b1fa127529606fe0209d`. -- ISO build note: - - The unbundled ISO completed successfully. - - Optional File Browser core image pull failed during Step 3b because `146.59.87.168:3000` answered HTTP while Podman tried HTTPS: `server gave HTTP response to HTTPS client`. - - This was non-fatal for unbundled media; Cloud/File Browser may need post-install Marketplace download unless registry TLS/insecure registry config is corrected before a bundled/core-image ISO. - -- Backend build deployed to `.198`: `eb539aaa11b32776888be1b23b90c9c0c78b46d8a86dc55ccce7f5b15bbda16e`. -- Tailscale is now qualified: - - Root cause: container command started `tailscale web` before `tailscaled`, so the web UI exited because `/var/run/tailscale/tailscaled.sock` did not exist yet. - - Fixed backend config and first-boot script to start `tailscaled --tun=userspace-networking` first, then bind `tailscale web --listen 0.0.0.0:8240`. - - Removed only the stale `tailscale` container on `.198`; preserved `/var/lib/archipelago/tailscale`. - - Full preserve-data lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=tailscale ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Frontend launch now opens local app port `http://:8240/` instead of the external Tailscale admin site. - - Browser launch passed: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=tailscale ARCHY_APP_TITLE=Tailscale ARCHY_APP_CARD_TITLE=Tailscale ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:8240/ ARCHY_EXPECTED_LAUNCH_MODE=popup ARCHY_EXPECTED_BODY_PATTERN='Tailscale|Connect|Login|Sign|Authorize|Machines|Admin|Tailnet|VPN' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. -- Grafana regression was found during broad audit: - - RPC/container state was `running`, but direct launch failed on `http://192.168.1.198:3000/` with `status=000`; Podman reported a port mapping while `ss` had no host listener. - - Extended existing host-port listener repair to include Grafana port `3000` on install/adoption/start/restart paths. - - Full Grafana lifecycle passed after repair, then focused Grafana audit passed. -- Broad `.198` audit passed after Tailscale and Grafana repairs: - - Command: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`. - - Running apps included `tailscale`, `grafana`, and the previously qualified app set. - - Absent and tolerated: `ollama`, `photoprism`, `electrumx`, `dwn`. -- Local verification passed: - - `cargo fmt` - - `cargo build -p archipelago --bin archipelago --release` with existing `reconcile_all` dead-code warning. - - `cargo check -p archipelago --bin archipelago` with same warning. - - `bash -n scripts/first-boot-containers.sh` - - `cd neode-ui && npm run build` - - `cd neode-ui && npm run type-check` - - `cd neode-ui && npm test -- appsConfig.test.ts appLauncher.test.ts --run` - -- Backend build deployed to `.198`: `4b92ecea7d0a988c4ebe814b47f49f00277867d5f1eb0dca2cb1cd906b536fe6`. -- Gitea regression re-tested and repaired after later launch failure: - - Failure reproduced during full lifecycle after restart: `launch failed: gitea http://192.168.1.198:3001/ status=000 bytes=0`. - - Live diagnosis: Gitea was healthy internally on container port `3000` and `ROOT_URL` was correct, but Podman's rootless `pasta` host listener on `:3001` accepted no traffic. - - Changed Gitea install networking in `core/archipelago/src/api/rpc/package/install.rs` to `--network=slirp4netns:allow_host_loopback=true`, matching the Uptime Kuma rootless listener repair path. - - Backend build deployed to `.198`: `9db6c192c2e633c4648fafc0372ea0f3cb0749aacc5396bb12f7710c8bac4aa7`. - - Full preserve-data lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Direct check passed: `http://192.168.1.198:3001/` returned `HTTP 200`; final container inspect showed `network=slirp4netns` and `rootlessport` listening on `:3001`. -- Botfights is qualified: - - Initial failure was stale `pasta.avx2` listener on host port `9100`; no Botfights container owned it. - - Killed stale pid `211879` and reran full lifecycle. - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=botfights ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. -- Gitea is qualified: - - User-visible launch error was broken asset root: Gitea generated `/app/gitea/assets/...` URLs while the UI/lifecycle launched direct port `http://192.168.1.198:3001/`. - - Fixed backend post-install hook in `core/archipelago/src/api/rpc/package/install.rs` to set `ROOT_URL = http://:3001/` instead of `/app/gitea/`. - - Added install/start/restart stale listener cleanup and host-port verification for Gitea host ports `3001`, `2222`, and legacy stale `3000`. - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=gitea ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. -- Icons updated locally: - - Replacement files found at `/home/archipelago/immich.png`, `/home/archipelago/electrumx.png`, and `/home/archipelago/grafana.png`. - - Replaced `neode-ui/public/assets/img/app-icons/immich.png`, `neode-ui/public/assets/img/app-icons/grafana.png`, and `neode-ui/public/assets/img/grafana.png`. - - Added `neode-ui/public/assets/img/app-icons/electrumx.png` and updated catalog/curated/marketplace references from `.webp` to `.png`. - - Installed Gitea icon now falls back to existing `/assets/img/app-icons/gitea.svg` instead of nonexistent `/assets/img/app-icons/gitea.png`. - - `AppHeroSection.vue` now uses `resolveAppIcon()` so app details uses the same fallback behavior. - - Verification passed: `npm test -- appsConfig.test.ts --run`. - -## 2026-05-05 Nextcloud, Uptime Kuma, ElectrumX Warning - -- Backend build deployed to `.198`: `1796cccd44e7d8f34b495b2dc04bc933d85a32c8c77cee31800653cc5f7b05d0`. -- Nextcloud live `403 Forbidden` was caused by unreadable Apache/PHP entry files inside the container: - - `.htaccess`, `index.php`, and `status.php` were `0600 root:root`. - - Added targeted Nextcloud permission repair in `core/archipelago/src/api/rpc/package/install.rs` instead of broad recursive ownership/mode changes. - - Manually repaired live container file modes and restarted Nextcloud. - - Retested `http://192.168.1.198:8085/status.php` and `http://192.168.1.198:8085/`; both returned `HTTP/1.1 200 OK`. -- Uptime Kuma root cause was rootless host port listener instability: - - The app was healthy internally on `127.0.0.1:3001` and returned `302 /dashboard`, while the host `3002` listener was missing despite Podman showing a mapping. - - Changed Uptime Kuma install networking in `core/archipelago/src/api/rpc/package/install.rs` to `--network=slirp4netns:allow_host_loopback=true`. - - Ran `cargo fmt`, `cargo check -p archipelago --bin archipelago`, and `cargo build -p archipelago --bin archipelago --release` successfully before deploy. - - Recreated Uptime Kuma through local backend RPC on `.198` with preserve-data uninstall/reinstall; preserved `/var/lib/archipelago/uptime-kuma`. - - Retested `http://192.168.1.198:3002/`; final response was `HTTP/1.1 302 Found` with `Location: /dashboard`. -- ElectrumX archival-node UI warning implemented in `neode-ui`: - - `Marketplace.vue`, `MarketplaceAppDetails.vue`, and `Discover.vue` fetch `/bitcoin-status` and only block ElectrumX/electrs/mempool-electrs installs when `blockchain_info.pruned === true`. - - Failed or unavailable prune-status fetches remain fail-safe and do not block install attempts. - - Warning text shown via toast/error paths: `You need a full archival bitcoin node before downloading ElectrumX`. - - `MarketplaceAppCard.vue` blocked warning button is clickable so the toast path can display the popup text instead of silently disabling the button. - - Frontend verification passed: `npm run type-check` from `neode-ui`. -- Icon replacement remains blocked: - - Searched likely upload locations and repo icon paths; no replacement icon files were found. - - Existing icon directory is `neode-ui/public/assets/img/app-icons/`. - - Continue once the actual replacement files/path are provided. - -## 2026-05-04 Testing Continuation - -- SearXNG rootless listener fix deployed and qualified after reconnection: - - Backend build deployed to `.198`: `0773e8719cfd1099ffeae27d9f046749353ebb7fa795c36097b674bd54c28820`. - - Root cause: the new-container install path repaired a missing rootless `pasta` host listener on port `8888`, but the legacy "container already exists, adopt it" path could return success without the same repair. This left Podman reporting `0.0.0.0:8888->8080/tcp` while `ss` showed no listener and launch probes returned `000`. - - Code fix: `core/archipelago/src/api/rpc/package/install.rs` now calls `ensure_host_port_listener(package_id, package_id)` before returning success from the existing-container adoption path. - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=searxng ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=180 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser launch passed in panel mode: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=searxng ARCHY_APP_TITLE=SearXNG ARCHY_APP_CARD_TITLE=SearXNG ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:8888/ ARCHY_EXPECTED_LAUNCH_MODE=panel ARCHY_EXPECTED_BODY_PATTERN='SearXNG|Search' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. -- Jellyfin is qualified: - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=jellyfin ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser launch passed in panel mode: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=jellyfin ARCHY_APP_TITLE=Jellyfin ARCHY_APP_CARD_TITLE=Jellyfin ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:8096/ ARCHY_EXPECTED_LAUNCH_MODE=panel ARCHY_EXPECTED_BODY_PATTERN='Jellyfin|jellyfin' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. -- ElectrumX is blocked on `.198`: - - Reproduced failure: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=300 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` stayed `absent` after install. - - Backend log shows install was rejected before container creation: `electrumx requires an unpruned Bitcoin node while indexing. Current Bitcoin is pruned`. - - Direct Bitcoin RPC confirmed `pruned: true`, `prune_target_size: 576716800`, IBD `blocks=472928`, `headers=947914`. - - Disk check showed `/var/lib/archipelago` has about `384G` free, likely not enough for unpruned mainnet plus ElectrumX index. User selected `Mark blocked`; do not reconfigure Bitcoin on `.198` unless explicitly requested. -- PhotoPrism is pending/blocked on image pull speed: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` stayed `installing` because the container image was still pulling. - - No `photoprism` container was created yet; no port `2342` listener. - - Backend logs show `146.59.87.168:3000/lfg2025/photoprism:240915` timed out after 600s, then `git.tx1138.com/lfg2025/photoprism:240915` timed out after 600s, then retry attempt 1/3 restarted the primary registry pull. - - Treat as image/registry-pull pending rather than app runtime failure unless a later pull completes and the container fails to start. -- Stuck-installing backend fix deployed after PhotoPrism exposed long pull retries: - - Backend build deployed to `.198`: `1f0dd8b9fe801d289557ac050f68011c395374f2b0d5c4677b884d6081612de0`. - - Single-container image pulls now try the configured registry list once with a 300s per-URL timeout instead of repeating the whole list three times with 600s per URL. This turns missing/stalled image pulls into visible failed installs instead of leaving cards in `installing` for close to an hour. - - Scanner now removes stale absent transitional entries after `TRANSITIONAL_STUCK_TIMEOUT`; previously an `Installing` entry with no container could survive indefinitely after a backend restart or killed pull task. - - Verified PhotoPrism state recovered to `absent` with `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_TIMEOUT=60 ARCHY_STABILITY_SECONDS=1 tests/lifecycle/remote-lifecycle.sh`. -- Nginx Proxy Manager is qualified: - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=nginx-proxy-manager ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser launch passed as a new-tab app: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=nginx-proxy-manager ARCHY_APP_TITLE='Nginx Proxy Manager' ARCHY_APP_CARD_TITLE='Nginx Proxy Manager' ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:81/ ARCHY_EXPECTED_LAUNCH_MODE=popup ARCHY_EXPECTED_BODY_PATTERN='Nginx|Proxy|Manager|Sign in|Email' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. -- Portainer is qualified: - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser launch passed as a new-tab app: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=portainer ARCHY_APP_TITLE=Portainer ARCHY_APP_CARD_TITLE=Portainer ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:9000/ ARCHY_EXPECTED_LAUNCH_MODE=popup ARCHY_EXPECTED_BODY_PATTERN='Portainer|Username|Password|Create administrator' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. -- Uptime Kuma is blocked on `.198`: - - Initial failure was a recipe bug: code overrode the image entrypoint to `/usr/bin/dumb-init` but did not pass a program, causing repeated `dumb-init` usage exits. - - Fixed recipe by passing `-- node server/server.js`; deployed backend `540aefb2e1d19aa64b7a5da316bf12c1933145d7ea536afedffb6068371a476f`. - - Added install/start/restart listener repair for host port `3002`; latest deployed backend is `bbcba3f32fab8e11349962f8bb5227ec0374cf36200a768a716c00485dcd121b`. - - Remaining blocker: Uptime Kuma container stays healthy and listens internally on `3001`, Podman reports `0.0.0.0:3002->3001/tcp`, but `ss` loses the actual host listener and direct curl returns `000`. - - Manual `podman restart uptime-kuma` makes `127.0.0.1:3002` return `302 32` for about 105 seconds, then the listener disappears while the container remains healthy. Treat as unstable rootless `pasta` listener, not an app process crash. -- Immich is qualified: - - Backend build deployed to `.198`: `22c8129b8f4e93b58cce9baef8f9e1d071cb243faf85bee1b56457d48f46bbfc`. - - Root cause of lifecycle failure: `container-health` was called with app id `immich`, but the fallback health/status aliases only inspected `immich` and `archy-immich`; the stack's real service container is `immich_server`. The scanner already reports the stack as `immich`, so state was running while health returned `unknown`. - - Code fix: `core/archipelago/src/api/rpc/container.rs` now includes `immich_server` in health/status app-id and container-name candidates for `immich`. - - Full lifecycle passed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=1800 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser launch passed in panel mode from `neode-ui`: `ARCHY_BASE_URL=http://192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APP_ID=immich ARCHY_APP_TITLE=Immich ARCHY_APP_CARD_TITLE=Immich ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:2283/ ARCHY_EXPECTED_LAUNCH_MODE=panel ARCHY_EXPECTED_BODY_PATTERN='Immich|Login|Admin|Photos' npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line`. - - Note: an earlier `/tmp/archipelago.new` transfer was truncated/mismatched and crashed with `SIGSEGV`; restored `bbcba3f32fab8e11349962f8bb5227ec0374cf36200a768a716c00485dcd121b`, recopied verified local release to `/tmp/archipelago.local-release`, then deployed it successfully. -- DWN is blocked on missing/unpullable image: - - Full lifecycle failed: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=dwn ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Failure: `dwn did not reach running within 900s (last=absent)`. - - Backend journal shows both pull attempts failed before container creation: `146.59.87.168:3000/lfg2025/dwn-server:main` and `git.tx1138.com/lfg2025/dwn-server:main`, ending with `Image pull failed from all 2 configured registries`. - - No `dwn` container or image exists on `.198`; treat as image/catalog publishing blocker unless a local fallback image is built or registry image is restored. -- Botfights handoff point: - - Lifecycle command was started but user interrupted during install while switching computers: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=botfights ARCHY_FULL_LIFECYCLE=1 ARCHY_TIMEOUT=900 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Last visible output before abort: `== botfights: install ==`. - - On resume, inspect current `botfights` state/container/image before rerunning because the backend install task may have continued after the local harness was aborted. - -- Broad `.198` audit passed: - - Command: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - Running/healthy enough for audit: `bitcoin-knots`, `btcpay-server`, `lnd`, `mempool`, `homeassistant`, `grafana`, `searxng`, `nextcloud`, `vaultwarden`, `filebrowser`, `fedimint`, `indeedhub`. - - Absent and tolerated by audit at the time: `ollama`, `jellyfin`, `photoprism`, `immich`, `nginx-proxy-manager`, `portainer`, `uptime-kuma`, `electrumx`, `dwn`, `botfights`, `gitea`. -- Focused full preserve-data lifecycle passed in this continuation: - - `btcpay-server`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `nextcloud`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=nextcloud ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `mempool`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `homeassistant`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=homeassistant ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `grafana`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=grafana ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `vaultwarden`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - `filebrowser`: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=filebrowser ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` -- Focused full preserve-data lifecycle still known-passing from prior handoff: `lnd`, `bitcoin-knots`, `fedimint`, `indeedhub`. -- SearXNG regression reproduced: - - Command failed at install launch probe: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=searxng ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` - - Failure: `launch failed: searxng http://192.168.1.198:8888/ status=000 bytes=0`. - - Post-failure state: container `searxng` is `Up ... (healthy)` and `podman port searxng` reports `8080/tcp -> 0.0.0.0:8888`, but `ss -ltn` has no `*:8888` listener and both `curl http://127.0.0.1:8888/` and `curl http://192.168.1.198:8888/` return `000 0`. - - A `package.restart` temporarily recreated the listener and direct curl returned `200 6316`, but the next full lifecycle reinstall reproduced the missing listener. -- Remaining focused full-lifecycle candidates after this continuation: - - Blocked on `.198`: `electrumx`, `uptime-kuma`. - - Pending on image pull: `photoprism`. - - Absent apps not yet qualified in this pass: `botfights`, `gitea`. - - Botfights lifecycle attempt was interrupted during install; inspect state first on resume. - - Blocked on missing image: `dwn`. - - Skip `ollama` until image/manifest/catalog entry is restored. - - `electrumx` is absent but was mentioned as a possible follow-up in earlier handoff; run only if it remains in scope. - -## 2026-05-04 IndeedHub And LND Update - -- Latest deployed backend hash observed on `.198`: `83ad80ec793095f2b19746ad8c3d76ab2e7b57b132e4182a28ea9ff86067908b`. -- Frontend bundle redeployed to `/opt/archipelago/web-ui`; dashboard `Last-Modified: Mon, 04 May 2026 10:15:11 GMT`. -- LND was intentionally switched back to panel/iframe launch per user request: - - Removed `lnd` from `NEW_TAB_APPS`, `TAB_LAUNCH_APPS`, and `NEW_TAB_APP_IDS`. - - Browser panel launch qualification passed against `http://192.168.1.198:18083/`. -- IndeedHub is now qualified: - - Full backend/container lifecycle passed. - - Browser Launch qualification passed in panel/iframe mode. - - `/nostr-provider.js` is served by IndeedHub and contains the NIP-07/NIP-98 bridge markers. - -### IndeedHub Issues Fixed - -- Stack restart failed because restarted backend containers lost network aliases (`minio`, `postgres`, `redis`, `relay`, `api`). -- Added alias repair for IndeedHub stack restart/start paths: - - `core/archipelago/src/api/rpc/package/stacks.rs` - - `core/archipelago/src/api/rpc/package/runtime.rs` - - `core/archipelago/src/container/prod_orchestrator.rs` -- The frontend nginx container failed under read-only root with: - - `open() "/run/nginx.pid" failed (30: Read-only file system)` -- Added writable tmpfs mounts for stack-created IndeedHub frontend: - - `/run` - - `/var/cache/nginx` -- The boot reconciler raced the async stack installer by recreating the single-container manifest `indeedhub:latest` while `package.install indeedhub` was still pulling stack images. This stole the `indeedhub` container name and caused stack frontend creation to fail. -- Fixed by marking IndeedHub as stack-managed in `ProdContainerOrchestrator::ensure_running_with_mode`, so generic manifest reconciliation no longer installs/recreates it. -- Lifecycle harness now waits for async install transition states to settle before checking `running`, avoiding stale-container false positives. - -### Passing Commands - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=indeedhub \ -ARCHY_APP_TITLE=IndeedHub \ -ARCHY_APP_CARD_TITLE=IndeedHub \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:7778/ \ -ARCHY_EXPECTED_LAUNCH_MODE=panel \ -ARCHY_EXPECTED_BODY_PATTERN='Indee|Indeed|Bitcoin|documentary|nostr' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=lnd \ -ARCHY_APP_TITLE=LND \ -ARCHY_APP_CARD_TITLE=LND \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:18083/ \ -ARCHY_EXPECTED_LAUNCH_MODE=panel \ -ARCHY_EXPECTED_BODY_PATTERN='Connect Your Wallet|lndconnect|REST|gRPC|Copy lndconnect URI' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -### Next Recommended Work After IndeedHub - -- Grafana is now qualified: - - Full backend/container lifecycle passed. - - Browser Launch qualification passed against `http://192.168.1.198:3000/` / `/login`. -- Home Assistant is now qualified: - - Full backend/container lifecycle passed. - - Browser Launch qualification passed; first-run redirect to `/onboarding.html` is accepted. -- SearXNG is now qualified: - - Full backend/container lifecycle passed. - - Browser Launch qualification passed in panel/iframe mode against `http://192.168.1.198:8888/`. - - Fixed stale rootless `pasta` listener recovery for port `8888` before install/retry. - - Fixed manifest image drift by aligning `apps/searxng/manifest.yml` with package install image `146.59.87.168:3000/lfg2025/searxng:latest`; backend restart was required on `.198` to reload the deployed manifest. -- SearXNG recheck after user reported UI not loading: - - RPC/container state showed `running` and Podman reported `0.0.0.0:8888->8080/tcp`, but `ss` showed no actual listener and direct `curl http://192.168.1.198:8888/` failed. - - Restarted SearXNG through `package.restart`, which recreated the rootless port listener on `*:8888`. - - Re-ran audit: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=searxng ARCHY_TIMEOUT=180 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` passed. - - Re-ran browser launch qualification for SearXNG in panel mode; Playwright passed. -- Ollama is currently blocked/unqualified: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=ollama ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh` failed after install because `container-list` stayed `absent` for 900s. - - No `apps/ollama/manifest.yml` exists and `ollama` is absent from `app-catalog/catalog.json` / `neode-ui/public/catalog.json`. - - Confirmed configured image is missing: `podman manifest inspect --tls-verify=false 146.59.87.168:3000/lfg2025/ollama:latest` returns `manifest unknown`. - - This matches `CHANGELOG.md` v1.7.45 note that Ollama was removed because it hung installs due to no source image in registries. -- Nextcloud is now qualified: - - Full backend/container lifecycle passed with preserve-data uninstall/reinstall: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=nextcloud ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser Launch qualification passed as a new-tab app against `http://192.168.1.198:8085/`. - - Note: Nextcloud sends `X-Frame-Options: SAMEORIGIN`; panel/iframe launch leaves an empty iframe body from dashboard origin, so qualify it with `ARCHY_EXPECTED_LAUNCH_MODE=popup`. -- Vaultwarden is now qualified: - - Initial audit found `vaultwarden` absent by RPC but a stale rootless `pasta` listener still bound to `*:8082`; cleared with `pkill -f "pasta.*8082"` before install. - - Full backend/container lifecycle passed with preserve-data uninstall/reinstall: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh`. - - Browser Launch qualification passed as a new-tab app against `http://192.168.1.198:8082/`. -- Continue one-by-one lifecycle/browser qualification with `jellyfin`, `photoprism`, `immich`, `nginx-proxy-manager`, `portainer`, `uptime-kuma`, `dwn`, `botfights`, and `gitea`. Skip Ollama until an image/manifest/catalog entry is restored. - -## 2026-05-04 Fedimint Update - -- Latest deployed backend hash observed on `.198`: `cb464ede6625c00f4fa9e8940d933d7a69d29b0537cfabd8da783f0116a0c587`. -- Fedimint Guardian is now qualified under the current release standard: - - Full backend/container lifecycle passed with preserve-data uninstall/reinstall. - - Browser Launch qualification passed in panel/iframe mode against `http://192.168.1.198:8175/`. -- Root-cause fix: Fedimint image runs as uid `0` inside the rootless container, so its bind-mounted data directory must be host-owned by `1000:1000`, not subuid `100000:100000`. -- Implemented ownership repair in `core/archipelago/src/container/prod_orchestrator.rs` via the Fedimint pre-start/data-dir hook. -- Passing lifecycle command: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -- Passing browser launch command: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=fedimint \ -ARCHY_APP_TITLE='Fedimint Guardian' \ -ARCHY_APP_CARD_TITLE='Fedimint Guardian' \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:8175/ \ -ARCHY_EXPECTED_LAUNCH_MODE=panel \ -ARCHY_EXPECTED_BODY_PATTERN='Fedimint|Guardian|Federation|Mint|Bitcoin' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -- Result: `1 passed (11.7s)`. -- Note: backend scanner currently reports Fedimint `lan_address` from the first exposed port (`8173`), but the frontend app-session mapping correctly launches the UI on `8175`. - -### Next Recommended Work After Fedimint - -- Continue with IndeedHub full lifecycle and browser Launch qualification. - -## 2026-05-04 Mempool Update - -- Latest deployed backend hash on `.198`: `02d79360df86d653c9e7b06a05bdf039a0454b81a65220dbe16fa57cafeed236`. -- Mempool is now qualified: - - Full backend/container lifecycle passed. - - Browser Launch qualification passed in panel/iframe mode. - -### Mempool Issues Fixed - -- Initial Mempool lifecycle failed after install with `bad health: mempool is unknown`. -- Root cause: package id `mempool` maps to manifest/app id `archy-mempool-web` with container name `mempool`; `container-health` called `orchestrator.health("mempool")` directly and bypassed alias candidates. -- Added alias handling in `core/archipelago/src/api/rpc/container.rs`: - - `mempool` / `mempool-web` status candidates include `archy-mempool-web`. - - specific `container-health { app_id: "mempool" }` now tries alias candidates and direct Podman container-name fallback. -- After deploy, short audit passed: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool ARCHY_TIMEOUT=60 ARCHY_STABILITY_SECONDS=0 tests/lifecycle/remote-lifecycle.sh -``` - -- Mempool full lifecycle passed: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -- Result: `all checks passed`. - -### Mempool Browser Launch - -- Mempool is an in-panel/iframe app, not a new-tab app. -- Initial browser test failed because the generic spec expected a popup. -- Updated `neode-ui/e2e/app-launch.spec.ts`: - - `ARCHY_EXPECTED_LAUNCH_MODE=panel` verifies an app session iframe instead of popup. - - Card selection now matches a card heading exactly via `APP_CARD_TITLE`/`APP_TITLE`, avoiding false matches from description text (ElectrumX description mentions Mempool). - - Panel iframe selector tolerates source URLs without a trailing slash. -- Passing command: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=mempool \ -ARCHY_APP_TITLE=Mempool \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:4080/ \ -ARCHY_EXPECTED_LAUNCH_MODE=panel \ -ARCHY_EXPECTED_BODY_PATTERN='Mempool|Bitcoin|Block|Transaction' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -- Result: `1 passed (15.8s)`. - -### Next Recommended Work After Mempool - -- Continue installed app qualification with `electrumx` or `filebrowser`. -- ElectrumX already had prior focused work but should get the current browser launch standard if not already rerun after these Playwright spec changes. -- Suggested ElectrumX backend lifecycle: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -- Suggested ElectrumX browser launch: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=electrumx \ -ARCHY_APP_TITLE=ElectrumX \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:50002/ \ -ARCHY_EXPECTED_LAUNCH_MODE=panel \ -ARCHY_EXPECTED_BODY_PATTERN='ElectrumX|Connect Your Wallet|50001' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -## 2026-05-04 Resume Snapshot - -- Another agent changed the worktree before this session; do not revert unrelated dirty files. -- `.198` service is active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive. -- Latest deployed backend hash on `.198`: `02d79360df86d653c9e7b06a05bdf039a0454b81a65220dbe16fa57cafeed236`. -- LND remains qualified from prior session: full backend lifecycle passed and browser Launch opens `http://192.168.1.198:18083/` with wallet-connect content. -- BTCPay is now qualified: - - Full backend/container lifecycle passed after stop-state normalization fix. - - Browser Launch qualification passed against `.198`; first-run redirect to `/register` is accepted. - -### 2026-05-04 Work Completed - -- Rechecked local/remote state after separate-agent work. -- Ran BTCPay full lifecycle: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -- Initial BTCPay run failed at stop because BTCPay containers were explicitly stopped, but Podman reports stopped containers as `exited`; scanner overwrote package state from `Stopped` to `Exited`, and the harness waited for `stopped`. -- Fixed scanner merge path in `core/archipelago/src/server.rs`: scanned `Exited` package entries are normalized to `Stopped` when the app id is present in `/var/lib/archipelago/user-stopped.json` via configured `data_dir`. -- Rebuilt and deployed backend to `.198`; new hash `6bd9db024ab37017cadd684cb3296c6adbcf290ac27e1238a6bf1e7c0f883e3e`. -- Verified BTCPay then reports `state=stopped` after explicit stop. -- Reran BTCPay full lifecycle; result: `all checks passed`. -- Updated `neode-ui/e2e/app-launch.spec.ts` to support app-specific URL/body regexes: - - `ARCHY_EXPECTED_LAUNCH_URL_PATTERN` - - `ARCHY_EXPECTED_BODY_PATTERN` -- Ran BTCPay browser launch qualification: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=btcpay-server \ -ARCHY_APP_TITLE=BTCPay \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:23000/ \ -ARCHY_EXPECTED_LAUNCH_URL_PATTERN='^http://192\.168\.1\.198:23000/(register)?$' \ -ARCHY_EXPECTED_BODY_PATTERN='BTCPay|Create.*account|Register|Store' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -- Result: `1 passed (10.3s)`. - -### Next Recommended Work - -- Mempool is now complete. Continue app-by-app qualification with ElectrumX or File Browser. -- Prior suggested Mempool command, now passing: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=mempool ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -- If Mempool backend lifecycle passes, run browser launch qualification: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=mempool \ -ARCHY_APP_TITLE=Mempool \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:4080/ \ -ARCHY_EXPECTED_BODY_PATTERN='Mempool|Bitcoin|Block|Transaction' \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -### Updated Resume Prompt - -```text -Resume Archipelago container lifecycle hardening from /home/archipelago/Projects/archy. Read docs/CONTAINER_LIFECYCLE_HANDOFF.md first. Remote node is 192.168.1.198, SSH key /home/archipelago/.ssh/id_ed25519, ARCHY_PASSWORD=password123. Preserve data unless explicitly told otherwise. Keep archipelago-doctor.timer and archipelago-reconcile.timer paused. Do not revert unrelated dirty worktree changes because another agent has been working too. LND, BTCPay, and Mempool now have full backend lifecycle plus browser Launch qualification passing. Latest deployed backend hash on .198 is 02d79360df86d653c9e7b06a05bdf039a0454b81a65220dbe16fa57cafeed236. Continue with the next installed app, likely ElectrumX or File Browser, using full lifecycle and then Playwright browser launch qualification. -``` - -## 2026-05-03 Resume Snapshot - -- Remote node under test: `192.168.1.198`. -- SSH key: `/home/archipelago/.ssh/id_ed25519`. -- Lifecycle password: `ARCHY_PASSWORD=password123`. -- Current qualification target: BTCPay full lifecycle. LND user-facing launch flow is now qualified. -- Do not proceed to broad release/audit until app launch qualification includes a real browser click/open-tab check, not just backend/direct-port curl. -- Preserve data during lifecycle testing unless explicitly told otherwise. -- Legacy timers should remain paused during deterministic qualification: `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive/disabled. - -### Latest Deployed State On `.198` - -- Backend deployed to `/usr/local/bin/archipelago`; service observed active. -- Latest backend hash observed on `.198`: `abbd9fa4e6beace75f590c1988a1904b9de62b4b21fade1291926ac039c4747b`. -- Frontend bundle was rebuilt with LND new-tab config and deployed to `/opt/archipelago/web-ui`. -- Dashboard entrypoint at `http://192.168.1.198/` returns `200` and fresh `Last-Modified: Sun, 03 May 2026 20:09:08 GMT`. -- Dashboard CSP allows direct app ports via `connect-src ... http://192.168.1.198:*` and `frame-src ... http://192.168.1.198:*`. -- LND direct UI still works from the test environment: - -```bash -curl -fsS -D - http://192.168.1.198:18083/ -o /tmp/opencode/lnd-ui.html -``` - -Expected: `HTTP/1.1 200 OK`, wallet-connect page content including `Connect Your Wallet`, `lndQrBox`, `rest-tor`, `grpc-tor`, and `Copy lndconnect URI`. - -### LND Status - -- Backend/container lifecycle for LND passed after the latest backend changes: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -- Result: `all checks passed` through install, stop, start, restart, preserve-data uninstall, reinstall. -- Direct LND UI is reachable at `http://192.168.1.198:18083/`. -- Product/UI launch is now qualified by Playwright against `.198`. User previously saw browser launch failures (`refused to connect` / `This site can't be reached`), but the deployed frontend/backend now open the direct LND UI URL successfully. -- Frontend changes intended to fix this: - - `neode-ui/src/views/appSession/appSessionConfig.ts`: `lnd` added to `NEW_TAB_APPS`. - - `neode-ui/src/views/apps/appsConfig.ts`: `lnd` added to `TAB_LAUNCH_APPS`. - - `neode-ui/src/stores/appLauncher.ts`: `lnd` added to `NEW_TAB_APP_IDS`. - -### Browser-Level Launch Check Added - -- Added `neode-ui/e2e/app-launch.spec.ts` as a reusable Playwright qualification test. -- Intended run command: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=lnd \ -ARCHY_APP_TITLE=LND \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:18083/ \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -- Current result: passing against `.198`. -- Passing command: - -```bash -cd /home/archipelago/Projects/archy/neode-ui -ARCHY_BASE_URL=http://192.168.1.198 \ -ARCHY_PASSWORD=password123 \ -ARCHY_APP_ID=lnd \ -ARCHY_APP_TITLE=LND \ -ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:18083/ \ -npx playwright test e2e/app-launch.spec.ts --config=playwright.config.ts --project=chromium --reporter=line -``` - -- Result: `1 passed (12.3s)`. -- The test clicks the real My Apps `Launch` button, waits for the popup, verifies URL `http://192.168.1.198:18083/`, and checks wallet-connect text in the popup body. - -### New Root-Cause Findings To Continue - -- `AppDetails` can render `App Not Found` before package data has arrived. The route still does not wait for the WebSocket initial package snapshot; the launch qualification now uses My Apps card launch, which matches user behavior. -- `server.get-state` frontend call was broken against the deployed backend: - -```text -RPC method: server.get-state -RPC error on server.get-state: Unknown method: server.get-state -``` - -- Fixed by adding `server.get-state` dispatch support in `core/archipelago/src/api/rpc/dispatcher.rs` and deploying the new backend to `.198`. -- Verified browser-authenticated `server.get-state` returns `hasLnd=true`, `status=200`, `error=null`. -- WebSocket initial data still works; logs showed `WebSocket /ws/db connected` and initial state dumps. -- Earlier browser-test failures were due to wrong Playwright `baseURL` defaulting to `.228` and/or empty package state on that node, not LND direct UI reachability. -- Direct unauthenticated `container-list` is allowed by auth rules, but authenticated browser calls without CSRF fail with `403`; the Playwright test should not rely on raw RPC calls without CSRF unless using exempt read-only methods. - -### Immediate Resume Steps - -1. Proceed to BTCPay full lifecycle: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=btcpay-server ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -2. If BTCPay passes backend lifecycle, add/run browser-level launch qualification for BTCPay using the same Playwright spec with `ARCHY_APP_ID=btcpay-server`, `ARCHY_APP_TITLE=BTCPay`, and `ARCHY_EXPECTED_LAUNCH_URL=http://192.168.1.198:23000/`. - -3. Fix stale `boot_reconciler` unit tests for existing-only production behavior if running the full backend test suite. - -### Verification Commands Before Resuming - -```bash -ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; podman ps -a --format "{{.Names}} {{.Status}} {{.Ports}}" | egrep "lnd|btcpay|nbxplorer|bitcoin|electrs" || true' -``` - -```bash -curl -fsS -D - http://192.168.1.198:18083/ -o /tmp/opencode/lnd-ui.html -``` - -### Files Touched In This Latest Session - -- `neode-ui/e2e/app-launch.spec.ts`: new parameterized Playwright launch qualification spec. -- `neode-ui/playwright.config.ts`: `baseURL` can now be overridden with `ARCHY_BASE_URL`. -- `core/archipelago/src/api/rpc/dispatcher.rs`: added `server.get-state` dispatch handler. -- `neode-ui/src/views/appSession/appSessionConfig.ts`: LND forced new-tab session behavior. -- `neode-ui/src/views/apps/appsConfig.ts`: LND marked as tab-launch app. -- `neode-ui/src/stores/appLauncher.ts`: LND forced new-tab from legacy/open URL path. -- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`: this handoff update. - -### Still Dirty / Important - -- Worktree is dirty with many lifecycle/backend/frontend changes and untracked files. Do not revert other changes. -- `git status --short` currently includes untracked `tests/lifecycle/remote-lifecycle.sh`, `core/archipelago/src/container/lnd.rs`, `neode-ui/e2e/app-launch.spec.ts`, and this handoff doc. -- No commit was created. - -### Resume Prompt - -Use this prompt in a fresh remote session: - -```text -Resume Archipelago lifecycle hardening from /home/archipelago/Projects/archy. Read docs/CONTAINER_LIFECYCLE_HANDOFF.md first. Current remote node is 192.168.1.198, SSH key /home/archipelago/.ssh/id_ed25519, ARCHY_PASSWORD=password123. LND backend lifecycle and browser launch qualification are now passing; latest deployed backend hash on .198 is abbd9fa4e6beace75f590c1988a1904b9de62b4b21fade1291926ac039c4747b. Continue with BTCPay full lifecycle, then add/run the same browser launch qualification for BTCPay. Preserve data unless explicitly told otherwise, keep doctor/reconcile timers paused, and do not revert unrelated dirty worktree changes. -``` - -## Operator Snapshot - -- Plan: harden app/container lifecycle before release using strict lifecycle tests and app-specific probes. -- Current target: run broad `.198` audit after focused fixes for LND, Bitcoin Knots, Fedimint, and IndeedHub. -- LND status on `.198`: strict audit and full preserve-data lifecycle passed on 2026-05-02. -- Bitcoin Knots status on `.198`: full preserve-data lifecycle passed on 2026-05-02. -- Fedimint status on `.198`: full preserve-data lifecycle passed on 2026-05-02. -- IndeedHub status on `.198`: full preserve-data lifecycle passed on 2026-05-02. -- Last known local status: focused lifecycle/orchestrator/container unit tests pass and release build succeeds. -- Do not release until broad audit and app-specific UI probes pass. - -## Goal - -Harden and verify Archipelago app/container lifecycle before release. Required coverage is install, launch, stop, start, restart, uninstall with `preserve_data=true`, reinstall, and launch again. UI checks must validate app-specific functionality, not only HTTP 200. - -## Current Focus - -Run broad lifecycle audit on node `192.168.1.198`, then continue app-by-app for any installed package that is non-running or unhealthy. LND, Bitcoin Knots, Fedimint, and IndeedHub have each passed focused strict lifecycle validation. - -Strict LND criteria: - -- `lnd` container reaches `running`. -- `archy-lnd-ui` companion serves `/app/lnd/`. -- LND wallet is initialized or unlocked non-interactively. -- `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon` exists. -- `/lnd-connect-info` returns certificate, macaroon, REST/gRPC ports, and Tor onion. -- LND UI contains all connection modes: REST local, REST Tor, gRPC local, gRPC Tor. -- QR/connect controls are present and backed by real connection info. - -## Important Nodes - -- `.198`: SSH works with `/home/archipelago/.ssh/id_ed25519`. -- `.228`: RPC works, SSH still blocked with `Permission denied (publickey,password)`. - -## Test Harness - -Primary remote harness: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -Harness changes made: - -- Normalizes package states with `ascii_downcase` because API can return `Running`. -- Audit mode allows `absent`, fails installed non-running states. -- Full lifecycle uses preserve-data uninstall. -- LND probe checks DOM, all four connection modes, `/lnd-connect-info`, macaroon/cert lengths, REST/gRPC ports, and Tor onion. -- Electrum probe now checks local and Tor QR containers/fields, `qrcode.js`, and `/electrs-status` Tor onion. -- Added `ARCHY_STABILITY_SECONDS` observation window, default `15`, so a single `running` snapshot is not enough. -- Audit/full lifecycle now call `container-health` after install/start/restart/reinstall and fail anything other than `healthy`. -- Focused validation passed for LND, Bitcoin Knots, Fedimint, and IndeedHub. - -## Implemented Backend Changes - -### Lifecycle/Reconcile - -- `core/archipelago/src/server.rs` - - Scanner merge now recovers stale `Removing -> Running` if the container is actually live. - - Added stale-removing recovery test. -- `core/archipelago/src/main.rs` - - Crash recovery now runs synchronously before BootReconciler. -- `core/archipelago/src/bootstrap.rs` - - Removed automatic deletion of `/run/user/1000/{containers,libpod}` when `podman info` fails. -- `core/archipelago/src/crash_recovery.rs` - - Generic boot recovery narrowed to safe containers only. -- `core/archipelago/src/container/prod_orchestrator.rs` - - Uninstall disables manifests rather than deleting manifest availability. - - Explicit reinstall re-enables disabled manifests. - - LND pre-start writes/repairs config. - - LND post-start initializes/unlocks wallet in production. - - Post-start hook is skipped in `cfg(test)` so unit tests do not mutate host LND state. - - `stop` disables desired-state reconcile until explicit start. - - Reconciler respects `/var/lib/archipelago/user-stopped.json` across daemon restarts. - - Start path recreates containers when stale rootless Podman runtime state prevents startup. -- `core/archipelago/src/api/rpc/package/install.rs` - - Install reconciles companion UIs synchronously. -- `core/archipelago/src/api/rpc/package/runtime.rs` - - Start/restart reconcile companions. - - Missing known companion containers are tolerated during stop/restart. -- `core/archipelago/src/health_monitor.rs` - - Added Bitcoin variant conflict guard for auto-restart: `bitcoin-core` and `bitcoin-knots` can both be installed, but the monitor must not auto-start one into default `8332/8333` while the other is already running. - - Added unit tests for the conflict guard. -- `core/archipelago/src/api/rpc/package/install.rs` - - Removed install-time hard block between `bitcoin-core` and `bitcoin-knots`; users may install both. Runtime still needs alternate ports or one inactive variant to run both simultaneously. -- `core/archipelago/src/api/rpc/package/config.rs` - - Bitcoin variant container resolution is precise, so package operations for one variant do not target the other. -- `core/container/src/podman_client.rs` - - Custom network containers now receive container-name DNS aliases. - - Containers get `host.archipelago:10.89.0.1` for host RPC access from rootless networks. -- `apps/fedimint/manifest.yml` and `apps/fedimint-gateway/manifest.yml` - - Fedimint data owner fixed to `1000:1000`. - - Bitcoin RPC host changed to `http://host.archipelago:8332`. - -### Companions - -- `core/archipelago/src/container/companion.rs` - - LND UI uses bridge networking, not host networking. - - LND UI moved from host `8081` to host `18083` to avoid `nostr-rs-relay` conflict. - - Test updated to expect `18083:80`. -- Routing/metadata moved LND UI to `18083`: - - `apps/lnd-ui/manifest.yml` - - `core/archipelago/src/container/docker_packages.rs` - - `core/container/src/podman_client.rs` - - `core/archipelago/src/port_allocator.rs` - - `neode-ui/src/views/appSession/appSessionConfig.ts` - - `neode-ui/src/stores/container.ts` - - `neode-ui/src/stores/appLauncher.ts` - - `neode-ui/src/views/appDetails/appDetailsData.ts` - - nginx snippets/configs for `/app/lnd/` now proxy to `127.0.0.1:18083`. - -### LND - -- New/expanded `core/archipelago/src/container/lnd.rs`. -- `ensure_config()` writes required Bitcoin backend flags: - - `bitcoin.active=true` - - `bitcoin.mainnet=true` - - `bitcoin.node=bitcoind` - - `bitcoind.rpchost=bitcoin-knots:8332` -- Handles permission denied writing `lnd.conf` via sudo. -- `ensure_wallet_initialized()` now: - - Checks wallet/macaroons via sudo-aware helpers because LND data is container-owned `0700`. - - Uses REST unlocker `GET /v1/genseed` and `POST /v1/initwallet` for new wallets. - - Falls back to `lncli unlock --stdin` if wallet already exists. - - Uses sudo-aware read for macaroon when checking `/v1/getinfo` readiness. - -## Verified Locally - -Recent focused test passes: - -```bash -cd /home/archipelago/Projects/archy/core -cargo test -p archipelago --bin archipelago health_monitor -cargo test -p archipelago --bin archipelago prod_orchestrator -cargo test -p archipelago --bin archipelago bitcoin_variant_container_names_are_precise -cargo test -p archipelago-container podman_network_settings_uses_networks_map_for_custom_networks -bash -n ../tests/lifecycle/remote-lifecycle.sh -``` - -Release build succeeds: - -```bash -cd /home/archipelago/Projects/archy/core -cargo build -p archipelago --bin archipelago --release -``` - -## `.198` Current State - -Recent deployment: - -- Built release binary with sudo-aware LND wallet checks and LND UI port `18083`. -- Deployed to `/usr/local/bin/archipelago` on `.198` with backup. -- Restarted `archipelago.service`; it returned `active`. -- nginx on `.198` was already updated so `/app/lnd/` proxies to `127.0.0.1:18083`. - -Known `.198` observations: - -- LND wallet artifacts exist after previous bootstrap: - - `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon` - - `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/wallet.db` -- `nostr-rs-relay` occupies `8081`; LND UI must stay on `18083`. -- LND strict audit passed on 2026-05-02: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh` -- LND full preserve-data lifecycle passed on 2026-05-02: - - `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh` -- Final observed state after LND lifecycle: - - `archipelago.service` active. - - `nginx` active. - - `lnd` running on `8080`, `9735`, and `10009`. - - `archy-lnd-ui` running on `18083`. - - `archy-electrs-ui` running and `50002` listening. -- Active default Bitcoin backend is currently `bitcoin-knots`; `bitcoin-core` is installed but user-stopped. -- `/var/lib/archipelago/user-stopped.json` should include `bitcoin-core` so daemon restart does not resurrect it into a default-port conflict. -- Fedimint fixed issues: - - stale rootless Podman runtime storage was handled by recreate-on-start-failure path. - - data ownership fixed for gateway and federation DB lock files. - - Bitcoin RPC DNS fixed via `host.archipelago` host alias. -- IndeedHub full lifecycle passed after forcing the dedicated stack installer path, which removes stale stack containers and recreates network aliases and volumes. - -## Focused Remote Passes - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=bitcoin-knots ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -Result for each focused run: `all checks passed`. - -## Immediate Next Steps - -1. Run broad audit: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 tests/lifecycle/remote-lifecycle.sh -``` - -2. Continue app-by-app for any installed package that broad audit reports as non-running or unhealthy. - -3. Resume Electrum full lifecycle with strict Tor/QR checks if Electrum remains in scope. Previous run was user-aborted during `electrumx: install`: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -4. If Electrum fails, capture current service and port state: - -```bash -ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'systemctl is-active archipelago.service; systemctl is-active nginx; ss -ltn | grep -E ":(50001|50002|18083|8081|8080|10009|9735)" || true; podman ps -a --format "{{.Names}} {{.Status}} {{.Ports}}" | egrep "electrs|electrum|lnd|nostr" || true' -``` - -5. LND commands that passed and can be rerun as a regression check: - -```bash -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh -ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh -``` - -6. If `/app/lnd/` regresses to `502`, inspect companion unit and logs: - -```bash -ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'systemctl --user status archy-lnd-ui.service --no-pager -l 2>&1 | sed -n "1,160p"; test -f ~/.config/containers/systemd/archy-lnd-ui.container && sed -n "1,160p" ~/.config/containers/systemd/archy-lnd-ui.container || true; journalctl --user -u archy-lnd-ui.service -n 160 --no-pager 2>&1 | sed -n "1,160p"' -``` - -7. If `package.stop lnd` regresses and does not stop the container, inspect runtime stop path in: - -- `core/archipelago/src/api/rpc/package/runtime.rs` -- `core/archipelago/src/container/prod_orchestrator.rs` - -Likely issue: state scanner/reconciler or companion handling re-starts LND during stop/uninstall, or stop path waits on package state while container is being reconciled. - -## Previously Fixed Live Issues On `.198` - -- stale `fedimint=removing` recovered. -- orphaned `filebrowser` rootlessport on `8083` cleared. -- orphaned `bitcoin-core` rootlessport on `8332/8333` cleared. -- LND missing `bitcoin.active`/backend config fixed. -- LND config permission denied fixed via sudo write. -- Companion start/restart race mostly fixed by synchronous companion reconciliation. -- Bitcoin Core/Knots install-time conflict removed while preserving runtime default-port safety. -- Bitcoin Core unintended resurrection after daemon restart fixed through persistent user-stopped state. -- Fedimint DB lock permission errors fixed through `1000:1000` data ownership. -- Fedimint Bitcoin RPC DNS errors fixed through `host.archipelago`. -- IndeedHub stale stopped stack fixed by reinstalling through the dedicated stack installer. - -## Do Not Forget - -- Do not release until strict lifecycle and app-specific UI probes pass. -- Preserve data during destructive lifecycle testing unless explicitly instructed otherwise. -- Do not revert user/other-agent worktree changes. -- `.228` still needs SSH fixed or must be tested RPC/UI-only. diff --git a/docs/CURRENT_AGENT_HANDOFF.md b/docs/CURRENT_AGENT_HANDOFF.md deleted file mode 100644 index 1e67913f..00000000 --- a/docs/CURRENT_AGENT_HANDOFF.md +++ /dev/null @@ -1,216 +0,0 @@ -# Current Agent Handoff - Bitcoin UI Recovery And `1.8-alpha` Resume - -Last updated: 2026-06-10 05:33 EDT - -## Read This First - -This is a separate handoff from `docs/NEXT_TERMINAL_HANDOFF.md`. That file tracks -an older/broader plan. For the next agent resuming this machine-switch pause, -read this file first, then read: - -- `docs/RESUME.md` -- `docs/1.8-alpha-improvements-tracker.md` -- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` -- `docs/MIGRATION_STATUS_REPORT.md` - -Do not assume `docs/NEXT_TERMINAL_HANDOFF.md` is the current short-term plan. - -## Current Goal - -Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image. - -The release goal is not just "apps launch once"; the app/container system needs -to be developer-ready and production-release ready: - -- manifests and docs must describe the real runtime contract; -- apps must install, start, stop, restart, uninstall, reinstall, survive reboot, - report truthful status, and show useful progress; -- My Apps must preserve last-known truth during Podman/scanner backoff instead - of showing false empty/no-app states; -- Bitcoin-dependent apps must explain sync/wallet readiness instead of looking - broken; -- final validation needs focused lifecycle, broad non-destructive lifecycle, - then repeated reboot checks before ISO cut/smoke test. - -## Current Estimate - -As of this pause: - -- Credible release candidate: roughly `87-91%`. -- Production-quality release developers will love: roughly `73-79%`. -- Calendar estimate if the remaining systemic lifecycle issues are bounded: - `1-2 focused engineering days` for a release candidate, then additional - reboot/ISO smoke time. -- The biggest remaining risk is not catalog wiring; it is rootless Podman - control-plane responsiveness, stale scanner state, lifecycle progress UX, and - reboot validation. - -## Validation Host - -- Host: `192.168.1.198` -- SSH user: `archipelago` -- Password used in this session: `password123` -- Active Bitcoin app on this host: `bitcoin-knots`, not `bitcoin-core` -- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive - for deterministic validation unless intentionally testing them. -- Preserve app data. -- Avoid broad Podman store/image cleanup commands on `.198`. - -## Bitcoin UI Incident Summary - -User reported the Bitcoin custom UI showing: - -`Bitcoin node is starting or busy syncing; retrying automatically. Detail: -getblockchaininfo: Bitcoin RPC request failed ... operation timed out` - -Then after listener repair, the message changed through: - -- `Connection refused` -- `Verifying blocks...` -- then the user reported it looked fine again. - -What happened: - -- The node is a `bitcoin-knots` node. -- During live debugging, the wrong alias, `bitcoin-core`, was started/stopped. -- `bitcoin-core` and `bitcoin-knots` compete for the same Bitcoin RPC/P2P ports. -- That action left the real `bitcoin-knots` service active but without the host - `8332` rootlessport listener for a while. -- Stopping the stray `bitcoin-core.service` and restarting only - `bitcoin-knots.service` recreated listeners on `8332` and `8333`. -- After restart, bitcoind entered the normal `-28 Verifying blocks...` phase. -- The user later reported the Bitcoin UI looked fine again. - -Known live state observed during recovery: - -- `bitcoin-knots.service`: active -- `bitcoin-core.service`: inactive -- `archy-bitcoin-ui.service`: active -- listeners present after repair: - - `8332` via `rootlessport` - - `8333` via `rootlessport` - - `8334` via nginx/Bitcoin UI -- `bitcoin-knots` logs showed active IBD around height `4137xx` and progress - about `0.09438`. - -Do not restart Bitcoin again unless there is a fresh confirmed service/listener -failure. If checking status, prefer read-only probes and avoid starting the -wrong variant. - -## Source Fixes Made Locally - -These local edits were made after live Bitcoin recovered. They are not deployed -yet and were not fully validated before the user paused. - -### `core/archipelago/src/bitcoin_status.rs` - -Changed Bitcoin status cache behavior and copy: - -- refresh interval changed from `5s` to `10s`; -- transient error backoff added at `15s`; -- RPC client timeout increased from `8s` to `20s`; -- error context now uses full anyhow chain with `{e:#}`; -- transient classifications now include common overloaded/backend states; -- user-facing copy now distinguishes: - - `verifying blocks after restart`; - - `waiting for the Bitcoin RPC listener`; - - `busy and not answering RPC before the timeout`; - - generic `starting or busy syncing`; -- added unit tests for the three user-visible states above. - -Intent: stop collapsing distinct backend states into the same stale -"starting or busy syncing" timeout message. - -### `core/archipelago/src/api/rpc/package/update.rs` - -Narrow Bitcoin alias fix added: - -- `orchestrator_update_app_id("bitcoin-knots")` now remains - `"bitcoin-knots"` instead of mapping to `"bitcoin-core"`; -- candidate app IDs for a Bitcoin container now prefer `bitcoin-knots` before - `bitcoin-core`; -- tests updated to lock this behavior. - -Intent: `bitcoin-core` and `bitcoin-knots` can be dependency/status aliases, -but must not be interchangeable lifecycle/update targets on a node that has a -specific installed variant. - -Important: this file also already contained other uncommitted update/pull -timeout changes from prior work. Do not assume every diff in this file came -from this interruption. - -## Validation Status At Pause - -Completed: - -- `cargo fmt --manifest-path core/Cargo.toml --all` passed after the local - Bitcoin edits. - -Attempted but not completed: - -- Targeted Cargo tests were first launched in three separate `/tmp` target dirs - and failed due `/tmp` filling with `No space left on device`. -- Those temporary dirs were removed: - - `/tmp/archy-cargo-bitcoin-status` - - `/tmp/archy-cargo-update-alias` - - `/tmp/archy-cargo-container-candidates` -- A second run using `CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix` was still - compiling when the user paused. It was terminated for handoff. -- No successful Rust test result exists yet for the new Bitcoin status/alias - tests. - -Recommended validation after resume: - -```bash -git diff --check -- core/archipelago/src/bitcoin_status.rs core/archipelago/src/api/rpc/package/update.rs docs/CURRENT_AGENT_HANDOFF.md -CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago bitcoin_status::tests -CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago update_aliases_map_to_manifest_app_ids -CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago container_name_candidates_cover_common_aliases -``` - -If Cargo target locking appears stale, check for real `cargo`/`rustc` workers -before deleting anything. Prefer workspace-local target dirs under `.codex-tmp` -over new cold `/tmp` targets. - -## Immediate Next Steps - -1. Confirm no lingering Cargo process: - - ```bash - pgrep -af "cargo|rustc|cargo-bitcoin-fix" - ``` - -2. Validate the local Bitcoin source fixes listed above. - -3. If validation passes, build/deploy the backend to `.198` only after - confirming the user still wants deployment. - -4. Recheck live Bitcoin non-destructively: - - - `bitcoin-knots.service` active; - - `bitcoin-core.service` inactive; - - listeners on `8332`, `8333`, `8334`; - - Bitcoin UI loads on `8334`; - - `/bitcoin-status` returns useful copy if backend is busy. - -5. Resume release backlog: - - - rootless Podman lifecycle/control-plane responsiveness; - - My Apps last-known-state truthfulness during scanner backoff; - - progress UX for install/uninstall/start/stop/restart; - - remaining tracker rows in `docs/1.8-alpha-improvements-tracker.md`; - - focused lifecycle matrix on `.198`; - - broad non-destructive lifecycle; - - 3 clean reboot validations minimum, 5 preferred; - - ISO cut and ISO smoke test. - -## Cautions For Next Agent - -- Do not start `bitcoin-core` on `.198` unless intentionally migrating variants. -- Treat `bitcoin-knots` as the installed Bitcoin variant. -- Do not run broad Podman prune/store cleanup. -- Do not revert unrelated dirty worktree changes. -- `docs/NEXT_TERMINAL_HANDOFF.md` exists but is not the short-term handoff for - this pause. -- Many repo files are dirty from broader release hardening. Read diffs before - attributing changes. diff --git a/docs/HANDOFF-2026-06-20-mesh-netbird.md b/docs/HANDOFF-2026-06-20-mesh-netbird.md deleted file mode 100644 index bdc6c33b..00000000 --- a/docs/HANDOFF-2026-06-20-mesh-netbird.md +++ /dev/null @@ -1,144 +0,0 @@ -# Handoff β€” Mesh device rename, mesh routing, duplicate contacts, netbird logout (2026-06-20) - -Session is a **test-build iteration toward the 1.8.0 bug-bash release** β€” sideload patched binaries -to test nodes, NO version bump / NO OTA release (manifest stays `1.7.99-alpha`). Because the version -string never changes, **verify a deploy by sha256-matching the deployed binary**, not by `current_version`. - -## Test node roster (creds in the operator's local notes / agent memory β€” NOT in this repo) -- `.116` 192.168.1.116 β€” this build host (archi-thinkpad), dev/validation. -- `.198` 192.168.1.198, `.228` 192.168.1.228 β€” LAN resilience nodes. -- `.5` Tailscale 100.72.136.5 (archy-x250-beta) β€” **Meshtastic radio**. -- `.120` Tailscale 100.66.157.120 (archy-x250-exp) β€” **Meshtastic radio**. -- `.89` Tailscale 100.89.209.89 (archy-x250-pa) β€” **dual radio**: ttyACM0 Meshtastic (probe FAILS), - ttyUSB0 MeshCore (active). Configured device_path = ttyACM0. Runs netbird (v2.38.0). - -Deploy driver used this session: `/tmp/archy-deploy/deploy-node.sh