docs: consolidate into PRODUCTION-MASTER-PLAN, add CLAUDE.md, prune 25 stale docs

Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform
north star: every app manifest-driven (zero OS-level reliance), manifests via the
signed registry, developer-ready external marketplace; rootless/secure/robust/
100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until
the 20x lifecycle gate is green. New design doc registry-manifest-design.md.

Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and
superseded trackers (content folded into the master plan or already in memory).
Kept all evergreen design/reference docs + ADRs (the master links them).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-21 05:11:32 -04:00
parent 03a4ee1b30
commit 192238cbb8
29 changed files with 494 additions and 8544 deletions

46
CLAUDE.md Normal file
View File

@ -0,0 +1,46 @@
# Archipelago — agent guide
## 🚩 TOP PRIORITY (until production testing passes)
**Read `docs/PRODUCTION-MASTER-PLAN.md` first.** It is the authoritative plan and
overrides ad-hoc direction until the production test gate is green. Goal: a
world-class, **developer-ready app platform** where every app is manifest-driven,
manifests ship via the **signed registry** (not OTA disk files), and **third-party
developers publish apps via an external/decentralized registry** — all rootless,
secure, robust, and 100%-uptime-capable.
Detailed sub-plans (all linked from the master):
- App platform / packaging phases + security model → `docs/APP-PACKAGING-MIGRATION-PLAN.md`
- Registry-distributed manifests (in progress) → `docs/registry-manifest-design.md`
- External/decentralized marketplace for devs → `docs/marketplace-protocol.md`
- Current per-app state → `docs/app-registry-status-2026-06-21.md`
- Production test gate (exit criterion) → `tests/lifecycle/TESTING.md`
## Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved.
- **No per-app Rust installers / no OS-level reliance.** Apps are declarative;
the orchestrator owns the lifecycle. `install_immich_stack` (hardcoded
`podman run` + `sudo chown`) is the anti-pattern being deleted, not a template.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
`container::secrets`, 0600/rootless) — never hardcoded, per-app, or logged.
- **Migrations never destroy data** — preserve `/var/lib/archipelago/<app>`,
secrets, credentials, ports, and adoption container names; keep a rollback path.
- **Verify on a real node (.228, then .198) before any tag.**
## Build / verify
- Rust workspace root is `core/` (no Cargo.toml at repo root). `cargo` from `core/`.
- If a `cargo test`/build hits `rust-lld: undefined hidden symbol`, it's
incremental-cache corruption — rebuild with `CARGO_INCREMENTAL=0`.
- Frontend: `neode-ui/``npm run build` outputs to `web/dist/neode-ui/`.
Grep the built bundle for new strings before shipping (build can silently no-op).
- App manifests load from disk on nodes at `/opt/archipelago/apps/*/manifest.yml`
(today); the goal is to distribute them via the signed catalog instead.
## Production test gate (definition of done)
`tests/lifecycle/run-20x.sh` green across install / UI / stop / start / restart /
reinstall / reboot-survive / archipelago-restart-survive / uninstall — **20× on
.228 AND .198**. Until green, the master plan is the priority.

View File

@ -1,231 +0,0 @@
# 1.8-alpha Improvements Tracker
Last updated: 2026-06-12 01:15 EDT
This tracks the user-facing improvement list that must land with the `1.8-alpha`
container migration release and the next ISO cut produced from that release. It
is intentionally separate from the container handoff docs, but should be treated
as release and ISO smoke-test scope.
Status legend:
- `todo`: not started.
- `in-progress`: active local work or validation.
- `blocked`: needs host access, hardware, credentials, a product decision, or an
external artifact.
- `done`: implemented and validated for this release.
- `defer?`: candidate to explicitly defer from `1.8-alpha` after product review.
Resume protocol:
1. Read this file after `docs/NEXT_TERMINAL_HANDOFF.md`.
2. Keep every user-requested improvement represented here until it is either
`done` or explicitly moved out of `1.8-alpha` by product decision.
3. When implementation starts, change status to `in-progress` and add the file,
test, host, or design decision being worked.
4. Mark `done` only after the change is implemented and validated locally or on
the release validation host, as appropriate.
5. Before cutting the next ISO, run this checklist as part of ISO smoke testing.
Active-session note, 2026-06-10 05:48 EDT: resumed from
`docs/NEXT_TERMINAL_HANDOFF.md`; no `.198` host actions have been run yet. The
immediate tracker-affecting local gate is rerunning the focused Rust
`container::image_versions::tests` validation for the Nextcloud false-update
row, then continuing lifecycle/control-plane truthfulness work.
Resume-save checkpoint, 2026-06-10 08:32 EDT: the current pass stayed on the
fixes backlog, not app migration. No `.198` host actions were run, no dev server
was intentionally left running, and no long-running validation command is
expected to still be active. Continue from the in-progress `Make tabs info load
quickly or show loading states` row or the next unresolved fixes-backlog row.
Active-session progress: `git diff --check` passed. Focused image-version Rust
validation is still inconclusive because the tool PTY stayed open with no
active compiler process visible, a bounded 300s retry using the normal
workspace target exited `124` before test output, and a fresh 600s retry in
`/tmp/archy-cargo-image-versions-2` also exited `124` after compiling into the
`archipelago` crate without reaching test output. The Nextcloud false-update
row remains `in-progress`. A local lifecycle fix is in progress so migrated
single-orchestrator app stops return immediately with a transitional state
instead of blocking the UI while Podman cleanup runs; `cargo fmt --check` and
focused backend compile check passed, and `git diff --check` is clean. Latest
credentials backlog follow-up added backend PhotoPrism credentials, centered
the mobile credential pre-launch modal in My Apps and the icon grid, and passed
focused frontend tests, type-check, backend compile check, `cargo fmt --check`,
and `git diff --check`. Web5 Connected Nodes Messages/Requests, Web5
Identities, and DWN message browsing now preserve visible content during
refresh/failure and show compact refresh labels instead of replacing populated
tabs with loading panels; focused tests and type-check passed. Server Network
overview, Network Interfaces, and Tor Services cards now keep visible values
during refresh or refresh failure and show compact refresh labels instead of
reverting to skeletons or false empty states; focused test and type-check
passed. The standalone Credentials view now keeps credential rows visible
during refresh/failure and shows `Refreshing credentials...`; focused test and
type-check passed. Lightning Channels now keeps existing channels visible
during refresh/failure and shows `Refreshing channels...`; focused test and
type-check passed. Peer Files now keeps existing peer catalog items visible
during Tor refresh/failure and shows `Refreshing peer files...`; focused test,
type-check, and `git diff --check` passed. Cloud peer cards now remain visible
during federation peer-list refresh/failure with `Refreshing peer nodes...`;
focused test, type-check, and `git diff --check` passed. The Web5 Verifiable
Credentials summary now keeps credential rows visible during refresh/failure
with `Refreshing credentials...`; focused test, type-check, and
`git diff --check` passed. Web5 Nostr Relays now keeps relay stats visible
during refresh/failure with `Refreshing relays...`; focused test, type-check,
and `git diff --check` passed. Web5 Domains now keeps registered-name counts
visible during refresh/failure with `Refreshing domains...`; focused test,
type-check, and `git diff --check` passed. Settings Backups now keeps existing
backup rows visible during refresh/failure with `Refreshing backups...`;
focused test, type-check, and `git diff --check` passed. Settings Transport
Preferences now keeps preference controls visible during refresh/failure with
`Refreshing transport preferences...`; focused test, type-check, and
`git diff --check` passed. Settings VPN status now keeps current connection
details visible during refresh/failure with `Refreshing VPN status...`;
focused test, type-check, and `git diff --check` passed. Web5 Federation now
shows `Refreshing federation...` during summary refresh and keeps existing node
counts/DID visible on refresh failure; focused test, type-check, and
`git diff --check` passed. Mesh map denied-location behavior now has component
coverage proving browser location denial reports that peer positions can still
appear without requiring local location; focused test, type-check, and
`git diff --check` passed. Companion/app-session mobile tab-app handling now
keeps apps that require a new tab inside the mobile session fallback instead of
auto-opening an external tab and closing; focused app-session, launcher, and
config tests passed with type-check and `git diff --check`.
Nostr Discoverable Nodes now keeps discovered rows visible during relay refresh
or relay failure and shows `Searching relays...`; focused test, type-check, and
`git diff --check` passed. App Store/App Details screenshot sections now render
only real screenshot metadata and no longer show fake placeholder tiles when no
assets exist; focused App Details content and marketplace handoff tests,
type-check, and `git diff --check` passed. Home now has an App Store
recommendations card driven by uninstalled core/recommended marketplace apps;
the recommendations respect installed aliases so apps drop out after install
and move into normal My Apps/Home behavior. Focused helper tests, type-check,
`git diff --check`, and the Playwright Home dashboard smoke passed. Easy Mode
goal configure steps now route to their owning app/screen, verify steps have an
explicit `Check & Continue` action, and configure/info/verify actions start
goal progress before completing the step; focused goal action/store tests,
type-check, and `git diff --check` passed. Setup path selection no longer shows
the disabled `Connect Existing (Coming Soon)` option; Fresh Start and Restore
from Seed are the only visible choices and route correctly. Focused onboarding
option/composable tests, type-check, and `git diff --check` passed. Header
responsiveness follow-up restored the primary My Apps/App Store/Websites
navigation to persistent desktop tabs at `md+` on My Apps, Discover, and
Marketplace; removed the desktop primary dropdowns; kept mobile dropdown
behavior; delayed App Store category collapse by lowering the search reserve and
header gap; and removed the My Apps desktop category dropdown. Focused
Marketplace/App config tests, type-check, and scoped `git diff --check` passed.
Browser smoke against the already-running local Vite/mock session is still next.
Active-session update, 2026-06-12 01:15 EDT: system update UX hardening landed
locally. `load_state()` now clears stale `update_in_progress` when no staged OTA
files exist, so failed legacy update attempts cannot leave the update screen
permanently stuck. Direct `update.git-apply` is gated behind
`ARCHIPELAGO_GIT_UPDATES`, preventing production nodes from accidentally entering
the local git/self-build path that requires `cargo`. `.116` was recovered from a
failed self-build attempt by applying its already-staged manifest OTA; it is now
on `1.7.84-alpha`, backend health is OK, nginx is active/config-valid, HTTP UI
returns `200`, `update_in_progress=false`, and staging was removed. Validation:
`cargo fmt --check`, `cargo check -p archipelago`, and scoped `git diff --check`
passed; focused `cargo test` was blocked by a local `rust-lld` undefined hidden
symbol linker failure unrelated to the updater patch.
Done criteria for this tracker:
- Code/UI items: implemented, covered by targeted test or manual smoke check,
and no known regression against the container migration work.
- Runtime/container items: validated on the release host named in
`docs/NEXT_TERMINAL_HANDOFF.md`, then included in ISO smoke test scope.
- Product-decision items: documented decision plus implementation task if the
decision keeps it in `1.8-alpha`.
- External/hardware items: hardware/document/access obtained, or explicitly
deferred from the release by product decision.
## Release-Critical Runtime Gates
| Item | Status | Release question / blocker |
| --- | --- | --- |
| Check logs of every server for errors and fix | blocked | Needs explicit target server list. Current docs name `.198`; are there more production validation hosts? |
| Go through issues on gate | blocked | Need location of "gate" issue tracker/board and access details. |
| Sort out container tagging so databases, backend, etc are sorted properly | in-progress | Tie to manifest/catalog metadata and My Apps grouping. |
| Sort out supplementary container naming so it is better | in-progress | Needs naming convention for dependencies: app-prefixed service names vs role-first names. |
| Figure out how we offer updates to apps | todo | Product/runtime design needed: manual update, scheduled checks, or auto-update by app tier. |
| Figure out how we provide different versions for Bitcoin to download and keep updated automatically | todo | Requires release policy for Knots/Core versions and whether users may pin old versions. |
| Make sure all credentials are given for apps without registration | in-progress | File Browser now exposes credentials on App Details and in the pre-launch interstitial. Backend `package.credentials` returns the secured File Browser password from `/var/lib/archipelago/secrets/filebrowser/password` when present, with `admin/admin` fallback matching the install hook. PhotoPrism now exposes manifest-backed `admin` / `archipelago` credentials from both backend `package.credentials` and the frontend fallback. My Apps and mobile icon-grid credential pre-launch modals are vertically centered on mobile. Covered by `appCredentials.test.ts`, `AppIconGrid.test.ts`, local type-check, backend compile check, `cargo fmt --check`, and `git diff --check`. Grafana was not added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo default/secret. Remaining no-registration apps still need inventory. |
| Nextcloud always shows update, and how are apps actually updated? | in-progress | Nextcloud manifest/catalog metadata is aligned to the pinned `nextcloud:29` image, and update detection now ignores registry-host-only image changes while still reporting real same-repo tag drift. Catalog drift check passed. Backend focused test was added but local validation hit a Rust linker/incremental artifact failure, then bounded retries exited `124` before test output, including a 600s fresh-target retry on 2026-06-10. Broader app update UX/policy design still needed. |
| Make sure Tor is solid as having to rotate addresses to get it to work | todo | Needs `.198`/target-host Tor logs and reproducible failure case. |
| Fix fleet it does not seem to work | done | Fleet data now preserves existing nodes during refresh, exposes an explicit refreshing state, sorts online nodes first, avoids duplicate history fetches when selecting a node, accepts backend `entries` and legacy `history` response shapes for per-node charts, and uses readable loading/auto-refresh UI. Covered by `useFleetData.test.ts`, local type-check, targeted tests, and user visual review of the Fleet header/card treatment. |
| Check Beta Telemetry and how it works | done | Telemetry is opt-in via `analytics-config.json`; the background reporter runs every 15 minutes only when enabled, saves `telemetry-latest.json`, writes local Fleet reports/history under `telemetry-fleet/`, and optionally POSTs a `telemetry.ingest` JSON-RPC envelope to `TELEMETRY_COLLECTOR_URL`. The systemd unit now reads optional `/var/lib/archipelago/telemetry.env`, and deploys write that file when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`. Manual and periodic report schemas now both include metric percentages and container inventory, and the Fleet UI normalizes older reports with missing fields. Covered by local type-check, `useFleetData.test.ts`, `cargo check -p archipelago`, deploy-script syntax check, and `git diff --check`. Remaining ops step: choose the real collector URL, deploy it, restart the service, and confirm central Fleet ingest. |
| Get Netbird working | todo | Requires app/runtime validation and credentials/config expectations. |
| Sort out how we are going to manage lightning channel creation | todo | Product design needed for UX, safety limits, fees, and peer selection. |
| Make sure old health notifications do not return on refresh/new login when stale/out of date | done | Health toasts now require a current app-linked unhealthy package state and hide stale package health notifications after 30 minutes on reload/new login. Backend monitoring notifications now prune duplicate active alerts and old generic alerts before pushing new ones. Covered by `HealthNotifications.test.ts`, local type-check, targeted frontend tests, and backend notification unit test work. |
| Fix BTCPay issue from desktop file "BTCPay Issues" | blocked | Need file contents or path to that desktop artifact. |
| Check Nostr Discoverable Nodes and get it working correctly | in-progress | Discover modal now keeps discovered rows visible during relay refresh/failure and shows `Searching relays...` instead of dropping to an empty state. Covered by `DiscoverModal.test.ts`, local type-check, and `git diff --check`. Needs live relay/trust validation before marking done. |
| Make sure update password is working properly | done | Backend now returns separate SSH update status so a successful web password change is not reported as a full failure when optional SSH password update fails. Settings modal shows success plus SSH warning and stays open for review. Covered by local type-check, focused modal/RPC tests, auth unit test, `cargo check -p archipelago`, and `git diff --check`. |
| Prevent System Update screen from getting permanently stuck | done | Update state loading now reconciles `update_in_progress` with the actual manifest OTA staging directory and clears stale stuck state when no staged files exist. Direct git/self-build apply is disabled unless `ARCHIPELAGO_GIT_UPDATES` is explicitly set, so production nodes cannot fall into the old `self-update.sh` path that requires local `cargo`. `.116` was recovered by applying its valid staged manifest OTA and verified on `1.7.84-alpha` with backend health OK, nginx active/config-valid, HTTP UI `200`, `update_in_progress=false`, and staging removed. Validated locally with `cargo fmt --check`, `cargo check -p archipelago`, and scoped `git diff --check`; focused `cargo test` was blocked by a local `rust-lld` linker artifact failure unrelated to the updater patch. |
| Do UI performance and general performance improvements | todo | Needs profiling target; start with obvious loading/render issues. |
| Make sure companion app is all working well, had issues with tab apps | in-progress | Mobile app-session now keeps apps that require a new tab inside the session fallback instead of auto-opening an external tab and closing immediately. Covered by `AppSessionMobileNewTab.test.ts`, existing app-session config tests, app launcher tests, local type-check, and `git diff --check`. Broader companion smoke test still needed before marking done. |
| Even though performance is better, on reboot/restart backend/update show checking-containers notification instead of no apps | done | My Apps now shows a dedicated `Checking containers` card when initial backend data has loaded but `server-info.status-info.containers-scanned` is still false and no apps are ready to render, instead of falling through to the no-apps empty state. A follow-up UI pass preserves the last known app list when a later scanner/backoff update reports an empty package map with `containers-scanned=false`, and shows a refresh status banner above the grid. Validated by local type-check, targeted tests, and `git diff --check`; follow-up validation passed `npm test -- --run src/views/apps/__tests__/appPackageCache.test.ts` and `npm run type-check`. |
| Check mesh core is picking up public channel/other devices, not just Archipelago ones | blocked | Needs Meshtastic hardware/radio environment. |
| Make tabs info load quickly or show loading states | in-progress | Fleet now has initial loading/background-refresh states, and node history keeps showing while the next sample is fetched instead of blanking out. Web5 Connected Nodes Trusted/Observers tabs now show loading instead of empty states while peer data is pending and keep existing lists visible during refresh; Messages and Requests now also keep populated lists visible during refresh/failure. Web5 Shared Content now keeps My Content visible during refresh/failure with `Refreshing shared content...`, and Browse Peers keeps current same-peer results visible during refresh with `Refreshing peer content...` instead of replacing lists with full loading panels. Web5 Identities now keeps the identity list visible during refresh/failure with `Refreshing identities...`; Web5 DWN message browsing keeps stored messages visible during refresh/failure with `Refreshing messages...`. The Web5 Verifiable Credentials summary keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Web5 Nostr Relays keeps relay stats visible during refresh/failure with `Refreshing relays...`. Web5 Domains keeps registered-name counts visible during refresh/failure with `Refreshing domains...`. Web5 Federation keeps summary node counts/DID visible during refresh/failure with `Refreshing federation...`. Server Network overview, Network Interfaces, and Tor Services cards now keep visible values during refresh/failure with `Refreshing network...`, `Refreshing interfaces...`, and `Refreshing Tor services...`. Credentials keeps credential rows visible during refresh/failure with `Refreshing credentials...`. Settings Backups keeps backup rows visible during refresh/failure with `Refreshing backups...`. Settings Transport Preferences keeps preference controls visible during refresh/failure with `Refreshing transport preferences...`. Settings VPN status keeps current connection details visible during refresh/failure with `Refreshing VPN status...`. Lightning Channels keeps existing channels visible during refresh/failure with `Refreshing channels...`. Peer Files keeps existing peer catalog items visible during Tor refresh/failure with `Refreshing peer files...`. Cloud keeps existing peer cards visible during federation peer-list refresh/failure with `Refreshing peer nodes...`. Covered by focused Web5/Server/Credentials/Backups/Transport/VPN/Lightning/Peer Files/Cloud tests and local type-check. Broader tab-info audit still needed for other slow panels before marking done. |
| Add states about why Bitcoin address is not ready | in-progress | Receive Bitcoin on-chain flows now reject blank LND address responses and translate common LND/Bitcoin readiness failures into user-facing reasons: wallet locked, wallet uninitialized, Bitcoin/LND still syncing, LND unreachable, or LND REST/newaddress transport issues. The receive modals now show a live “checking wallet readiness” message while the request is in flight. Backend `lnd.newaddress` now errors if LND returns an error or no address. Needs live wallet-state smoke test before marking done. |
| Add new Bitcoin wallets easily and securely | todo | Product/security design needed. |
| Add the new gate instead of gate | blocked | Need definition of "new gate" and target integration. |
| Local Nostr signer app should ask which account after logout/re-login | todo | Needs signer/session state validation. |
| See what apps can migrate to local Nostr signer sign-in | todo | Needs app-by-app auth inventory. |
| Make server name change change the host name | in-progress | Settings label changed to `Hostname`. `server.set-name` now persists the display name, derives a Linux-safe hostname slug, attempts `sudo -n hostnamectl set-hostname`, and returns non-fatal hostname warning fields if OS update fails. Covered by hostname slug unit test, local type-check, `cargo check -p archipelago`, and `git diff --check`. Impact audit: mDNS/SSH/Tailscale labels may change; already-created app configs using old `HOST_MDNS` (notably Fedimint derived env) are not automatically rewritten by hostnamectl, so this needs release-host smoke validation before marking done. |
| Sort out HTTPS certificate, what is best way? | todo | Needs product decision: self-signed local CA, ACME DNS, Tailscale certs, or reverse proxy model. |
## User Interface And App Experience
| Item | Status | Release question / blocker |
| --- | --- | --- |
| LND Channels then back/back gets stuck between LND detail and channels | done | App Details back now routes explicitly to the parent surface, and Lightning Channels back replaces history so browser back no longer bounces between LND detail and Channels. Validated by local type-check and targeted tests. |
| Add a Meshtastic icon | done | Added `meshcore.svg` asset and manifest-owned icon metadata. Catalog generation is idempotent and strict catalog drift is clean. |
| Improve default app icon fallback | done | Missing/broken app icons now fall back to the centered Archipelago `A` mark using the same black fill and gradient-border treatment as the custom UI icon asset, instead of the old generic placeholder. Applied to My Apps cards, mobile icons, Marketplace cards, and App Details. Validated by local type-check, targeted tests, Rust check, and `git diff --check`. |
| Use favicon for Portainer apps? | todo | Need decision: use upstream favicons dynamically or ship curated icons. |
| Settings for apps | blocked | Needs definition: per-app config screen, runtime env vars, credentials, or install options? |
| Update SearXNG app icon | blocked | Needs user-provided/approved icon asset. User said to move past this until they can make icons. |
| Once an app is installed remove recommended/core pills | done | Marketplace cards hide tier badges when installed. Validated by `MarketplaceAppCard.test.ts`, targeted Vitest, type-check, and `git diff --check`. |
| Get Bitcoin / LND UI fully done with all options and controls | todo | Large feature area; needs scope for `1.8-alpha` vs post-release. |
| Fix intro always showing on new browser sessions | done | Splash gating now checks the backend onboarding-complete state before showing the intro when this browser has no local intro flag. Already-onboarded nodes skip the splash and seed `neode_intro_seen`; fresh installs still show it. Covered by `introSplash.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Fix App Store tabs/categories/search overflow | done | Discover/App Store and Marketplace render one shared App Store section list. Follow-up after user review restored the primary My Apps/App Store/Websites navigation to persistent desktop tabs at `md+` on My Apps, Discover, and Marketplace; mobile keeps dropdown behavior. App Store category collapse now happens later by starting uncollapsed and using a smaller header gap/search reserve, and the My Apps category dropdown no longer appears on desktop. Covered by local type-check, focused Marketplace/App config tests, and scoped `git diff --check`; browser smoke remains the next resume step. |
| Add a test harness for all of the application | in-progress | Lifecycle harness exists; need expand UI/e2e coverage definition. |
| Fix app details screen links | done | App Details sidebar no longer renders dead `href="#"` links. It now renders only real manifest website/marketing, upstream/wrapper repo, and support URLs, and hides the Links card when no usable URLs exist. Covered by `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Fix FIPS anchoring, update FIPS | todo | Needs expected FIPS UX/API behavior. |
| Fix generate receive address not working on nodes and identify wallet management | todo | Needs wallet API/backend validation. |
| Fix mesh page on larger screens so it scales nicely | done | Mesh keeps the tabbed tools layout on normal desktop/1920px widths and only splits Off-Grid Bitcoin, Dead Man, and Map into separate stacked containers on very large screens (`>=2560px` wide and `>=1200px` tall). The desktop tools column now fills its panel instead of using a wrapper scroll container. Validated by local type-check, targeted tests, and `git diff --check`. |
| Mesh map should handle denied location permission and still show other devices | in-progress | Mesh map now treats browser geolocation as optional in the UI: denied local location reports that peer locations can still appear, and the empty hint waits for mesh device positions instead of saying location sharing is required. Covered by `MeshMap.test.ts`. Needs browser smoke test with denied location plus a peer coordinate message before marking done. |
| Make tablet-size Meshtastic scrollable | done | Tablet/mobile Mesh tools panels now have bounded heights and internal scrolling so the selected Bitcoin/Dead Man/Map panel can scroll without blowing out the page. Validated by local type-check, targeted tests, and `git diff --check`. |
| Make mobile screens have gap below lowest container and tab bar | done | Dashboard route panels, including the separate Chat/Mesh branch, now use mobile tab-bar bottom clearance so the lowest content clears the bottom tab bar. |
| Add Trusted tab to Connected Nodes container and have Peers and Observers | done | Connected Nodes now labels trusted peers as Trusted and splits federation nodes with `trust_level: observer` into the Observers tab. Observer nodes are excluded from Trusted, shown with their own count/badge, and refresh from the same live federation list. Validated by local type-check and targeted tests. |
| Add more tree navigation to cloud files so they do not all go back to first screen | done | Cloud folder navigation now persists the current folder path in the route query so refresh/browser back keeps nested folders instead of resetting to the section root. The Cloud back button now walks up to the parent folder before returning to Cloud home. Covered by `cloudPath.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Fix visible UI refreshing on find nodes screens | done | Federation node auto-refresh no longer blanks/replaces the visible node lists after the initial load. Existing nodes stay visible during background refreshes, covered by `NodeList.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Remove dead UI components/ones that are coming soon | done | Removed the dead Web3/coming-soon Network card, disabled local-network placeholder button, and the non-interactive Spotlight AI Assistant coming-soon block. Verified active UI no longer contains explicit `Coming soon` copy outside historical release-note text. Covered by local type-check and `git diff --check`. |
| Hide Web3 container on network for now and move FIPS Mesh up | done | Network page now places the live FIPS Mesh card in the top overview grid where the dead Web3 card was, removes the duplicate lower FIPS card, and updates the Home Network description to remove Web3 language. Validated by local type-check, targeted tests, and `git diff --check`. |
| Make cool screens less hidden: Find Nodes, Fleet, Monitoring, etc. | done | Existing Web5 summary cards now expose Monitoring, Find Nodes/Federation, and Fleet directly. Federation card has separate `Find Nodes` and `Fleet` actions instead of hiding Find Nodes behind Fleet. Covered by `Web5Federation.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Fix dashboard container/card square rendering corruption | done | Generalized the App Store compositor workaround to dashboard scroll-panel glass cards/buttons/inputs and removed transform-based stagger movement so Chromium/Brave no longer paints random large black square/rectangle layers over containers. Kept the Web5 bottom-action placement change. Validated by local type-check, targeted tests, and `git diff --check`. |
| Move constrained card header actions to bottom buttons | done | Web5 summary actions and Network actions for Add Device, Scan WiFi, Restart Tor, and Add Service now stay in the card header only on very wide screens; otherwise they render at the card bottom as full-width or 50/50 buttons. Button icons were removed from those action buttons. Validated by local type-check, targeted tests, and `git diff --check`. |
| Work on setup screens function and flows | in-progress | Onboarding setup choice now shows only usable paths: Fresh Start and Restore from Seed. Removed the disabled `Connect Existing (Coming Soon)` option, and covered default Fresh routing plus Restore routing with `OnboardingOptions.test.ts`; `useOnboarding.test.ts`, local type-check, and `git diff --check` passed. Broader onboarding/setup audit still needed before marking done. |
| Work on Easy Mode experience | in-progress | Easy Mode goal configure steps now route to their owning app/screen instead of silently completing without navigation; verify steps now expose a `Check & Continue` action; configure/info/verify actions start goal progress before completing the active step. Covered by `goalStepActions.test.ts`, existing goal store tests, local type-check, and `git diff --check`. Broader Easy Mode product scope still needed before marking done. |
| Update My Apps homescreen to show most-used apps instead of hardcoded | done | App launches are recorded locally through the app launcher, and the Home My Apps card now shows the top three installed user apps by launch count/recency with a running-app/name fallback when there is no history. Covered by `appUsage.test.ts`, existing app launcher tests, local type-check, targeted tests, and `git diff --check`. |
| Improve Full Archive Node dependent apps UX | in-progress | Electrum-style apps already block install on pruned Bitcoin nodes; Marketplace/App Store cards now surface an inline warning that a full archive Bitcoin node is required instead of only showing a terse `Bitcoin Pruned` button. Covered by `MarketplaceAppCard.test.ts` and local type-check. Broader dependency UX remains. |
| Fix incorrect modals that are wrong color and are not full-screen overlay | done | Custom Teleport modals that still used the old light `bg-black/10` overlay now use the same full-screen `bg-black/60` overlay treatment as BaseModal/newer modals. Verified no fixed modal overlays retain `bg-black/10`; validated by local type-check, targeted tests, and `git diff --check`. |
| Prevent modals from allowing background scroll | done | Added shared scroll-lock composable, root-level body lock, wheel/touch containment, and explicit dashboard route-panel locking. User validated the background no longer scrolls behind modal overlays. |
| Look over gamepad navigation | todo | Needs focused controller-nav pass. |
| App Store screenshots | in-progress | Placeholder policy fixed: Marketplace App Details and installed App Details now render screenshot sections only when real screenshot metadata exists, and otherwise hide the fake placeholder tiles. Metadata can be string URLs or `{ src, alt }` objects. Covered by `AppContentSection.test.ts`, `useMarketplaceApp.test.ts`, local type-check, and `git diff --check`. Needs actual screenshot assets/metadata before marking done. |
| Fix App Detail page issues; container controls are not good | done | App Details container controls now disable while start/stop/restart/update/uninstall RPCs are running and show action-specific progress labels. Header actions collapse into the bottom 50/50 grid below `1280px` to avoid tablet/smaller desktop overlap. Credentials now show a loading state while package credentials are being fetched. Covered by `AppHeroSection.test.ts`, `AppSidebar.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Add setup instructions for apps that need them | done | App Details now renders a dedicated Setup Instructions card from `static-files.instructions` when present, so apps can show install/setup notes without a new schema. Covered by `AppSidebar.test.ts`, local type-check, and `git diff --check`. |
| Add press-and-hold option for apps on mobile app screen | done | Mobile My Apps icons now support long press/context menu to open the app detail/options screen while a normal tap still launches the app. Space key opens the same options path for keyboard users. Covered by `AppIconGrid.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Side-load: add port-not-available validation | done | Sideload modal now validates app ID collisions, malformed `host:container` mappings, reserved Archipelago/package host ports, and host ports already exposed by installed packages before queueing install. Backend install remains the final bind authority. Covered by `sideloadValidation.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Delete app data option and uninstall warning | done | Uninstall dialogs in My Apps and App Details now include a clear warning plus a `Delete app data and reset it` choice. Leaving it off preserves app data for later reinstall; checking it passes `preserve_data=false` through `package.uninstall` so the app is fully reset. Covered by `AppsUninstallModal.test.ts`, `rpc-client.test.ts`, local type-check, targeted tests, and `git diff --check`. |
| Add App Store container with recommended apps that change to Home Screen | done | Home now shows up to three uninstalled core/recommended App Store apps and routes clicks through the existing Marketplace App Details handoff. Installed aliases are honored, so recommendations disappear once the app is installed and the app moves into normal My Apps/Home behavior. Follow-up layout polish moved Cloud back into the second card slot, moved Recommended Apps into Cloud's previous slot, and placed Quick Start inside the grid next to Wallet to avoid an odd-width row. Covered by `homeRecommendations.test.ts`, local type-check, `git diff --check`, and Playwright Home dashboard smoke against local Vite/mock backend. |
| Add QR code to download mobile companion app in login-triggered modal and improve modal | done | Companion intro modal now renders a QR code on desktop and a direct download button on mobile. It reads `VITE_COMPANION_APK_URL` and falls back to `/packages/archipelago-companion.apk.zip`; the APK zip is now published at `neode-ui/public/packages/archipelago-companion.apk.zip` so the modal can serve it immediately. Covered by local type-check, `git diff --check`, and manual file placement verification. |
| Fix TV HDMI overscan clipping in kiosk mode | in-progress | Kiosk launcher now passes a browser safe-area fallback through `/kiosk?safe_area=...`; `/kiosk` now persists the safe-area value during redirect; self-update and deploy paths refresh kiosk launcher/services. The X11 safe-area attempt is opt-in because it stretched the live TV output on `100.66.157.120`. Wi-Fi UI fixes are included in the same OTA patch: scan errors are visible, scans can be retried, escaped SSIDs parse correctly, and open networks do not require a password. Needs live validation on HDMI node `100.66.157.120` after applying the visible OTA update. |
| Video calling Picture-in-Picture | blocked | Need referenced document or desired provider/library. |
| Card-based loading visuals on App Store pages | done | Discover and Marketplace now show app-card skeleton grids while community/Nostr catalog data is loading and no cards are available yet, instead of a centered spinner/empty state. Validated by local type-check, targeted tests, and `git diff --check`. |
## External / Hardware Items
| Item | Status | Release question / blocker |
| --- | --- | --- |
| Buy a HaLow device and start integration | blocked | Requires hardware purchase and driver/device target. Not a code-only `1.8-alpha` item unless hardware is available now. |

View File

@ -1,96 +0,0 @@
# Beta Test Issues — 2026-03-28 (ISO build 2137)
Hardware: Dell OptiPlex 3020M, i5, 8GB RAM, 465G HDD, UEFI+Legacy
## ISO / Boot (image-recipe)
### 1. UEFI autodetect broken
- **Severity**: High
- **Detail**: Only autodetects/boots in Legacy BIOS mode. UEFI boot does not autodetect the install disk.
- **Where**: `build-auto-installer-iso.sh` GRUB config, EFI boot chain
- **Status**: TODO
### 2. Installation TUI screens need redesign
- **Severity**: Medium
- **Detail**: Current installer output is plain/ugly. Needs polished design.
- **Action**: User will provide .md mockup for each screen, then we implement.
- **Where**: `build-auto-installer-iso.sh` auto-install.sh embedded script
- **Status**: AWAITING DESIGN
### 3. No TUI animations
- **Severity**: Low
- **Detail**: Would like Claude-style spinner/progress animations during install. May not be possible with bash.
- **Where**: auto-install.sh
- **Status**: TODO (investigate)
### 4. USB read errors on boot
- **Severity**: Medium (cosmetic but bad first impression)
- **Detail**: Read errors scroll on screen during USB boot before installer loads. Scares new users.
- **Where**: Kernel/initramfs boot, possibly `quiet` not suppressing early messages
- **Status**: TODO
### 5. GRUB background tiling + text cutoff
- **Severity**: Medium
- **Detail**: Boot menu background image tiles instead of scaling. Menu text ("Install Archipelago", "Failsafe mode") is cut off.
- **Where**: `branding/grub-theme/`, `boot/grub/grub.cfg`, theme.txt resolution settings
- **Status**: TODO
### 6. USB removal drops to command line
- **Severity**: Medium
- **Detail**: After install completes, removing USB drops to shell before user presses Enter to reboot. Confuses non-technical users.
- **Where**: auto-install.sh — end of install, before `read -s` / `reboot`
- **Status**: TODO
## Frontend / UI (neode-ui)
### 7. Broken splash screen flashes before onboarding
- **Severity**: High
- **Detail**: Black screen with "online/offline" top-right, broken archipelago image top-left, "use arrow keys" text. Flashes briefly before onboarding loads.
- **Where**: Likely `RootRedirect.vue` or `SplashScreen.vue` — routing/transition timing
- **Status**: TODO (reported before, persists)
### 8. Skip buttons still visible in onboarding
- **Severity**: Medium
- **Detail**: Onboarding flow still shows skip buttons. Should be removed for clean UX.
- **Where**: `src/views/onboarding/` components
- **Status**: TODO
### 9. App install UX outdated
- **Severity**: High
- **Detail**: Missing the yellow "Installing..." button that persists across navigation. Apps don't show as "installing" in My Apps view during install.
- **Where**: `src/views/marketplace/`, `src/views/myapps/`, app install store
- **Status**: TODO
### 10. Login requires double Enter
- **Severity**: Medium
- **Detail**: Password field on login page requires pressing Enter twice to submit.
- **Where**: `src/views/LoginView.vue` — form submission handler
- **Status**: TODO (reported before, persists)
### 11. No password setting UI
- **Severity**: High
- **Detail**: No way for user to set/change their password from the web UI. Currently hardcoded `password123`.
- **Where**: Settings view, backend auth API
- **Status**: TODO
### 12. Browser login loops (non-kiosk)
- **Severity**: High
- **Detail**: Logging in from a browser (not kiosk) on the same network redirects back to login in a loop. Kiosk mode works fine.
- **Where**: Auth/session handling — possibly cookie `SameSite` or redirect logic in `RootRedirect.vue`
- **Status**: TODO
### 13. Can't exit input fields with arrow keys
- **Severity**: Medium
- **Detail**: When focused on a text input, up/down arrow keys don't move focus to adjacent UI elements. Stuck in the field.
- **Where**: `useControllerNav.ts` — input field focus trap logic
- **Status**: TODO (reported before, persists)
---
## Summary
| Category | Critical | High | Medium | Low |
|----------|----------|------|--------|-----|
| ISO/Boot | 0 | 1 | 4 | 1 |
| Frontend | 0 | 4 | 3 | 0 |
| **Total** | **0** | **5** | **7** | **1** |

View File

@ -1,335 +0,0 @@
# Beta Progress Tracker
> **Goal**: Flawless beta that works perfectly on every machine we install it on.
> **Freeze started**: 2026-03-18
> **Last updated**: 2026-03-25
---
## Pipeline
```
PHASE 1: Feature Testing (internal) ← WE ARE HERE
PHASE 2: User Testing (real users, controlled)
PHASE 3: Beta Live (public release)
```
**Current phase**: PHASE 1 — Feature Testing
**Gate to Phase 2**: Every feature works, all bugs fixed, security hardened, ISO verified
**Gate to Phase 3**: User testing feedback resolved, no P0/P1 issues remaining
---
## Phase 1: Feature Testing (Internal)
Everything in this phase must pass before we hand it to real users.
### Overall Status: IN PROGRESS (~65%)
| Workstream | Status | Completion | Gate-blocking? |
|------------|--------|------------|----------------|
| 1A. Critical Bugs (BUG-1 CSRF) | DONE | 100% | ~~YES~~ |
| 1B. Boot Screen (FEATURE-4) | IN PROGRESS | ~80% (needs hardware test) | YES |
| 1C. Security Hardening (TASK-8) | DONE (12/12 + code audit) | 100% | ~~YES~~ |
| 1D. Rootless Podman (TASK-11) | DONE (.228), IN PROGRESS (.198) | ~80% | YES |
| 1E. Beta Telemetry (TASK-12) | NOT STARTED | 0% | YES |
| 1F. App Testing — every feature | NOT STARTED | 0% | YES |
| 1G. ISO Build & Fresh Install | NOT STARTED | 0% | YES |
| 1H. UI Polish & Layout | DONE (batch + What's New) | ~90% | No |
| 1I. WebSocket Reliability | NOT STARTED | 0% | No |
| 1J. Quality Baseline Check | NOT STARTED | 0% | No |
| 1K. Architecture Review Fixes | DONE (4/4 items) | 100% | ~~YES~~ |
| 1L. Update System (git.tx1138.com) | DONE | 100% | No |
### 1A. Critical Bugs
#### BUG-1: Random logout / CSRF mismatch — P0
**Status**: PLANNED
**Impact**: Users get randomly logged out. Blocks user testing — unacceptable UX.
**What's known**:
- Sessions now persist to disk (fixed)
- CSRF token mismatch between cookie and header still causes 403s
- Likely caused by cookie rotation in multi-tab or deploy scenarios
**Remaining work**:
- [ ] Add debug logging to capture actual cookie vs header values
- [ ] Reproduce reliably (multi-tab, deploy, long idle)
- [ ] Fix the root cause
- [ ] Verify fix survives deploys and multi-tab use
#### BUG-3: IndeedHub WebSocket spam — P2
**Status**: PLANNED
**Impact**: Console noise, minor. Should fix before user testing.
- [ ] Rebuild IndeedHub with relative WebSocket URL
- [ ] Verify fix
---
### 1B. Boot Screen (FEATURE-4)
**Status**: IN PROGRESS (~80% complete)
**Impact**: Users hit errors on first boot before backend is ready. Blocks user testing.
- [x] Audit current `/health` endpoint — returns trivial "OK"
- [x] Add granular service readiness to health endpoint (JSON with version + services)
- [x] Design boot screen component — BootScreen.vue (379 lines, starfield + terminal log + orb)
- [x] Create pixel art icon animations (6 SVG icons cycling)
- [x] Implement health polling with smooth transition (server.echo RPC, 2s interval)
- [x] Handle edge cases (timeout, 502/503 detection, boot-reset)
- [ ] Test on fresh ISO install (first-boot path)
- [ ] Test on normal reboot (existing user path)
---
### 1C. Security Hardening (TASK-8)
**Status**: DONE — 12/12 pentest findings fixed + additional hardening from code audit
#### Pentest (12/12 fixed)
- [x] C1: /lnd-connect-info requires session auth
- [x] C3: DEV_MODE removed from production service
- [x] H1: node-message verifies ed25519 signatures
- [x] H2: federation.peer-joined verifies ed25519 signature
- [x] H3: federation.peer-address-changed requires signed proof
- [x] H4: Backend binds to 127.0.0.1
- [x] M1: content.add rejects `..` path traversal
- [x] M2: NIP-07 postMessage uses specific origin
- [x] M3: AIUI nginx checks session_id cookie
- [x] L2: Strict v3 onion validation
- [x] MED-03: Shell injection in bitcoin.conf generation
- [x] MED-07: No body size limit on /rpc/
#### Code audit (additional)
- [x] CSRF: HMAC-derived from session token (BUG-1 fix)
- [x] Argon2id password hashing (bcrypt auto-upgrade)
- [x] Random Bitcoin RPC password on first boot
- [x] RBAC Viewer role: explicit allowlist
- [x] Error sanitization tightened
- [x] Identity label max length enforced
- [ ] Cosign image verification (large scope — post-beta candidate)
---
### 1D. Rootless Podman (TASK-11)
**Status**: DONE on .228 (30 containers rootless), IN PROGRESS on .198
**Impact**: Security posture — containers no longer require root.
- [x] Migrate existing root Podman containers to rootless (archipelago user)
- [x] Update PodmanClient to run `podman` directly (no sudo) — 9 Rust files
- [x] Deploy script auto-fixes ownership + sysctl + linger on every deploy
- [x] All 30 containers running rootless on .228
- [ ] .198: only 2 containers running — needs full container recreation (TASK-39)
- [x] Tailscale deploy script: full deploy-tailscale.sh with split-mode SSH, rootful→rootless migration, container creation, all infrastructure
- [ ] Test full deploy on .198 (validation before Tailscale)
- [ ] Deploy to Tailscale nodes (Arch 1/2/3)
---
### 1E. Beta Telemetry — Node Reporting (TASK-12)
**Status**: NOT STARTED
**Impact**: Without this we're blind during user testing — can't see what's broken on their machines.
All beta nodes report health/errors to a central log. We build a panel to monitor and triage issues.
**Design**:
- Opt-in telemetry (user consents during onboarding or settings)
- Each node periodically reports: health status, error log digest, container states, uptime
- Central endpoint collects reports (could be a simple API on one of our servers)
- Dashboard panel shows all reporting nodes, their status, recent errors
- Privacy: no wallet data, no keys, no personal data — only system health and error logs
- Nodes identified by anonymous ID (hash of DID), not IP or name
**Tasks**:
- [ ] Design report payload (health, errors, container states, versions, uptime)
- [ ] Design privacy model — what's collected, what's NOT, user consent flow
- [ ] Build reporting endpoint (backend RPC → central collector)
- [ ] Build central collector service (receives + stores reports)
- [ ] Build monitoring dashboard/panel (view all nodes, filter by error type)
- [ ] Add opt-in toggle to Settings UI
- [ ] Add reporting interval config (default: every 15 min?)
- [ ] Test with multi-node fleet (.228, .198, Tailscale nodes)
---
### 1F. App Testing — Every Feature
**Status**: NOT STARTED
**Reference**: `docs/BETA-RELEASE-CHECKLIST.md` — full matrix
Systematic test of **every feature** on the dev server, then on fresh install.
#### Core Flows
- [ ] Onboarding: welcome → password → path → DID → backup → dashboard
- [ ] Login / logout / re-login
- [ ] Password change (invalidates other sessions)
- [ ] 2FA enrollment and verification
- [ ] Settings: view server name, version, DID, Tor address
- [ ] Dashboard: all overview cards render with data
#### App Lifecycle (every app)
- [ ] Bitcoin Knots: install, sync starts, UI loads, uninstall
- [ ] Electrs: install, auto-connects to Bitcoin, UI loads, uninstall
- [ ] LND: install, auto-connects to Bitcoin, UI loads, uninstall
- [ ] BTCPay Server: install, connects, Lightning available, uninstall
- [ ] Mempool: install with Bitcoin+Electrs, shows data, uninstall
- [ ] Fedimint + Gateway: install, UI loads, uninstall
- [ ] File Browser: install, UI loads, uninstall
- [ ] Immich: install, UI loads, uninstall
- [ ] PhotoPrism: install, UI loads, uninstall
- [ ] Penpot: install, UI loads, uninstall
- [ ] SearXNG: install, UI loads, uninstall
- [ ] Ollama: install, UI loads, uninstall
- [ ] Nostr Relay: install, UI loads, uninstall
- [ ] Nginx Proxy Manager: install, UI loads, uninstall
- [ ] Tailscale: install, UI loads, uninstall
- [ ] Home Assistant: install, UI loads (new tab), uninstall
- [ ] IndeedHub: opens external URL in iframe
#### Dependency Chain Errors
- [ ] Electrs without Bitcoin → clear error message
- [ ] LND without Bitcoin → clear error message
- [ ] Mempool without Bitcoin+Electrs → clear error message
#### Federation & Identity
- [ ] Federation invite + join between nodes
- [ ] DWN sync between federated nodes
- [ ] Backup create + download
- [ ] Backup restore on fresh install
#### WebSocket
- [ ] Connects on login, receives initial data
- [ ] Reconnects after network drop
- [ ] Ping/pong heartbeat both directions
- [ ] Connection state visible in UI
- [ ] Install progress delivered real-time
#### Nginx Proxies
- [ ] Every `/app/*` proxy resolves correctly
- [ ] BTCPay and Home Assistant open in new tab
- [ ] Tor hidden services resolve
---
### 1G. ISO Build & Fresh Install
**Status**: NOT STARTED
- [ ] ISO builds successfully on dev server
- [ ] ISO size < 10 GB
- [ ] All container images captured
- [ ] Boot from USB on x86_64 hardware
- [ ] Auto-installer partitions correctly
- [ ] Services start on first boot
- [ ] Web UI accessible within 3 minutes
- [ ] Full onboarding flow completes
- [ ] Second machine test (different hardware)
- [ ] ARM64 test (if targeting)
---
### 1H. UI Polish & Layout
**Status**: MOSTLY DONE — batch of fixes shipped 2026-03-18
**Note**: Layout rearrangements and UX improvements allowed during freeze.
- [x] Rename fedimintd → "Fedimint Guardian" + icon (TASK-26)
- [x] Tab-launch icons for apps opening in new tabs (TASK-27)
- [x] Installed apps sorted to end of marketplace (TASK-28)
- [x] Mesh mobile: header hidden, overflow fixed (TASK-29)
- [x] On-Chain first in receive modals (TASK-30)
- [x] Federation node names — show name not DID, hover for key (TASK-35)
- [x] Cleaner iframe error screen with remediation (TASK-36)
- [x] CPU alert threshold fixed (BUG-33)
- [x] ElectrumX shows index size during indexing
- [x] Container startup "Checking..." shimmer
- [ ] Sticky nav header (TASK-31)
- [ ] Review all views for consistent glass design
- [ ] Verify all loading/empty/error states work
- [ ] Check responsive layout on tablet/mobile
---
### 1I. WebSocket Reliability
Covered under 1F testing — no separate workstream needed.
---
### 1J. Quality Baseline Check
**Last known** (2026-03-11):
- Silent catches: 0
- Console statements: 0
- `any` types: 0
- TypeScript errors: 0
- Tests: 515 passed
- npm audit (runtime): 0
- [ ] Re-run full quality sweep — verify no regressions
- [ ] Fix any new violations
---
## Phase 2: User Testing (Controlled)
**Gate**: All Phase 1 items pass. No P0/P1 bugs open.
Starts when we hand ISOs to real users on real hardware we don't control.
| Item | Status |
|------|--------|
| Recruit test users (3-5 people, varied hardware) | NOT STARTED |
| Provide ISOs + install instructions | NOT STARTED |
| Beta telemetry collecting reports from user nodes | NOT STARTED |
| Monitor dashboard for errors across fleet | NOT STARTED |
| Triage + fix reported issues | NOT STARTED |
| User feedback collection (structured form or channel) | NOT STARTED |
| Fix all P0/P1 issues from user reports | NOT STARTED |
| Rebuild ISO with fixes, re-test | NOT STARTED |
---
## Phase 3: Beta Live (Public)
**Gate**: User testing complete. No P0/P1 issues. Telemetry shows stable fleet.
| Item | Status |
|------|--------|
| Final ISO build with all fixes | NOT STARTED |
| Release notes / changelog | NOT STARTED |
| Download page / distribution | NOT STARTED |
| Public announcement | NOT STARTED |
| Telemetry monitoring active for early adopters | NOT STARTED |
---
## Session Log
| Date | Session | Work Done | Items Closed |
|------|---------|-----------|--------------|
| 2026-03-18 | #1 | Created beta freeze plan, progress tracker | — |
| 2026-03-18 | #2 | Restructured into 3-phase pipeline, added telemetry workstream | — |
| 2026-03-18 | #3 | Updated tracking to reflect completed work — TASK-11 done, TASK-8 9/12, UI batch done | TASK-11, TASK-26-30, TASK-32, TASK-34-36, BUG-33 |
| 2026-03-18 | #4 | Rewrote deploy-tailscale.sh (full deploy with split-mode SSH, rootful migration, containers, infra). Fixed first-boot-containers.sh rootless bugs (subnet, UID mapping, prereqs). Dynamic HTTPS certs. | — |
| 2026-03-18 | #5 | BUG-1 CSRF fix, TASK-8 12/12 done, 7 bugs fixed, Argon2id migration, random BTC RPC, RBAC hardened, What's New history, Bitcoin sync gauge. Tagged v1.2.0-alpha.9. | BUG-1, TASK-8, BUG-20/37/40/41, TASK-31/38 |
| 2026-03-25 | #6 | Architecture review audit: all P0s+P1s verified fixed. Fixed remaining items: Nostr timeouts (6 calls), crypto dep pinning (12 deps), container image pinning (15 images), CI pipeline. Update system wired to git.tx1138.com. Cleaned stale branches. Docs updated. | Architecture review 4/4, CI pipeline |
---
## Post-Beta Parking Lot
These are explicitly deferred until after beta ships:
- FEATURE-6: Watch-only wallet architecture
- TASK-7: Mesh Bitcoin security hardening
- INQUIRY-5: Offline balance check via mesh relay
- TASK-2: Roll incoming-tx into deploy & ISO (P2, not blocking)
- did:dht integration
- Multi-user support
- Cluster mode
- Mobile companion PWA

View File

@ -1,269 +0,0 @@
# Beta Release Checklist (v0.5.0-beta)
## Pre-Build Verification
### Source Code
- [ ] All changes committed and pushed to `main`
- [ ] `cargo clippy --all-targets --all-features` passes (zero warnings)
- [ ] `cargo fmt --all` applied
- [ ] `cd neode-ui && npm run type-check` passes (zero errors)
- [ ] `cd neode-ui && npm test` passes (all tests green)
- [ ] `cargo test --all-features` passes on dev server
### Critical Files
- [ ] `core/container/src/podman_client.rs` — rootless Podman REST API socket
- [ ] `core/archipelago/src/container/docker_packages.rs` — app metadata + UI mapping
- [ ] `core/archipelago/src/api/rpc/package.rs` — app configs, capabilities, dependencies
- [ ] `core/archipelago/src/session.rs` — session security hardening
- [ ] `core/security/src/secrets_manager.rs` — encryption + rotation
- [ ] `neode-ui/src/views/Marketplace.vue` — all app entries with pinned image versions
- [ ] `neode-ui/src/api/websocket.ts` — heartbeat + reconnection
- [ ] `image-recipe/configs/nginx-archipelago.conf` — all app proxies + path traversal blocks
- [ ] All app icons present in `neode-ui/public/assets/img/app-icons/`
---
## App Integration Matrix
Every app must be tested for install, launch, and uninstall on a fresh system.
### Core Bitcoin Stack
| App | Image | Version | Install | Launch | UI Loads | Uninstall |
|-----|-------|---------|---------|--------|----------|-----------|
| Bitcoin Knots | `bitcoinknots/bitcoin` | `v28.1` | [ ] | [ ] | [ ] | [ ] |
| Electrs | `mempool/electrs` | `v0.4.1` | [ ] | [ ] | [ ] | [ ] |
| LND | `lightninglabs/lnd` | `v0.18.4` | [ ] | [ ] | [ ] | [ ] |
| BTCPay Server | `btcpayserver/btcpayserver` | `2.0.6` | [ ] | [ ] | [ ] | [ ] |
| Mempool | `mempool/frontend` | `v3.0.0` | [ ] | [ ] | [ ] | [ ] |
| Fedimint | `fedimintui/fedimint` | `0.5.0` | [ ] | [ ] | [ ] | [ ] |
| Fedimint Gateway | `fedimintui/gateway-ui` | `0.5.0` | [ ] | [ ] | [ ] | [ ] |
### Storage & Media
| App | Image | Version | Install | Launch | UI Loads | Uninstall |
|-----|-------|---------|---------|--------|----------|-----------|
| File Browser | `filebrowser/filebrowser` | `v2` | [ ] | [ ] | [ ] | [ ] |
| Immich | `ghcr.io/immich-app/immich-server` | `v1.121.0` | [ ] | [ ] | [ ] | [ ] |
| PhotoPrism | `photoprism/photoprism` | `240915` | [ ] | [ ] | [ ] | [ ] |
### Productivity & Privacy
| App | Image | Version | Install | Launch | UI Loads | Uninstall |
|-----|-------|---------|---------|--------|----------|-----------|
| Penpot | `penpotapp/frontend` | `2.4` | [ ] | [ ] | [ ] | [ ] |
| SearXNG | `searxng/searxng` | `2024.11.17-e2554de75` | [ ] | [ ] | [ ] | [ ] |
| Ollama | `ollama/ollama` | `0.5.4` | [ ] | [ ] | [ ] | [ ] |
### Network & Infrastructure
| App | Image | Version | Install | Launch | UI Loads | Uninstall |
|-----|-------|---------|---------|--------|----------|-----------|
| Nostr Relay | `scsiblade/nostr-rs-relay` | `0.9.0` | [ ] | [ ] | [ ] | [ ] |
| Nginx Proxy Manager | `jc21/nginx-proxy-manager` | `2.12.1` | [ ] | [ ] | [ ] | [ ] |
| Tailscale | `tailscale/tailscale` | pinned | [ ] | [ ] | [ ] | [ ] |
| Home Assistant | `homeassistant/home-assistant` | pinned | [ ] | [ ] | [ ] | [ ] |
### Virtual Apps (No Container)
| App | Behavior | Works |
|-----|----------|-------|
| IndeedHub | Opens external URL | [ ] |
---
## Dependency Chain Tests
These must be tested in order on a fresh install:
- [ ] Install Bitcoin Knots → starts and begins syncing
- [ ] Install Electrs while Bitcoin running → connects to Bitcoin automatically
- [ ] Install LND while Bitcoin running → connects to Bitcoin automatically
- [ ] Install BTCPay while Bitcoin running → connects; Lightning available if LND present
- [ ] Install Mempool while Bitcoin + Electrs running → shows blockchain data
- [ ] Try installing Electrs without Bitcoin → shows clear error message
- [ ] Try installing LND without Bitcoin → shows clear error message
- [ ] Try installing Mempool without Bitcoin + Electrs → shows missing deps error
- [ ] Fedimint Gateway auto-detects LND credentials when available
---
## Security Hardening Verification
### Session Security
- [ ] Sessions expire after 24 hours of inactivity
- [ ] Password change invalidates all other sessions
- [ ] Maximum 5 concurrent sessions (oldest evicted when exceeded)
- [ ] Session tokens are SHA-256 hashed in memory (never stored as plaintext)
- [ ] Login rate limiting: 5 failures per 60 seconds per IP
### Container Security
- [ ] All container images use pinned versions (no `:latest`)
- [ ] Read-only root filesystem enabled for compatible apps
- [ ] `--cap-drop=ALL` applied to all containers
- [ ] `--security-opt=no-new-privileges:true` applied to all containers
- [ ] Required capabilities added explicitly per app (e.g., CHOWN for File Browser)
### Secrets Management
- [ ] Secrets encrypted with AES-256-GCM on disk
- [ ] Secret metadata tracked (creation date, rotation count)
- [ ] Secret rotation generates new random values and re-encrypts
- [ ] `security.list-expiring` RPC returns secrets older than threshold
### Path Traversal Prevention
- [ ] Nginx blocks `..` in filebrowser API paths (403 response)
- [ ] Frontend `sanitizePath()` strips `..` and resolves paths
- [ ] File Browser token not exposed in URLs
### Authentication
- [ ] TOTP 2FA enrollment and verification works
- [ ] TOTP backup codes work for recovery
- [ ] Maximum 5 TOTP attempts before session invalidation
- [ ] Pending TOTP sessions expire after 5 minutes
- [ ] Cookie-based auth (no tokens in query strings)
---
## WebSocket & Connectivity
- [ ] WebSocket connects on login and receives initial data dump
- [ ] WebSocket reconnects after network interruption (exponential backoff, max 30s)
- [ ] Server sends ping every 30s; client responds with pong
- [ ] Client sends JSON ping every 30s; server responds with JSON pong
- [ ] Server closes inactive connections after 5 minutes
- [ ] Connection state shown in UI (connected/reconnecting/disconnected)
- [ ] Install progress updates delivered in real-time via WebSocket
---
## Fresh Install Testing Matrix
### ISO Build
- [ ] ISO builds successfully on dev server
- [ ] ISO size is reasonable (< 10 GB)
- [ ] All container images captured in ISO
### Installation
- [ ] Boot from USB on x86_64 hardware
- [ ] Auto-installer partitions disk correctly
- [ ] Debian 13 installs without errors
- [ ] Archipelago services start on first boot
- [ ] Web UI accessible at server IP within 3 minutes of first boot
### Onboarding Flow
- [ ] Welcome screen displays with intro video
- [ ] Password creation enforces minimum requirements
- [ ] Path selection shows all 6 options
- [ ] DID generation completes within 60 seconds
- [ ] Identity naming is optional and skippable
- [ ] Backup download produces valid JSON file
- [ ] Onboarding completes and reaches Dashboard
### Post-Onboarding
- [ ] Dashboard shows all overview cards
- [ ] App Store loads with all curated apps
- [ ] Settings shows server name, version, DID, Tor address
- [ ] Logout and re-login works
- [ ] Password change works and invalidates other sessions
---
## Performance Targets
- [ ] Backend startup: < 3 seconds
- [ ] Frontend initial load: < 500 KB gzipped
- [ ] WebSocket initial data: < 1 second after connection
- [ ] App install progress visible in UI within 5 seconds of starting
---
## Nginx Proxy Verification
All app proxies must work in both HTTP and HTTPS blocks:
- [ ] `/rpc/` → backend:5678
- [ ] `/ws/` → backend:5678 (WebSocket upgrade)
- [ ] `/health` → backend:5678
- [ ] `/app/filebrowser/` → filebrowser:80
- [ ] `/app/searxng/` → searxng:8080
- [ ] `/app/immich/` → immich:2283
- [ ] `/app/penpot/` → penpot-frontend:80
- [ ] `/app/ollama/` → ollama:11434
- [ ] `/app/photoprism/` → photoprism:2342
- [ ] `/app/nginx-proxy-manager/` → npm:81
- [ ] `/app/tailscale/` → tailscale:8240
- [ ] BTCPay (port 23000) opens in new tab
- [ ] Home Assistant (port 8123) opens in new tab
- [ ] Tor hidden services resolve for all configured apps
---
## Rollback Procedures
### If Backend Fails to Start
```bash
# Check logs
sudo journalctl -u archipelago -n 50 --no-pager
# Restore previous binary
sudo cp /usr/local/bin/archipelago.bak /usr/local/bin/archipelago
sudo systemctl restart archipelago
```
### If Frontend is Broken
```bash
# Restore previous frontend build
sudo cp -r /opt/archipelago/web-ui.bak/* /opt/archipelago/web-ui/
sudo systemctl reload nginx
```
### If Container Won't Start
```bash
# Check container logs
podman logs <container-name>
# Remove and recreate
podman rm -f <container-name>
# Reinstall from App Store
```
### If ISO Install Fails
1. Boot into rescue mode from USB
2. Check `/var/log/installer.log` on target disk
3. Verify disk partitioning with `lsblk`
4. Re-run installer with `INSTALLER_STARTED= /opt/installer.sh`
### Full System Rollback
If the beta is unusable:
1. Re-flash the ISO from the last known good build
2. Restore user data from `/var/lib/archipelago/` backup
3. Re-import DID from backup JSON file
---
## Sign-Off
| Reviewer | Area | Date | Pass/Fail |
|----------|------|------|-----------|
| | Backend | | |
| | Frontend | | |
| | Security | | |
| | ISO Build | | |
| | Fresh Install | | |
| | App Integrations | | |

View File

@ -1,317 +0,0 @@
# Chat Transcript And Working Notes
Date: 2026-05-02
This file captures the current chat context, decisions, progress, and next steps so work can continue from another device/session.
## User Request
The user asked to continue hardening Archipelago app/container lifecycle, then asked multiple times to save the plan/progress/next steps and finally to save the entire chat to Markdown.
Key user constraints and corrections:
- Continue if next steps are clear; ask only if blocked.
- Exhaustively harden app/container lifecycle before release.
- Preserve data during destructive lifecycle testing unless explicitly instructed otherwise.
- Do not rely on `/app/...` proxy paths for app launch/testing. The user corrected: “we never use paths only ports.”
- LND/Electrum wallet-connect tests must validate real connection details and QR, including Tor.
## Earlier Progress Summary
Before the latest work, the project already had substantial lifecycle hardening in progress:
- Remote lifecycle harness exists at `tests/lifecycle/remote-lifecycle.sh`.
- `.198` SSH works with `/home/archipelago/.ssh/id_ed25519`.
- `.228` RPC works, but SSH is blocked with `Permission denied (publickey,password)`.
- Multiple backend release binaries were built and deployed to `.198` with backups in `/usr/local/bin/archipelago.bak-*`.
- Fixed stale package scanner state recovery from `Removing -> Running` when a container is actually live.
- Fixed startup ordering so crash recovery runs before BootReconciler.
- Removed dangerous automatic Podman runtime directory deletion on `podman info` failure.
- Narrowed generic crash recovery to safe legacy containers.
- Fixed companion reconciliation on install/start/restart.
- Fixed uninstall/reinstall behavior so uninstall disables manifest apps instead of deleting manifest availability, and reinstall re-enables them.
- Fixed LND config generation/repair:
- `bitcoin.active=true`
- `bitcoin.mainnet=true`
- `bitcoin.node=bitcoind`
- `bitcoind.rpchost=bitcoin-knots:8332`
- sudo fallback for writing container-owned config paths.
- `.198` had previously passed focused lifecycle for `filebrowser`, `bitcoin-knots`, and a looser LND launch test.
## Major Files Touched In This Session
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
- `docs/CHAT_TRANSCRIPT_2026-05-02.md`
- `tests/lifecycle/remote-lifecycle.sh`
- `core/archipelago/src/container/lnd.rs`
- `core/archipelago/src/container/companion.rs`
- `core/archipelago/src/container/prod_orchestrator.rs`
- `core/archipelago/src/container/docker_packages.rs`
- `core/container/src/podman_client.rs`
- `core/archipelago/src/port_allocator.rs`
- `apps/lnd-ui/manifest.yml`
- `neode-ui/src/views/appSession/appSessionConfig.ts`
- `neode-ui/src/stores/container.ts`
- `neode-ui/src/stores/appLauncher.ts`
- `neode-ui/src/views/appDetails/appDetailsData.ts`
- nginx config/snippet files under `scripts/` and `image-recipe/`
## LND Wallet Bootstrap Investigation
Initial strict LND probe failed because `/lnd-connect-info` could not read `admin.macaroon`:
```text
Failed to read LND admin macaroon — is LND installed?
direct: Permission denied (os error 13)
sudo: cat: /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon: No such file or directory
```
LND logs showed the wallet was uninitialized/locked:
```text
Waiting for wallet encryption password. Use lncli create...
```
Tests showed `lncli create` is interactive and does not support `--stdin`:
```text
[lncli] flag provided but not defined: -stdin
```
`lncli unlock --stdin` is supported, so the final approach was:
- Use LND REST unlocker endpoints for new wallet creation.
- Use `lncli unlock --stdin` only for an existing wallet.
- Treat “wallet already exists” from REST as a signal to unlock.
- Use sudo-aware checks/reads for wallet artifacts because LND data directories are container-owned and `0700`.
Implemented in `core/archipelago/src/container/lnd.rs`:
- `ensure_wallet_initialized()`
- `file_exists_as_root()`
- `read_file_as_root()`
- `init_wallet_via_rest()`
- `get_lnd_unlocker_json()`
- `post_lnd_unlocker_json()`
- `unlock_existing_wallet()`
- `wait_for_admin_macaroon()`
- `lnd_getinfo_ready()`
Focused Rust test passes:
```bash
cd /home/archipelago/Projects/archy/core
cargo test -p archipelago --bin archipelago lnd
```
Result:
```text
7 passed; 0 failed
```
## LND UI Port Collision
The strict LND UI test then failed with `502`.
Investigation found a real port collision:
- `nostr-rs-relay` uses host `8081`.
- Old `archy-lnd-ui` also used host `8081`.
- nginx `/app/lnd/` proxy also pointed at `8081`.
Fix implemented:
- Move LND UI companion to host port `18083`, container port `80`.
- Keep `nostr-rs-relay` on `8081`.
- Update app metadata/routing to `18083`.
- Update tests to expect direct port launch.
Important correction from user:
```text
we never use paths only ports, how many times do you need to be told
```
Action taken after correction:
- Stop validating through `/app/lnd/` and `/app/electrumx/` in the lifecycle harness.
- Switch `launch_url_for()` to direct app ports.
- Switch app session resolver to direct `http://host:port` launch, even from HTTPS parent pages.
- Remove use of `HTTPS_PROXY_PATHS[id]` in `resolveAppUrl()`.
Direct-port LND audit command:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh
```
Result:
```text
### 192.168.1.198 iteration 1 / 1 ###
lnd state=running
all checks passed
```
The audit now validates `http://192.168.1.198:18083/`, not `/app/lnd/`.
## Lifecycle Harness Changes
`tests/lifecycle/remote-lifecycle.sh` changes made:
- Normalize package states with `ascii_downcase` because API returned `Running`.
- Direct port launch URLs:
- LND: `http://${ARCHY_HOST}:18083/`
- Electrum/Electrs: `http://${ARCHY_HOST}:50002/`
- Bitcoin UI: `http://${ARCHY_HOST}:8334/`
- Other apps mapped to direct ports where known.
- LND probe checks:
- `Connect Your Wallet`
- `id="lndQrBox"`
- `id="connHost"`
- `value="rest-tor"`
- `value="grpc-tor"`
- `value="rest-local"`
- `value="grpc-local"`
- `Copy lndconnect URI`
- `/lnd-connect-info` cert, macaroon, ports, and Tor onion.
- Electrum probe checks:
- local QR container and address field
- Tor QR container and onion field
- port `50001`
- QR renderer
- direct `http://${ARCHY_HOST}:50002/qrcode.js`
- `/electrs-status` Tor onion.
- Full lifecycle now fails immediately on any failed phase with `|| return 1` so a later reinstall cannot mask a failed restart/probe.
## Deployments To `.198`
Several release builds were made and deployed:
```bash
cd /home/archipelago/Projects/archy/core
cargo build -p archipelago --bin archipelago --release
```
Deploy pattern:
```bash
scp -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \
/home/archipelago/Projects/archy/core/target/release/archipelago \
archipelago@192.168.1.198:/tmp/archipelago.new
ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \
archipelago@192.168.1.198 \
"sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-<timestamp> && \
sudo install -m 0755 /tmp/archipelago.new /usr/local/bin/archipelago && \
sudo systemctl restart archipelago.service && \
systemctl is-active archipelago.service"
```
Latest deploy returned:
```text
active
```
## `.198` Current Observations
After forcing LND package restart, companion reconciliation succeeded:
```text
nostr-rs-relay Up ... 0.0.0.0:8081->8080/tcp
lnd Up ... 0.0.0.0:8080->8080/tcp, 0.0.0.0:9735->9735/tcp, 0.0.0.0:10009->10009/tcp
archy-lnd-ui Up ... 0.0.0.0:18083->80/tcp
```
Direct UI test from `.198` returned `200`:
```bash
curl -i http://127.0.0.1:18083/
```
Strict direct-port LND audit is green:
```text
lnd state=running
all checks passed
```
## Full LND Lifecycle Status
Full direct-port lifecycle was started:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
It reached:
```text
### 192.168.1.198 iteration 1 / 1 ###
== lnd: install ==
== lnd: stop ==
```
Then the user aborted the command while asking to save memory/transcript.
The next continuation point is to rerun full LND direct-port lifecycle from scratch and inspect the stop phase if it hangs/fails.
## Handoff File
A durable handoff file was also created:
```text
docs/CONTAINER_LIFECYCLE_HANDOFF.md
```
It contains the plan, progress, current blockers, and next steps.
## Immediate Next Steps
1. Rerun full strict LND direct-port lifecycle:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
2. If it hangs/fails at `stop`, inspect package runtime stop path and logs:
```bash
ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 \
'journalctl -u archipelago.service -n 260 --no-pager | egrep -i "package\.(stop|start|restart|install|uninstall)|lnd|companion|error|failed" | sed -n "1,220p"; podman ps -a --format "{{.Names}} {{.Status}} {{.Ports}}" | egrep "lnd|nostr" || true'
```
3. If stop is unreliable, inspect/fix:
- `core/archipelago/src/api/rpc/package/runtime.rs`
- `core/archipelago/src/container/prod_orchestrator.rs`
Likely causes to check:
- Reconciler restarting LND while stop is expected.
- State scanner reporting stale `running`.
- Companion handling interfering with parent app state.
- Async lifecycle returning before actual stop completes.
4. Once LND full lifecycle is green, run Electrum strict lifecycle with direct port `50002`:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
5. Continue with app groups after LND/Electrum:
- `filebrowser`
- `bitcoin-knots`
- `lnd`
- `electrumx`
- `mempool`
- `btcpay-server`
- `fedimint`
- remaining catalog apps.
## Important Instruction To Preserve
Use ports only for app launch/testing. Do not add or rely on `/app/...` path proxy launch behavior unless the user explicitly changes this requirement.

View File

@ -1,508 +0,0 @@
# Archipelago Container Infrastructure — Critical Issues Report
**Date:** 2026-03-31
**Status:** Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window.
**Purpose:** Fix guide for getting container lifecycle to production quality.
---
## Executive Summary
The container system has **7 systemic failures** that compound each other:
1. **Silent failures everywhere** — errors are swallowed with `|| true`, `.unwrap_or_default()`, and warn-level logs. Nothing actually tells the user (or the system) that something broke.
2. **Health checks are fake** — manifests define real health checks (HTTP probes, exec checks) but they are **never executed**. "Healthy" just means `podman ps` shows "running".
3. **Duplicate polling burns CPU** — health monitor + metrics collector both call `podman stats` every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning.
4. **Uninstall doesn't clean up** — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure.
5. **Two divergent install paths**`first-boot-containers.sh` and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync.
6. **UI misrepresents state**`Exited` (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels.
7. **Dependency-blind restarts** — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit.
---
## LIVE EVIDENCE: .228 Reboot on 2026-03-31
After rebooting .228, here's the actual container state 30 minutes later:
### Permanently Dead (exceeded 3 restart attempts, abandoned)
| Container | Exit Code | Cause |
|-----------|-----------|-------|
| `indeedhub-postgres` | 0 (clean) | Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too. |
| `indeedhub-redis` | 0 | Same — clean exit, 3 failed restart attempts, abandoned |
| `indeedhub-minio` | 0 | Same |
| `indeedhub-relay` | 0 | Same |
| `indeedhub` | 0 | Same |
| `indeedhub-api` | 1 | Can't resolve hostname `indeedhub-postgres` (postgres is dead, DNS entry gone from network) |
| `jellyfin` | 137 (OOM) | "Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted. |
### Crash-Looping (still failing on every restart)
| Container | Cause |
|-----------|-------|
| `mempool-api` | `ECONNREFUSED 10.89.0.42:3306` — DB (`archy-mempool-db`) just restarted, not ready yet |
| `portainer` | "database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover. |
| `photoprism` | "Failed creating test file in storage folder" — volume permission issue (rootless UID mapping) |
### Never Started (stuck in "Created" state)
| Container | Cause |
|-----------|-------|
| `archy-mempool-web` | "cannot assign requested address" — network binding failure |
| `fedimint` | Same network error |
### Running but Unhealthy
| Container | Notes |
|-----------|-------|
| `homeassistant` | Up 14 min, health check failing |
| `searxng` | Up 13 min, health check failing |
| `onlyoffice` | Up 10 min, health check failing |
### Actually Recovered (healthy)
`filebrowser`, `bitcoin-knots`, `vaultwarden`, `nginx-proxy-manager`, `archy-btcpay-db`, `lnd`, `electrumx`, `grafana`
### Key Observations
1. **All containers have `unless-stopped` restart policy** — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts.
2. **The entire IndeedHub stack died** because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. **No dependency awareness.**
3. **Containers in "Created" state** were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers.
4. **The UI showed ALL apps as "crashed"** during the first few minutes, even the ones that eventually recovered. This is because `Exited` state (even exit code 0) maps to the label "crashed" in `appsConfig.ts`.
---
## Problem 1: Containers Don't Start or Recover After Reboot
**Confirmed:** All apps crashed after .228 reboot on 2026-03-31.
### Root Causes
#### A. Crash recovery has a 30-second timeout that's too short
**File:** `core/archipelago/src/crash_recovery.rs:265-271`
```rust
let result = tokio::time::timeout(
std::time::Duration::from_secs(30),
tokio::process::Command::new("podman").args(["start", &record.name]).output(),
).await;
```
On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is **skipped** — no retry.
#### B. If `podman ps` itself times out, recovery finds zero containers
**File:** `core/archipelago/src/crash_recovery.rs:318`
The `podman ps -a` call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: `all_names` is empty, recovery silently exits having started nothing.
#### C. Boot tier ordering uses a catch-all that misses dependencies
**File:** `core/archipelago/src/crash_recovery.rs:374-385`
```rust
fn container_boot_tier(name: &str) -> u8 {
match id {
"btcpay-db" | "mempool-db" | ... => 0, // databases
"bitcoin-knots" | ... => 1, // bitcoin
"lnd" | "electrumx" | ... => 2, // depends on bitcoin
"mempool-web" | ... => 4, // frontend
_ => 3, // EVERYTHING ELSE - may start before its dependencies
}
}
```
Any app not explicitly listed gets tier 3, which may be before its dependencies are ready.
#### D. First-boot script swallows ALL errors
**File:** `scripts/first-boot-containers.sh:8` — no `set -e`
48+ commands have `|| true` appended. Every `podman run` failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed.
#### E. Install RPC returns success before container is actually running
**File:** `core/archipelago/src/api/rpc/package/install.rs:260-294`
After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds:
```rust
if i == 5 {
debug!("Container {} health check timeout (30s) -- continuing anyway");
}
```
It logs at debug level and **returns success**. The user sees "installed" but the container never actually started.
### Fixes Required
1. **Increase crash recovery timeout to 120s** and add retry with backoff (3 attempts per container)
2. **Increase `podman ps` timeout to 60s** during boot recovery
3. **Replace tier catch-all** — every container must be explicitly listed or derived from manifest dependencies
4. **Remove `|| true`** from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end
5. **Install RPC must return failure** if container isn't running after timeout, not silently succeed
6. **Add `--restart unless-stopped`** to container creation in the Podman client (`core/container/src/podman_client.rs:303-335`) — currently missing, so Podman itself never auto-restarts crashed containers
---
## Problem 2: Health Checks Are Fake
### Root Causes
#### A. "Healthy" just means "running" — application health is never checked
**File:** `core/archipelago/src/container/dev_orchestrator.rs:239-249`
```rust
pub async fn get_health_status(&self, app_id: &str) -> Result<String> {
match status.state {
ContainerState::Running => Ok("healthy".to_string()), // <-- THIS IS THE ENTIRE CHECK
ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()),
...
}
}
```
A container can be "running" but the application inside is completely broken. This is reported as "healthy".
#### B. Manifest health checks exist but are never executed
All 30+ app manifests in `image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml` define health checks like:
```yaml
health_check:
type: http
endpoint: http://localhost:4080
path: /api/health
interval: 30s
timeout: 5s
retries: 3
```
The `HealthMonitor` struct at `core/container/src/health_monitor.rs` can execute these checks. **But it is never instantiated.** No code path creates a `HealthMonitor` from the manifest health check definitions.
#### C. Health status is never pushed to the frontend via WebSocket
**File:** `core/archipelago/src/data_model.rs:120-127`
```rust
pub struct PackageDataEntry {
pub health: Option<String>, // Field exists but is NEVER POPULATED
}
```
The health field in the data model is always `None`. Frontend can only get health via explicit RPC call, which it almost never makes.
#### D. Frontend never polls health status
**File:** `neode-ui/src/stores/container.ts:169-175`
`fetchHealthStatus()` is only called after `startContainer()` and `startBundledApp()`. There is **no setInterval, no periodic polling, no watch**. After the initial call, health status is never refreshed.
### Fixes Required
1. **Wire up manifest health checks** — instantiate `HealthMonitor` from manifest definitions, run actual HTTP/exec probes instead of just checking `podman ps`
2. **Populate the `health` field in `PackageDataEntry`** so WebSocket pushes real health status to frontend
3. **Add 30-second health polling** in the frontend container store (with backoff to 60s when all healthy)
4. **Fix `get_health_status()`** in dev_orchestrator to call actual health checks, not just check container state
---
## Problem 3: CPU Exhaustion from Duplicate Polling
### Root Causes
#### A. Two independent monitors both call `podman stats` every 60 seconds
- **Health monitor:** `core/archipelago/src/health_monitor.rs:17``CHECK_INTERVAL_SECS = 60`
- Runs `podman ps -a --format json` (line 305-323)
- Runs `podman stats --no-stream` every 5 cycles (line 442-450)
- **Metrics collector:** `core/archipelago/src/monitoring/mod.rs:28` — 60-second interval
- Runs `podman stats --no-stream --format json` independently (collector.rs:220-224)
These are **not coordinated**. Both spawn separate subprocesses. On a system with 15+ containers, each `podman stats` call is expensive.
#### B. Total subprocess spawning frequency
| Component | Interval | What it runs |
|-----------|----------|-------------|
| Health monitor | 60s | `podman ps`, `podman stats` (every 5th), restart attempts |
| Metrics collector | 60s | `podman stats` (duplicate!) |
| Crash recovery snapshot | 120s | `podman ps` |
| Disk monitor | 300s | `df`, `sudo dmesg`, potentially `podman image prune` |
| Telemetry | 900s | `podman stats` (another duplicate) |
| Systemd watchdog | 120s | sd_notify ping |
| Frontend fleet polling | 60s | RPC calls that trigger more podman commands |
That's roughly **one `podman` subprocess every 10-15 seconds** on average, plus all the triggered operations.
#### C. No restart policy means polling-driven restarts
**File:** `core/container/src/podman_client.rs:303-335`
Container creation spec does NOT include `RestartPolicy`. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism.
#### D. Health monitor restart attempts with exponential backoff still spawn processes
When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns `podman start`, `podman inspect`, etc. If multiple containers are unhealthy, this multiplies.
### Fixes Required
1. **Deduplicate `podman stats`** — create a shared cache layer. One component fetches, others read from cache (TTL: 30s)
2. **Add `RestartPolicy: unless-stopped` with MaxRetryCount: 5** to all container creation — let Podman handle restarts natively instead of polling
3. **Increase health monitor interval to 120s** (60s is too aggressive when health checks are just `podman ps`)
4. **Remove duplicate `podman stats`** call from metrics collector — share data with health monitor
5. **Make frontend fleet polling viewport-aware** — only poll when user is actually viewing the fleet page
6. **Batch all container queries** — use a single `podman ps -a --format json` per check cycle, shared across all consumers
---
## Problem 4: Uninstall Doesn't Work
### Root Causes
#### A. No volume removal
**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289`
The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It **never removes Podman volumes**. Orphaned volumes accumulate forever.
#### B. No network cleanup
**File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289`
Multi-container stacks create networks (`archy-net`, `immich-net`, `penpot-net`) during install (`stacks.rs:89, 211`). These are **never cleaned up** during uninstall. Leftover networks can prevent reinstallation.
#### C. Force-kills stateful containers without graceful shutdown
**File:** `core/archipelago/src/api/rpc/package/runtime.rs:226`
```rust
let rm_out = tokio::process::Command::new("podman")
.args(["rm", "-f", name]) // -f = force kill
.output().await;
```
The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for `stop`. The `rm -f` that follows **ignores these timeouts** and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL.
#### D. Returns 200 OK even on partial failure
**File:** `core/archipelago/src/api/rpc/package/runtime.rs:268-289`
```rust
Ok(serde_json::json!({
"status": if errors.is_empty() { "uninstalled" } else { "partial" },
...
}))
```
Returns HTTP 200 with `"partial"` status. Frontend at `neode-ui/src/views/apps/useAppsActions.ts:74` doesn't check for "partial" — it deletes the app from the UI regardless.
#### E. Data directory cleanup requires sudo and fails silently
**File:** `core/archipelago/src/api/rpc/package/runtime.rs:256-265`
```rust
let rm_out = tokio::process::Command::new("sudo")
.args(["rm", "-rf", dir]).output().await;
if let Ok(o) = rm_out {
if !o.status.success() {
tracing::warn!(...); // Warning only, continues
}
}
```
If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled".
#### F. Container name detection has gaps
**File:** `core/archipelago/src/api/rpc/package/config.rs:287-340`
Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed.
### Fixes Required
1. **Add `podman volume rm`** for all volumes associated with the app after container removal
2. **Add network cleanup** — remove app-specific networks after all containers on that network are gone
3. **Use `podman stop -t {timeout}` then `podman rm`** (without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases
4. **Return an error (not 200)** when uninstall has failures. Frontend must check and display errors
5. **Surface "partial" failures to the user** with specific error messages
6. **Unify container naming** — derive names from a single source (manifest), not hardcoded patterns in multiple files
---
## Problem 5: Two Divergent Install Paths
The first-boot bash script and the Rust RPC installer create containers with **different configurations**. This is a major source of bugs.
### Specific Divergences
#### A. Database passwords
- **First-boot** (`scripts/first-boot-containers.sh:118-127`): Generates random passwords with `openssl rand -base64 24`, stores in `/var/lib/archipelago/secrets/`
- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610`): Uses hardcoded `"btcpaypass"`, `"mempoolpass"`, `"rootpass"`, `"immichpass"`
**Result:** Apps installed via RPC after first-boot can't connect to databases because passwords don't match.
#### B. Bitcoin configuration
- **First-boot** (`scripts/first-boot-containers.sh:295-313`): Dynamically sets `-prune=550` on small disks, `-txindex=1` on large disks
- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:415-420`): No custom args at all
**Result:** Bitcoin installed via RPC has no pruning or txindex regardless of disk size.
#### C. ZMQ configuration for LND
- **First-boot** (`scripts/first-boot-containers.sh:100-114`): Bitcoin.conf generated without ZMQ publisher settings
- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:438-439`): LND configured to connect to `tcp://bitcoin-knots:28332` and `tcp://bitcoin-knots:28333`
**Result:** LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path.
#### D. Port conflicts
- **First-boot** (`scripts/first-boot-containers.sh:813,835`): Both strfry and indeedhub bind to host port 7777
- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:734`): IndeedHub uses `8190:3000`
**Result:** On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely.
#### E. Memory limits
- **First-boot** (`scripts/first-boot-containers.sh:253-283`): Ollama gets 1g on low-mem systems
- **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:245-280`): Ollama gets 4g always
**Result:** Same app gets different resource limits depending on how it was installed.
#### F. Version mismatches in marketplace UI
- `scripts/image-versions.sh:17`: LND image is `v0.18.4-beta`
- `neode-ui/src/views/marketplace/marketplaceData.ts:155`: Shows `0.17.4`
- `scripts/image-versions.sh:21-22`: Mempool images are `v3.0.0`
- `neode-ui/src/views/marketplace/marketplaceData.ts:177`: Shows `2.5.0`
### Fixes Required
1. **Single source of truth for container config** — Rust config must read passwords from `/var/lib/archipelago/secrets/`, not hardcode them
2. **Add ZMQ config** to Bitcoin startup in both paths: `zmqpubrawblock=tcp://0.0.0.0:28332` and `zmqpubrawtx=tcp://0.0.0.0:28333`
3. **Fix port 7777 conflict** — assign unique ports to strfry and indeedhub
4. **Add disk-aware Bitcoin config** to Rust installer (prune/txindex based on disk size)
5. **Sync memory limits** between first-boot and Rust config
6. **Update marketplace version strings** to match actual image versions in `image-versions.sh`
7. **Long-term: eliminate first-boot-containers.sh** — have the backend handle all container creation using the same Rust code path
---
## Problem 6: Post-Install Hooks Run Async and Fail Silently
**File:** `core/archipelago/src/api/rpc/package/install.rs:541-625`
Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks:
```rust
tokio::spawn(async move {
let _ = tokio::fs::create_dir_all(secret_dir).await;
let _ = tokio::fs::write(...).await;
});
```
The install RPC returns success **before hooks complete**. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied.
### Fix Required
Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion.
---
## Problem 7: Podman Client Swallows Errors
**File:** `core/container/src/podman_client.rs`
#### A. JSON serialization failures return empty strings (line 182-183)
```rust
let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default();
```
#### B. Container ID parsing failures return empty string (line 344-348)
```rust
let id = result["Id"].as_str().unwrap_or("").to_string();
Ok(id) // Empty string = success?
```
#### C. Socket timeout is only 5 seconds (line 154-160)
On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic.
### Fixes Required
1. Replace `.unwrap_or_default()` with proper error propagation using `?`
2. Return `Err` when container ID is empty
3. Increase socket timeout to 15-30s
4. Add retry with backoff (3 attempts) on socket connection
---
## Problem 8: UI Misrepresents Container State
### Root Causes
#### A. "Exited" always displays as "Crashed" — even for clean shutdowns
**File:** `neode-ui/src/views/apps/appsConfig.ts:119-146`
```typescript
getStatusLabel(state, health):
- "exited" → "crashed" // <-- THIS IS THE PROBLEM
```
Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up.
#### B. No "recovering" or "boot in progress" state exists
**File:** `core/archipelago/src/data_model.rs:103-119`
PackageState enum has `Starting`, but it's only set during **explicit user start actions**, not during automatic crash recovery. During boot recovery, containers transition from `Exited → Running` without ever passing through `Starting`, so the UI never shows a spinner or "starting up" message.
#### C. Backend skips sub-containers from package listing, so their state is invisible
**File:** `core/archipelago/src/container/docker_packages.rs:39-117`
The excluded_services list filters out backend services like `mempool-db`, `btcpay-db`, `nbxplorer`, `penpot-postgres`, etc. UI containers ending in `-ui` are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., `indeedhub-postgres` being dead kills the entire IndeedHub stack, but only `indeedhub-api` errors are visible).
#### D. No distinction between "needs manual intervention" and "will recover soon"
The UI shows the same visual treatment for:
- Portainer (DB migration error — will NEVER recover without manual intervention)
- mempool-api (DB not ready yet — will recover in 30 seconds)
- IndeedHub (dependencies abandoned — won't recover until deps are manually restarted)
### Fixes Required
1. **Differentiate exit codes**: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning)
2. **Add a "recovering" state**: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers
3. **Show sub-container health**: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down")
4. **Distinguish recoverable from permanent failures**: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed"
5. **Add recovery progress indicator**: During boot, show "Recovering containers: 15/22 started" on the dashboard
---
## Problem 9: Dependency-Blind Restarts
### Root Cause (Confirmed by .228 reboot)
The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure:
1. `indeedhub-postgres` exits cleanly (code 0) on reboot
2. Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready)
3. After 3 attempts, postgres is **abandoned**
4. Meanwhile, `indeedhub-api` tries to connect to postgres → `ENOTFOUND indeedhub-postgres` → exits
5. Health monitor restarts api → same DNS failure → exits
6. After 3 attempts, api is **abandoned**
7. Same cascade for redis, minio, relay, main container — all abandoned within minutes
**File:** `core/archipelago/src/health_monitor.rs:500-530`
The restart loop treats each container independently. There's no logic to:
- Check if a container's dependencies are running before restarting it
- Restart dependencies first when a dependent container fails
- Reset attempt counters when a dependency comes back online
**3 attempts is too few**, especially when dependencies need time:
- Attempt 1: 10s backoff → dependency still starting
- Attempt 2: 30s backoff → dependency crashed and is being restarted
- Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned
- Game over. Entire stack is dead.
### Fixes Required
1. **Dependency-aware restart ordering**: Before restarting a container, check if its dependencies are running. If not, restart dependencies first.
2. **Increase max restart attempts to 5-10** for containers with dependencies
3. **Reset attempt counters** when a dependency comes back online (the dependent container failed because of the dependency, not itself)
4. **Add a "stack restart" concept**: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order
5. **Handle "Created" state containers**: `archy-mempool-web` and `fedimint` are in "Created" state (never started). The health monitor should detect these and attempt to start them.
---
## Priority Order for Fixes
### P0 — System is broken without these (reboot = broken system)
1. **Dependency-aware restarts** in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover
2. **Increase max restart attempts to 10** (currently 3) — dependency chains need more time on boot
3. **Handle "Created" state** — containers stuck in Created are never started by health monitor
4. **Fix UI state labels** — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window.
5. Fix Rust config to read secrets from `/var/lib/archipelago/secrets/` instead of hardcoded passwords
6. Fix port 7777 conflict (strfry vs indeedhub)
7. Add ZMQ config to Bitcoin for LND block notifications
### P1 — Core functionality broken
8. Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes)
9. Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts
10. Return actual errors from install/uninstall instead of silent success on partial failure
11. Remove `|| true` from critical first-boot commands
12. Show sub-container health in UI (which dependency is actually broken)
### P2 — Performance and CPU
13. Deduplicate `podman stats` calls (health monitor + metrics collector both call every 60s independently)
14. Increase health monitor interval to 120s
15. Add frontend health polling via WebSocket push (populate `health` field in data model)
16. Make fleet polling viewport-aware (don't poll when user isn't viewing)
### P3 — Consistency and correctness
17. Sync memory limits between first-boot and Rust config
18. Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0)
19. Unify container naming conventions between first-boot script and Rust config
20. Add disk-aware Bitcoin config (prune/txindex) to Rust installer
21. Distinguish "needs manual intervention" from "will recover soon" in UI
---
## Key Files to Modify
| File | What to fix |
|------|-------------|
| `core/archipelago/src/health_monitor.rs` | Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector |
| `core/container/src/podman_client.rs` | Add RestartPolicy to container creation spec, fix `.unwrap_or_default()` error swallowing, increase socket timeout to 15-30s |
| `core/archipelago/src/crash_recovery.rs` | Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all |
| `core/archipelago/src/api/rpc/package/install.rs` | Return failure on timeout (not silent success), await post-install hooks |
| `core/archipelago/src/api/rpc/package/runtime.rs` | Add volume/network cleanup on uninstall, use `podman stop -t` then `podman rm` (not `-f`), return errors on partial failure |
| `core/archipelago/src/api/rpc/package/config.rs` | Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits |
| `core/archipelago/src/container/dev_orchestrator.rs` | Wire up manifest-defined health checks instead of just checking podman state |
| `core/archipelago/src/container/docker_packages.rs` | Stop filtering sub-containers from state — or expose their health as part of parent app status |
| `core/archipelago/src/data_model.rs` | Populate `health` field for WebSocket push, add exit code to state |
| `core/archipelago/src/monitoring/mod.rs` | Share podman stats data with health monitor instead of duplicate subprocess calls |
| `neode-ui/src/views/apps/appsConfig.ts` | Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window |
| `neode-ui/src/stores/container.ts` | Add periodic health polling (30s) |
| `neode-ui/src/views/apps/useAppsActions.ts` | Check for "partial" uninstall status, show errors to user |
| `neode-ui/src/views/marketplace/marketplaceData.ts` | Fix version strings to match image-versions.sh |
| `scripts/first-boot-containers.sh` | Remove `\|\| true` from critical commands, fix port 7777 conflict, add proper error reporting |

File diff suppressed because it is too large Load Diff

View File

@ -1,216 +0,0 @@
# Current Agent Handoff - Bitcoin UI Recovery And `1.8-alpha` Resume
Last updated: 2026-06-10 05:33 EDT
## Read This First
This is a separate handoff from `docs/NEXT_TERMINAL_HANDOFF.md`. That file tracks
an older/broader plan. For the next agent resuming this machine-switch pause,
read this file first, then read:
- `docs/RESUME.md`
- `docs/1.8-alpha-improvements-tracker.md`
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
- `docs/MIGRATION_STATUS_REPORT.md`
Do not assume `docs/NEXT_TERMINAL_HANDOFF.md` is the current short-term plan.
## Current Goal
Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.
The release goal is not just "apps launch once"; the app/container system needs
to be developer-ready and production-release ready:
- manifests and docs must describe the real runtime contract;
- apps must install, start, stop, restart, uninstall, reinstall, survive reboot,
report truthful status, and show useful progress;
- My Apps must preserve last-known truth during Podman/scanner backoff instead
of showing false empty/no-app states;
- Bitcoin-dependent apps must explain sync/wallet readiness instead of looking
broken;
- final validation needs focused lifecycle, broad non-destructive lifecycle,
then repeated reboot checks before ISO cut/smoke test.
## Current Estimate
As of this pause:
- Credible release candidate: roughly `87-91%`.
- Production-quality release developers will love: roughly `73-79%`.
- Calendar estimate if the remaining systemic lifecycle issues are bounded:
`1-2 focused engineering days` for a release candidate, then additional
reboot/ISO smoke time.
- The biggest remaining risk is not catalog wiring; it is rootless Podman
control-plane responsiveness, stale scanner state, lifecycle progress UX, and
reboot validation.
## Validation Host
- Host: `192.168.1.198`
- SSH user: `archipelago`
- Password used in this session: `password123`
- Active Bitcoin app on this host: `bitcoin-knots`, not `bitcoin-core`
- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive
for deterministic validation unless intentionally testing them.
- Preserve app data.
- Avoid broad Podman store/image cleanup commands on `.198`.
## Bitcoin UI Incident Summary
User reported the Bitcoin custom UI showing:
`Bitcoin node is starting or busy syncing; retrying automatically. Detail:
getblockchaininfo: Bitcoin RPC request failed ... operation timed out`
Then after listener repair, the message changed through:
- `Connection refused`
- `Verifying blocks...`
- then the user reported it looked fine again.
What happened:
- The node is a `bitcoin-knots` node.
- During live debugging, the wrong alias, `bitcoin-core`, was started/stopped.
- `bitcoin-core` and `bitcoin-knots` compete for the same Bitcoin RPC/P2P ports.
- That action left the real `bitcoin-knots` service active but without the host
`8332` rootlessport listener for a while.
- Stopping the stray `bitcoin-core.service` and restarting only
`bitcoin-knots.service` recreated listeners on `8332` and `8333`.
- After restart, bitcoind entered the normal `-28 Verifying blocks...` phase.
- The user later reported the Bitcoin UI looked fine again.
Known live state observed during recovery:
- `bitcoin-knots.service`: active
- `bitcoin-core.service`: inactive
- `archy-bitcoin-ui.service`: active
- listeners present after repair:
- `8332` via `rootlessport`
- `8333` via `rootlessport`
- `8334` via nginx/Bitcoin UI
- `bitcoin-knots` logs showed active IBD around height `4137xx` and progress
about `0.09438`.
Do not restart Bitcoin again unless there is a fresh confirmed service/listener
failure. If checking status, prefer read-only probes and avoid starting the
wrong variant.
## Source Fixes Made Locally
These local edits were made after live Bitcoin recovered. They are not deployed
yet and were not fully validated before the user paused.
### `core/archipelago/src/bitcoin_status.rs`
Changed Bitcoin status cache behavior and copy:
- refresh interval changed from `5s` to `10s`;
- transient error backoff added at `15s`;
- RPC client timeout increased from `8s` to `20s`;
- error context now uses full anyhow chain with `{e:#}`;
- transient classifications now include common overloaded/backend states;
- user-facing copy now distinguishes:
- `verifying blocks after restart`;
- `waiting for the Bitcoin RPC listener`;
- `busy and not answering RPC before the timeout`;
- generic `starting or busy syncing`;
- added unit tests for the three user-visible states above.
Intent: stop collapsing distinct backend states into the same stale
"starting or busy syncing" timeout message.
### `core/archipelago/src/api/rpc/package/update.rs`
Narrow Bitcoin alias fix added:
- `orchestrator_update_app_id("bitcoin-knots")` now remains
`"bitcoin-knots"` instead of mapping to `"bitcoin-core"`;
- candidate app IDs for a Bitcoin container now prefer `bitcoin-knots` before
`bitcoin-core`;
- tests updated to lock this behavior.
Intent: `bitcoin-core` and `bitcoin-knots` can be dependency/status aliases,
but must not be interchangeable lifecycle/update targets on a node that has a
specific installed variant.
Important: this file also already contained other uncommitted update/pull
timeout changes from prior work. Do not assume every diff in this file came
from this interruption.
## Validation Status At Pause
Completed:
- `cargo fmt --manifest-path core/Cargo.toml --all` passed after the local
Bitcoin edits.
Attempted but not completed:
- Targeted Cargo tests were first launched in three separate `/tmp` target dirs
and failed due `/tmp` filling with `No space left on device`.
- Those temporary dirs were removed:
- `/tmp/archy-cargo-bitcoin-status`
- `/tmp/archy-cargo-update-alias`
- `/tmp/archy-cargo-container-candidates`
- A second run using `CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix` was still
compiling when the user paused. It was terminated for handoff.
- No successful Rust test result exists yet for the new Bitcoin status/alias
tests.
Recommended validation after resume:
```bash
git diff --check -- core/archipelago/src/bitcoin_status.rs core/archipelago/src/api/rpc/package/update.rs docs/CURRENT_AGENT_HANDOFF.md
CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago bitcoin_status::tests
CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago update_aliases_map_to_manifest_app_ids
CARGO_TARGET_DIR=.codex-tmp/cargo-bitcoin-fix CARGO_BUILD_JOBS=2 cargo test --manifest-path core/Cargo.toml -p archipelago container_name_candidates_cover_common_aliases
```
If Cargo target locking appears stale, check for real `cargo`/`rustc` workers
before deleting anything. Prefer workspace-local target dirs under `.codex-tmp`
over new cold `/tmp` targets.
## Immediate Next Steps
1. Confirm no lingering Cargo process:
```bash
pgrep -af "cargo|rustc|cargo-bitcoin-fix"
```
2. Validate the local Bitcoin source fixes listed above.
3. If validation passes, build/deploy the backend to `.198` only after
confirming the user still wants deployment.
4. Recheck live Bitcoin non-destructively:
- `bitcoin-knots.service` active;
- `bitcoin-core.service` inactive;
- listeners on `8332`, `8333`, `8334`;
- Bitcoin UI loads on `8334`;
- `/bitcoin-status` returns useful copy if backend is busy.
5. Resume release backlog:
- rootless Podman lifecycle/control-plane responsiveness;
- My Apps last-known-state truthfulness during scanner backoff;
- progress UX for install/uninstall/start/stop/restart;
- remaining tracker rows in `docs/1.8-alpha-improvements-tracker.md`;
- focused lifecycle matrix on `.198`;
- broad non-destructive lifecycle;
- 3 clean reboot validations minimum, 5 preferred;
- ISO cut and ISO smoke test.
## Cautions For Next Agent
- Do not start `bitcoin-core` on `.198` unless intentionally migrating variants.
- Treat `bitcoin-knots` as the installed Bitcoin variant.
- Do not run broad Podman prune/store cleanup.
- Do not revert unrelated dirty worktree changes.
- `docs/NEXT_TERMINAL_HANDOFF.md` exists but is not the short-term handoff for
this pause.
- Many repo files are dirty from broader release hardening. Read diffs before
attributing changes.

View File

@ -1,144 +0,0 @@
# Handoff — Mesh device rename, mesh routing, duplicate contacts, netbird logout (2026-06-20)
Session is a **test-build iteration toward the 1.8.0 bug-bash release** — sideload patched binaries
to test nodes, NO version bump / NO OTA release (manifest stays `1.7.99-alpha`). Because the version
string never changes, **verify a deploy by sha256-matching the deployed binary**, not by `current_version`.
## Test node roster (creds in the operator's local notes / agent memory — NOT in this repo)
- `.116` 192.168.1.116 — this build host (archi-thinkpad), dev/validation.
- `.198` 192.168.1.198, `.228` 192.168.1.228 — LAN resilience nodes.
- `.5` Tailscale 100.72.136.5 (archy-x250-beta) — **Meshtastic radio**.
- `.120` Tailscale 100.66.157.120 (archy-x250-exp) — **Meshtastic radio**.
- `.89` Tailscale 100.89.209.89 (archy-x250-pa) — **dual radio**: ttyACM0 Meshtastic (probe FAILS),
ttyUSB0 MeshCore (active). Configured device_path = ttyACM0. Runs netbird (v2.38.0).
Deploy driver used this session: `/tmp/archy-deploy/deploy-node.sh <user@host> <pw> <label>`
(scp binary + stream `web/dist/neode-ui` + sudo swap `/usr/local/bin/archipelago`, preserve aiui +
claude-login.html, chown 1000:1000, restart, verify sha256+health). Recreate from this doc if /tmp is gone.
## Deploy state (binary sha) at handoff
- `b5183dfc…` (HEAD d00d1b20, includes Meshtastic rename) → on **.5 and .120** (verified).
- `f702b4f1…` (the 3 wallet/mesh/ui fixes, pre-rename) → on **.116, .198, .228**.
- `7c17a96…` (OLD, pre-f702b4f1) → **.89 is STALE** — update before re-testing .120→.89.
## DONE
1. **Meshtastic device rename → server name** — committed `d00d1b20` (pushed to gitea-vps2/main).
`meshtastic.rs set_advert_name` was a no-op (in-memory only). Now sends
`AdminMessage{set_owner=User{long_name,short_name}}` to the local node on ADMIN_APP port (6),
set_owner field = 32. long_name = server name (≤39), short_name = first 4 alphanumerics upper-cased.
**Hardware-verified**: .120 radio now reads back `Archy-X250-EXP`, .5 reads back `Archy-X250-Beta`.
MeshCore already renamed (CMD_SET_ADVERT_NAME, serial.rs:147) — unchanged, now at parity.
2. **Routing priority confirmed = Mesh → FIPS → Tor**. `send_typed_wire` (mesh/mod.rs:1007): reachable
radio peer → LoRa; federation-synthetic OR (`!reachable && arch_pubkey_hex.is_some()`) → federation.
`send_typed_wire_via_federation` (mod.rs:1124): FIPS first w/ `.fips_timeout(8s)`, Tor fallback.
3. **`.120``.89` "non-delivery" diagnosed — it is NOT a delivery failure.** `.120` sends to .89's
federation contact_id `3027572739`, logs `Federation envelope delivered transport=tor` (gated on
HTTP 2xx, mod.rs:1185). The receiver returns 2xx ONLY after ed25519-verify + successful
`inject_typed_from_federation` (node_message.rs:217-263). Identity matches (.89 pubkey 031875b4…).
`.89``.120` works. So .120's messages ARE injected into .89's state under contact_id
`2679725907` = federation_peer_contact_id(.120 pubkey 535fb91f…), name "Archy-X250-EXP".
It's a **duplicate-contact SURFACING** problem (user confirmed doubles).
## SESSION 3 PROGRESS (2026-06-20 — deployed fleet-wide, binary `e1f2e88`)
- **#5 Arch Mobile messages CONFIRMED FIXED** by the #12 dedup — user verified MeshCore surfaces them.
- **#3 ecash pay-for-file — confirm UI + auto-refund** (`12f54e39`): PeerFiles shows a confirmation
step (amount + which wallet Cashu/Fedimint + balances + switch + styled Confirm); `content.download-peer-paid`
takes `method`, logs the backend+outcome, gives backend-specific rejection errors, and RECLAIMS the
spent token on any failure (fedimint reissue / cashu receive) so funds aren't lost. Root cause of the
user's failed pay: `.198` had no Cashu → spent Fedimint notes → seller `.89` not in the SAME federation
→ rejected → notes stuck (now auto-refunded; old stuck notes auto-return in ~1h via the 3600s spend timeout).
To COMPLETE a fedimint pay, payer+seller must share a federation (or share a Cashu mint w/ balance).
- **#1 companion crash** — added an on-screen red error overlay (`242baf5d`) since chrome://inspect isn't
reachable on the WebView; user reproduces → screenshots the box → that's the real error to fix on.
- **#7 NEW: can't add Fedimint federations on `.116`** — fmcd sidecar crash-loops `Operation not permitted
(os error 1)`, so `:8178` answers HTTP 000 and `wallet.fedimint-join` fails. fmcd WORKS on `.198`/`.89`.
EXHAUSTIVE black-box isolation on `.116` (seccomp default vs unconfined; cap-drop ALL vs caps restored;
fresh data vs a `cp -a` COPY of the real /data; default net vs archy-net; /data 755 vs 777) — **fmcd ran
in EVERY standalone `podman run` config**, including full real security (cap-drop ALL + readonly +
no-new-priv + archy-net + copy of real data). Only the ORCHESTRATOR-created container EPERMs. So:
- **seccomp is NOT the cause** (default-seccomp standalone runs) — the seccomp "fix" was reverted (`63b98599`).
- NOT caps, NOT /data perms/ownership, NOT the existing multimint.db (the copy runs), NOT archy-net.
- The differentiator is something specific to the orchestrator's libpod-API create vs `podman run` that I
did NOT pin (a related symptom: the orchestrator's volume self-heal logs `chown /data: Operation not
permitted` because the container has cap-drop ALL → no CAP_CHOWN). NEXT: create fmcd via the libpod API
socket directly (replicating prod_orchestrator's exact body) to repro outside the orchestrator, then diff.
WORKAROUND for now: **test Fedimint on `.198`/`.89` (working fmcd), not `.116`.** Not the ecash code.
- Deploy: all 6 nodes verified on `e1f2e88`; pushed gitea-vps2 (gitea-local token still 401s).
## SESSION 2 PROGRESS (2026-06-20, code-complete — NOT yet deployed; user held deploy)
All committed to local `main`; NOT pushed to gitea-vps2/origin yet, NOT sideloaded.
- **#12 dup contacts DONE** (`f92e442b`, +3 unit tests pass). Backend `group_peer_twins()`
helper (mesh/mod.rs) dedups by `arch_pubkey_hex`, radio twin = canonical send id, unions
messages; wired into conversations.list/messages + mesh.contacts-list. **KEY FINDING:**
conversations.list/messages have NO frontend consumer — the live chat list renders the
*frontend* merge `mergedPeers` (Mesh.vue), which matched twins by the `Archy-z6Mk…` advert
prefix that the device RENAME broke. Real fix = merge by `arch_pubkey_hex` (now exposed on the
MeshPeer TS type). Should also clear `.120→.89` and likely **#5** (Arch Mobile on .116, same bug).
- **Companion crash diagnostic SHIPPED** (`b3633ec5`): main.ts global handler now shows the REAL
error + keeps a 25-entry `window.__archyErrors` ring buffer + catches async/unhandledrejection.
Still need to deploy + repro on the optiplex node (read `window.__archyErrors` via chrome://inspect)
to get the actual throw. User says LAN/mobile-browser fine → Tailscale-WebView-specific.
- **#3 dual-ecash pay-for-file DONE** (`8f06d88f`, compiles): payer tries Cashu→Fedimint, seller
accepts both (verify_and_receive_payment: non-"cashu" = reissue_into_any), new
fedimint_client::spend_from_any(), wallet.ecash-balance reports total_sats. LIVE federation
validation pending (two nodes sharing a federation).
- **#2 mobile scroll cutoff DONE** (`a8c668ee`): DashboardMobileNav wrote `--mobile-tab-bar-height:0px`
when the bar was hidden/unlaid-out, defeating the `,88px` fallback → bar covered last row. Now never
writes 0 (removes var → fallback), re-measures on rAF + post-WebView-injection. Backup hypothesis if
it persists: `.dashboard-view` is `min-h-screen`(100vh) → mobile-browser toolbar overlap, switch to dvh.
DEPLOYED 2026-06-20 to ALL 6 nodes — binary sha `4a8f2198…` (release build of commit a6957a48 +
this handoff), FE rebuilt, all sha-verified + service active: .116(local) .198 .228 .89 .5 .120.
.5/.120 needed a 30-min timeout (slow DERP). #10 netbird OIDC gate also shipped in this build.
REMAINING VERIFICATION (on real hardware, user-side):
- #12/#5: open mesh chat on .116 (and .89/.120) — confirm a federated node shows ONCE with its
messages (no radio/federation double), and that "Arch Mobile" messages now surface.
- #1 companion crash: open the companion app to the optiplex node over Tailscale, reproduce the
crash, then read the REAL error from `window.__archyErrors` (chrome://inspect the WebView) or the
now-detailed toast. That error is what's needed to write the actual fix. Confirm which node = optiplex.
- #3: pay for a peer file when the buyer's balance is only in Fedimint (needs two nodes in a federation).
- #2: check Cloud/files bottom rows clear the tab bar on mobile browser.
Commits are LOCAL on main (f92e442b/b3633ec5/8f06d88f/a8c668ee/a6957a48 + docs) — NOT pushed to
gitea-vps2/origin (no version bump; bug-bash sideload only).
## TODO (original resume — #12 now DONE above)
### #12 Fix duplicate mesh contacts ← DONE this session (see SESSION 2 PROGRESS)
Root cause: `handle_mesh_contacts_list` (api/rpc/mesh/typed_messages.rs:1126) and
`handle_conversations_list` (api/rpc/mesh/status.rs:89) emit **one row per `state.peers` entry** with
**no cross-transport dedup**. A node can have TWO peers: a radio peer (low contact_id, firmware key)
and a federation peer (high contact_id ≥ 0x8000_0000, archipelago key). `bind_federation_twins`
(mesh/mod.rs:85) correlates them by exact advert_name and copies `arch_pubkey_hex` onto the radio
twin, but LEAVES BOTH ROWS. Messages are keyed by `peer_contact_id` (split across the two ids), so
the federation-injected messages sit on the federation row while the user may open the radio row → empty.
**Design constraint (important):** the two twins have DIFFERENT routing. Collapsing must NOT break
"mesh-first": the canonical SEND contact_id should be the RADIO twin when one exists (so send_typed_wire
routes LoRa-if-reachable, else federation via the bound arch key), else the federation id. The merged
THREAD must union messages from ALL twin contact_ids (group by `arch_pubkey_hex`). Apply the dedup in:
- `handle_conversations_list` (status.rs:89) — one conversation per identity group; last msg = newest across twins.
- `handle_mesh_contacts_list` (typed_messages.rs:1126).
- `handle_conversations_messages` (status.rs ~146) — when asked for a contact_id, resolve its group's
twin ids and filter messages by ANY of them.
Add a shared helper (e.g. group peers by `arch_pubkey_hex` when Some, else singleton by contact_id).
Do NOT merge/re-key at `bind_federation_twins` time — that would force federation routing and break mesh-first.
MeshPeer struct: mesh/types.rs:28 (fields: contact_id, advert_name, did, pubkey_hex, arch_pubkey_hex, reachable…).
**Before testing #12:** update `.89` to the current build (it's on stale 7c17a96), then re-check whether
.120 ("Archy-X250-EXP") shows once with its messages. NB: .89 had 0 journal mentions of "Archy-X250-EXP"
and no radio contact for .120 — so its specific double may be a stale-binary artifact; confirm on fresh build.
### #10 Netbird logout race
Symptom: right after install netbird shows logged-in but can't log out; self-corrects after a while.
Map: install `stacks.rs install_netbird_stack` (~1760-1918): 3 containers (netbird-server :8086, dashboard,
nginx proxy :8087→443 self-signed TLS). `wait_for_stack_containers` waits for "running", NOT OIDC-ready.
Dashboard is netbird's own SPA, opened in a NEW TAB (appLauncher.ts ~52-60, secure-context/crypto.subtle).
Hypothesis: startup race — dashboard loads before netbird-server's OIDC provider is ready, caches a bad auth
state; logout endpoint not ready. Likely fix: gate install completion / launch on netbird-server OIDC
readiness (poll an endpoint) rather than container "running". Repro on `.89` (has netbird running).
Prior note: AccountInfoSection.vue ~602 release note claims a previous unified-origin fix for the 404
logout/login loop — the initial-state race remains.
## Mesh parity directive
MeshCore "works great"; Meshtastic must reach the SAME parity (rename done; duplicate-contact + routing
fallback shared across both). Meshtastic↔MeshCore are INCOMPATIBLE over-the-air, so cross-protocol
federated peers (.120↔.89) rely entirely on the FIPS/Tor fallback.

View File

@ -1,58 +0,0 @@
# Marketplace QA — app-by-app install walk
Purpose: track install/launch/uninstall health for every app in the marketplace catalog on `.228`. User installs each app one by one; for each broken one we triage, fix at the right layer (app recipe / registry image / backend / frontend), commit, redeploy, and re-verify.
Target build: `v1.7.43-alpha` + backend md5 `9b8ead06aaf210b85cd78fce270384e3` (image-versions path fix included).
## Status key
- ✅ install, launch, uninstall all clean
- ⚠️ installs and runs but has cosmetic or partial issues (note in details)
- ❌ broken — fix needed
- ⏳ pending verification
## Catalog
Pull the authoritative list from Marketplace page on `.228` during the walk. Fill in as you go.
| App | Status | Notes / fix applied |
|---|---|---|
| _(to be filled during walk)_ | ⏳ | |
## Known issues going in
- **Vaultwarden** — container exits immediately on start. Pre-existing. Backend async wrapper correctly detects + removes the install state entry. Needs container-config investigation (image pin / env vars / volume layout).
## Fix layers cheat-sheet
When an app breaks, identify which layer to fix at:
1. **App recipe**`apps/<app>/package.yaml` or wherever the Podman manifest lives. Ports, volumes, env vars, healthcheck, resource caps.
2. **Registry image** — if image itself is missing/wrong-tag on `.168`:3000/lfg2025 or `git.tx1138.com`. Push corrected image, bump `scripts/image-versions.sh`.
3. **Backend orchestrator**`core/archipelago/src/container/` or `core/archipelago/src/api/rpc/package/` if the install flow mishandles this app's shape.
4. **Frontend**`neode-ui/src/views/marketplace/` or curated data in `neode-ui/src/views/marketplace/marketplaceData.ts` if catalog entry is wrong or UI can't render this app correctly.
## Per-app fix workflow
For each broken app:
1. Capture failure mode:
```
ssh archy228 'sudo journalctl -u archipelago --since "5 minutes ago" --no-pager | tail -80'
ssh archy228 'podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.Image}}" | grep <app>'
ssh archy228 'podman logs <container-name> 2>&1 | tail -60'
```
2. Diagnose — which layer.
3. Fix in repo (use SSHFS mount for edits).
4. `cargo check` if backend changed; `npm run build` if frontend changed.
5. Commit with `fix(app/<name>): ...` or `fix(registry/<image>): ...` etc.
6. Redeploy as needed (binary via Mac ferry; frontend via rsync; registry via podman push).
7. User re-verifies on `.228`. Mark ✅.
## Release-notes policy
For each app fix, append a bullet to the current in-flight release entry in `neode-ui/src/views/settings/AccountInfoSection.vue`. If the fix pile gets large enough to warrant its own release, bump to v1.7.44-alpha and start a new block at the top. Keep entries operator-focused ("Nostr Relay no longer crashes on first start"), not implementation-focused.
## Running log
_Add dated notes here as we progress through the catalog._

View File

@ -1,476 +0,0 @@
# MASTER PLAN
> Archipelago project task tracking and roadmap.
>
> **BETA FREEZE ACTIVE (2026-03-18)** — No new features. Fix bugs, harden security, test everything.
> Pipeline: **Feature Testing****User Testing** → **Beta Live**
> Progress: `docs/BETA-PROGRESS.md` | Acceptance: `docs/BETA-RELEASE-CHECKLIST.md`
## Roadmap
### Phase 1: Feature Testing (internal) — CURRENT
| ID | Title | Priority | Status | Dependencies |
|----|-------|----------|--------|--------------|
| **FEATURE-4** | **Onboarding loading screen with progress** | **P1** | IN PROGRESS | - |
| **TASK-9** | **Full feature testing sweep** | **P1** | PLANNED | - |
| **TASK-10** | **ISO build verification + multi-hardware test** | **P1** | PLANNED | - |
| **TASK-12** | **Beta telemetry — reporter + toggle + collector POST** | **P1** | IN PROGRESS | - |
| **TASK-39** | **Finish .198 rootless container migration** | **P1** | PLANNED | TASK-11 |
| **TASK-42** | **LUKS2 full-partition encryption for /var/lib/archipelago/** | **P1** | IN PROGRESS | - |
| **TASK-49** | **Container app reliability — bulletproof installs + recovery** | **P0** | PLANNED | - |
| **TASK-50** | **Networking stack: first-install → reboot-proof** | **P0** | IN PROGRESS | - |
| **BUG-44** | **App iframe shows blank/broken when container is starting or crashed** | **P2** | PLANNED | - |
| **TASK-45** | **Deploy script: auto-chown data dirs after rootful→rootless migration** | **P2** | PLANNED | - |
| **BUG-46** | **FileBrowser missing in unbundled ISO + Cloud auto-login broken** | **P1** | IN PROGRESS | - |
| **BUG-47** | **Onboarding: DID sign 403 + blob HTTPS + no password setup** | **P1** | IN PROGRESS | - |
| **FEATURE-48** | **Meshtastic support for mesh (plug and play)** | **P1** | PLANNED | - |
### Phase 2: User Testing (controlled, real hardware)
| ID | Title | Priority | Status | Dependencies |
|----|-------|----------|--------|--------------|
| **TASK-13** | **Recruit 3-5 test users, distribute ISOs** | **P1** | NOT STARTED | Phase 1 complete |
| **TASK-14** | **Monitor telemetry, triage + fix user-reported issues** | **P1** | NOT STARTED | TASK-12, TASK-13 |
| **TASK-15** | **Rebuild ISO with fixes, re-verify** | **P1** | NOT STARTED | TASK-14 |
### Phase 3: Beta Live (public)
| ID | Title | Priority | Status | Dependencies |
|----|-------|----------|--------|--------------|
| **TASK-16** | **Final ISO build + release notes + distribution** | **P1** | NOT STARTED | Phase 2 complete |
### Post-Beta (FROZEN — do not start)
| ID | Title | Priority | Status | Dependencies |
|----|-------|----------|--------|--------------|
| **TASK-2** | **Roll incoming-tx into deploy & ISO** | **P2** | DEFERRED | - |
| **INQUIRY-5** | **Offline balance check via mesh relay** | **P2** | DEFERRED | - |
| **FEATURE-6** | **Watch-only wallet architecture** | **P1** | DEFERRED | - |
| **TASK-7** | **Mesh Bitcoin security hardening** | **P1** | DEFERRED | FEATURE-6 |
| **FEATURE-43** | **P2P encrypted voice/video calling (WebRTC over federation)** | **P1** | DEFERRED | - |
| **FEATURE-48** | **Meshtastic support for mesh (plug and play)** | **P1** | PLANNED | - |
## Active Work
### FEATURE-4: Onboarding loading screen with progress (IN PROGRESS)
**Priority**: P1 — High
**Status**: IN PROGRESS (2026-03-17)
Users hit the onboarding screen before the backend is ready, resulting in "Server is still starting up" errors that block identity creation. The onboarding flow should not begin until the server is fully operational.
**Solution**: Show the existing screensaver as a loading/boot screen with server startup progress. Swap the inner logo for animated pixel art icons (smiley face, Bitcoin logo, etc.) that cycle while services come online. Show progress indicators for each backend service (identity store, container runtime, LND, etc.). Only transition to onboarding once `/health` returns ready.
**Key considerations**:
- Reuse the existing screensaver component as the boot screen
- Animated pixel art icons rotate in the center (smiley, BTC, lightning bolt, etc.)
- Progress bar or status checklist showing which services are ready
- Poll `/health` endpoint for service readiness
- Smooth transition from boot screen → onboarding once all critical services are up
- First-boot vs normal boot: first boot shows onboarding after, normal boot goes to dashboard
**Key files**:
- `neode-ui/src/views/Onboarding.vue` — current onboarding flow
- `neode-ui/src/components/Screensaver.vue` — existing screensaver to repurpose
- `core/archipelago/src/api/rpc/mod.rs` — health endpoint
- `core/archipelago/src/server.rs` — startup sequence and service initialization
**Tasks**:
- [ ] Investigate current health endpoint — what services does it check, what's missing
- [ ] Design boot screen component: screensaver background + animated pixel icons + progress
- [ ] Create pixel art icon set (smiley, BTC, lightning, shield, etc.) as SVG/CSS animations
- [ ] Implement service readiness polling (health check with granular service status)
- [ ] Add backend support for granular startup progress (which services are ready)
- [ ] Build boot screen component with smooth transition to onboarding/dashboard
- [ ] Handle edge cases: very slow starts, partial service failures, timeout fallback
- [ ] Test on fresh ISO install (first-boot scenario)
### TASK-9: Full app testing matrix on fresh install (PLANNED)
**Priority**: P1 — High
**Status**: PLANNED (2026-03-18)
Run through the complete `docs/BETA-RELEASE-CHECKLIST.md` app matrix on a fresh ISO install. Every app: install, launch, UI loads, uninstall. Every dependency chain: correct errors when deps missing.
### TASK-10: ISO build verification + multi-hardware test (PLANNED)
**Priority**: P1 — High
**Status**: PLANNED (2026-03-18)
Build a fresh ISO, install on at least 2 different hardware configurations, verify full onboarding flow, app installs, and multi-day uptime.
---
### TASK-17: Alpha version tags + rollback strategy (PLANNED)
**Priority**: P2 — Medium
**Status**: PLANNED (2026-03-18)
Tag every significant alpha version with git tags for easy rollback. Each tag should correspond to a deployable state. Maintain a version log so any alpha can be rebuilt and deployed.
**Tasks**:
- [ ] Tag current state as `v1.2.0-alpha.1` (pre-rootless-podman)
- [ ] Establish naming convention: `v{major}.{minor}.{patch}-alpha.{build}`
- [ ] Tag after rootless podman migration: `v1.2.0-alpha.2`
- [ ] Document rollback procedure (git checkout tag + deploy)
- [ ] Add version tag step to deploy script (auto-tag on successful deploy)
- [ ] Update CHANGELOG.md with each alpha milestone
---
### TASK-42: LUKS2 full-partition encryption for /var/lib/archipelago/ (IN PROGRESS)
**Priority**: P1 — High
**Status**: IN PROGRESS (2026-03-26)
Encrypt all Archipelago app data at rest using LUKS2 full-partition encryption. Protects Bitcoin wallet data, LND macaroons, FileBrowser files, Vaultwarden vault, secrets, and everything else from physical disk seizure. Seamless UX — user never interacts with encryption directly.
**Design**:
- LUKS2 partition for `/var/lib/archipelago/` created during ISO install
- Cipher: AES-256-XTS (hardware AES-NI on x86_64, ChaCha20 fallback on ARM without AES-NI)
- Key derived from setup password via Argon2id + hardware salt (`/sys/class/dmi/id/product_uuid`)
- Key file stored at `/root/.luks-archipelago.key` (root:600, on boot partition)
- Auto-unlock via `/etc/crypttab` on every boot — no passphrase prompt
- Password change in Settings re-derives key and rotates LUKS keyslot
**Threat model**:
- Disk removed from machine = fully encrypted, unreadable
- Running machine with login = transparent (same as today)
- Forgot password = cannot decrypt (correct sovereign behavior)
**Tasks**:
- [x] ISO installer: create LUKS2 partition, format + mount at `/var/lib/archipelago/`
- [ ] First-boot: derive LUKS key from setup password via Argon2id + hardware salt
- [x] Store key file at `/root/.luks-archipelago.key` with 600 perms
- [x] Configure `/etc/crypttab` for auto-unlock at boot
- [ ] Settings password change: re-derive LUKS key, add new keyslot, remove old
- [x] Detect AES-NI availability, fall back to ChaCha20 on ARM without it
- [ ] Test: fresh install, reboot survives, power-cycle survives, password change works
- [ ] Test: disk removed from machine is unreadable
- [x] Update `image-recipe/build-auto-installer-iso.sh`
**Key files**:
- `image-recipe/build-auto-installer-iso.sh` — partition creation
- `scripts/first-boot-containers.sh` — runs after LUKS mount
- `core/archipelago/src/api/rpc/system.rs` — password change handler
- `core/archipelago/src/server.rs` — startup checks
### TASK-49: Container app reliability — bulletproof installs + recovery (PLANNED)
**Priority**: P0 — Critical
**Status**: PLANNED (2026-03-29)
Every marketplace app must install cleanly, survive failures, auto-recover from unhealthy states, and uninstall without residue. Currently: some apps fail silently, health checks are inconsistent, and there's no systematic testing.
**Scope**: All 25+ marketplace apps — install, health, restart, uninstall, dependency chains.
#### Phase A: Audit & Fix Install Flow (Days 1-2)
Test every app install on a fresh .198 node. Fix failures as found.
- [ ] **A1**: Create install test matrix — spreadsheet of all apps with columns: installs?, starts?, healthy?, UI loads?, uninstalls?, deps correct?
- [ ] **A2**: Test core apps: Bitcoin Knots, LND, Mempool, BTCPay, Electrumx, FileBrowser
- [ ] **A3**: Test recommended apps: Fedimint, Vaultwarden, Grafana, SearXNG, Tailscale, Portainer
- [ ] **A4**: Test optional apps: Home Assistant, Jellyfin, PhotoPrism, Nextcloud, Ollama, Immich, Penpot, OnlyOffice
- [ ] **A5**: Test web-only/L484 apps: noStrudel, BotFights, NWNN, IndeedHub, DWN
- [ ] **A6**: Test Nostr relay (nostr-rs-relay) install + relay functionality
- [ ] **A7**: Fix all install failures found in A2-A6
#### Phase B: Health Checks & Restart Policies (Days 2-3)
Ensure every container has proper health checks and restart policies.
- [ ] **B1**: Audit all container manifests for `--health-cmd`, `--health-interval`, `--health-retries`
- [ ] **B2**: Add health checks to containers missing them (curl endpoint or process check)
- [ ] **B3**: Verify `--restart unless-stopped` on all containers
- [ ] **B4**: Test failure recovery: `podman kill <container>` → verify auto-restart
- [ ] **B5**: Test OOM recovery: set low memory limit → trigger OOM → verify restart
- [ ] **B6**: Verify container-doctor.sh runs on timer and fixes unhealthy containers
- [ ] **B7**: Verify reconcile-containers.sh detects and recreates missing containers
#### Phase C: Dependency Chain Validation (Day 3)
Apps with dependencies (BTCPay→Bitcoin+Postgres, Mempool→Bitcoin+MariaDB) must handle missing deps gracefully.
- [ ] **C1**: Map all dependency chains (which app needs which)
- [ ] **C2**: Test installing dependent app without dependency → verify error message
- [ ] **C3**: Test stopping dependency while dependent is running → verify graceful degradation
- [ ] **C4**: Test restarting dependency → verify dependent reconnects automatically
- [ ] **C5**: Ensure backend `dependency_resolver.rs` handles all chains correctly
#### Phase D: Uninstall & Cleanup (Day 4)
Every app must uninstall cleanly — no orphaned volumes, networks, or config.
- [ ] **D1**: Test uninstall for each app — verify container, volumes, config removed
- [ ] **D2**: Verify no orphaned podman volumes after uninstall (`podman volume ls`)
- [ ] **D3**: Verify no orphaned networks after uninstall
- [ ] **D4**: Test reinstall after uninstall — must work cleanly
- [ ] **D5**: Fix any cleanup issues found
#### Phase E: Stress & Soak Testing (Day 5)
Multi-day uptime test with all core apps running.
- [ ] **E1**: Install all core + recommended apps on .198
- [ ] **E2**: Let run for 24h — check for crashes, memory leaks, disk growth
- [ ] **E3**: Simulate power failure (hard reboot) — verify all apps come back
- [ ] **E4**: Simulate network failure — verify apps recover when network returns
- [ ] **E5**: Run container-doctor after soak test — should report all healthy
#### Phase E2: FileBrowser Auto-Login (Day 5)
FileBrowser must auto-login seamlessly after install — user should never see a separate login screen. Still protected via nginx session cookie validation.
- [ ] **E2a**: Fix FileBrowser auto-login flow: nginx auth_request validates Archipelago session, injects FileBrowser auth token
- [ ] **E2b**: Verify auto-login works on fresh bundled install (first boot)
- [ ] **E2c**: Verify auto-login works on unbundled install (Marketplace install)
- [ ] **E2d**: Verify FileBrowser is NOT accessible without valid Archipelago session (security)
- [ ] **E2e**: Test auto-login after session expiry → re-login to Archipelago → FileBrowser works again
#### Phase F: Frontend UX (Day 5-6)
The UI must accurately reflect container state at all times.
- [ ] **F1**: Installing state persists across navigation (DONE — TASK-49 server store)
- [ ] **F2**: App card shows correct state: stopped, starting, running, unhealthy, crashed
- [ ] **F3**: App iframe shows contextual error when container is down (BUG-44)
- [ ] **F4**: Uninstall progress shown in My Apps
- [ ] **F5**: Error toast when install fails with actionable message
**Key files**:
- `core/archipelago/src/container/` — PodmanClient, manifests, health
- `core/archipelago/src/api/rpc/package/` — install/uninstall RPC handlers
- `scripts/container-doctor.sh` — health check + auto-fix
- `scripts/reconcile-containers.sh` — recreate missing containers
- `scripts/image-versions.sh` — pinned image versions
- `scripts/first-boot-containers.sh` — first-boot container creation
- `neode-ui/src/views/marketplace/` — install UI
- `neode-ui/src/views/apps/` — My Apps state display
**Testing approach**:
- Fresh .198 install as test bed
- SSH in, run installs via web UI, check with `podman ps -a`
- Automated: `scripts/container-doctor.sh --local` after each test
- Manual: kill containers, pull power, break networks, verify recovery
---
### BUG-44: App iframe shows blank/broken when container is starting or crashed (PLANNED)
**Priority**: P2 — Medium
**Status**: PLANNED (2026-03-21)
When an app container is still starting up or has crashed, the iframe overlay shows a blank/broken page with no feedback. Should show contextual loading states:
- **Starting**: skeleton loader or "App is starting up..." with spinner
- **Crashed**: "App has stopped" with restart button and link to logs
- **Port not ready**: "Waiting for app to become available..." with timeout warning
- **X-Frame-Options blocked**: Detect and open in new tab automatically
**Key files**:
- `neode-ui/src/views/AppSession.vue` — iframe container
- `neode-ui/src/stores/appLauncher.ts` — app launch state
- `neode-ui/src/api/container-client.ts` — container status checks
### TASK-45: Deploy script: auto-chown data dirs after rootful→rootless migration (PLANNED)
**Priority**: P2 — Medium
**Status**: PLANNED (2026-03-21)
When `deploy-tailscale.sh` migrates from rootful to rootless Podman, all files in `/var/lib/archipelago/` created by the old root-running backend are owned by `root:root`. The new backend runs as `archipelago` user and can't read them (node-key.pem, credentials, sessions, identity, etc.). Deploy script must auto-detect and fix ownership after migration.
Also fix:
- `/run/user/1000/crun` ownership (left as root from rootful container creation)
- Container recreation needs `--cap-add NET_BIND_SERVICE` for apps binding port 80 (nextcloud)
- Container recreation needs config volume mounts for apps writing to `/etc/` (searxng)
- Frontend should be copied from .228, not built locally (prevents build mismatches)
**Key files**:
- `scripts/deploy-tailscale.sh` — Step 14 (UID mapping) and Step 22 (container creation)
- `scripts/first-boot-containers.sh` — container creation reference
### BUG-46: FileBrowser missing in unbundled ISO + Cloud auto-login broken (IN PROGRESS)
**Priority**: P1 — High
**Status**: IN PROGRESS (2026-03-26)
Two issues with the Cloud feature on fresh installs:
1. **FileBrowser not prepackaged in unbundled ISO** — The unbundled ISO variant doesn't include the FileBrowser container image, so Cloud doesn't work out of the box. FileBrowser is a core dependency (not an optional app) since it powers the Cloud file manager. Must be bundled even in the unbundled variant.
2. **FileBrowser auto-login not working** — The auto-login flow (so users don't need to enter separate FileBrowser credentials) appears broken. Need to investigate whether the auth proxy/token injection is functioning correctly on fresh installs.
**Tasks**:
- [x] Add FileBrowser image to unbundled ISO build (core dependency, always bundled)
- [x] Create minimal first-boot script for unbundled mode (FileBrowser only)
- [x] Fix auto-login: `Secure` cookie flag silently fails on HTTP — made conditional
- [x] Changed `SameSite=Strict` to `SameSite=Lax` for better navigation compatibility
- [ ] Test Cloud feature end-to-end on a fresh install (both bundled and unbundled)
**Key files**:
- `image-recipe/build-auto-installer-iso.sh` — UNBUNDLED container image list
- `scripts/first-boot-containers.sh` — FileBrowser container creation
- `image-recipe/configs/nginx-archipelago.conf` — FileBrowser proxy config
- `neode-ui/src/views/Cloud.vue` — Cloud UI / auto-login logic
### BUG-47: Onboarding: DID sign 403 + blob HTTPS + no password setup (IN PROGRESS)
**Priority**: P1 — High
**Status**: IN PROGRESS (2026-03-26)
Three onboarding issues on clean install:
1. **Sign DID returns 403 Forbidden** — The DID verification/signing step during onboarding fails with a 403 response from the backend.
2. **Blob URL HTTPS warning** — Browser complains about blob URL loaded over insecure connection (`blob:http://...` should be served over HTTPS). Likely related to the backup download on HTTP connections.
3. **No password setup on clean install** — Users cannot set a password during onboarding. The setup password flow is missing or broken.
**Root causes found**:
- `node.did`, `node.signChallenge`, `node.nostr-pubkey`, `node.createBackup`, `identity.verify` were NOT in `UNAUTHENTICATED_METHODS` — onboarding has no session, so they all returned 403
- `auth.setup` and `auth.isSetup` RPC methods were missing from the dispatcher — the frontend called them but no handler existed
- Blob HTTPS warning is a browser security feature on HTTP connections (not a code bug)
**Tasks**:
- [x] Add onboarding methods to UNAUTHENTICATED_METHODS in middleware.rs
- [x] Add `auth.setup` RPC handler (creates user with password, prevents re-setup)
- [x] Add `auth.isSetup` RPC handler (checks if user.json exists)
- [x] Rust compiles clean
- [ ] Blob URL HTTPS warning — known browser limitation on HTTP, no code fix needed
- [ ] Test full onboarding flow end-to-end on fresh ISO
**Key files**:
- `neode-ui/src/views/OnboardingVerify.vue` — DID signing step
- `neode-ui/src/views/OnboardingBackup.vue` — Backup download (blob URL)
- `neode-ui/src/views/OnboardingIntro.vue` — Password setup entry point
- `core/archipelago/src/api/rpc/auth.rs` — Auth RPC endpoints
- `core/archipelago/src/api/rpc/middleware.rs` — Request auth middleware
---
### TASK-50: Networking stack: first-install → reboot-proof (IN PROGRESS)
**Priority**: P0 — Critical
**Status**: IN PROGRESS (2026-04-08)
Every networking service must work from first install, survive reboots, and never go down. Covers the full stack: WireGuard (traditional peer VPN), NostrVPN (mesh VPN), Tor, Tor hidden services, Tor Electrum, and LND Connect wallet.
**Why**: These are the sovereignty backbone — if any of them fail silently after a reboot or fresh install, the node is useless as a self-sovereign server. Users shouldn't need to SSH in to fix networking.
**Services**:
- **WireGuard** (port 51820) — traditional peer VPN for direct connections
- **NostrVPN** (port 51821) — mesh VPN with Nostr identity, `nvpn` daemon
- **nostr-rs-relay** (port 7777) — private relay for NostrVPN signaling + general use
- **Tor** — SOCKS proxy + hidden services for all apps
- **Tor hidden services** — .onion addresses for node access without public IP
- **Tor Electrum** — Electrum server accessible over Tor
- **LND Connect** — wallet connect URIs over Tor for mobile wallets
**Tasks**:
- [x] NostrVPN systemd service (`nostr-vpn.service`) — enabled, reboot-proof
- [x] WireGuard interface (`wg0`) — configured, auto-start
- [ ] Build nvpn v0.3.7 from source (fixes event processing bug in v0.3.4)
- [ ] Verify NostrVPN mesh forms between server and phone after v0.3.7 upgrade
- [ ] nostr-rs-relay service — systemd unit, auto-start, in-memory mode
- [ ] Each node runs its own relay on port 7777
- [ ] Tor service — systemd, auto-start, SOCKS on 9050
- [ ] Tor hidden services — auto-generate .onion for web UI, LND, Electrum
- [ ] Nodes without public IP use Tor hidden service as relay endpoint
- [ ] Tor Electrum — Electrumx/Fulcrum accessible over .onion
- [ ] LND Connect — generate wallet connect URI over Tor
- [ ] Show relay URLs in VPN card UI
- [ ] ISO first-boot: all networking services configured and started automatically
- [ ] Reboot test: power cycle → all services come back without intervention
- [ ] Fresh install test: ISO → boot → all networking operational
**Key files**:
- `/etc/systemd/system/nostr-vpn.service` — NostrVPN daemon
- `/var/lib/archipelago/nostr-vpn/.config/nvpn/config.toml` — nvpn config
- `image-recipe/configs/nginx-archipelago.conf` — proxy rules
- `scripts/first-boot-containers.sh` — first-boot service setup
- `scripts/image-versions.sh` — pinned versions
- `neode-ui/src/views/apps/VpnCard.vue` — VPN UI card
- `core/archipelago/src/vpn.rs` — VPN status backend
---
## Post-Beta (FROZEN)
*These tasks are deferred until after beta ships. Do not start.*
- **INQUIRY-5**: Offline balance check via mesh relay
- **FEATURE-6**: Watch-only wallet architecture
- **TASK-7**: Mesh Bitcoin security hardening
- **TASK-2**: Roll incoming-tx into deploy & ISO
- **FEATURE-43**: P2P encrypted voice/video calling (WebRTC over federation)
---
### FEATURE-43: P2P encrypted voice/video calling — WebRTC over federation (DEFERRED)
**Priority**: P1 — High
**Status**: DEFERRED (post-beta)
Self-sovereign encrypted voice and video calling between Archipelago peers. Zero new containers or dependencies — uses browser-native WebRTC with signaling over the existing federation WebSocket. Integrates directly into peer tabs/chat.
**Security & Privacy**:
- All media encrypted via DTLS/SRTP (WebRTC mandatory encryption — no opt-out)
- Signaling (SDP offers, ICE candidates) transmitted over existing federation WebSocket through Tor
- ICE candidate filtering: strip local/public IP candidates in Tor-relay mode
- No central server, no metadata leakage — true P2P between browsers
- Two privacy modes:
- **LAN Direct**: <50ms latency, IPs visible to peer (trusted same-network peers)
- **Tor Relay**: 300-800ms latency, full anonymity via coturn TURN server on .onion
**Architecture**:
- Signaling reuses existing federation WebSocket — new message types: `call-offer`, `call-answer`, `call-ice`, `call-hangup`, `call-reject`, `call-busy`
- Browser `getUserMedia()` + `RTCPeerConnection` — no backend media processing
- Opus codec for voice (~30kbps, handles Tor latency well)
- VP8/VP9 adaptive bitrate for video (720p on LAN, degrades gracefully)
- Optional `coturn` container (~10MB RAM) for Tor-relay media mode only
**UX**:
- Voice and video call buttons in peer chat (federation contacts)
- Incoming call: glass modal slides up with peer name + avatar, accept/decline
- In-call: floating glass PIP overlay — navigate while talking
- One-tap mute, camera toggle, speaker toggle, hangup
- Call quality indicator (green/yellow/red based on RTT)
- Ring timeout (30s) → missed call notification
- Call history in peer chat thread
**Tasks**:
- [ ] `CallService.ts` — WebRTC wrapper (offer/answer, ICE management, stream handling, codec negotiation)
- [ ] Federation signaling protocol — new message types over existing WS (`call-offer`, `call-answer`, `call-ice`, `call-hangup`)
- [ ] Rust backend — relay call signaling messages between federation peers (pass-through, no media processing)
- [ ] ICE candidate filtering — strip public IPs in privacy mode, force relay-only
- [ ] `CallOverlay.vue` — incoming call modal (glass aesthetic, ring animation, accept/decline)
- [ ] `CallPIP.vue` — floating picture-in-picture during active call (draggable, minimize/expand)
- [ ] `CallControls.vue` — mute, camera toggle, speaker, hangup, privacy mode switch
- [ ] Voice-only mode — Opus codec, bandwidth-optimized, Tor-friendly
- [ ] Video mode — VP8/VP9 adaptive bitrate, resolution scaling based on connection quality
- [ ] Optional `coturn` container manifest — TURN relay for Tor-routed media
- [ ] Call quality monitoring — RTT measurement, packet loss detection, quality indicator
- [ ] Call history — persist in peer chat thread, missed call notifications
- [ ] Multi-peer consideration — design for 1:1 first, extensible to group calls later
- [ ] Test: LAN direct call (voice + video)
- [ ] Test: Tor relay call (voice — verify latency is acceptable)
- [ ] Test: call during active chat, call while navigating other views
- [ ] Test: network interruption recovery (ICE restart)
**Key files** (new):
- `neode-ui/src/services/CallService.ts` — WebRTC engine
- `neode-ui/src/components/call/CallOverlay.vue` — incoming call UI
- `neode-ui/src/components/call/CallPIP.vue` — in-call floating overlay
- `neode-ui/src/components/call/CallControls.vue` — call action buttons
- `apps/coturn/manifest.yml` — optional TURN relay container
**Key files** (modified):
- `neode-ui/src/views/Federation.vue` — call buttons in peer chat
- `core/archipelago/src/api/rpc/federation.rs` — call signaling relay
- `neode-ui/src/stores/federation.ts` — call state management
## Completed
| ID | Title | Completed |
|----|-------|-----------|
| **TASK-11** | Rootless podman migration (.228 — 30 containers) | 2026-03-18 |
| **TASK-32** | Integrate boot loader into deploy + build + production | 2026-03-17 |
| **TASK-34** | Pentest findings remediation plan | 2026-03-18 |
| **TASK-26** | Rename fedimintd to "Fedimint Guardian" + icon | 2026-03-18 |
| **TASK-27** | Add tab-launch icon to apps that open in tabs | 2026-03-18 |
| **TASK-28** | Sort installed apps to end of marketplace | 2026-03-18 |
| **TASK-29** | Fix mesh mobile: remove title/flash/peers header, fix gutters | 2026-03-18 |
| **TASK-30** | On-Chain as first tab in receive Bitcoin modals | 2026-03-18 |
| **TASK-35** | Federation node names (show name not DID, hover for key) | 2026-03-18 |
| **TASK-36** | Cleaner iframe error screen with remediation | 2026-03-18 |
| **BUG-1** | Random logout / CSRF mismatch — HMAC-derived tokens | 2026-03-18 |
| **TASK-8** | Security hardening — 12/12 pentest findings fixed | 2026-03-18 |
| **BUG-20** | ElectrumX index estimate string ~55→~130 GB | 2026-03-18 |
| **BUG-37** | App card Start/Launch flicker during container scan | 2026-03-18 |
| **BUG-40** | Uninstall dialog not full-screen modal | 2026-03-18 |
| **BUG-41** | Uninstall loader ends but app card persists | 2026-03-18 |
| **BUG-33** | CPU load alert threshold too low (8 = 2x cores) | 2026-03-18 |
| **TASK-31** | Sticky nav header (Apps page) | 2026-03-18 |
| **TASK-38** | Blockchain sync info on homepage System card | 2026-03-18 |
| **TASK-17** | Alpha version tags + deploy auto-tag | 2026-03-18 |
| **BUG-3** | IndeedHub WebSocket spam — removed dead nostrConfig | 2026-03-18 |

View File

@ -1,252 +0,0 @@
# Migration Status Report
Last updated: 2026-06-14
## RESUME CHECKPOINT (2026-06-14, after SSH drop)
State right now, so any disconnect resumes cleanly:
- **`main` = `a483fe4b`** = the other agent's 4 fixes (`0ed892a4`: wallet receive / bitcoin
install self-heal / ElectrumX tile / extended test gate) + **my F1 fix committed on top**
(`launch_url_port` in `docker_packages.rs` + 3 regression tests). Tree is clean (only two
untracked `docs/*.md` tracking files remain). Not pushed.
- The old isolated `archy-f1` worktree was **removed** — built the combined tree in-place.
- ✅ **DONE — combined backend release build** (`cd core && TMPDIR=/home/archipelago/.buildtmp
cargo build --release -p archipelago`, 7m46s, exit 0). `/tmp` is a full tmpfs so `TMPDIR`
MUST point at `/home/archipelago/.buildtmp`.
- ✅ **DONE — sideloaded + restarted on `.116`.** Backed up old binary to
`/usr/local/bin/archipelago.pre-f1.bak`, `install`ed new binary (root:root 755),
`sudo systemctl restart archipelago` (new MainPID 2885863).
- ✅ **F1 VALIDATED LIVE on `.116` (2026-06-14).** See "FINDING F1" below — before/after proves
the fix. Harness focused audit `jellyfin,filebrowser`**all checks passed, exit 0**.
- **IMPORTANT — restart is SAFE on this node:** containers run rootless under
`user-1000.slice/user@1000.service/app.slice`, a DIFFERENT cgroup from
`/system.slice/archipelago.service`. They survived both the 01:47 and this restart
(bitcoin/lnd/btcpay/immich/indeedhub all intact, count stayed 36). The
`feedback_no_systemctl_deploy_until_quadlet` cgroup-cascade warning does NOT apply to `.116`'s
current config. (The reconciler does recreate a few app containers like jellyfin/fedimint on
adoption — normal level-triggered behavior, not casualties.)
- **RELEASE IN PROGRESS — v1.7.91-alpha (user approved 2026-06-14).** Bundles the other agent's
4 fixes (`0ed892a4`) + F1 (`a483fe4b`) + changelog (`ab858271`). Steps:
1. ✅ Freed `/tmp` (removed stale published frontend tarballs 1.7.83→1.7.89; ~1.1G free) —
`create-release.sh` writes the 184MB frontend tarball to `/tmp` (hardcoded, NOT TMPDIR).
2. ✅ `cargo fmt -p archipelago --check` clean; curated layman changelog added + committed.
3. 🔄 `TMPDIR=/home/archipelago/.buildtmp scripts/create-release.sh 1.7.91-alpha`
(runs `tests/release/run.sh` gate → bumps Cargo.toml/package.json → builds backend+frontend
→ manifest → commit "chore: release v1.7.91-alpha" → tag `v1.7.91-alpha`). MUST set TMPDIR
or cargo's ring C-build fails on the full `/tmp` tmpfs.
- **AFTER create-release.sh:** `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`
`git push origin main && git push gitea-local main``git push --tags` (origin+gitea-local).
Ship target per memory: vps2 (146.59.87.168) is PRIMARY OTA manifest; tx1138 RETIRED.
- Verify packaged tarball actually contains the new version string before trusting the build
(npm run build can silently produce stale dist — see `feedback_frontend_build_verify`).
## Validation node (ACTIVE)
As of 2026-06-14 the app-migration lifecycle validation moves from `.198` (remote, OVH) to
**`.116` — the local dev node (`archi-thinkpad`, `192.168.1.116`)** because it is the machine
this session runs on, so the harness drives it over loopback instead of SSH (much faster, no
network latency). A separate agent owns OS-level fixes + its own test harness; this track owns
the **app-packaging migration** lifecycle validation only.
How to drive the harness against `.116` (local):
```bash
ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD='ThisIsWeb54321@' \
ARCHY_APPS='meshtastic,jellyfin,filebrowser,uptime-kuma' \
tests/lifecycle/remote-lifecycle.sh # focused, audit-only (non-destructive)
```
- `.116` serves nginx on **:80 only** (443 is tailscale's) → use `ARCHY_SCHEME=http`, `ARCHY_HOST=127.0.0.1`.
- Local node is healthy: `update_state.json.current_version == 1.7.90-alpha`, `update_in_progress=false`
(the OTA self-heal that was a follow-up gap in PROGRESS_MEMORY is now confirmed resolved on .116).
- Login password for `.116`: `ThisIsWeb54321@` (verified against `auth.login`). Note: auth.login
has a login rate-limiter — avoid rapid repeated attempts.
- `.198` results below remain the prior baseline; new results are tagged `[.116]`.
### [.116] audit log (newest first)
- **2026-06-14 — focused audit `meshtastic,jellyfin,filebrowser,uptime-kuma` (audit-only, non-destructive):**
harness exit 1, FAILED checks: 1.
- `filebrowser` — running, pass (also passed a standalone single-app smoke run).
- `uptime-kuma` — running, pass.
- `meshtastic``state=absent`. Not installed on `.116` (was installed/validated on `.198`).
Not a regression; just node state. To exercise meshtastic here, install it first (it needs
`/dev/ttyUSB0`, which `.116` may not have) or drop it from the focused set on this node.
- `jellyfin` — **running but FAILED: "launch metadata missing: jellyfin has no lan_address".**
**ROOT-CAUSED 2026-06-14 — real, current bug in the working tree (a regression).** See
"FINDING F1" below.
### [.116] FINDING F1 — manifest launch URLs with a path are silently dropped (OPEN, fix pending)
**Symptom:** `jellyfin` is `running` and genuinely serving (`curl 127.0.0.1:8096/` → 302), but
`container-list` reports `lan_address: null`, so the UI/harness sees no launch URL.
**Root cause:** `core/archipelago/src/container/docker_packages.rs::reachable_lan_address()` parses
the port out of the candidate URL with `url.rsplit(':').next()`. When the candidate comes from the
manifest `interfaces.main` (via `PodmanClient::lan_address_for`
`core/container/src/podman_client.rs::manifest_primary_interface_url`), the URL **includes the
manifest `path`** — e.g. jellyfin → `http://localhost:8096/`. Then `rsplit(':').next()` yields
`"8096/"`, which **fails to `parse::<u16>()`**, so the function hits its `else { return None }`
branch and drops a perfectly reachable launch URL. (Diagnostic tell: the dropped-at-parse path
emits **no** log, whereas a genuine unreachable port logs "suppressing unreachable launch URL".
jellyfin has no such log; uptime-kuma — whose candidate `…:3002` has no path — does.)
**Why it's a regression:** the old `extract_lan_address(ports)` produced `http://localhost:PORT`
(no path), which parsed fine. The newer manifest-interface feature appends the declared `path`,
so any app routed through `lan_address_for` now yields `…:PORT/` and trips the parser.
**Blast radius (apps in `requires_reachable_launch` whose `interfaces.main.path` = `/`):**
`botfights`, `btcpay-server`, `fedimint`, `jellyfin`, `gitea`, `nextcloud`, `portainer`.
(`filebrowser`/`nextcloud`/`nginx-proxy-manager`/`vaultwarden` are in `uses_allocated_launch_port`
so they hit `extract_lan_address` first and dodge it; `grafana`/`mempool`/`uptime-kuma`/`searxng`
have no manifest `interfaces.main` path.) On `.198` this likely went unnoticed because those apps
weren't all running during the launch-metadata assertion, or predated the interfaces.main addition.
**Fix (IMPLEMENTED in working tree, uncommitted):**
`docker_packages.rs::reachable_lan_address` now parses the port via a new `launch_url_port()`
helper that reads digits after the final colon (`take_while(is_ascii_digit)`), mirroring the
RPC-layer `port_from_url`, so `http://localhost:8096/``Some(8096)`. Added unit tests
(`launch_url_port_tests`) covering the trailing-path regression, the bare-authority case, and a
no-port reject. The existing `lan_address_prefers_manifest_main_interface` test only exercised
`lan_address_for` (which always returned `…:8175/`) and never the `reachable_lan_address` wrapper,
which is why the bug slipped through.
**Unit validation: GREEN (2026-06-14).** `cargo test -p archipelago --bin archipelago launch_url_port`
→ 3 passed / 0 failed (trailing-path, bare-authority, no-port-reject); crate compiles clean.
**Coordination note (shared tree):** the repo is on branch `fix/wallet-receive-portdrift-secrets`
at commit `bb808df8` (= the deployed 1.7.90-alpha). A parallel agent has uncommitted changes here
(lnd `wallet.rs`, `bitcoin_relay.rs`, `prod_orchestrator.rs`, electrumx manifest, neode-ui, new
bats). To validate F1 in isolation (and NOT deploy their in-flight work onto the live node, nor
disturb their tree), the live-validation build is done in a detached git worktree at
`/home/archipelago/archy-f1` = clean `bb808df8` + only the F1 `docker_packages.rs` change. Build:
`cd /home/archipelago/archy-f1/core && TMPDIR=/home/archipelago/.buildtmp cargo build --release -p archipelago`
(`.116`'s `/tmp` is a 7.7G tmpfs that runs 100% full → the ring crate's C compile fails with
"No space left on device"; redirect `TMPDIR` to `/` which has ~399G). After validation the
worktree is removed (`git worktree remove`). NOTE: sideloading replaces the OTA-managed
`/usr/local/bin/archipelago` with a local 1.7.90-alpha+F1 build until the next OTA — back up the
current binary first (`/usr/local/bin/archipelago.pre-f1.bak`).
**Live validation status — ✅ GREEN on `.116` (2026-06-14).** Built combined tree (`a483fe4b`),
sideloaded, restarted `archipelago.service`. Before/after on the live node (old buggy binary → new):
| app | OLD lan_address | NEW lan_address |
|---|---|---|
| jellyfin | `None` ❌ | `http://localhost:8096/` ✅ |
| btcpay-server | `None` ❌ | `http://localhost:23000/` ✅ |
| fedimint | `None` ❌ | `http://localhost:8175/` ✅ |
| gitea | `None` ❌ | `http://localhost:3001/` ✅ |
| portainer | `None` ❌ | `http://localhost:9000/` ✅ |
| botfights | `None` ❌ | `http://localhost:9100/` ✅ |
| nextcloud | `:8085` ✓ | `:8085` (unchanged — allocated-port path) |
| filebrowser | `:8083` ✓ | `:8083` (unchanged) |
Harness focused audit `jellyfin,filebrowser`**all checks passed, exit 0**. Unit tests green.
No container casualties (all 36 survived; see RESUME CHECKPOINT for the cgroup detail).
NOTE: Do NOT run the prod binary directly to "check a version" —
`/usr/local/bin/archipelago <anyflag>` boots a whole second node instance (learned the hard way
2026-06-14; it exited without leaving a stray, but don't repeat).
## Goal
Make Archipelago's app/container system developer-ready and release-ready: app installs, lifecycle, recovery, and integrations should be portable, manifest-driven, and not rely on one-off OS-level changes or hardcoded Rust branches for each new app. The OS/backend should provide generic primitives for manifests, Quadlet rendering, lifecycle, health/readiness, dependency ordering, data ownership, image availability, bind mounts, secrets, app files, networking, bridge/signer integrations, and recovery.
The developer contract should be clear enough that a third-party developer can build and ship an Archipelago app from documentation plus manifest/schema examples. If an app needs a capability the platform does not yet expose, the release direction is to add a reusable manifest/orchestrator primitive rather than a special case tied to that app. This is the standard for the `1.8-alpha` app migration: professional app delivery, predictable behavior after restart/reboot, and a path for user-installed/community apps that does not require rebuilding the OS image for every app.
Release quality bar: every supported app must install, stop, start, restart, uninstall, survive host reboot, report accurate status, and expose clear install/uninstall progress. Stale health notifications must not persist across login or refresh after the underlying condition has cleared. Final release validation should run on the intended release validation server, not drift between appliances without an explicit checkpoint.
Target release: `1.8-alpha`, including a cut and smoke-tested ISO once validation is green.
Current release readiness estimate: about `82%`. The remaining percentage is mostly post-reboot recovery confidence, repeated reboot validation, and ISO creation/smoke testing rather than the core manifest/catalog migration itself.
## Current Result
- The migration is not final-release complete yet, but the core direction is being met.
- Portainer, Filebrowser, BTCPay, Grafana, Nostr Relay, SearXNG, Gitea, and key dependency units have moved further into the manifest/orchestrator path.
- `.198` has passed focused and broad lifecycle audits for the already migrated set.
- Meshtastic is now routed through the orchestrator path, no longer falls back to legacy `localhost/meshtastic:latest`, and has passed full lifecycle validation on `.198`.
- On 2026-06-02, focused and broad `.198` non-destructive lifecycle audits passed after clearing a wedged `nextcloud` Podman record. The live registry config already has OVH primary plus tx1138 mirror, and Meshtastic/Portainer were added to the catalog surfaces.
- Later on 2026-06-02, the current release backend hash `579b823cf4a4b8c50bb3d0c3d49449c58101b016eb6ebc8049975dce98e34265` was found active and stable on `.198`. Meshtastic `app.files` rendering was proven live by removing `/var/lib/archipelago/meshtastic/config.yaml`, restarting through `package.restart`, and verifying the manifest recreated the file. Focused Meshtastic, focused `meshtastic,jellyfin,filebrowser`, and broad non-destructive audits all passed afterward; raw Podman sweep was clean.
- The remaining release gate was continued on 2026-06-02: bounded disk cleanup, journal retention, backend-backup retention, and release-focused catalog drift classification were added. `.198` is active on backend hash `e285d421cef497beb6b4b929f36fb4296d6db1f4a4c786157b6751eec51619ca`; focused and broad post-cleanup lifecycle audits passed, and final raw Podman sweep was clean.
- Follow-up found Podman store commands can hang on `.198` beyond image prune (`podman system df`, image list/exists, and sometimes broad ps/inspect). The release cleanup path now skips Podman image/volume prune rather than touching that unstable path. `.198` is active on backend hash `c9695dc3db10ff6e593cdbcfbbdc94b2e98b6008aa62655bba51b9879b549e8c`; Uptime Kuma was repaired with a normal `package.restart`; focused and broad post-repair lifecycle audits passed, and final raw bad-state sweep was clean.
- On 2026-06-03, startup/adoption scanner hardening and pasta restart repair were deployed. `.198` is active on backend hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`; `package.restart` for Uptime Kuma now returns successfully and restores the `3002` pasta listener; focused `meshtastic,jellyfin,filebrowser,uptime-kuma` and broad lifecycle audits passed.
- Later on 2026-06-03, expanded rollback cleanup and store-safe uninstall hardening were deployed. `.198` is active on backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b`; `system.disk-cleanup` reclaimed `10.3 GB` from old backend and web UI rollback artifacts while still skipping Podman prune, and focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed afterward.
- Latest 2026-06-03 follow-up deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e`. It mitigates stale cached `container-list` state during Podman scan backoff, adds a bounded TCP reachability fallback for `container-health`, and adds Jellyfin `8096` to legacy pasta host-listener repair. Focused `meshtastic,jellyfin,filebrowser,uptime-kuma` lifecycle passed on this hash. Broad lifecycle still needs rerun on this latest hash.
- Current validation backend hash is `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b`. It keeps the generic host-listener health direction, preserves the `container-health` fallback fix from `be95ea...`, hardens fresh local-build installs so `podman image exists <local-build-tag>` failures/timeouts rebuild instead of failing the lifecycle operation, and reduces duplicated legacy runtime port repair by deriving host ports from manifests. Targeted PhotoPrism and broad non-destructive `.198` lifecycle audits passed on this hash.
- Catalog metadata generation from manifests is now implemented via `scripts/generate-app-catalog.py`. The canonical catalog and UI public catalog are synced from manifest-owned fields, strict release drift is zero, and frontend build validation passed.
- Current live `.198` validation backend hash is `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`. Broad non-destructive lifecycle is green on that deployed line after app health/port recovery, IndeedHub recovery, scoped legacy install hardening, and bounded Podman pull hardening.
- Local release validation now passes the full backend binary test target and every Rust workspace member after release cleanup fixes for scanner backoff wakeups, crash-recovery tests, manifest-port lookup, journal parsing, and boot-reconciler test determinism.
- Frontend release validation now passes `npm run type-check`, `npm test` (`548` tests), and `npm run build` after fixing mobile app-launch routing for new-tab apps and updating stale launch tests. Local `npm ci` is blocked by root-owned `neode-ui/node_modules` entries, so dependency reinstall remains a local environment cleanup item requiring explicit approval.
- Reboot validation is not yet green. User reported that a reboot test left IndeeHub stopped afterward, with multiple containers killed by SIGKILL during shutdown/reboot and at least one crash. Treat post-reboot recovery as the active release blocker.
- Local follow-up now hardens IndeeHub stack boot recovery and updates lifecycle validation so IndeeHub must still serve the Nostr signer bridge (`/nostr-provider.js`) before a launch probe passes.
## Completed In This Pass
- Pause checkpoint for resume: generated app-session metadata now covers manifest-owned launch ports, titles, and new-tab behavior. The next migration step should continue from proxy path/companion UI alias generation or return to the release blocker around post-reboot IndeeHub recovery.
- Updated `docs/APP-PACKAGING-MIGRATION-PLAN.md` to reflect the current `apps/<app-id>/manifest.yml` contract, replacing stale `archy-app.yml` next-step language with the actual parser/generator/orchestrator progress and the remaining migration blockers.
- Updated `docs/app-developer-guide.md` so developers see the current manifest fields, generated catalog flow, validation commands, and release lifecycle expectations instead of the older Nostr marketplace publish/trust-score draft.
- Verified the developer-guide manifest example parses as YAML, `scripts/generate-app-catalog.py` is idempotent, strict release catalog drift remains zero, and `git diff --check` is clean for the migration docs.
- Extended `scripts/generate-app-catalog.py` to also emit `neode-ui/src/views/appSession/generatedAppSessionConfig.ts` from manifests, and wired `appSessionConfig.ts` to merge generated launch ports/titles/new-tab launch behavior with the existing manual overrides for companion UIs and aliases.
- Added a Fedimint `interfaces.main` launch declaration for the Guardian wait/proxy UI on port `8175`, so that public launch surface is now represented in the manifest.
- Focused validation passed for the generated app-session path: Python helper compile, generator idempotence, strict catalog drift, `appSessionConfig.test.ts`, and frontend type-check.
- Aligned `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` with the current manifest/runtime contract so the release docs no longer describe the stale marketplace-style schema.
- Removed the hardcoded Portainer host-prep path and replaced it with a manifest plus generic Podman socket bind-mount preparation.
- Added generic Quadlet health drift detection for command, interval, timeout, and retry changes.
- Made rendered HTTP health helpers honor manifest timeouts.
- Added image availability guards before Quadlet starts/restarts so pruned images are pulled or built before systemd tries to start them.
- Fixed stale dependency handling so active manifest dependencies are not suppressed by old `user-stopped.json` entries.
- Added parent-app reconcile syncing for dependency Quadlet units.
- Validated Portainer, Filebrowser, BTCPay, and broad non-destructive audits on `.198`.
- Updated Meshtastic manifest to use a real available image, the real `/dev/ttyUSB0` device, the actual daemon data path, and a non-HTTP health check.
- Updated the lifecycle harness so non-HTTP apps do not require launch metadata.
- Added a generic manifest-owned file rendering primitive under `app.files` so apps can declare required bind-mounted config files without adding app-specific Rust/OS branches.
## Current `.198` State
- `archipelago.service`: active.
- `archipelago-doctor.timer`: inactive.
- `archipelago-reconcile.timer`: inactive.
- Current validation backend hash: `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- `.198` root filesystem pressure is currently resolved for release validation: latest sweep showed `/` at 65% used with about 9.6G free after expanded rollback cleanup.
- Latest focused Fedimint, Immich, IndeedHub, and PhotoPrism audits passed on the current hash.
- Broad non-destructive lifecycle passed on the current hash before and after backend restart validation.
## Meshtastic Status
- Orchestrator routing is fixed and verified by the generated Quadlet unit.
- Current generated unit uses:
- `Image=docker.io/meshtastic/meshtasticd:daily-alpine`
- `Volume=/var/lib/archipelago/meshtastic:/var/lib/meshtasticd:Z`
- `AddDevice=/dev/ttyUSB0`
- `HealthCmd=test -f /var/lib/meshtasticd/config.yaml`
- The daemon starts and accepts TCP API connections on port `4403`.
- Full lifecycle passed on `.198`: install, stop, start, restart, uninstall with preserved data, and reinstall.
- A persisted `config.yaml` is required. The release path is now the generic `app.files` manifest primitive rather than a Meshtastic-specific backend hook, and this has been verified live on `.198` by deleting the file and proving `package.restart` recreates it from the manifest.
## Release Blockers
- Continue monitoring the current optimized release backend on `.198`; the previously observed release-binary segfault is not reproducing with hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- `system.disk-cleanup` now handles journal, backend-backup, legacy backend rollback, and web UI rollback retention while intentionally skipping Podman image/volume prune because Podman store commands can hang on `.198` under current load. Diagnose Podman store health separately from the release cleanup path.
- Release image probes have been further quarantined from the fragile Podman store commands and deployed to `.198` on backend hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`: runtime, legacy install, and companion image checks now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`. Focused and broad non-destructive lifecycle validation passed on the deployed hash.
- Podman socket/runtime health remains a release blocker: `package.restart jellyfin` stopped the container but failed to complete because Podman reported `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`; `package.start jellyfin` recovered the app and the focused lifecycle passed afterward.
- Release-focused catalog drift now has zero missing catalog/manifest entries and zero metadata drift after generating catalog metadata from manifests.
- Backend-restart validation passed. Host-reboot validation is currently failed/pending due to post-reboot IndeeHub recovery. Reboot retests should run only after an explicit release checkpoint/approval.
- Local code-review/refactor cleanup gate has full local validation coverage now:
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` passed (`688` tests);
- all other workspace packages check/test clean;
- frontend type-check/tests/build passed;
- release build, catalog drift, catalog idempotence, Python helper compile, and whitespace checks passed.
- Before `1.8-alpha` release:
- deploy the post-reboot recovery fixes;
- prove focused IndeeHub lifecycle with Nostr signer injection intact;
- update the app packaging/developer docs so `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` match the current manifest/runtime contract and release-quality lifecycle expectations;
- complete the required refactor/remove-dead-code gate after correctness validation: remove obsolete transitional code, stale per-app hacks, duplicate lifecycle paths, and misleading compatibility fallbacks, then rerun release validation;
- require at least 3 consecutive clean post-fix reboots with broad non-destructive lifecycle green after each;
- prefer 5 consecutive clean reboots for production-release confidence;
- cut and smoke-test the `1.8-alpha` ISO.
## Bottom Line
We are working toward the intended goal: better than Umbrel/StartOS by making app behavior declarative and registry/manifest-owned. The migration is substantially advanced, Meshtastic manifest-owned config generation is verified live, catalog metadata is generated from manifests, disk cleanup/backup retention is in place without Podman prune risk, and full local backend/frontend workspace validation has been green. Remaining follow-up for `1.8-alpha` is post-reboot recovery validation, especially IndeeHub plus Nostr signer behavior, repeated reboot passes, ISO cut/smoke test, separate Podman socket/store-health diagnosis, and optional local cleanup of root-owned frontend dependencies before rerunning `npm ci`.

View File

@ -1,572 +0,0 @@
# Next Terminal Handoff - Archipelago `1.8-alpha`
Last updated: 2026-06-11 00:17 America/New_York
## Resume Prompt
Paste this into the next terminal/session:
> Continue Archipelago `1.8-alpha` release hardening from `/home/archipelago/Projects/archy`. First read `docs/NEXT_TERMINAL_HANDOFF.md`, then `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, `docs/MIGRATION_STATUS_REPORT.md`, and `docs/1.8-alpha-improvements-tracker.md`. Active validation node is `.198` at `192.168.1.198` with user `archipelago` and password `password123`. Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic validation. Do not run broad Podman store/image cleanup commands on `.198` (`podman prune`, `podman image list`, `podman system df`, broad image-exists/list/store-wide cleanup); the store/control path is known to hang under load. Preserve app data. Latest deployed backend hash on `.198` is `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`. Fedimint Guardian public launch is fixed: `8175` serves the styled wait/proxy UI with real background/icon assets and proxies to backend Guardian on `8177`; `package.restart fedimint` now returns immediately and settled with both services active. Latest local-only tracker pass added uninstall preserve/delete-data UI, companion APK QR/download, setup instructions rendering, Fleet/Bitcoin receive-state loading improvements, Nextcloud false-update work, PhotoPrism credential fallback, and removed the Spotlight AI coming-soon block. Continue with the broader rootless Podman lifecycle/control-plane blocker, My Apps state truthfulness, progress UX, remaining in-progress tracker items, full lifecycle, clean reboot iterations, ISO cut, and ISO smoke test.
## Current Goal
Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.
Release status is still not green. The remaining work is mostly systemic hardening and final gates, not basic app catalog wiring.
The user improvement list in `docs/1.8-alpha-improvements-tracker.md` is part of
the same release and next ISO cut. Keep that tracker updated as items move from
`todo` to `in-progress`, `blocked`, `done`, or explicit release deferral.
## Active Session Checkpoint - 2026-06-10 05:48 EDT
New terminal resumed from this handoff. No `.198` host actions have been run in
this resumed pass yet.
Resume-save checkpoint, 2026-06-10 08:32 EDT: progress is saved in this handoff
and `docs/1.8-alpha-improvements-tracker.md`. No `.198` host actions were run
after the 05:48 checkpoint, no dev server was intentionally left running, and no
long-running validation command is expected to still be active from this pass.
The user explicitly wants the fixes backlog continued, not app migration work,
unless they redirect. Start a resumed session by re-reading the tracker row
`Make tabs info load quickly or show loading states`, then continue the slow
panel audit or move to the next unresolved fixes-backlog row.
Resume-save checkpoint, 2026-06-10 23:15 EDT: continued only frontend fixes
backlog work and avoided Bitcoin/Tor RPC/backend paths because another agent is
working there. No `.198` host actions were run, no dev server was intentionally
left running, and no long-running validation command is expected to still be
active from this pass.
Resume-save checkpoint, 2026-06-11 00:17 EDT: continued the fixes backlog only,
not app migration. Avoid Bitcoin/Tor RPC/backend work because a separate agent
is working there. The latest local change fixes the header responsiveness
regression the user flagged: primary My Apps/App Store/Websites navigation is
restored to persistent desktop tabs at `md+` on My Apps, Discover, and
Marketplace; desktop primary dropdowns were removed; mobile dropdown behavior
remains; App Store category collapse is delayed by starting uncollapsed and
using a smaller header gap/search reserve; My Apps desktop category dropdown was
removed. Validation passed `npm run type-check`,
`npm test -- --run src/views/marketplace/__tests__/MarketplaceAppCard.test.ts src/views/apps/__tests__/appsConfig.test.ts`,
and scoped `git diff --check`. Browser smoke against the already-running local
Vite/mock session (`http://127.0.0.1:8102` and mock backend `5959`) is still
pending. Leave that existing session alone unless it has already exited.
Exact first step for this pass:
1. Update the handoff docs with this fresh checkpoint.
2. Rerun local resume gates that were pending after the 05:30 checkpoint:
`git diff --check` and the focused Rust image-version test for the
Nextcloud false-update work.
3. If local gates are clean, continue the rootless Podman lifecycle/control-plane
blocker by inspecting the backend scanner/backoff and package stop/start/
restart paths before touching `.198`.
Progress in this resumed pass:
- `git diff --check` passed.
- `/tmp` has sufficient build headroom for focused Rust validation
(`/tmp` was 14% used at the start of the pass).
- Focused Rust validation for Nextcloud/image-version work is still
inconclusive, not green:
`env CARGO_INCREMENTAL=0 CARGO_TARGET_DIR=/tmp/archy-cargo-image-versions cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests`
compiled through the `archipelago` crate, then the tool PTY stayed open with
no active `cargo`, `rustc`, or linker process visible in `ps`.
- A bounded retry using the normal workspace target also did not finish:
`timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests`
exited `124` after compiling the `archipelago` test target without reaching
test output. Keep the Nextcloud false-update row `in-progress`.
- Found and fixed a lifecycle asymmetry in
`core/archipelago/src/api/rpc/package/runtime.rs`: `package.stop` claimed to
return immediately but single-orchestrator apps still stopped synchronously
before responding. The local change now lets migrated single-orchestrator apps
return `{"status":"stopping"}` immediately and finish stop in the background,
matching start/restart behavior. This is not deployed yet and still needs
local validation.
- Separate UI-only pass on port-review track:
- My Apps now preserves the last known backend package list when a later
scanner/backoff update reports `containers-scanned=false` with an empty
package map;
- the page shows `Refreshing container state. Showing the last known app list
until the scan finishes.` above the app grid while cached app state is being
rendered;
- this touched only `neode-ui` UI files and this handoff/tracker note, so it
should not conflict with the backend app migration/control-plane pass;
- focused validation passed:
`npm test -- --run src/views/apps/__tests__/appPackageCache.test.ts` and
`npm run type-check`.
- Web5 Shared Content My Content tab now keeps the current content list
visible during refresh/failure and shows `Refreshing shared content...`;
- Web5 Shared Content Browse Peers tab now keeps the current peer content list
visible while refreshing the same peer, and shows `Refreshing peer content...`
instead of replacing the tab with a full loading panel;
- switching to a different peer still clears stale content and shows the full
connecting state;
- focused validation passed:
`npm test -- --run src/views/web5/__tests__/Web5SharedContent.test.ts` and
`npm run type-check`.
- Local review services are running for user review:
Vite `http://localhost:8102/` / `http://192.168.1.116:8102/` and mock
backend `http://localhost:5959`; `curl` probes returned HTTP `200` for both
the Vite root and proxied `server.get-state`.
- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed after the
stop-path fix.
- Backend compile validation for the stop-path fix passed:
`env CARGO_TARGET_DIR=/tmp/archy-cargo-runtime-check cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`.
The first check session also eventually returned success after the bounded
rerun waited on its build-directory lock.
- `git diff --check` passed again after the stop-path edit and doc updates.
- Follow-up inspection confirmed the lower-level Quadlet/orchestrator stop path
is already bounded: `quadlet::stop_service` uses timed `systemctl --user stop`
with app-scoped kill/reset recovery, and the runtime fallback treats missing
containers as success. No additional lower-level stop change was made in this
pass.
- Latest backlog-fix pass stayed on the fixes tracker, not new app migration:
- backend `package.credentials` now returns manifest-backed PhotoPrism
credentials (`admin` / `archipelago`) directly, matching the existing UI
fallback;
- My Apps and mobile icon-grid credential pre-launch modals are centered
vertically on mobile instead of behaving like bottom sheets;
- validation passed:
`npm test -- --run src/views/apps/__tests__/appCredentials.test.ts src/views/apps/__tests__/AppIconGrid.test.ts`,
`npm run type-check`,
`env CARGO_TARGET_DIR=/tmp/archy-cargo-runtime-check timeout 300s cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`,
`cargo fmt --manifest-path core/Cargo.toml --all --check`, and
`git diff --check`.
- Focused Nextcloud/image-version Rust test is still not green:
`env CARGO_INCREMENTAL=0 CARGO_TARGET_DIR=/tmp/archy-cargo-image-versions-2 timeout 600s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests -- --nocapture`
again exited `124` after compiling into the `archipelago` crate without
reaching test output. Keep that tracker row `in-progress`.
- Continued the tab loading-state backlog:
- Web5 Connected Nodes Messages and Requests tabs keep populated lists
visible during refresh or refresh failure;
- Web5 Identities keeps the current identity list visible during refresh or
refresh failure and shows `Refreshing identities...`;
- Web5 DWN message browsing keeps stored messages visible during refresh or
refresh failure and shows `Refreshing messages...`;
- validation passed:
`npm test -- --run src/views/web5/__tests__/Web5ConnectedNodes.test.ts src/views/web5/__tests__/Web5Identities.test.ts src/views/web5/__tests__/Web5DWN.test.ts`
and `npm run type-check`.
- Continued the same tab/loading-state backlog on Server networking:
- Server Network overview keeps current values visible during refresh/failure
and shows `Refreshing network...`;
- Server Network Interfaces keeps current detected interfaces visible during
refresh/failure and shows `Refreshing interfaces...`;
- Server Tor Services keeps existing hidden-service rows visible during
refresh/failure and shows `Refreshing Tor services...`;
- validation passed:
`npm test -- --run src/views/__tests__/ServerNetworkRefresh.test.ts` and
`npm run type-check`.
- Continued the same loading-state backlog on Credentials:
- the Credentials list keeps existing credential rows visible during
refresh/failure and shows `Refreshing credentials...`;
- validation passed:
`npm test -- --run src/views/__tests__/CredentialsRefresh.test.ts src/views/__tests__/ServerNetworkRefresh.test.ts`
and `npm run type-check`.
- Continued the same loading-state backlog on Lightning Channels:
- the channels list keeps existing channels visible during refresh/failure
and shows `Refreshing channels...`;
- validation passed:
`npm test -- --run src/views/apps/__tests__/LightningChannels.test.ts src/views/__tests__/CredentialsRefresh.test.ts src/views/__tests__/ServerNetworkRefresh.test.ts`
and `npm run type-check`.
- Continued the same loading-state backlog on Peer Files:
- the peer catalog keeps existing file cards visible during Tor
refresh/failure and shows `Refreshing peer files...`;
- validation passed:
`npm test -- --run src/views/__tests__/PeerFilesRefresh.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Cloud peer cards:
- Cloud keeps existing peer cards visible during federation peer-list
refresh/failure and shows `Refreshing peer nodes...`;
- validation passed:
`npm test -- --run src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on the Web5 Verifiable Credentials
summary:
- the summary keeps existing credential rows visible during refresh/failure
and shows `Refreshing credentials...`;
- validation passed:
`npm test -- --run src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Web5 Nostr Relays:
- relay stats stay visible during refresh/failure and show
`Refreshing relays...`;
- validation passed:
`npm test -- --run src/views/web5/__tests__/Web5NostrRelays.test.ts src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Web5 Domains:
- registered-name counts stay visible during refresh/failure and show
`Refreshing domains...`;
- validation passed:
`npm test -- --run src/views/web5/__tests__/Web5Domains.test.ts src/views/web5/__tests__/Web5NostrRelays.test.ts src/views/web5/__tests__/Web5CredentialsSummary.test.ts src/views/__tests__/CloudPeersRefresh.test.ts src/views/__tests__/PeerFilesRefresh.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Settings Backups:
- existing backup rows stay visible during refresh/failure and show
`Refreshing backups...`;
- validation passed:
`npm test -- --run src/views/settings/__tests__/BackupSection.test.ts src/views/web5/__tests__/Web5Domains.test.ts src/views/web5/__tests__/Web5NostrRelays.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Settings Transport Preferences:
- existing preference controls stay visible during refresh/failure and show
`Refreshing transport preferences...`;
- validation passed:
`npm test -- --run src/views/settings/__tests__/TransportPrefsCard.test.ts src/views/settings/__tests__/BackupSection.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Settings VPN status:
- current VPN connection details stay visible during refresh/failure and show
`Refreshing VPN status...`;
- validation passed:
`npm test -- --run src/views/settings/__tests__/VpnStatusSection.test.ts src/views/settings/__tests__/TransportPrefsCard.test.ts src/views/settings/__tests__/BackupSection.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the same loading-state backlog on Web5 Federation:
- summary node counts and node DID stay visible during refresh/failure and
show `Refreshing federation...`;
- validation passed:
`npm test -- --run src/views/web5/__tests__/Web5Federation.test.ts`,
`npm run type-check`, and `git diff --check`.
- Continued the Mesh map denied-location backlog:
- added component coverage that browser geolocation denial remains optional
and tells the user peer positions can still appear;
- validation passed:
`npm test -- --run src/components/__tests__/MeshMap.test.ts`,
`npm run type-check`, and `git diff --check`.
- row remains `in-progress` until browser smoke validates denied location
with a real peer coordinate message.
- Continued the companion/tab-app backlog:
- mobile app-session keeps apps that require a new tab inside the mobile
session fallback instead of auto-opening an external tab and closing;
- validation passed:
`npm test -- --run src/views/__tests__/AppSessionMobileNewTab.test.ts src/views/appSession/__tests__/appSessionConfig.test.ts src/stores/__tests__/appLauncher.test.ts`,
`npm run type-check`, and `git diff --check`.
- row remains `in-progress` until broader companion smoke testing is done.
- Continued the Nostr Discoverable Nodes UI backlog:
- Discover modal keeps existing discovered rows visible during relay
refresh/failure and shows `Searching relays...`;
- validation passed:
`npm test -- --run src/views/federation/__tests__/DiscoverModal.test.ts`,
`npm run type-check`, and `git diff --check`.
- row remains `in-progress` until live relay/trust validation is done.
- Continued the App Store screenshots backlog:
- Marketplace App Details and installed App Details no longer show fake
screenshot placeholder tiles when no screenshot metadata exists;
- both views now render real screenshot URLs when metadata is provided as
strings or `{ src, alt }` objects;
- validation passed:
`npm test -- --run src/views/appDetails/__tests__/AppContentSection.test.ts src/composables/__tests__/useMarketplaceApp.test.ts`,
`npm run type-check`, and `git diff --check`;
- row remains `in-progress` until real screenshot assets/metadata are added.
- Continued the Home/App Store recommendations backlog:
- Home now shows an App Store recommendations card with up to three
uninstalled core/recommended marketplace apps;
- the selector respects installed aliases, so recommended apps drop out once
installed and then rely on normal My Apps/Home behavior;
- card clicks reuse the existing Marketplace App Details handoff;
- card animation ordering was tightened so Home cards have a stable stagger
sequence as the recommendations card appears/disappears;
- validation passed:
`npm test -- --run src/views/home/__tests__/homeRecommendations.test.ts`,
`npm run type-check`,
`git diff --check`, and
`ARCHY_BASE_URL=http://127.0.0.1:8103 npx playwright test e2e/visual-regression.spec.ts -g 'home / dashboard' --project=chromium`;
- temporary Vite on `8103` was stopped after the smoke. An older local
dev/mock session on `8102`/`5959` was already present and was left alone.
- tracker row is `done`.
- Home layout follow-up:
- Cloud was moved back into the second card slot;
- Recommended Apps moved into Cloud's previous position;
- Quick Start now lives inside the dashboard grid next to Wallet, with
stacked goal buttons, instead of rendering as a separate odd-width row;
- validation passed:
`npm test -- --run src/views/home/__tests__/homeRecommendations.test.ts`,
`npm run type-check`,
`git diff --check`, and
`ARCHY_BASE_URL=http://127.0.0.1:8102 npx playwright test e2e/visual-regression.spec.ts -g 'home / dashboard' --project=chromium`.
- Continued the Easy Mode experience backlog:
- goal configure steps now route to their owning app/screen instead of
silently completing without navigation;
- verify steps now show `Check & Continue`, so goals that start with a verify
step are no longer stuck without an active action;
- configure/info/verify actions start goal progress before completing the
current step;
- validation passed:
`npm test -- --run src/views/goals/__tests__/goalStepActions.test.ts src/stores/__tests__/goals.test.ts`,
`npm run type-check`, and `git diff --check`;
- tracker row is `in-progress` because broader Easy Mode product scope still
needs review.
- Continued the setup screens/function/flow backlog:
- onboarding setup choice now shows only usable paths, Fresh Start and
Restore from Seed;
- removed the disabled `Connect Existing (Coming Soon)` option;
- validation passed:
`npm test -- --run src/views/__tests__/OnboardingOptions.test.ts src/composables/__tests__/useOnboarding.test.ts`,
`npm run type-check`, and `git diff --check`;
- tracker row is `in-progress` because broader onboarding/setup audit still
needs review.
## Latest Local Checkpoint - 2026-06-10 05:30 EDT
User paused work to switch machines. No dev server or validation command should
be intentionally left running from this checkpoint.
Latest local-only release-tracker work since the older `.198` handoff:
- Uninstall/data reset:
- My Apps and App Details uninstall dialogs now include `Delete app data and reset it`;
- unchecked preserves app data and sends `preserve_data=true`;
- checked sends `preserve_data=false`;
- covered by `AppsUninstallModal.test.ts`, `rpc-client.test.ts`, type-check, and `git diff --check`;
- tracker row is `done`.
- Companion APK:
- companion intro modal uses `VITE_COMPANION_APK_URL` or `/packages/archipelago-companion.apk.zip`;
- desktop shows a centered QR image generated with the same `qrcode` library used by wallet flows;
- mobile shows a direct download button;
- visible close button restored;
- APK exists at `neode-ui/public/packages/archipelago-companion.apk.zip`;
- tracker row is `done`.
- Setup instructions:
- App Details sidebar renders `static-files.instructions` when non-empty;
- covered by `AppSidebar.test.ts`, type-check, and `git diff --check`;
- tracker row is `done`.
- Fleet / tab loading:
- Fleet auto-refresh header/sort controls were tightened;
- node history no longer blanks during refresh and now shows `Refreshing history...`;
- covered by `useFleetData.test.ts`, type-check, and `git diff --check`;
- tracker row remains `in-progress` pending broader slow-tab audit.
- Bitcoin receive readiness:
- receive modals show a live `Checking Lightning wallet readiness...` message while on-chain address generation is in flight;
- shared helper now distinguishes LND REST/newaddress transport failures;
- covered by `bitcoinReceive.test.ts`, type-check, and `git diff --check`;
- tracker row remains `in-progress` pending live wallet-state smoke test.
- Nextcloud false update:
- Nextcloud manifest/catalog/static UI metadata moved from `28` to pinned `29`;
- update comparison now ignores registry-host-only image changes while reporting same-repo tag drift;
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed;
- `cargo test -p archipelago container::image_versions::tests` from `core/` failed first with a Rust linker/incremental artifact issue after `/tmp` was full, then the non-incremental retry was killed because it ran too long;
- old `/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered to about 14% used;
- tracker row is `in-progress`; rerun the focused Rust test before marking done.
- Dead/coming-soon UI:
- removed the non-interactive Spotlight AI Assistant coming-soon block;
- verified no active UI `Coming soon` strings remain outside historical release-note text;
- type-check passed and `git diff --check` passed;
- tracker row is `done`.
- No-registration credentials:
- added PhotoPrism fallback credentials from its manifest (`admin` / `archipelago`);
- did not add Grafana because its `GRAFANA_ADMIN_PASSWORD` is not resolved to a known local secret/default in the repo;
- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed;
- `npm run type-check` passed;
- tracker row still `in-progress` because other no-registration apps still need inventory.
Most recent validations before pause:
- `npm run type-check` passed after the PhotoPrism credential fallback.
- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed.
- `git diff --check` passed after the Spotlight cleanup and before the PhotoPrism fallback; rerun it after resuming.
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during the Nextcloud pass.
- Backend Rust focused validation for image versions is still not clean because of the local linker/incremental artifact failure and the killed retry; rerun from `core/` when convenient.
## Latest Known `.198` State
- Host: `192.168.1.198`.
- Backend deployed: `/usr/local/bin/archipelago` sha256 `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`.
- `archipelago.service`: active after deploy.
- `archipelago-doctor.timer`: inactive.
- `archipelago-reconcile.timer`: inactive.
- No reboot validation should be started yet.
## What Was Just Done
- Investigated current Fedimint Guardian UI report:
- live `.198` RPC reports `fedimint` as `starting` and `container-health {"fedimint":"starting"}`;
- direct `http://192.168.1.198:8175/` returns HTTP `000` because the manifest wrapper has not exec'd `fedimintd` yet;
- `bitcoin-knots` is `running` and `http://192.168.1.198:8334/` returns HTTP `200`;
- `bitcoin.status` RPC returned an operation-failed error during the check, consistent with the current Bitcoin-dependent-app wait-state problem.
- Added frontend Fedimint-specific wait-state copy:
- My Apps/App card now says `Waiting for Bitcoin to finish initial sync before Guardian starts.` when Fedimint is starting or running with `health=starting`;
- App session fallback title now says `Waiting for Bitcoin sync` instead of generic `App not reachable` for that state.
- Validated frontend changes:
- `npm test -- --run src/views/apps/__tests__/appsConfig.test.ts` passed (`7` tests);
- `npm run type-check` passed;
- `npm run build` passed.
- Deployed rebuilt static frontend to `.198` only:
- preserved `aiui/` and `claude-login.html`;
- backed up previous web root at `/opt/archipelago/rollback/web-ui-fedimint-ui-20260610-042927.tar`;
- reloaded nginx;
- confirmed deployed assets contain the new Fedimint copy.
- Fixed Fedimint Guardian launch on `.198` while Bitcoin is still syncing:
- added `docker/fedimint-ui`, an nginx wait/proxy companion;
- changed Fedimint backend manifest so real Guardian UI maps to host `8177` instead of the public launch port;
- public launch port `8175` is now owned by `archy-fedimint-ui`, which serves `Waiting for Bitcoin sync` until `fedimintd` binds behind it;
- fixed the Fedimint wait command to avoid `printf '%s'` in Quadlet `Exec=` because systemd expands `%s` to the user shell (`/bin/bash`);
- live `.198` `fedimint.service` unit has `TimeoutStartSec=infinity` so systemd does not kill the intentional Bitcoin-sync wait loop;
- rebuilt and deployed frontend static files so Fedimint remains launchable while `health=starting`;
- confirmed `http://192.168.1.198:8175/` returns HTTP `200` with `Waiting for Bitcoin sync`.
- Restyled the Fedimint wait/proxy page:
- `docker/fedimint-ui/index.html` now uses Archipelago-style `glass-card`, app icon block, Montserrat-like heading stack, orange focus/glow accents, and yellow starting badge styling;
- rebuilt `localhost/fedimint-ui:latest` on `.198`;
- restarting `archy-fedimint-ui.service` hit the known rootless Podman cleanup slowness and left the unit temporarily `deactivating`;
- recovered with app-scoped `systemctl --user kill --kill-whom=all -s SIGKILL archy-fedimint-ui.service`, `reset-failed`, and `start`;
- final LAN validation: `http://192.168.1.198:8175/` returns HTTP `200`, size `6419`, and contains `glass-card`, `app-icon`, `Archipelago App`, and `Waiting for Bitcoin sync`.
- Updated the Fedimint wait/proxy page again per design feedback:
- uses the Bitcoin custom UI's `/assets/img/bg-network.jpg` full-screen background + dark overlay pattern;
- uses the real Fedimint icon inside the Bitcoin custom UI `logo-gradient-border` treatment instead of text initials;
- copied those assets into `docker/fedimint-ui/assets/`;
- rebuilt `localhost/fedimint-ui:latest` on `.198`;
- fixed nginx routing so `/assets/...` is served statically instead of being proxied to the not-yet-running Guardian backend;
- corrected the companion page to reference `fedimint.jpg` because the catalog icon bytes are JPEG despite the old `.png` extension;
- final LAN validation: `http://192.168.1.198:8175/` returns HTTP `200`, size `11328`; `/assets/img/app-icons/fedimint.jpg` returns `200 image/jpeg`; `/assets/img/bg-network.jpg` returns `200 image/jpeg`;
- Playwright render validation confirmed title `Fedimint Guardian`, status `Waiting for Bitcoin sync`, background URL `/assets/img/bg-network.jpg`, and icon natural width `860`.
- Hardened Fedimint/backend lifecycle enough for this path:
- generated Quadlet services now include `TimeoutStartSec=0` so systemd does not kill dependency-gated container entrypoints while they wait for Bitcoin IBD;
- `package.restart` now returns `{"status":"restarting"}` immediately instead of blocking the RPC call for minutes in the single-orchestrator path;
- `quadlet::restart_service` now uses bounded stop/start, app-scoped kill/reset recovery, and settle waits instead of opaque `systemctl restart`;
- deployed backend hash `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228` to `.198`;
- backup made at `/opt/archipelago/rollback/archipelago-before-quadlet-timeout0-20260610-082535`;
- `package.restart fedimint` returned `{"status":"restarting"}` in `0s`;
- restart observation: `8175` stayed HTTP `200` throughout; generated `fedimint.container` gained `TimeoutStartSec=0`; `fedimint.service` and `archy-fedimint-ui.service` settled `active`; ports `8175` and `8177` listened.
- Final Fedimint live validation after restart:
- `container-health` returned `{"fedimint":"healthy"}`;
- `container-list` returned `fedimint` `state:"running"` and `lan_address:"http://localhost:8175"`;
- services: `fedimint.service` active, `archy-fedimint-ui.service` active;
- unit contains `TimeoutStartSec=0` at line `42`;
- public wait/proxy UI and both image assets returned `200`.
- Fedimint live rollback references:
- previous frontend backup: `/opt/archipelago/rollback/web-ui-fedimint-guardian-launch-20260610-045949.tar`;
- previous Fedimint Quadlet backup: `/home/archipelago/.config/containers/systemd/fedimint.container.guardian-fix-rewrite-20260610-050607.bak`.
- Earlier backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` was superseded by `159e0daf13fca2df7e831122cb0e6c84223a7e5b7433f5dd0b7eec263233e228`.
- Added explicit release gates:
- app packaging docs must match current manifest/runtime contract before `1.8-alpha`;
- refactor/remove-dead-code is mandatory before `1.8-alpha`, after correctness validation and before final ISO/release gates.
- Validated IndeeHub:
- `container-list` reported `indeedhub` running;
- `container-health` returned `{"indeedhub":"healthy"}`;
- `http://192.168.1.198:7778/` returned HTTP `200`;
- `http://192.168.1.198:7778/nostr-provider.js` returned HTTP `200` and contains the Archipelago NIP-07/NIP-98 provider shim.
- Validated Immich launch:
- `http://192.168.1.198:2283/` returned HTTP `200`;
- one `container-health` check returned `{"immich":"unknown"}`, so health truthfulness still needs follow-up.
- Fixed Tailscale launch UI:
- patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh`;
- command now waits for `/var/run/tailscale/tailscaled.sock` before starting `tailscale web`;
- copied updated catalog to `/opt/archipelago/web-ui/catalog.json` on `.198`;
- patched the live generated Tailscale `.container` unit and restarted only `tailscale.service`;
- confirmed `container-list` reports Tailscale running;
- confirmed `container-health` returns `{"tailscale":"healthy"}`;
- confirmed `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content.
## Important Caveat
Tailscale launch is fixed, but Tailscale lifecycle is not fully passing:
- `package.restart tailscale` failed through RPC with `podman ps timed out while listing containers`.
- Manual app-scoped restart showed old container stop needed SIGKILL and Podman cleanup took roughly 2 minutes.
- Logs still showed `podman ps timed out`, `podman stats timed out`, scan backoff, and slow cleanup.
This confirms the active blocker is the rootless Podman control-plane/lifecycle path, not just individual app launch URLs.
## Active Blockers
- Rootless Podman/control-plane responsiveness:
- `podman ps` and cleanup paths time out;
- backend scan/backoff causes stale or slow UI state;
- app stop/start/restart can look frozen or fail through RPC.
- My Apps state truthfulness:
- do not show false empty/no-apps while scanner/Podman is in backoff;
- preserve last-known apps and show explicit stale/checking state.
- Progress UX:
- install/uninstall/start/stop/restart must show meaningful phase progress and not appear frozen.
- Immich health truthfulness:
- HTTP launch works, but health may still report `unknown`.
- Portainer:
- HTTP `9000` returned `200`;
- user still needs to retry environment wizard and confirm `/var/run/docker.sock` works.
- Fedimint:
- public Guardian launch URL now loads on `8175` even while Bitcoin is in IBD;
- `archy-fedimint-ui` owns `8175` and proxies to the real Guardian backend on `8177` when `fedimintd` eventually starts;
- durable manifest/companion/frontend/backend changes are now deployed on `.198`;
- `package.restart fedimint` fast-returned and settled active with `TimeoutStartSec=0`, but keep Fedimint in the broader lifecycle matrix because rootless Podman cleanup slowness remains a systemic blocker.
- Reboot validation:
- require at least 3 clean consecutive post-fix reboots with broad lifecycle green after each;
- prefer 5 clean reboots;
- do not start until lifecycle/control-plane is stable.
- App packaging docs:
- aligned `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` with the current manifest/runtime contract.
- Refactor/remove-dead-code:
- required before `1.8-alpha`;
- remove stale per-app hacks, duplicate lifecycle paths, stale fallback metadata, misleading compatibility shims;
- rerun release gates afterward.
## Local Validation Already Run
- `bash -n tests/lifecycle/remote-lifecycle.sh` passed.
- `bash -n scripts/first-boot-containers.sh tests/lifecycle/remote-lifecycle.sh` passed.
- `cargo fmt --manifest-path core/Cargo.toml --all` was run.
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests).
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed.
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed.
- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json` passed.
- `git diff --check` passed.
- `npm test -- --run src/views/apps/__tests__/appsConfig.test.ts` passed.
- `npm run type-check` passed.
- `npm run build` passed.
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed after Fedimint manifest changes.
- `git diff --check` passed for Fedimint manifest, companion, frontend, and new `docker/fedimint-ui` files.
- `cargo fmt --manifest-path core/Cargo.toml --all` passed.
- `CARGO_TARGET_DIR=/tmp/archy-cargo-check-quadlet cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed after Quadlet/restart changes.
- `CARGO_TARGET_DIR=/tmp/archy-cargo-final-quadlet cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` produced the deployed backend binary (tool PTY heartbeat wrapper became stale after link; artifact hash was validated separately before deploy).
- Live Fedimint restart validation passed on `.198`:
- `package.restart fedimint` returned `{"status":"restarting"}` immediately;
- `8175` remained HTTP `200`;
- `fedimint.service` and `archy-fedimint-ui.service` settled `active`;
- `container-health fedimint` returned `healthy`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago companion::tests` compiled then the tool PTY stuck with no active `cargo`/`rustc` process visible; treat as inconclusive, not failed.
- Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat as inconclusive, not failed.
## Immediate Next Step
Do not reboot yet.
Start with the rootless Podman lifecycle/control-plane blocker:
1. Inspect the backend stop/start/restart path around `package.restart`, scanner backoff, and `podman ps` dependency.
2. Make stop/restart tolerate slow cleanup without wedging RPC/UI state.
3. Keep last-known app state during scanner backoff.
4. Revalidate focused apps on `.198`: `tailscale`, `indeedhub`, `immich`, `portainer`, `vaultwarden`, `botfights`; keep `fedimint` in the matrix but its focused Guardian launch/restart path is currently green.
5. Only after focused lifecycle is clean, run broad non-destructive lifecycle.
6. Only after that, begin 3/5 reboot validation.
## Files Touched In Last Mini-Pass
- `docs/NEXT_TERMINAL_HANDOFF.md` - this file.
- `neode-ui/src/views/apps/appsConfig.ts` - Fedimint launch-blocked reason helper.
- `neode-ui/src/views/apps/AppCard.vue` - show Fedimint Bitcoin-sync wait copy on app cards.
- `neode-ui/src/views/AppSession.vue` - pass app-specific blocked reason into app session.
- `neode-ui/src/views/appSession/AppSessionFrame.vue` - show app-specific blocked title/reason instead of generic unreachable fallback.
- `neode-ui/src/views/apps/__tests__/appsConfig.test.ts` - regression coverage for Fedimint wait-state copy.
- `apps/fedimint/manifest.yml` - backend real Guardian UI now maps host `8177` and wait command avoids systemd `%` expansion.
- `core/archipelago/src/container/companion.rs` - added `archy-fedimint-ui` companion mapping.
- `core/archipelago/src/container/quadlet.rs` - generated unit `TimeoutStartSec=0` plus bounded stop/restart recovery helpers.
- `core/archipelago/src/api/rpc/package/runtime.rs` - restart RPC returns immediately and runs restart async.
- `docker/fedimint-ui/` - new nginx wait/proxy companion image for Fedimint Guardian launch.
- `docs/RESUME.md` - checkpoint and gates.
- `docs/MIGRATION_STATUS_REPORT.md` - packaging/refactor release gates.
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md` - packaging/refactor release gates.
- `docs/APP-PACKAGING-MIGRATION-PLAN.md` - updated manifest/runtime contract documentation.
- `docs/app-developer-guide.md` - updated manifest/runtime contract documentation.
- `docs/MIGRATION_STATUS_REPORT.md` - noted that the docs gate is being closed in this pass.
- `app-catalog/catalog.json` - Tailscale socket-wait startup command.
- `neode-ui/public/catalog.json` - same Tailscale catalog update.
- `scripts/first-boot-containers.sh` - same Tailscale first-boot startup update.
- `neode-ui/src/views/apps/appPackageCache.ts` - UI-only last-known package
cache for scanner backoff.
- `neode-ui/src/views/apps/__tests__/appPackageCache.test.ts` - cache behavior
coverage.
- `neode-ui/src/views/Apps.vue` - uses cached packages during scanner backoff
and shows a refresh status banner.
- `docs/1.8-alpha-improvements-tracker.md` - noted My Apps backoff cache
improvement.
- `neode-ui/src/views/web5/Web5SharedContent.vue` - preserves shared/peer
content during refresh and shows compact refresh states.
- `neode-ui/src/views/web5/__tests__/Web5SharedContent.test.ts` - shared and
peer content refresh regression coverage.
The worktree has many other pre-existing release-hardening changes. Do not revert unrelated dirty files.

View File

@ -0,0 +1,157 @@
# 🚩 PRODUCTION MASTER PLAN — Archipelago App Platform & Registry
> **THIS IS THE AUTHORITATIVE PLAN. Agents: read this first and keep it open until
> the production test gate (§5) is green.** It overrides ad-hoc direction and
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
> the priority banner and demote this doc.
>
> Last updated: 2026-06-21 · Binary: v1.7.99-alpha
---
## 1. The North Star
Make Archipelago a **world-class, developer-ready app platform** where:
1. **Every app is manifest-driven** — install/run/update/uninstall needs only the
app's manifest (+ catalog entry). **Zero OS-level code reliance**: no per-app
Rust installers, no `sudo mkdir/chown`, no host provisioning.
2. **Manifests are distributed via the (signed) registry**, not baked into the
binary OTA as disk files. Bumping/adding an app = a signed catalog change.
3. **Third-party developers can build and ship apps via an external registry**
a decentralized marketplace (DID-signed manifests, Nostr discovery, reputation),
not a gatekept central store. `archy app validate/render/install/test` tooling.
4. The platform stays **rootless, secure-by-default, elegant, robust, and
100%-uptime-capable** (reboot-survivable, self-healing, no data loss on migrate).
**Definition of done:** the production test gate (§5) is green for the app set on
real nodes. Until then, this plan is the priority.
## 2. Invariants (never violate)
- **Rootless Podman only.** No rootful, no Docker-socket mounts, no privileged
containers unless explicitly approved. (ADR-001, ADR-009.)
- **No app-specific business logic in the Rust backend.** The orchestrator owns
the lifecycle state machine; apps are declarative. Legacy `install_immich_stack`
(hardcoded `podman run` + `sudo chown`) is the anti-pattern being deleted.
- **Secrets are manifest-declared** (`generated_secrets`, materialised by
`container::secrets` 0600/rootless, idempotent + self-healing) — never hardcoded,
per-app, or logged. Replaces the deleted `ensure_fmcd_password`.
- **Migrations never destroy data.** Preserve `/var/lib/archipelago/<app>`,
generated secrets, displayed credentials, public ports, and adoption container
names. Always provide a rollback path. Stop/recreate only when necessary.
- **Verify on a real node (.228, then .198) before any tag.**
## 3. Current state (2026-06-21)
- **~40 apps are manifest-based and Quadlet-migrated** (survive
`archipelago.service` restart + reboot). Exhaustive per-app table:
`docs/app-registry-status-2026-06-21.md`.
- **Legacy holdout: immich** — the one app with **no manifest** and a hardcoded
Rust stack installer (in-cgroup, not Quadlet). 3 containers, healthy, live data.
The migration proof case.
- **Manifests still travel by OTA disk rsync** (`apps/ → /opt/archipelago/apps`).
The signed catalog (`app-catalog.json`) currently distributes **only image
overrides** — not full manifests. Gap closed by workstream B.
- **The 4 companions** (`archy-bitcoin-ui`, `-lnd-ui`, `-electrs-ui`,
`-fedimint-ui`) build from `docker/<name>` contexts via `companion.rs`, not the
manifest registry — a later phase folds them in.
- **No app has passed the formal 20× production gate.** That is the blocker.
## 4. Workstreams (each links its authoritative detail doc)
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **design done — implementing phase 1** |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 20× lifecycle on .228 + .198, per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
scan, Quadlet rendering) and `bulletproof-containers.md` (the six container failure
modes FM1FM6 + the desired-state-first reconciler that fixes them).
## 5. Production test gate (exit criterion)
An app is **production-ready** only when `tests/lifecycle/run-20x.sh` is green
across the full matrix — install / UI-reachable / stop / start / restart /
reinstall / **reboot-survive** / **archipelago-restart-survive** / uninstall —
**20× on .228 AND .198**. All 8 gate checkboxes in `tests/lifecycle/TESTING.md`
are currently unchecked. Coverage today: L0 unit (631 ●), L1 RPC ● for 6 core apps,
L2 UI ● dashboard + proxies; L3 survival ◐; ~30 apps have zero automated coverage.
## 6. Immediate sequence (live workstream)
1. **B-phase 1**`manifest` field on `AppCatalogEntry`; `load_manifests`
catalog-wins merge; `manifest_dir: Option`; unit tests (image-only apps first).
2. **B-phase 2** — publisher generator embeds + signs manifests into
`releases/app-catalog.json`.
3. **C immich proof** — author immich as registry manifests (postgres/redis/server)
installed via `install_stack_via_orchestrator`; delete `install_immich_stack`;
`generated_secrets: [immich-db-password]` — **reuse the live secret `39ec03dc…`**
(postgres is initialised with it; never regenerate). Anon `/data` vol is empty.
4. **Verify on .228, then .198.**
5. **E** — run the 20× gate; fix until green.
6. Demote this banner.
## 7. Release blockers & operational gotchas (durable)
Carried forward from prior handoffs (deduped against persistent memory):
- **Rootless control-plane responsiveness** — slow `podman ps`/store cleanup at
startup must not surface a false "no apps installed" UI. **My Apps must preserve
last-known apps during scanner backoff**, never show empty during a transient.
- **Reboot survival** — gate on ≥3 (prefer 5) consecutive clean post-reboot
lifecycle passes. Quadlet units under `user.slice` survive `archipelago.service`
restart; legacy in-cgroup containers get SIGKILLed and reconciled back.
- **Startup patterns** — wait on a socket/health, never `sleep`. Tailscale waits
for its socket; Fedimint Guardian waits for Bitcoin RPC `initialblockdownload:false`
before launching fedimintd (proxy/wait companion on :8175 during IBD).
- **Bitcoin must run full** (`txindex=1`, non-pruned) for ElectrumX/mempool.
- **Adoption** — match existing containers by name and adopt without recreate;
record a migration version in app state; preserve Nostr signer bridges
(IndeeHub needs `/nostr-provider.js` served, not just port reachability).
- **Image presence** — use bounded targeted `podman image inspect`, not
`podman image exists` (avoids store-walk stalls).
- **Companion rebuilds**`companion.rs` must rebuild `:latest` when the build
context changes (staleness check), else baked-in fixes (e.g. guardian CSS) never
reach nodes. `:local` is a manual override, never auto-rebuilt.
## 8. Roadmap
**Pipeline:** Feature Testing (internal) → User Testing (controlled hardware) →
Beta Live (public). Hardening priorities feeding the gate:
- **P0** Container app reliability — bulletproof install/health/restart/uninstall
across all apps, dependency chains, multi-container stacks.
- **P0** Networking stack first-install → reboot-proof (WireGuard/NetBird, Tor
hidden services, LND Connect).
- **P1** LUKS2 full-partition encryption for `/var/lib/archipelago/`
(AES-256-XTS, Argon2id, key from setup password + hardware salt).
- **P1** Meshtastic plug-and-play parity with MeshCore.
**Post-beta (deferred — do not start until gate is green):** P2P encrypted
voice/video (WebRTC over federation via Tor); watch-only wallet + mesh BTC
hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.md`);
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
phases 26 (`dual-ecash-design.md`).
## 9. Documentation map (what survives)
This master plan is the hub. Authoritative standalone docs (linked above), kept:
- **Design:** `architecture.md`, `app-developer-guide.md`,
`APP-PACKAGING-MIGRATION-PLAN.md`, `registry-manifest-design.md`,
`marketplace-protocol.md`, `dht-distribution-design.md`,
`multi-node-architecture.md`, `rust-orchestrator-migration.md`,
`bulletproof-containers.md`, `three-mode-ui-design.md`, `dual-ecash-design.md`,
`meshroller-integration-design.md`, `phase4-streaming-ecash-plan.md`, `adr/*`.
- **Reference:** `app-manifest-spec.md`, `api-reference.md`, `developer-guide.md`,
`operations-runbook.md`, `troubleshooting.md`, `user-walkthrough.md`,
`bitcoin-rpc-relay.md`, `security-code-audit-2026-03.md`, `GAMEPAD-NAV.md`,
`SEED-VERIFICATION.md`, `hotfix-process.md`, `app-registry-status-2026-06-21.md`.
All dated handoffs/resumes/transcripts/superseded trackers were consolidated here
and removed (recoverable via git) on 2026-06-21.

View File

@ -1,44 +0,0 @@
# Progress Memory
Last updated: 2026-06-13
## Current State
- `v1.7.90-alpha` release is complete, tagged, pushed, uploaded, and verified on vps2.
- Release commit: `bb808df8` (chore: release v1.7.90-alpha).
- Feature commit: `c800293f` (fix: bitcoin receive, AIUI pointer input, electrs self-heal, OTA timeout).
- Gitea tag: `v1.7.90-alpha` (on origin/gitea-vps2).
- Live OTA manifest on the update host (146.59.87.168) now resolves to `1.7.90-alpha`; both
artifact download URLs (binary + frontend tarball) return HTTP 200.
- v1.7.89-alpha was already fully shipped before this session.
## What shipped in v1.7.90-alpha
- Bitcoin receive address generation fixed (correct address type, no more 400).
- AIUI/app session: on-screen pointer can click + type into app content (incl. app store
search); "open in new tab" opens the phone browser; mobile credential modal centered.
- Electrs self-heals from a corrupt index and shows a percent/block-height progress screen.
- update.rs: retired tx1138 secondary mirror dropped (one-time migration); longer download
timeout for slow connections.
## Verification
- Full release harness green (8 stages): git-diff, cargo-fmt, catalog-drift, release-manifest,
ui-type-check, ui-unit-tests (80 files / 655 tests), cargo-check, cargo-test-weekly.
- Freshly built binary embeds `1.7.90-alpha` (no stale 1.7.89); frontend dist rebuilt fresh
(new AppSession bundle); manifest sha256 + size match on-disk artifacts.
## Known gaps / follow-ups
- `gitea-local` (localhost:3000) push FAILS from this node — redirects to /login (auth).
The v1.7.88 and v1.7.89 tags were also already missing there, so this is a pre-existing
condition on this node, not a v1.7.90 regression. vps2 is the primary OTA mirror and is fine.
- OTA self-update verification on THIS node (.116) not yet observed this session — the node
should auto-apply from the live 1.7.90-alpha manifest; confirm
`update_state.json.current_version == 1.7.90-alpha` after the scheduler runs.
## Resume Context
- If a later session resumes, continue from the next active product/release task, not this
finished release.
- Broader context: docs/WEEKLY_RELEASE_TRACKER.md, docs/RESUME.md, docs/NEXT_TERMINAL_HANDOFF.md

View File

@ -1,224 +0,0 @@
# Remaining issues — implementation plans
Written 2026-06-17. Covers the open Gitea issues not closeable in the single-box
dev env. Each plan lists the files to touch, the approach, and how to verify
(most need .116 + .198, a companion phone, or funded wallets). Issues #3 (VPN)
and #5 (OpenWRT/TollGate) are intentionally out of scope per the user.
Status of the rest at time of writing:
- **#31** group chat over Tor — dedup-by-`msg_id` fix already shipped (open only
for a 2-node Tor confirmation). See its Gitea comment.
- **#43** install on .70 — blocked: .70 unreachable. Plan below is a code-side
hardening that doesn't depend on .70's logs.
---
## #46 — Pay for peer files (local wallet OR invoice+QR to seller)
> **Status (2026-06-17): Phase 1 DONE & compiles** (LN invoice + QR + release).
> Seller: `content_invoice.rs` entitlement store, `GET /content/{id}/invoice`
> + `/invoice-status/{hash}`, invoice-paid path in `serve_content`
> (`X-Invoice-Hash`), LND `create_invoice`/`invoice_is_settled`. Buyer:
> `content.request-invoice` / `.invoice-status` / `.download-peer-invoice` +
> `PeerFiles.vue` picker modal + QR + poll. Phases 2 (on-chain) and 3 (local
> LN/on-chain methods) remain; needs live funded-wallet verify. Issue left open.
**Goal.** At the paid-download step in Cloud → peer files, let the buyer choose
how to pay: (a) their local wallet (ecash today; LN/on-chain later), or (b) get
an invoice with a QR drawn on the **selling** node's wallet, pay from any
external wallet, and have the file release on confirmation.
**What exists already**
- Buyer ecash auto-pay: `content.download-peer-paid` (mints ecash, downloads
atomically) — wired in `neode-ui/src/views/PeerFiles.vue` `downloadFile()`.
- Payer-side builder: `streaming.prepare-payment` RPC + `wallet/ecash.rs`
(`build_payment_token`, cross-mint), `swarm/payment.rs`.
- Free streaming download: `/api/peer-content/:onion/:id` (Range-capable).
- LND invoice RPC: `lnd.createinvoice`; ecash balance: `wallet.ecash-balance`.
**Backend work**
1. **Seller-side invoice RPC** (new), e.g. `content.request-invoice`
`{ onion, content_id }` → asks the *selling* node (over the existing
`/archipelago/...` peer transport, same path machinery as
`content.download-peer-paid`) to produce a payment request for `price_sats`:
- LN: `lnd.createinvoice` on the seller, return `bolt11` + `payment_hash`.
- on-chain: `lnd.newaddress` on the seller, return `address` + `amount`.
- Seller records a pending entitlement keyed by `payment_hash`/address →
content_id → buyer.
2. **Payment confirmation + release**: seller polls its own LND
(`lnd.lookup-invoice` / address watch); on settle, marks the entitlement
paid. Buyer side polls `content.invoice-status { payment_hash }` → when paid,
downloads via the existing `/api/peer-content` (gate now passes because the
entitlement is satisfied). Reuse the streaming gate in `streaming/` — add an
"invoice-paid" path alongside the ecash-token path.
3. Keep `content.download-peer-paid` (local-ecash) as the (a) fast path.
**Frontend work** (`PeerFiles.vue`)
1. Before a paid download, open a small **payment-method picker** modal:
- "Pay from this node's wallet" → existing ecash flow (show balance; if
insufficient, the LN/on-chain local options when those land).
- "Pay from another wallet (QR)" → call `content.request-invoice`, render the
`bolt11`/address as a **QR** (add a tiny QR lib or reuse one already in the
bundle — check `package.json`), show amount + a live "waiting for
payment…" state polling `content.invoice-status`, then auto-download.
2. Reuse the existing `purchaseError`/`downloading` state + `triggerDownload`.
**Verify**: .116 (seller) + .198 (buyer), a funded regtest/LN wallet. Buyer
picks QR, pays from a 3rd wallet, file releases. Then the local-ecash path.
**Effort**: large (multi-day). Phase it: (1) LN-invoice + QR + release, (2)
on-chain, (3) local LN/on-chain methods.
---
## #18 — Companion app: "open in external browser" apps don't work
> **Status (2026-06-17): DONE & compiles (Rust + TS); Android unbuilt here.**
> Reverse relay hop added: `external_open_tx` channel, kiosk publishes
> `{"t":"o","url"}` on `/ws/remote-relay` (URL-validated), forwarded to the
> companion's `/ws/remote-input`. `requestExternalOpen()` in `remote-relay.ts`
> wired into all four `appLauncher.ts` external-open sites; `InputWebSocket.kt`
> + `RemoteInputScreen.kt` open it via `ACTION_VIEW`. Issue closed; live pairing
> test pending.
**Goal.** Apps configured to open in a new/external browser should launch on the
**phone** when driven from the companion controller, using the phone-default-
browser request pattern.
**What exists**
- Relay protocol in `neode-ui/src/api/remote-relay.ts` — message cases `m`
(move cursor), `c` (click), `s` (scroll, just fixed in #7). Click resolves the
element under the virtual cursor via `deepElementFromPoint`.
- The kiosk side runs the dashboard; "open external" apps currently try to
`window.open` on the **kiosk**, which the phone never sees.
**Approach**
1. **Detect external-open intent on the kiosk**: when a click lands on an
element that would open externally (anchor with `target=_blank` / an app
flagged `opensExternally`, or an intercepted `window.open`), instead of
opening locally, send a new relay message to the phone:
`{ t: 'open-url', url }` over the `/ws/remote-relay` channel (the kiosk is the
relay server side — find where it sends frames back to the companion).
2. **Companion (phone) side** handles `open-url` by doing `window.open(url,
'_blank')` / `location.href = url` so it opens in the phone's default browser.
- If the companion is the **Android APK** (separate codebase, see
`Android/` + memory `feedback_companion_apk_not_in_update`), add an
intent-based handler there; if it's a mobile web client, handle in JS.
3. Intercept `window.open` on the kiosk dashboard globally (a small shim that,
when remote-relay is active, forwards to the phone instead of opening).
**Verify**: phone + kiosk paired; tap an "open external" app from the companion;
it opens in the phone browser.
**Effort**: medium; needs the companion device + possibly an APK change.
---
## #50 — Integrate Meshroller into our mesh features
> **Decision made 2026-06-17: seam (a) — Rust-native lift.** Full design with
> verified seam anchors (message types, dispatch, send API, event/trust gates,
> Ollama call) is in **`docs/meshroller-integration-design.md`**. Summary below.
Source: https://gitea.l484.com/clasko/Meshroller
**Phase 0 — review (DONE 2026-06-17)**
- Reviewed. Meshroller is a single ~29KB Python script (`meshroller.py`): a
daemon that bridges a **Meshtastic** radio (via the `meshtastic` Python serial
module, `SerialInterface`) to an **Ollama** LLM (`qwen2.5-coder`). It has
trusted-node auth, scheduled/queued messaging, and command handling on mesh
channels. It is a **daemon**, not firmware or a library.
- **License**: in-house (our own developer) — no third-party license blocker.
- **Hardware/transport reality**: it rides **Meshtastic serial + a local
Ollama**. Our radio is **Meshcore** (Heltec V3) and our mesh stack targets
meshcore. The `meshtastic` module does NOT speak meshcore, so the script
cannot drive our radio unmodified.
- **Decision needed (architecture)**: per user, integration **must work with
meshcore**. Two seams:
- (a) Lift Meshroller's *behaviors* (LLM bridge, trusted-node auth, scheduled
messaging, command parser) into our Rust mesh stack as typed message kinds —
native to meshcore, no Python/Meshtastic dependency. Preferred for meshcore.
- (b) Package the Python daemon as a container app and add a meshcore serial
backend to it (keeps the script, but requires writing meshcore I/O the
`meshtastic` module doesn't provide).
This choice is the remaining gate; the rest of Phase 1 below stands.
**Phase 1 — choose the seam**
- Our mesh stack: `core/archipelago/src/mesh/` (`mod.rs` `MeshService`,
`listener/`, `protocol.rs`, `types.rs`). Decide:
- If Meshroller is a *protocol/feature on the same radio* → implement it as a
typed message kind in our `MeshMessageType` + `listener/dispatch.rs`
(mirrors how block headers / alerts are handled).
- If it's a *separate transport/daemon* → wrap it behind our transport router
(`transport/`) like FIPS/LAN/Tor.
- Reuse the event seam (`MeshEvent`) so the UI gets pushes (same path we just
wired for #48).
**Phase 2 — UX** (ties into `project_mesh_telegram_plan`)
- A dead-simple onboarding + usage flow in the Mesh tab. Define the 12 killer
actions and design the setup wizard.
**Verify**: 2 radios (the .116 Meshcore + a second).
**Effort**: multi-day; gated on the Phase 0 review + a license/architecture
decision.
---
## #15 — netbird app doesn't work (LOW PRIORITY)
> **Status (2026-06-17): DIAGNOSED LIVE on .198 + FIXED (option A shipped); login works.**
> THE real blocker: the dashboard needs a **secure context**
> `window.crypto.subtle is unavailable` over plain http, so OIDC PKCE threw
> before login. Fix: proxy now serves **HTTPS** (self-signed cert at install,
> `8087:443`, all origins `https://`); frontend opens netbird in a **new tab**
> (self-signed-HTTPS iframe is blocked). Layered fixes also in `stacks.rs`:
> nginx `resolver <gateway>` + variable upstreams (IP-cache 502; `resolver
> local=on`/`${NGINX_LOCAL_RESOLVERS}` FAIL on nginx:1.27-alpine), LAN-IP
> canonical origin + CORS + multi-origin redirect URIs, `/nb-auth`+`/nb-silent-auth`
> SPA fallback (were 404), and a stale-store note (wipe to re-init). Also found:
> `conmon died` zombie containers (recreate fixes; #53). Validated on .198,
> registration+login succeed. Trusted-cert/iframe (option B) = #56;
> registry-app migration = #52. Existing nodes need a clean reinstall.
**Diagnose first** (likely a container/config issue, like other app fixes):
1. On a node: `podman logs <netbird container>` — capture the actual failure.
2. Check the app manifest + install path (`container/` install, env, ports,
the four iframe-sync places per memory `feedback_gitea_iframe_setup` if it
has a UI).
3. netbird needs a management URL / setup key — confirm whether the app expects
config we don't provide, or a host capability (TUN device / NET_ADMIN) the
rootless-podman setup lacks.
**Likely fix**: either supply the missing env/setup-key UI, or add the required
container capability. Low priority — schedule after the above.
---
## #43 — Install errors at DID-creation + password screens (.70); FIPS slow
`.70` is unreachable, so we can't read its logs. Code-side hardening that helps
regardless:
> **Status (2026-06-17): hardening DONE & compiles.** Root cause was a
> non-idempotent `seed.generate` that overwrote node keys under the client's
> retry storm on slow first boot. Fixed: idempotent generate + retry-safe
> verify (`seed_rpc.rs`), transient-vs-genuine error handling in
> `OnboardingSeedGenerate/Verify.vue`, and a non-blocking FIPS status on
> `OnboardingDone.vue`. Issue closed; full closure wants a fresh install on a
> reachable node + re-test on .70.
1. **Onboarding error surfacing** — in the seed/DID + password onboarding views
(`OnboardingSeed*`, the password step) and their RPC handlers
(`seed.generate` / `seed.verify` / `auth.setup`), make a *successful*
operation never show an error toast, and make genuinely-failed ops show the
real message + a retry — so cosmetic errors (op actually succeeded) stop
alarming users. Audit the promise/catch paths for races where a slow backend
resolves after a timeout fires.
2. **FIPS start delay** — confirm `spawn_post_onboarding_fips_activate`
(`api/rpc/seed_rpc.rs`) isn't blocking onboarding; it already runs detached.
Consider surfacing "FIPS starting…" status instead of letting it look stuck.
**Verify**: a fresh ISO install on a reachable node (.198 or a scratch box),
watch the DID + password screens; then re-test on .70 once reachable.
**Effort**: smallmedium (the hardening); full closure needs a repro node.

View File

@ -1,840 +0,0 @@
# RESUME - Archipelago Release Hardening on `.198`
Last updated: 2026-06-10
## 2026-06-10 05:48 EDT Active Session Checkpoint
Work resumed from `docs/NEXT_TERMINAL_HANDOFF.md`. No `.198` host actions have
been run yet in this resumed pass.
Current first steps:
1. Rerun `git diff --check`.
2. Rerun the focused Rust image-version test for the Nextcloud false-update
helper.
3. If those are clean, inspect and continue the rootless Podman lifecycle/
scanner-backoff work before any `.198` validation.
Progress:
- `git diff --check` passed.
- Focused Rust image-version test in `/tmp/archy-cargo-image-versions` remains
inconclusive: the tool PTY stayed open after compile output stopped, with no
active `cargo`, `rustc`, or linker process visible.
- Bounded retry of the focused image-version test using the normal workspace
target also timed out: `timeout 300s cargo test --manifest-path core/Cargo.toml -p archipelago container::image_versions::tests`
exited `124` after compiling the `archipelago` test target without reaching
test output. Nextcloud false-update validation is still not closed.
- Local code change in progress: single-orchestrator `package.stop` now returns
immediately with `stopping` and runs the orchestrator stop in the background,
instead of blocking the RPC/UI while Podman cleanup happens.
- `cargo fmt --manifest-path core/Cargo.toml --all --check` passed.
- Compile check passed in `/tmp/archy-cargo-runtime-check`:
`cargo check --manifest-path core/Cargo.toml -p archipelago --bin archipelago`.
- `git diff --check` passed after the stop-path edit and doc updates.
- Lower-level stop path inspection: Quadlet service stop is already bounded
with kill/reset recovery, and the runtime fallback treats already-absent
containers as success. No extra lower-level stop change was made.
## 2026-06-10 05:30 EDT Pause Checkpoint
User paused to switch machines. Continue from `/home/archipelago/Projects/archy`
and read `docs/NEXT_TERMINAL_HANDOFF.md` plus
`docs/1.8-alpha-improvements-tracker.md` first. No dev server or validation
command should be intentionally left running from this checkpoint.
Latest local-only tracker progress:
- Done: uninstall preserve/delete-data choice, companion APK QR/download modal,
App Details setup-instructions card, dead/coming-soon UI cleanup via Spotlight
AI placeholder removal.
- In progress: Fleet/tab loading polish, Bitcoin receive-address readiness
states, no-registration credentials inventory, Nextcloud false-update fix.
- New credential fallback: PhotoPrism now shows manifest-backed credentials
(`admin` / `archipelago`) when backend credentials are empty. Grafana was not
added because `GRAFANA_ADMIN_PASSWORD` is not resolved to a known repo
default/secret.
- Nextcloud local fix: manifest/catalog/UI metadata now points at `nextcloud:29`
and image update detection ignores registry-host-only changes. Catalog drift
passed, but backend focused Rust validation did not complete cleanly. First
`cargo test -p archipelago container::image_versions::tests` from `core/`
hit a Rust linker/incremental artifact failure while `/tmp` was full; a
non-incremental retry was killed after running too long. Old
`/tmp/archy-cargo-*` build-cache directories were removed and `/tmp` recovered.
Latest local validations:
- `npm run type-check` passed after the PhotoPrism credential fallback.
- `npm test -- --run src/views/apps/__tests__/appCredentials.test.ts` passed.
- `git diff --check` passed after the Spotlight cleanup and should be rerun
after resuming.
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed during
the Nextcloud pass.
Immediate next steps:
1. Rerun `git diff --check`.
2. Rerun `cargo test -p archipelago container::image_versions::tests` from
`core/` when ready to validate the Nextcloud update-detection helper.
3. Continue the `docs/1.8-alpha-improvements-tracker.md` rows that remain
`todo` or `in-progress`, avoiding host-gated items until `.198` access is
intentionally resumed.
## 2026-06-09 Resume Handoff - Read First
Last user prompt to preserve:
> please can we save all our progress, backlog, and goal to memory so I can resume on another device please
>
> including the last prompt
Ultimate release goal:
Archipelago's app/container system must be developer-ready and production-release ready. New apps should be supported through manifest/runtime contracts and clear developer documentation, not one-off OS-level changes or fragile per-app hacks. The app system must be professional, secure, elegant, lightweight, and predictable: apps install, start, stop, restart, uninstall, reinstall, survive reboot, show correct status/progress, and launch correctly from tabs/iframes. Developers should be able to package apps for Archipelago clearly from the migration/developer docs.
Important target node:
- Validation node: `archipelago@192.168.1.198`, password `password123`.
- Current release deadline pressure from user: production release target was Thursday, 2026-06-11.
- Tests have been run mostly on `.198`; user noted we may also need to validate on the current intended release server, not only `.198`.
- Avoid broad/destructive Podman store cleanup. Do not use `git reset --hard` or revert unrelated user changes.
Current deployed backend on `.198`:
- Latest deployed `/usr/local/bin/archipelago` sha256: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
- A later local-only code change exists and passed `cargo check`: cached web-app health now requires HTTP reachability, not just TCP. This was not deployed because the user interrupted the release build/deploy flow. No build process was left running at handoff.
Major progress achieved in the latest session:
- Beta Telemetry / Fleet collector:
- Confirmed `TELEMETRY_COLLECTOR_URL` was not set in the current shell and no repo/service config was setting it.
- Fixed the periodic reporter to POST a `telemetry.ingest` JSON-RPC envelope to the configured collector endpoint instead of POSTing the raw telemetry report body.
- Added optional systemd env loading with `EnvironmentFile=-/var/lib/archipelago/telemetry.env` in `image-recipe/configs/archipelago.service`.
- Updated `scripts/deploy-to-target.sh` so deployments write `/var/lib/archipelago/telemetry.env` when `TELEMETRY_COLLECTOR_URL` is exported in `scripts/deploy-config.sh`.
- Documented the expected value shape in `scripts/deploy-config.example`: `https://<collector-host>/rpc/v1`.
- Verification passed: `cargo fmt -p archipelago --manifest-path core/Cargo.toml`, `bash -n scripts/deploy-to-target.sh`, `git diff --check` for the touched files, and `CARGO_TARGET_DIR=/tmp/archy-cargo-check cargo check -p archipelago --manifest-path core/Cargo.toml`.
- `systemd-analyze verify image-recipe/configs/archipelago.service` could not run in the sandbox because systemd bus access failed with `SO_PASSCRED failed: Operation not permitted`.
- Still needed: choose the real collector host, create or update local `scripts/deploy-config.sh` with `export TELEMETRY_COLLECTOR_URL='https://<collector-host>/rpc/v1'`, deploy, restart `archipelago`, and confirm opted-in nodes ingest into Fleet.
- IndeeHub:
- Recovered stale/corrupt metadata/container state enough for fresh lifecycle.
- Full lifecycle passed earlier on `.198`.
- Verified launch on `7778`.
- Verified `/nostr-provider.js` is served and the Nostr signer bridge requirement is preserved.
- Saleor:
- Removed from app catalog/server as requested.
- Bitcoin Knots / Bitcoin UI:
- Fixed false health path so `bitcoin-knots` health no longer just probes the UI bridge on `8334`.
- Patched Bitcoin UI wording to show retrying/busy sync states instead of scary permanent failure.
- Verified `/bitcoin-status` recovered; node is in IBD and pruned, progress around 6-7% during latest checks.
- Fedimint:
- Restored/kept Fedimint Gateway as separate catalog app. Do not make Guardian launch Gateway.
- Fixed Guardian startup path so `fedimint` uses manifest-backed Quadlet/orchestrator, not legacy startup.
- Fixed generated unit regeneration by removing the pre-orchestrator Podman inspect gate for orchestrator starts.
- Fedimint Guardian unit now includes `FM_BITCOIND_URL=http://bitcoin-knots:8332`.
- Added manifest wrapper that waits for Bitcoin RPC sync with `"initialblockdownload":false` before launching `fedimintd`.
- Current correct behavior on `.198`: `fedimint.service` active and logging `Waiting for Bitcoin RPC sync at http://bitcoin-knots:8332...`; RPC health returns `starting`; container-list now reports `fedimint` as `starting` instead of stale `stopping`.
- Guardian iframe/tab does not yet show UI because `fedimintd` is intentionally gated until Bitcoin leaves IBD. The UI should explain "waiting for Bitcoin sync" rather than opening a blank/dead iframe.
- BotFights:
- User reported stopped/unhealthy.
- Added `botfights` to manifest-backed orchestrator start path so it no longer fails immediately on legacy Podman discovery.
- Deployed backend hash `9a00e543...`.
- BotFights started and is active.
- Direct checks after it finished booting: `/` returned HTTP 200; `/api/health` returned `{"status":"ok","name":"botfights"}`.
- Note: `.198` manifests still use `git.tx1138.com/lfg2025/botfights:1.1.0`; local repo manifest shows `146.59.87.168:3000/lfg2025/botfights:1.1.0`. Reconcile this catalog/manifest mismatch later.
- Status/health correctness:
- Reduced container health/status Podman timeouts to avoid UI hanging forever.
- `container-list` now refreshes stale cached states and uses Quadlet service-active fallback for stale `stopping` states.
- Fedimint stale `stopping` fixed to `starting`.
- Local-only patch passed `cargo check`: web-app cached health requires HTTP success/redirect, not just open TCP. This fixes false healthy during app boot, seen with BotFights.
- Filebrowser/Home Assistant/Immich/Bitcoin:
- Latest RPC health check showed filebrowser healthy, homeassistant healthy, immich healthy, bitcoin-knots healthy.
- Still treat Home Assistant setup/restart hang and Immich post-setup HTTP 500 as backlog blockers needing focused validation.
Current critical blockers:
- Runtime control plane / Podman scanning:
- Backend restarts repeatedly take 1-2 minutes because startup/crash recovery synchronously waits on slow `podman ps`.
- Logs show repeated `podman ps -a --format json timed out after 30s` and crash recovery `podman ps stopped timed out after 60s`.
- This is causing bad UX: "checking forever", false "no apps installed", intermittent "loading apps", stale statuses, slow lifecycle actions.
- Next platform fix should move Podman/crash-recovery scans out of the service readiness path and keep last-known app state during scanner backoff.
- My Apps UI false negatives:
- User reports apps sometimes do not show, "checking" forever, "loading apps" sometimes good but often false "no apps installed".
- Required fix: do not show empty/no-apps while scanner or Podman is in backoff. Keep last known apps, show explicit loading/checking/stale state, and avoid destructive UI conclusions from scan timeout.
- Fedimint Guardian:
- Current "starting/waiting for Bitcoin sync" is correct while Bitcoin is in IBD.
- Need UI/status copy that explains waiting for Bitcoin sync, and later validate Guardian UI on `8175` once Bitcoin sync condition is satisfied.
- Progress UX:
- User explicitly requires install/uninstall/start/stop/restart progress to be accurate and not look frozen.
- Uninstall indicator currently poor/no progress. Must fix with clear phase updates and no stale notifications.
- Stale health notifications:
- Must not persistently trigger on new logins/refreshes after no longer valid.
- Some UI filtering was patched earlier, but keep this in regression backlog.
- Reboot survival:
- Must pass repeated reboot validation after runtime/status fixes.
- Acceptance target from user: minimum 3 clean consecutive reboots, preferably 5.
Backlog captured from user reports:
- Portainer:
- Environment wizard error: `Dial unix /var/run/docker.sock: connect: connection refused`.
- User noted Portainer does Podman orchestration well; compare/learn from its socket/control flow where useful.
- Fedimint:
- Setup after guardian confirmation caused app not to launch.
- Guardian launch was opening Gateway before; do not regress. Guardian and Gateway must remain distinct.
- Gateway app disappeared from catalog before; it has been restored but keep in regression tests.
- Bitcoin Knots:
- User saw missing app/launch issues and status bridge messages. UI now improved, but include in lifecycle/reboot regression.
- Home Assistant:
- Setup has issues on this node and restart hung for a long time.
- Immich:
- After setup user saw HTTP 500 stacktrace from `loadServerConfig`. Needs focused post-setup validation, not just "healthy".
- Filebrowser:
- User saw erroneous stopped status while app was working. Status ordering was patched; keep in regression.
- Tailscale:
- Launch must show local login/auth UI, not merely container running.
- BTCPay/Fedimint/Gateway/other Bitcoin-dependent apps:
- Need clearer dependency wait states when Bitcoin RPC is slow/IBD.
- App catalog/developer readiness:
- Apps should not require OS-level changes per app.
- App migration document and developer guide must include this principle and current app packaging contract.
- Saleor:
- Removed from catalog/server and should stay removed unless intentionally reintroduced.
Release readiness estimate:
- Prior estimate was 68%; after latest IndeeHub/Fedimint/BotFights/status progress, a realistic estimate is about 72%.
- Remaining 28% is not feature volume; it is systemic hardening: runtime control-plane responsiveness, truthful UI during Podman backoff, lifecycle/reboot gates, and focused app-specific post-setup validation.
Suggested immediate next steps after resuming:
1. Read this file and verify no background build/process is running.
2. Build/deploy the local-only HTTP-health tightening patch if not already deployed.
3. Patch backend startup/crash recovery so Podman scans are async/non-blocking and service readiness is not held hostage by `podman ps`.
4. Patch My Apps UI/data flow to preserve last-known apps during scanner backoff and never show false empty state while checking.
5. Run focused status checks on `.198`: fedimint, botfights, filebrowser, bitcoin-knots, immich, homeassistant, portainer.
6. Continue lifecycle gates only after the runtime scan/control path is stable enough that tests measure apps, not Podman timeouts.
Read this first if resuming in a fresh OpenCode session. Paste the resume prompt below verbatim.
---
## Resume Prompt
> Continue Archipelago release hardening from `docs/RESUME.md`. First read `docs/RESUME.md`, `docs/CONTAINER_LIFECYCLE_HANDOFF.md`, and `docs/MIGRATION_STATUS_REPORT.md`. The active validation node is `.198` at `192.168.1.198`; keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive for deterministic tests. Do not run Podman prune/image-list/system-df/image-exists/store-wide cleanup commands on `.198`; the store is known to hang under load. Preserve app data. Latest deployed backend hash is `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`. This includes the rootless Podman socket fix that treats `/run/user/1000/podman/podman.sock` as a socket bind, never a directory/data bind, prefers persistent `podman-archy-api.service` for Portainer, and changes absent cached `Stopping` entries to `Stopped`. User reported host reboot validation was not clean: many containers were SIGKILLed during reboot/shutdown and IndeeHub was stopped after boot. User also reported Immich, IndeeHub, Tailscale, Vaultwarden, Portainer, Home Assistant, Uptime Kuma, Nextcloud, Fedimint, and Botfights app lifecycle/launch/state issues. BTCPay was a false alarm: slow but fine. Current live validation: Vaultwarden full preserve-data lifecycle passed; Portainer full preserve-data lifecycle passed and its socket mount is no longer `//deleted`, but the user still needs to retry the Portainer environment wizard. Fedimint direct container state is running/healthy. IndeeHub remains P0: Podman still has a corrupted `indeedhub|Removing|97cf9fd13bb2` record; targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Treat post-reboot recovery, launch reachability, lifecycle correctness, progress indication, and rootless Podman socket-backed apps as active release blockers. IndeeHub is not passing unless `http://<node>:7778/` is reachable and `/nostr-provider.js` is injected/served so the Nostr signer works as before. Tailscale is not passing unless launch presents the Tailscale login/auth UI. Before editing or touching `.198`, summarize current state and your exact first step.
---
## Current Goal
Cut Archipelago `1.8-alpha`, including a ready-to-test ISO image.
Current status estimate: about 68% of the way to release. The app migration, manifest/catalog generation, and many local gates are advanced, and the latest pass fixed Vaultwarden plus the concrete Portainer stale socket mount. Live `.198` testing still shows the app platform is not production-bulletproof. Remaining release blockers include app install/start truthfulness, frontend launch readiness gating, IndeeHub recovery and Nostr signer compatibility, Tailscale login-link launch, Home Assistant/Uptime Kuma/Nextcloud install/start failures, full lifecycle coverage, progress indication quality, app packaging documentation, refactor/dead-code cleanup, repeated reboot validation, final `.198` lifecycle confidence, and cutting/smoke-testing the `1.8-alpha` ISO.
## Release Readiness Estimate
- Estimated completion: `68%`.
- What is already achieved:
- manifest-driven app migration is substantially advanced;
- catalog metadata generation and strict drift checks are green;
- local backend/frontend release gates have been green in prior passes;
- broad non-destructive lifecycle has passed on the deployed release-candidate line before the reboot-gate finding;
- Podman store-risk paths have been quarantined from known fragile broad image/store commands;
- IndeeHub recovery now has local hardening in progress, including explicit Nostr signer validation in the lifecycle harness;
- targeted Immich fixes now make dependency creation fail fast instead of silently reporting install success, and a follow-up readiness-gating patch is in progress so the app does not look launchable before HTTP readiness;
- mobile and desktop app progress UX now has clearer install/remove phase labels in local changes;
- Vaultwarden full preserve-data lifecycle passed on `.198` after the rootless socket fix;
- Portainer full preserve-data lifecycle passed on `.198` after recreating the container against persistent `podman-archy-api.service`; its mount now points at `/podman/podman.sock`, not `/podman/podman.sock//deleted`.
- What must still pass before release:
- deploy the current Immich readiness-gating backend and frontend progress UX changes;
- focused Immich validation: install must stay in progress until `http://<node>:2283/` returns HTTP success and app launch opens the frontend;
- focused IndeeHub validation: recover stale/corrupt frontend container, prove `http://<node>:7778/`, and prove `/nostr-provider.js` signer bridge is injected/served;
- keep Vaultwarden in regression coverage even though the latest full lifecycle passed;
- focused Tailscale validation: launch must present the local login/auth link/UI on `8240`;
- focused Portainer validation: user must retry the environment wizard and confirm it can connect to the rootless Podman socket at `/var/run/docker.sock`;
- full preserve-data lifecycle testing for representative migrated apps and key stacks: `install -> launch -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch`;
- progress indication validation for install, uninstall, start, stop, restart, reboot recovery, and failed transitions; generic "running" or "removing" pills are not enough;
- app packaging documentation gate: update `docs/APP-PACKAGING-MIGRATION-PLAN.md` and `docs/app-developer-guide.md` so they match the current manifest/runtime contract, include lifecycle/progress/reboot expectations, and clearly tell developers to use reusable manifest/orchestrator primitives instead of OS-level per-app hacks;
- required refactor/remove-dead-code gate: after correctness is proven and before cutting `1.8-alpha`, remove obsolete app-specific paths, stale fallback metadata, duplicate lifecycle logic, unused scripts/hooks, and misleading compatibility shims; rerun lifecycle, launch, and release gates afterward;
- broad non-destructive lifecycle after the deploy;
- at least 3 consecutive clean post-fix reboot iterations, with broad lifecycle green after each;
- preferably 5 consecutive clean reboot iterations before calling `1.8-alpha` production-release ready;
- final local release gates after any additional fixes;
- cut the `1.8-alpha` ISO;
- boot/smoke-test the ISO enough to prove installability, backend startup, UI startup, app catalog availability, and at least a focused app lifecycle.
---
## Latest User Directive
> A lot were killed SIGKILL and one crashed, a couple stopped. Not sure if we did fixes but we should be a few reboot tests until 3/4/5 reboots are clean I guess, unless you advise a different passing criteria
>
> please do not forget that indeehub must work with the nostr signer just like before, I hope we haven't broken that or anything, please add to tasks
>
> also please note that immich and tailscale are not launching on the front-ends on their ports from the app screen, they say running/healthy but clearly aren't
>
> Also BTCPay is not running either
>
> no my bad, wrong server, BTCPay is fine just slow, please continue
>
> Yes, as shown in trying to complete the environment wizard in portainer you get "Failure Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
>
> please confirm there is a refactor/remove dead code release gate too
Passing criterion adopted: after the post-reboot recovery fix is deployed, require at least 3 consecutive clean reboots with broad non-destructive lifecycle green after each; prefer 5 consecutive clean reboots for production-release confidence. SIGKILL during shutdown is not automatically disqualifying if every managed app recovers and is reachable after boot, but any app left stopped/crashed/unreachable after boot is a failed reboot iteration. IndeeHub validation must include the Nostr signer bridge, not just HTTP reachability.
Immich, Tailscale, Vaultwarden, and Portainer are explicit blockers. Container `running`/`healthy` is not enough for Immich/Tailscale; direct/app-screen launch routes must work. Tailscale launch must present the login/auth UI. Vaultwarden must survive install/start/restart. Portainer must be able to talk to the rootless Podman socket from inside its Docker-compatible socket bind. BTCPay is not currently a blocker; it was a wrong-server/slow-app false alarm.
There is also an explicit app packaging documentation gate and an explicit required refactor/remove-dead-code release gate. The packaging docs must be current enough for a third-party developer to package an app against the actual manifest/runtime contract. Do the refactor/dead-code cleanup after current correctness fixes are validated, not before, but do not cut `1.8-alpha` without it: remove stale per-app hacks, dead legacy code paths, duplicate lifecycle helpers, obsolete scripts/hooks, and misleading fallback metadata that would make `1.8-alpha` hard to maintain, then rerun the release gates.
---
## Live `.198` State
- Host: `192.168.1.198`.
- Password for lifecycle harness/RPC login: `password123`.
- Latest recorded `/usr/local/bin/archipelago` sha256: `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`.
- `archipelago.service`: active.
- `archipelago-doctor.timer`: inactive.
- `archipelago-reconcile.timer`: inactive.
- `/`: `65%` used, about `9.6G` free.
- `/var/lib/archipelago`: about `9-10%` used, about `370G` free.
Current active app blockers:
- Immich: after deploying hash `54d781...`, reinstall no longer immediately stops. Live test showed `immich_postgres` and `immich_redis` healthy and `immich_server` running; first launch had a readiness gap while Immich ran migrations/geodata import, then `2283` returned HTTP `200`. Local follow-up changes add an Immich server health check and require healthy status before install completes.
- IndeeHub: still blocked. Latest targeted check after hash `f1f5c61c...` showed a corrupted Podman ghost record: `indeedhub|Removing|97cf9fd13bb2`; `podman inspect indeedhub` fails with `layer not known`. Targeted `podman rm -f`, `podman rm -f --time 0`, and `podman container cleanup --rm indeedhub` hang and must be killed. Must recover this record without broad store cleanup and then verify `http://<node>:7778/` plus `/nostr-provider.js` for the Nostr signer.
- Home Assistant: user reports install completes then app stops. Treat as part of the migrated single-container/rootless Podman control-plane blocker.
- Uptime Kuma: user reports install takes ages then app stops. Live logs showed `package.install uptime-kuma failed: systemctl --user restart podman.socket exited exit status: 1`.
- Nextcloud: user reports same install-then-stop behavior. Live logs showed `package.install nextcloud failed: systemctl --user restart podman.socket exited exit status: 1`.
- Vaultwarden: latest full preserve-data lifecycle passed on hash `2a168489...`: install -> launch on `8082` -> stop -> start -> restart -> uninstall preserve_data -> reinstall -> launch. Keep in regression tests because the user-visible transition/progress UX still looked like it was stuck while stopping.
- Portainer: latest full preserve-data lifecycle passed on hash `2a168489...`. The stale mount was confirmed as `/run/user/1000/podman/podman.sock//deleted`; after persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock` without `//deleted` and `http://127.0.0.1:9000/` returns HTTP `200`. User still needs to retry the environment wizard; do not close this blocker until the wizard no longer reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`.
- Tailscale: still blocked. Container running is not enough; launch must present local login/auth UI on `8240`.
- Fedimint: user reported it showed `stopping`; after hash `f1f5c61c...`, direct targeted state shows `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`. Keep in focused regression/launch checks.
- Botfights: newly reported stopped/broken. Direct probe after the report showed `botfights` running/healthy and `http://127.0.0.1:9100/` returning `200`; keep in focused lifecycle/launch validation after Podman control-plane recovery.
- Rootless Podman socket/control plane: improved but still a release-risk area. Fixed the concrete bug where `/run/user/1000/podman/podman.sock` could be created as a directory and the Portainer bind could point at a deleted socket inode. The current deployed backend prefers persistent `podman-archy-api.service`. Continue watching scanner timeouts and lifecycle behavior for Home Assistant, Uptime Kuma, Nextcloud, and Portainer.
- Stuck Podman records: P0 migration blocker. IndeeHub proves ordinary targeted `podman rm` fallbacks are not sufficient once a record is wedged in `Removing`.
- Progress UX: still blocked until live validation proves install/uninstall/start/stop/restart show phase detail and do not appear frozen.
Do not treat root disk pressure as a current blocker anymore. It was reduced from `99%` used with under `600M` free to about `65%` used with roughly `10G` free.
### 2026-06-10 Resume Continuation Checkpoint
- Deployed backend hash `7f58da80063f58574675256913ac9cddf131e65d8935015748a70adffc228f83` to `.198`.
- Previous live hash observed before deploy: `9a00e5432dd9241a9a54087cc87ede46fc0c77a5051dbfb2d34112b9b12e902f`.
- `archipelago.service` is active.
- `archipelago-doctor.timer` and `archipelago-reconcile.timer` are inactive.
- Added explicit release gates to this handoff:
- app packaging docs must be updated before `1.8-alpha`;
- refactor/remove-dead-code is required before `1.8-alpha`, after correctness validation and before final release gates/ISO.
- Local validation before deploy:
- `bash -n tests/lifecycle/remote-lifecycle.sh` passed;
- `cargo fmt --manifest-path core/Cargo.toml --all`;
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` passed (`45` tests);
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed;
- `python3 scripts/check-app-catalog-drift.py --release --strict` passed;
- `git diff --check` passed.
- Filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` appeared wedged in the tool PTY after compilation started; no local cargo/rustc worker remained visible. Treat that one filtered run as inconclusive, not failed.
- IndeeHub live validation after deploy:
- `container-list` reports `indeedhub` running;
- `container-health` reports `{"indeedhub":"healthy"}`;
- `http://192.168.1.198:7778/` returns HTTP `200`;
- `http://192.168.1.198:7778/nostr-provider.js` returns HTTP `200` and contains the Archipelago NIP-07/NIP-98 Nostr provider shim.
- Immich live validation after deploy:
- `container-list` reports `immich` running;
- direct `http://192.168.1.198:2283/` returns HTTP `200`;
- `container-health` reported `{"immich":"unknown"}` during one focused check, so health truthfulness still needs follow-up even though launch HTTP is reachable.
- Tailscale live validation after deploy:
- Found the live generated unit still used the stale catalog command `sleep 2; tailscale web...`; locally patched `app-catalog/catalog.json`, `neode-ui/public/catalog.json`, and `scripts/first-boot-containers.sh` to use the safer socket-wait startup, and copied the catalog to `/opt/archipelago/web-ui/catalog.json`.
- App-scoped `package.restart tailscale` failed via RPC with `podman ps timed out while listing containers`.
- Patched the live generated Tailscale `.container` unit to match the catalog fix and restarted only `tailscale.service`; the old container required SIGKILL during stop and Podman cleanup took roughly 2 minutes.
- After restart, the Tailscale unit runs both `tailscaled` and `tailscale web`, `container-list` reports `tailscale` running, `container-health` reports `{"tailscale":"healthy"}`, and `http://192.168.1.198:8240/` returns HTTP `200` with Tailscale UI content.
- Do not close Tailscale lifecycle as fully passing yet: launch UI is fixed, but stop/restart behavior exposed the rootless Podman cleanup/control-plane blocker.
- Other live probes after deploy:
- `portainer` HTTP `9000` returns `200`; user still needs to retry the environment wizard.
- `vaultwarden` HTTP `8082` returns `200` from localhost on `.198`.
- `botfights` HTTP `9100` returns `200` from localhost on `.198`.
- `btcpay-server` returned `302` then timed out under a short probe; continue treating BTCPay as slow rather than a current blocker unless a focused check fails.
- `fedimint` port `8175` reset during probe while RPC showed `starting`; keep expected Bitcoin-sync wait-state/status copy in scope.
- Podman/control-plane remains the active systemic blocker:
- logs still show `podman ps timed out`, `podman stats timed out`, scan backoff, and slow app cleanup;
- do not start reboot-count validation until app stop/start/restart and post-reboot recovery are clean enough that tests measure app behavior instead of Podman timeouts.
---
## Latest Completed Work
### 2026-06-08 Rootless Socket, Vaultwarden, and Portainer Fix
- Built and deployed backend hash `2a168489737180b4088503dd93ef89c11da13e64790b324db8baea8ca05d3536` to `.198`; then built and deployed follow-up hash `f1f5c61c9f66ae58e3cb0c7f1cb390777814d162345685c1ddec099057ba2fe3`; `archipelago.service` active, `archipelago-doctor.timer` inactive, `archipelago-reconcile.timer` inactive.
- Fixed rootless Podman socket bind handling in `core/archipelago/src/container/prod_orchestrator.rs`:
- `/run/user/1000/podman/podman.sock` is skipped by bind-directory creation and data UID/chown prep;
- socket bind mounts call explicit socket repair before other bind prep;
- `ensure_user_podman_socket()` now prefers persistent `podman-archy-api.service` at `unix:///run/user/1000/podman/podman.sock`, falling back to `podman.socket` only if needed.
- Validated locally before deploy:
- `cargo fmt --manifest-path core/Cargo.toml --all`.
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago absent_` (`4 passed`, including the stale absent `Stopping` regression tests).
- `git diff --check`.
- `timeout 900s cargo build --manifest-path core/Cargo.toml -p archipelago --release`.
- Vaultwarden full preserve-data lifecycle passed on `.198`:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=vaultwarden ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Portainer full preserve-data lifecycle passed on `.198`:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=portainer ARCHY_FULL_LIFECYCLE=1 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Portainer stale socket mount was confirmed and repaired:
- Before recreate, mountinfo showed `/run/user/1000/podman/podman.sock//deleted -> /var/run/docker.sock`.
- After persistent `podman-archy-api.service` and Portainer recreate, mountinfo shows `/podman/podman.sock -> /var/run/docker.sock`, host socket exists, and Portainer UI returns HTTP `200`.
- User still needs to retry the Portainer environment wizard; do not close the blocker until that wizard can connect.
- Direct state check after deploy:
- `fedimint|Up ... (healthy)` and RPC `container-list` shows `fedimint running`.
- `indeedhub|Removing|97cf9fd13bb2`; `podman inspect` fails with `layer not known`; targeted removal/cleanup hangs and had to be killed.
- `vaultwarden running true`.
- `portainer running true`.
### 2026-06-08 Reboot Blocker Follow-up In Progress
- User reported host reboot validation was not clean: many containers were killed with SIGKILL during reboot/shutdown, one crashed, a couple stopped, and IndeeHub was stopped after boot.
- Treat this as a failed reboot gate. Do not call the release ready until post-fix reboot iterations are clean.
- Local changes made in this pass:
- hardened `core/archipelago/src/container/prod_orchestrator.rs` IndeeHub stack recovery so reboot reconcile starts existing backend containers through a user scope when possible, waits for backend containers and API dependency DNS, starts/restarts the frontend, verifies it remains running, and verifies host port `7778`;
- hardened `core/container/src/manifest.rs` package validation for app IDs, ports, env keys, capabilities, devices, volume sources/options, network policy, and reviewed host-bind exceptions while preserving all current real manifests;
- updated `tests/lifecycle/remote-lifecycle.sh` so IndeeHub launch validation requires `/nostr-provider.js` to be injected into the HTML and served from the app, preserving the Nostr signer requirement.
- Deployed follow-up backend hash `4108ca146b482c028ae8d7c4bec314b71ef3412f15efd2e61846a2c345b36aba` to `.198`; service active, timers inactive. Focused audit still showed:
- `indeedhub` stuck `stopping` and unhealthy;
- `immich` stopped/unhealthy;
- `tailscale` running/healthy but direct launch `8240` returned `000`;
- `vaultwarden` health RPC errored and launch `8082` returned `000`;
- `btcpay-server` was fine (`23000` returned HTTP 200); user confirmed BTCPay was a wrong-server/slow-app false alarm.
- Targeted diagnostics on `.198` found:
- IndeeHub frontend Podman state `removing`/`stopping` with no `7778` listener;
- Immich server stopped, Redis exited, Postgres unhealthy, no `2283` listener;
- Tailscale listener process existed on `8240`, but direct HTTP still returned `000`; logs show Tailscale is `NeedsLogin`/`WantRunning=false`, so launch must present the login/auth UI rather than a generic daemon endpoint;
- Vaultwarden container was absent; public `package.start vaultwarden` failed on stale/refused Podman socket before local fixes;
- Portainer launches but the environment wizard reports `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`, confirming socket-backed apps are not release-ready.
- Local follow-up fixes after those diagnostics:
- `core/container/src/runtime.rs` now tries `podman rm -f --time 0`, targeted `podman container cleanup`, and another `rm -f` when normal forced remove fails;
- `ensure_user_podman_socket()` now verifies the rootless Podman socket accepts Unix connections, not just that the socket path exists;
- IndeeHub readiness now falls back to platform-managed network-alias presence when `getent` inside the API image cannot prove DNS;
- lifecycle harness now requires Tailscale launch content to look like login/auth UI.
- Local validation passed after those fixes:
- `cargo fmt --manifest-path core/Cargo.toml --all`.
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
- `bash -n tests/lifecycle/remote-lifecycle.sh`.
- `git diff --check`.
- Deployed second follow-up backend hash `06420c0377fff650a2bf3211f13c1e0754bf8df81345b8485f4c9a30cb552439` to `.198`; service active, timers inactive.
- Public RPC recovery attempts on hash `06420c...`:
- `package.restart indeedhub` still failed;
- `package.start immich` accepted async start but app remained `starting` with no `2283` launch;
- `package.start vaultwarden` accepted async start but no `8082` launch appeared;
- `package.restart portainer` failed;
- `package.restart tailscale` accepted async restart but no `8240` launch UI appeared.
- Latest focused probe after hash `06420c...`:
- `tailscale` `running`, `http://192.168.1.198:8240/` returns `000`;
- `immich` `starting`, `http://192.168.1.198:2283/` returns `000`;
- `indeedhub` `stopping`, `http://192.168.1.198:7778/` returns `000`;
- `portainer` `running`, `http://192.168.1.198:9000/` returns `000`;
- `vaultwarden` absent/not listed, `http://192.168.1.198:8082/` returns `000`.
- Conclusion: do not proceed to reboot testing or ISO work. The rootless Podman control-plane/socket health and stuck container-state recovery need a deeper platform fix before lifecycle/reboot gates are meaningful.
- Local validation passed so far:
- `cargo fmt --manifest-path core/Cargo.toml --all`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`45 passed`).
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `bash -n tests/lifecycle/remote-lifecycle.sh`.
- `git diff --check`.
- A filtered `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago indeedhub` compiled and ran the matching existing IndeedHub test (`1 passed`); it did not exercise the new reboot recovery branch because there is no direct unit for that path yet.
- Next steps:
- deploy the new backend only after approval;
- verify focused `indeedhub,immich,tailscale,vaultwarden,portainer` lifecycle/launch, including IndeeHub Nostr provider check and Portainer socket usability;
- run reboot validation iterations on `.198` only after explicit approval;
- pass threshold: 3 consecutive clean post-fix reboots minimum, 5 preferred for production-release confidence.
- cut and smoke-test the `1.8-alpha` ISO after reboot validation is green.
### Local Release Gate Completion After `.198` App Recovery
- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
- Fixed scanner backoff/in-flight skip behavior: skipped scans now bump `scan_tick`, so install/update success paths that kicked the scanner do not wait for their timeout when Podman scan backoff is active.
- Fixed stale crash-recovery unit tests after `should_auto_start_stopped_container` gained the `include_stack_members` flag; coverage now asserts generic boot recovery skips stack helpers while stack recovery can include them.
- Fixed local runtime manifest-port lookup so tests and local backend runs can find workspace `apps/*/manifest.yml` via `CARGO_MANIFEST_DIR`; this covers new public apps such as PhotoPrism.
- Fixed journal usage parsing for real `journalctl --disk-usage` compact output such as `463.9M`.
- Fixed boot-reconciler cadence tests so `without_companion_stage()` also bypasses the global crash-recovery wait gate in tests; production still waits for recovery completion.
- Verified catalog generation is idempotent: `python3 scripts/generate-app-catalog.py` reported `updated 0 fields` for both catalogs.
- Validation passed locally:
- `cargo fmt --manifest-path core/Cargo.toml --all`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container` (`43 passed`).
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo check --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security`.
- `cargo test --manifest-path core/Cargo.toml -p archipelago-performance -p archipelago-security` (`12 security tests passed`; performance has no tests).
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
- `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
- `git diff --check`.
- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
- Remaining gated item remains host reboot validation on `.198`, only if explicitly approved.
### Frontend Release Gate Completion
- Did not touch `.198`, reboot the host, change timers, or run Podman store-wide commands.
- Found and fixed a mobile app-launch regression in `neode-ui/src/stores/appLauncher.ts`:
- desktop-only new-tab apps still open directly on desktop;
- mobile now routes those apps through the app-session route instead of escaping Archipelago in a new browser tab;
- `dashboardReturnPath()` now tolerates tests/minimal router mocks with no `currentRoute`.
- Updated frontend tests to match current desktop new-tab policy and mobile in-app routing behavior.
- Fixed `AppIconGrid` test setup so it shares the mounted Pinia instance and mocks credential lookup before launch.
- Fixed onboarding retry test timing to cover the actual exponential retry budget.
- Validation passed locally:
- `npm run type-check` from `neode-ui`.
- `npm test` from `neode-ui` (`548 passed`).
- `npm run build` from `neode-ui`.
- `python3 scripts/generate-app-catalog.py` (`updated 0 fields`).
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
- `python3 -m py_compile scripts/generate-app-catalog.py scripts/check-app-catalog-drift.py scripts/app-catalog-image-smoke-test.py`.
- `cmp -s app-catalog/catalog.json neode-ui/public/catalog.json`.
- `git diff --check`.
- Local caveat: `npm ci` is currently blocked because existing `neode-ui/node_modules/@alloc` entries are owned by `root:root`. Existing installed modules were sufficient for type-check, tests, and build. Do not delete or chown this tree without explicit approval.
### Fedimint/File Browser, Nostr/NPM, and IndeedHub Recovery
- Built and deployed backend hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de` to `.198`.
- Fixed UI-facing package health for reachable running apps whose Podman health stayed `starting`, `unhealthy`, or a numeric exit value while the launch port was reachable.
- Confirmed Fedimint Guardian and File Browser were actually reachable; their `server.get-state` package-data now reports healthy instead of “starting up”.
- Fixed Nostr relay port conflict by moving `apps/nostr-rs-relay/manifest.yml` host port from `8081` to `18081`.
- Recovered Nginx Proxy Manager admin launch on `8081`; Nostr now launches on `18081` and no longer captures the NPM launch port.
- Hardened legacy package install so scoped web-app installs use `podman create` plus `systemd-run --user --scope podman start`, avoiding backend-cgroup coupling without hanging the install RPC.
- Recovered IndeedHub without deleting data: started the stopped `indeedhub-minio` dependency, repaired frontend reachability, and verified `7778` returns the app.
- Validation passed:
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
- Focused lifecycle for `indeedhub,nginx-proxy-manager,nostr-rs-relay,fedimint,filebrowser`.
- Direct launch checks returned HTTP `200` for `7778`, `8081`, `18081`, `8175`, and `8083`.
- Broad non-destructive lifecycle passed on live hash `95dfd8530ae9621b2f16da05d2229fe40bed7e5f6e2097cf4c87000fe97b92de`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.6G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
### Deployed Podman Store-Risk Cleanup
- Reviewed release-relevant Podman store/image call sites without running broad Podman store/image commands on `.198`.
- Bounded stack installer image pulls and manual package update image pulls with `kill_on_drop` and 600s timeouts.
- Deployed backend hash `a52a87474c9a788e058ee1da1edd6091ab305594a53e7a153889f77041598ff4` to `.198` with the previous backend backed up under `/usr/local/bin/archipelago.backup-20260608-store-risk-*`.
- Validation passed:
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
- `cargo fmt` from `core/`.
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- Focused post-deploy lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint,immich,indeedhub,photoprism ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Broad post-deploy non-destructive lifecycle: `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Final `.198` state after validation: `archipelago.service` active; `archipelago-doctor.timer` inactive; `archipelago-reconcile.timer` inactive; `/` at `65%` used with about `9.8G` free; `/var/lib/archipelago` at `10%` used with about `370G` free.
### Release Candidate Backend Restart Validation
- Built and deployed backend hash `e28affdf4c1d3cecbe4c14b0439b53d977ed20873c966c288116601d49dac732` to `.198`.
- Bounded additional Podman store/control probes so image and stack health checks fail fast instead of hanging under `.198` Podman store/socket load.
- Fixed Fedimint health reporting: if Podman health remains `starting` but the app endpoint is reachable, `container-health` can use the reachable cached app fallback.
- Fixed package start/restart fallback for runtime web apps by using `systemd-run --user --scope` for `podman start`, then falling back to direct bounded `podman start`.
- Recovered live Immich without data loss:
- `immich_server` had exited because `/usr/src/app/upload/encoded-video/.immich` could not be written.
- Correct live ownership is still `podman unshare chown -R 0:0 /var/lib/archipelago/immich`, which maps to host UID/GID `1000:1000` and container root ownership.
- A temporary `1000:1000` in-container ownership experiment was reverted because Immich's storage check writes as container root.
- Validation passed on latest hash:
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `python3 scripts/check-app-catalog-drift.py --release --strict`.
- `npm run build` from `neode-ui`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=10 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Backend restart validation followed by focused `fedimint,immich,indeedhub,photoprism` lifecycle passed.
- Post-restart broad non-destructive lifecycle passed.
- Remaining gate before calling this a release: host reboot validation, if approved.
### IndeedHub and Immich Lifecycle Recovery
- Built and deployed backend hash `89dfc3d4e801b35564dc8dc7f4a513028eb7e2027b586e8aad7a0f374e20d6a9` to `.198`.
- IndeedHub focused audit is green after sequencing network alias repair immediately before frontend startup, after dependencies are running.
- Fedimint and NetBird focused audits are green; they were not current blockers after rerun.
- Immich was the broad-audit blocker and is now green:
- dependency readiness accepts healthy Podman health state for `immich_postgres` and `immich_redis` before falling back to slower exec probes;
- `immich_server` startup repairs `/var/lib/archipelago/immich` ownership with `podman unshare chown -R 0:0`, preserving upload data while matching the current rootless container user mapping;
- this fixed the observed `EACCES` on `/usr/src/app/upload/encoded-video/.immich`.
- Validation passed on latest hash:
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=indeedhub ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=fedimint ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=netbird ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=300 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=immich ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Residual risk remains: `.198` still intermittently logs `podman ps -a --format json timed out after 30s` and transient Bitcoin RPC timeouts under load. Continue avoiding store-wide Podman commands.
### Release Refactor Cleanup
- Built and deployed backend hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` to `.198`.
- Legacy package runtime host-port cleanup/repair now derives host ports from manifests when available.
- Hardcoded ports remain only as fallback for legacy/non-manifest apps and extra stale-port cleanup compatibility.
- Removed the duplicate Gitea-specific stale port cleanup helper.
- Validation passed on latest hash:
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Added focused runtime-host-port tests, but local `cargo test --manifest-path ../../core/Cargo.toml -p archipelago runtime_host_ports` did not finish within 5 minutes during compilation.
### Catalog Metadata Generation
- Added `scripts/generate-app-catalog.py` to sync manifest-owned metadata into `app-catalog/catalog.json` and `neode-ui/public/catalog.json`.
- The generator updates fields that manifests already own: `title`, `version`, `description`, `dockerImage`, `category`, `tier`, `icon`, and `repoUrl`.
- The catalog still preserves catalog-only fields such as `author`, `requires`, `featured`, and rich `containerConfig` notes.
- Corrected stale manifest metadata for BotFights, IndeeHub, Gitea, LND, ElectrumX, Fedimint, and Mempool before generation.
- Release catalog drift is now zero:
- `python3 scripts/check-app-catalog-drift.py --release --strict` reports `metadata_drift=0`, `missing_catalog=0`, `missing_manifests=0`.
- Validation passed:
- `jq empty app-catalog/catalog.json neode-ui/public/catalog.json`.
- canonical and UI public catalogs match byte-for-byte.
- `cargo test --manifest-path core/Cargo.toml -p archipelago-container`.
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
- `npm run build` from `neode-ui`.
### Podman Store-Risk Hardening
- Built and deployed backend hash `eaa83c30467acd42ad864a8e0ea0d5fd88b94b775a06bfcdc460c4b0cd8e75b2` to `.198`.
- Fresh local-build installs now treat `podman image exists <local-build-tag>` failure/timeout as "unknown/missing" and rebuild the local image instead of failing the lifecycle operation.
- This keeps local image store checks from being release-blocking while preserving bounded runtime timeouts and matching the existing drift-restart behavior.
- Validation passed on the latest hash:
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
- Added focused unit test coverage for the image-exists failure behavior, but local `cargo test --manifest-path core/Cargo.toml -p archipelago install_fresh_builds_when_image_exists_check_fails` did not complete within 15 minutes during compilation.
### Container Health Fallback and Broad Lifecycle Green
- Built and deployed backend hash `be95ea91339a7fb0a3b20d0ae5d816dca220d5e5ca86838cc0ba50b609ad7b36` to `.198`.
- Fixed `container-health` broad lifecycle timeout behavior:
- `cached_reachable_health()` now parses ports from URLs with trailing slashes correctly, such as `http://localhost:2342/`.
- The local TCP fallback now covers the lifecycle web app ports, including PhotoPrism, BTCPay, LND UI, Mempool, Electrum, Fedimint, Gitea, IndeedHub, Ollama, Vaultwarden, Tailscale, and others.
- Cached-running apps with reachable local TCP listeners can report `healthy` without depending on flaky Podman health/inspect calls.
- Validation passed on the latest hash:
- `cargo check --manifest-path core/Cargo.toml -p archipelago`.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`.
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`.
### Generic Host-Port Health Checkpoint
- Built and deployed backend hash `3912b900c376b6c28bf5453640cae82135f67d7e0f984b8adcc78064b924143b` to `.198`.
- Confirmed objective remains: app behavior should be manifest/platform-primitive owned, not OS-image or per-app backend hack owned.
- Broad lifecycle on `d21202cd...` failed only on Uptime Kuma briefly showing `stopping` during listener repair; it recovered afterward.
- Fixed stale transitional merge: `Stopping -> Running` recovers when no user-stop marker exists; user-initiated stops still keep `Stopping`.
- Health monitor now derives required host TCP ports from Podman JSON `Ports` and marks running containers unhealthy when declared host listeners are missing.
- This is generic host-port health, not an app-specific mapping.
- After deploying `3912b900...`, Uptime Kuma recovered `3002` and returned HTTP `302` after backend restart.
- Jellyfin still needs follow-up: Podman reports `jellyfin Up ... (healthy)` with `0.0.0.0:8096->8096/tcp`, but `ss` shows no `8096` listener and `curl http://192.168.1.198:8096/` fails.
- Follow-up on `be95ea...` resolved the broad lifecycle timeout by hardening `container-health` fallback behavior.
### Stale State and Jellyfin Pasta Listener Hardening
- Built and deployed backend hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` to `.198`.
- `container-list` now overlays cached `exited` entries with targeted live state so scanner backoff does not leave lifecycle/UI reads stuck on stale `exited` after recovery.
- `container-health` now has a bounded cached-running plus local TCP reachability fallback for web apps, reducing dependency on slow/hung Podman inspect paths for health reads.
- Jellyfin was added to legacy runtime host-port repair for pasta listener `8096`.
- `package.restart jellyfin` still exposed a real Podman socket/runtime blocker after stopping the container: `Cannot connect to Podman socket at /run/user/1000/podman/podman.sock: Permission denied`.
- `package.start jellyfin` recovered the app afterward; `jellyfin` became `Up ... (healthy)`, `8096` had a `pasta.avx2` listener, and `http://192.168.1.198:8096/` returned HTTP `302`.
- Focused lifecycle passed on the latest hash:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Release catalog drift check remains: `missing_catalog=0`, `missing_manifests=0`, `metadata_drift=35`.
### Expanded Cleanup and Store-Safe Uninstall
- Built and deployed backend hash `7f90345b75148b7ed748e1a417f31d1273e1646a9b742891858df11c5397051b` to `.198`.
- Expanded `system.disk-cleanup` to remove old rollback artifacts while keeping newest rollback points:
- `/usr/local/bin/archipelago.backup-*` newest 3.
- legacy `/usr/local/bin/archipelago.bak*` newest 3.
- `/usr/local/bin/archipelago.before-*` newest 3 as part of legacy backend cleanup.
- `/opt/archipelago/web-ui.bak*` newest 3.
- `/opt/archipelago/web-ui.old` included as web UI rollback cleanup.
- Live `system.disk-cleanup` reclaimed `10.3 GB`:
- `Removed old backend backups: 41.6 MB freed`.
- `Removed old legacy backend backups: 3.6 GB freed`.
- `Removed old web UI backups: 6.6 GB freed`.
- `Skipped Podman image/volume prune: Podman store commands can block app health on busy nodes`.
- `/usr/local/bin` dropped to about `336M`.
- `/opt/archipelago` dropped to about `1.1G`.
- Removed global `podman volume prune -f` from uninstall. Uninstall now logs a skip and still removes explicit app data when `preserve_data=false`.
### Startup Scan and Uptime Kuma Fixes
- Startup `adopt_existing()` is bounded with a 35s timeout.
- Initial container scan seeds the same 300s Podman scan backoff used by periodic scans.
- Legacy pasta restart paths use scoped `podman restart` instead of stop+start.
- Uptime Kuma was repaired:
- Before: container internally healthy on `127.0.0.1:3001`, but host `3002` had no pasta listener.
- After: `package.restart uptime-kuma` returns `{"status":"restarted"}` and `http://192.168.1.198:3002/` returns HTTP `302`.
### Cleanup and Catalog Work Already Done
- `system.disk-cleanup` intentionally skips Podman image/volume prune.
- `nostr-rs-relay` was added to both catalog surfaces.
- `scripts/check-app-catalog-drift.py --release --strict` reports zero missing catalog/manifest entries and zero metadata drift after catalog generation.
- Meshtastic `app.files` live behavior was validated: deleting `/var/lib/archipelago/meshtastic/config.yaml` and restarting recreated it from the manifest.
---
## Verification Already Run
- `cargo check --manifest-path core/Cargo.toml -p archipelago -p archipelago-container` passed for the currently deployed release-candidate line.
- `cargo build --manifest-path core/Cargo.toml -p archipelago --bin archipelago --release` passed for the currently deployed release-candidate line.
- Broad lifecycle on current hash `14d360a206d1e58f287c5722d709dace0284b0dea56b66aa4bce0f57c631631b` passed:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Targeted PhotoPrism audit on current hash passed:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=photoprism ARCHY_STABILITY_SECONDS=1 ARCHY_TIMEOUT=120 tests/lifecycle/remote-lifecycle.sh`
- Focused lifecycle on current hash `d21202cd79794e3bfc882d37134afd7a41dac766bae386a675714e5fa030e94e` passed:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Live cleanup RPC passed and reclaimed `10.3 GB`.
- Focused lifecycle after expanded cleanup passed:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=meshtastic,jellyfin,filebrowser,uptime-kuma ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Before the expanded cleanup pass, broad lifecycle also passed on hash `2b72e83ff368e4a696ad701f8985b0a8e1e889d9f4844056dc063455df973b28`:
- `ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_STABILITY_SECONDS=5 ARCHY_TIMEOUT=900 tests/lifecycle/remote-lifecycle.sh`
- Direct app checks after latest cleanup passed:
- `http://192.168.1.198:3002/` -> HTTP `302`.
- `http://192.168.1.198:8096/` -> HTTP `302` after Jellyfin recovery/start.
- `http://192.168.1.198:8083/` -> HTTP `404` on `/`, which is expected for Filebrowser root probe behavior used here.
### Test Caveat
- Earlier local focused test commands timed out during first-time test binary compilation, but after compilation completed the full backend test target passed: `cargo test --manifest-path core/Cargo.toml -p archipelago --bin archipelago` (`688 passed`).
- Remaining workspace packages also pass checks/tests: `archipelago-container`, `archipelago-performance`, and `archipelago-security`.
---
## Critical Constraints
- Preserve app data.
- `.198` is the active validation node.
- Current live backend hash on `.198`: `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
- Keep `archipelago-doctor.timer` and `archipelago-reconcile.timer` inactive unless explicitly testing them.
- Do not run destructive git commands.
- Do not run Podman store-wide cleanup or broad image/store commands on `.198` without a mitigation plan:
- Avoid `podman system df`.
- Avoid `podman image list` / `podman image ls`.
- Avoid broad `podman image exists` loops.
- Avoid `podman image prune` and `podman volume prune`.
- Podman store commands can hang and block app health under current `.198` load.
- Latest local mitigation: Rust release image-existence probes now use bounded targeted `podman image inspect` instead of `podman image exists` or `podman images -q`.
---
## Current Remaining Blockers
1. Podman socket/store health remains unresolved.
- Need quarantine/mitigation strategy rather than store-wide commands in release paths.
- Current release paths avoid prune and broad image-list/existence commands; orchestrator, companion, and legacy install image checks now use bounded `podman image inspect`.
- Latest concrete failure remains historical: `package.restart jellyfin` stopped the container but failed to complete because Podman reported socket permission/runtime failure. `package.start jellyfin` recovered afterward.
- Latest deployed hash still logged one initial `podman ps -a --format json` scan timeout/backoff, but focused and broad non-destructive lifecycle validation passed.
2. Release code-review/refactor gate is still open.
- Reduce remaining app-specific Rust/OS branches where possible.
- Review scanner, health, reconcile, and install/update paths for performance and store-risk.
- Clean up dead transitional paths.
3. Clean release branch hygiene is not done.
- Worktree is very dirty with many modified and untracked files.
- Do not commit unless explicitly asked.
4. Full production validation still needed.
- Broad non-destructive lifecycle is green on live hash `7e82532137292e91111f63819d1be7fa69f994ce20d6b5e0194915f194f20412`.
- Backend restart validation has passed.
- Run host reboot validation if approved.
- Run selected full lifecycle tests for critical apps if time allows.
---
## Files Changed In Latest Pass
- `core/container/src/runtime.rs`
- Changed Podman runtime `image_exists()` from `podman image exists` to a bounded targeted `podman image inspect` local-storage probe.
- `core/archipelago/src/api/rpc/package/install.rs`
- Replaced legacy `podman images -q` local fallback and post-pull verification checks with bounded targeted `podman image inspect`.
- `core/archipelago/src/container/companion.rs`
- Changed companion image existence checks from `podman image exists` to `podman image inspect`.
- `core/archipelago/src/container/prod_orchestrator.rs`
- Updated image-existence failure test fixture wording for the new `image inspect` probe.
- Validation for latest local mitigation:
- `cargo fmt --all --check` passed.
- `cargo check -p archipelago-container` passed.
- `cargo check -p archipelago` passed.
- `CARGO_INCREMENTAL=0 cargo check -p archipelago --tests` passed.
- `cargo test -p archipelago-container` passed (`43` tests).
- `git diff --check -- <changed files>` passed.
- Filtered `cargo test -p archipelago install_fresh_build` did not complete: one run hit a `rust-lld` undefined hidden symbol artifact/link failure after concurrent Cargo jobs; the sequential `CARGO_INCREMENTAL=0` rerun exceeded 10 minutes during compile, but test-target compilation passed afterward.
- `core/archipelago/src/api/rpc/system/handlers.rs`
- Calls expanded rollback cleanup helpers and reports reclaimed bytes.
- `core/archipelago/src/api/rpc/system/mod.rs`
- Added cleanup helpers for legacy backend backups and web UI rollback backups.
- Uses size accounting for directories before removal.
- Keeps newest rollback artifacts instead of deleting all.
- `core/archipelago/src/api/rpc/package/runtime.rs`
- Skips global `podman volume prune -f` during uninstall.
- Adds Jellyfin `8096` to runtime host-port/pasta cleanup repair.
- Derives legacy runtime host-port cleanup/repair ports from manifests.
- Keeps compatibility fallback ports for legacy/non-manifest apps and removes duplicate Gitea stale-port cleanup code.
- `core/archipelago/src/api/rpc/container.rs`
- Adds stale cached `exited` refresh for `container-list`.
- Adds cached-running plus local TCP reachability fallback for `container-health`.
- Fixes fallback URL port parsing and expands lifecycle web app port coverage.
- `core/archipelago/src/container/prod_orchestrator.rs`
- Rebuilds local-build images when `image_exists` fails/times out instead of failing fresh install.
- Adds focused unit test coverage for that behavior.
- `scripts/generate-app-catalog.py`
- Generates/syncs public catalog metadata from manifest-owned fields.
- `app-catalog/catalog.json` and `neode-ui/public/catalog.json`
- Generated from current manifests; files match byte-for-byte.
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
- Added latest deployment, cleanup, validation, and residual-risk checkpoint.
- `docs/MIGRATION_STATUS_REPORT.md`
- Updated current hash, root disk state, and remaining blockers.
- `docs/RESUME.md`
- This file, replacing stale April migration resume content.
---
## Suggested Next Steps
1. Re-read the three docs:
- `docs/RESUME.md`
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
- `docs/MIGRATION_STATUS_REPORT.md`
2. Verify latest `.198` state:
- `ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 'df -h / /var/lib/archipelago; systemctl is-active archipelago.service; systemctl is-active archipelago-doctor.timer 2>/dev/null || true; systemctl is-active archipelago-reconcile.timer 2>/dev/null || true; sha256sum /usr/local/bin/archipelago'`
3. Start Podman-store-risk review:
- Search for image/store operations: `image_exists`, `podman image`, `podman system`, `podman prune`, `volume prune`.
- Prefer targeted container status/API calls with timeouts.
- Avoid new broad store commands.
4. Continue release code-review/refactor cleanup.
5. If approved, run backend-restart validation and then host-reboot validation.
---
## Current Release Readiness Estimate
- Credible release candidate: closer now, roughly `87-91%`.
- Production-quality release developers will love: still closer to `73-79%`.
The biggest improvement in the latest pass is that broad lifecycle is green again on the latest backend. The biggest remaining technical risk is Podman store/socket health.

View File

@ -1,56 +0,0 @@
# Session 2026-03-18 — Resume Guide
## What Was Done
### Rootless Podman Migration (TASK-11 DONE)
- .228: 30 containers running rootless with full security hardening
- All `sudo podman` removed from Rust backend (9 files) + deploy script
- UID mapping: container UID N → host UID (100000 + N - 1)
- Deploy script auto-fixes ownership + sysctl + linger on every deploy
### .198 Migration (IN PROGRESS)
- Root containers stopped, UID ownership fixed, IndeedHub images migrated
- `/etc/hosts` fixed to 644 (rootless podman needs read access)
- **Only 2 containers running — needs full container recreation**
- Next: run container setup (Bitcoin, LND, ElectrumX, all apps)
- The `--both` deploy only copies binary+frontend, doesn't create containers
### Security Hardening (TASK-8 — 9/12 pentest findings fixed)
- C1: /lnd-connect-info requires session auth
- C3: DEV_MODE removed from production service
- H1: node-message verifies ed25519 signatures
- M1: content.add rejects `..` path traversal
- M2: NIP-07 postMessage uses specific origin
- M3: AIUI nginx checks session_id cookie
- L2: Strict v3 onion validation
- **Still open**: H2/H3 (federation signature verification), H4 (bind ports to 127.0.0.1)
### UI/UX Fixes
- Mesh serial: auto-detect, backoff, udev rule, Connect button
- External iframes: CSP https: added
- Container startup: "Checking..." shimmer, marketplace sort
- Port mapping: all nginx+frontend+backend synced
- ElectrumX: shows index size during indexing
- Fedimintd → "Fedimint Guardian"
- IndeedHub Studio version
- On-Chain first in receive modals
- Tab-launch icons, iframe error screen, CPU alert threshold
- Mesh mobile: header hidden, overflow fixed
- Federation/Cloud: DID on hover
### Git Tags
- v1.2.0-alpha.1 through v1.2.0-alpha.8 (current)
## Resume Checklist
1. **Finish .198 containers** — create Bitcoin, LND, ElectrumX, MariaDB, Mempool, BTCPay, Grafana, etc.
2. **H2/H3** — federation peer-joined/address-changed signature verification
3. **H4** — bind service ports to 127.0.0.1
4. **BUG-1** — CSRF mismatch (P0 critical)
5. **Many /task items** in MASTER_PLAN.md from testing session
6. **Tailscale migration** for other nodes (preserve auth state)
## Key Facts
- Rootless subnet: 10.89.0.0/16
- Bitcoin RPC: rpcallowip=0.0.0.0/0, password in /var/lib/archipelago/secrets/
- .198 /etc/hosts must be 644
- Deploy --both only copies, --live creates containers

View File

@ -1,653 +0,0 @@
> gitea app icon is still missing.
> and we have a container called “bold_lichterman” which I have no idea what it is
> great, let's finish it off
# Session Resume - 2026-04-24
## Latest user directives (must be followed first)
> please continue, please state my last comment in the resume doc and first before making this plan to adhere to
> And we need to get every container working on .116 and tested before we release
> we have no time requirements so the best path is the way
> Continue, leave release gate as a reminder later it wont happen for a while
> we only work via fuse thinkpad
> all code has to be local changes to .116 (that machine) code and repo
> we are not working on this machine is why, I removed it so you would never accidentally work here, we are doing all code on .116 Projects/archy repo
> we're using paths instead of port which seems to be causing issues again, launch and tab should use port no? Please confirm this is correct as paths have never worked.
> A lot of the apps aren't loading properly, did you screw all the apps up with this wrong approach?
Adherence for current session:
- Before proposing or executing a plan, record the latest directive in this `SESSION-RESUME` doc first.
- Release gate is now explicit: `.116` required containers must be working and tested before release.
- No time constraint: choose the most correct long-term architecture/stability path even if it takes significantly longer.
- Release gate remains required, but treat it as a later checkpoint reminder while long-running sync/migration work continues.
- Runtime stabilization on `.116` is immediate priority; keep migration work aligned with this gate.
- Work context is strictly the `.116` repo via FUSE thinkpad mount; do not make/code against any non-`.116` local workspace.
## Goal in progress
Move package lifecycle to orchestrator-first behavior with automated proof gates, while keeping safe legacy fallback during migration.
## Work completed in this session
### Step 8b.1 wiring progress (orchestrator runtime parity)
- Implemented orchestrator-side resolution for new manifest fields in `core/archipelago/src/container/prod_orchestrator.rs`:
- resolve `container.derived_env` from detected host facts (`HOST_IP`, `HOST_MDNS`, `DISK_GB`) before create
- resolve `container.secret_env` from `/var/lib/archipelago/secrets/<name>` before create
- apply `container.data_uid` with pre-create recursive `chown -R UID:GID` on bind-mounted volume sources
- Added unit coverage in `prod_orchestrator.rs` for:
- derived+secret env resolution reaching `create_container`
- data_uid ownership path executing prior to create/start
- Extended Podman create payload mapping in `core/container/src/podman_client.rs` to honor:
- `container.network` (with legacy `security.network_policy` fallback)
- `container.entrypoint`
- `container.custom_args` as command args
- `volumes.type=tmpfs` with `tmpfs_options`
### Step 8b.2 first backend manifest port started (fedimint)
- Ported `apps/fedimint/manifest.yml` from legacy `container-specs.sh` behavior:
- image corrected to `git.tx1138.com/lfg2025/fedimintd:v0.10.0`
- network set to `archy-net`
- bitcoin RPC target corrected to `bitcoin-knots:8332`
- `FM_BIND_P2P` / `FM_BIND_API` / `FM_BIND_UI` aligned with spec
- `FM_P2P_URL` / `FM_API_URL` migrated to `derived_env` with `HOST_MDNS`
- `FM_BITCOIND_PASSWORD` migrated to `secret_env` from `bitcoin-rpc-password`
- data dir ownership mapping set with `data_uid: "100000:100000"`
### Step 8b.2 continued (fedimint-gateway manifest added)
- Added `apps/fedimint-gateway/manifest.yml` with a shell entrypoint wrapper matching legacy two-path behavior:
- if LND cert+macaroon are present, starts `gatewayd ... lnd --lnd-rpc-host lnd:10009 ...`
- otherwise starts `gatewayd ... ldk --ldk-lightning-port 9737 ...`
- Manifest uses new schema fields now wired in orchestrator runtime:
- `network: archy-net`
- `entrypoint` + `custom_args` (dynamic runtime command)
- `secret_env` for `FM_BITCOIND_PASSWORD` and `FEDI_HASH`
- `data_uid: "100000:100000"`
- Note: unlike legacy script, this manifest declares both `8176` and `9737` host ports statically; runtime branch still selects LND-vs-LDK execution at startup.
### Step 8b.3 started (filebrowser baseline service)
- Added `apps/filebrowser/manifest.yml` to port baseline filebrowser from legacy specs/first-boot behavior:
- image: `git.tx1138.com/lfg2025/filebrowser:v2.27.0`
- `network: archy-net`
- `custom_args: ["--config", "/data/.filebrowser.json"]`
- `data_uid: "100000:100000"`
- capabilities include `NET_BIND_SERVICE` + legacy rootless write caps
- binds `/var/lib/archipelago/filebrowser``/srv` and `/var/lib/archipelago/filebrowser-data``/data`
- Added orchestrator pre-start hook for `filebrowser` in `core/archipelago/src/container/filebrowser.rs` and wired in `prod_orchestrator`:
- ensures root directories exist (`Documents`, `Photos`, `Music`, `Downloads`, `Builds`)
- writes `/var/lib/archipelago/filebrowser-data/.filebrowser.json` if missing (atomic tmp+rename)
- keeps behavior idempotent (no rewrite if config already exists)
### Step 8b.3 continued (electrumx manifest added)
- Added `apps/electrumx/manifest.yml` with spec-faithful baseline:
- image `git.tx1138.com/lfg2025/electrumx:v1.18.0`
- network `archy-net`
- bind mount `/var/lib/archipelago/electrumx:/data`
- electrum TCP port `50001:50001`
- `secret_env` for Bitcoin RPC password
- shell entrypoint wrapper that exports `DAEMON_URL` with secret at runtime before launching `electrumx_server`
- keeps `COIN`, `DB_DIRECTORY`, `SERVICES` env aligned with legacy behavior
### Step 8b.3 continued (bitcoin-knots + lnd manifest reconciliation)
- Reconciled `apps/bitcoin-core/manifest.yml` toward production `bitcoin-knots` behavior while keeping app id stable:
- added `container_name: bitcoin-knots` to preserve adoption of existing container name
- switched image to `git.tx1138.com/lfg2025/bitcoin-knots:latest`
- set `network: archy-net`
- added dynamic startup command (prune-vs-full-node) using `custom_args` and `DISK_GB` from `derived_env`
- added `secret_env` for Bitcoin RPC password and `data_uid: "100101:100101"`
- Reconciled `apps/lnd/manifest.yml` to legacy/runtime expectations:
- image updated to `git.tx1138.com/lfg2025/lnd:v0.18.4-beta`
- network set to `archy-net`
- capabilities aligned with spec (`CHOWN`, `FOWNER`, `SETUID`, `SETGID`, `DAC_OVERRIDE`, `NET_RAW`)
- bitcoin backend host corrected to `bitcoin-knots`
- RPC password moved to `secret_env` from `bitcoin-rpc-password`
- data ownership mapping set via `data_uid: "100000:100000"`
### Step 8b.3 continued (mempool + btcpay companion manifests)
- Added new manifests for stack companions previously only defined in `container-specs.sh`:
- `apps/archy-mempool-db/manifest.yml`
- `apps/mempool-api/manifest.yml`
- `apps/archy-mempool-web/manifest.yml` (with `container_name: mempool` to preserve existing frontend container adoption)
- `apps/archy-btcpay-db/manifest.yml`
- `apps/archy-nbxplorer/manifest.yml`
- Reconciled `apps/btcpay-server/manifest.yml` toward runtime stack parity (image/tag/network/ports/env/deps aligned to legacy stack installer).
### Step 8b.5 progress (update path: orchestrator-first recreate)
- Updated `core/archipelago/src/api/rpc/package/update.rs` recreate path to avoid hard dependency on `reconcile-containers.sh`:
- after stop/pull/rm, each container recreate now tries orchestrator `install(app_id)` first using container-name alias candidates
- includes alias mapping for known name/app-id mismatches (`bitcoin-knots``bitcoin-core`, `archy-*` aliases, `mempool``archy-mempool-web`)
- on orchestrator miss/error, falls back to legacy reconcile script path (safe migration fallback retained)
- rollback path now reuses the same orchestrator-first recreate helper instead of invoking reconcile directly
- Added unit test coverage for alias candidate generation in update module tests.
### .116 release-gate automation scaffold started
- Added read-only required-stack lifecycle suite for `.116` in `tests/lifecycle/bats/required-stack.bats`:
- asserts required containers are present + running
- probes core endpoints (bitcoin RPC, electrumx TCP, lnd getinfo, mempool API/frontend, bitcoin-ui, lnd-ui)
- Updated `tests/lifecycle/run.sh` so no-auth read-only suites can run with `ARCHY_ALLOW_NOAUTH=1` (password still required for RPC-auth suites).
### Stack install path migration progress (orchestrator-first)
- Updated `core/archipelago/src/api/rpc/package/stacks.rs`:
- added orchestrator-first stack installer helper (`install_stack_via_orchestrator`) with legacy stack fallback
- wired helper into `install_btcpay_stack` and `install_mempool_stack`
- fixed mempool legacy fallback drift:
- adopt checks now include current frontend container name `mempool`
- root DB secret name corrected to `mysql-root-db-password`
- backend host env aligned to `electrumx` and `bitcoin-knots` on `archy-net`
- Expanded orchestrator install allowlist in `core/archipelago/src/api/rpc/package/install.rs` to include newly ported backend/companion apps.
### Legacy config drift cleanup (package config helpers)
- Updated legacy `get_app_config` paths in `core/archipelago/src/api/rpc/package/config.rs` to match current `.116` runtime topology and secrets:
- moved host-based RPC/electrum endpoints to in-network service names (`bitcoin-knots`, `electrumx`, `mempool-api`, `archy-nbxplorer`)
- corrected mempool mysql root secret fallback name to `mysql-root-db-password`
- aligned btcpay and fedimint bitcoin RPC URLs to `bitcoin-knots` service target
- removed LND host-based ZMQ defaults in legacy args path and aligned bitcoind RPC host to `bitcoin-knots:8332`
### Step 8b migration tightening (install/update/stack policy)
- `core/archipelago/src/api/rpc/package/update.rs`
- moved `btcpay-server` and `mempool` out of forced legacy-update list (now orchestrator-first update candidates)
- kept safe legacy-update routing for still-unported stack families (`immich`, `penpot`, `indeedhub`, `fedimint`)
- `core/archipelago/src/api/rpc/package/stacks.rs`
- extracted canonical stack app-id sets for BTCPay and mempool and added unit test coverage to prevent drift
- `core/archipelago/src/api/rpc/package/install.rs`
- tests updated to assert expanded orchestrator-install allowlist for newly ported backend/companion apps
### Continued migration + test gate expansion
- `core/archipelago/src/api/rpc/package/update.rs`
- moved `fedimint` out of forced legacy-update list (now orchestrator-first update candidate with fallback)
- `core/archipelago/src/api/rpc/package/config.rs`
- removed obsolete mempool data-dir cleanup target (`/var/lib/archipelago/mempool-electrs`) to match current stack shape
- Added destructive required-stack lifecycle suite:
- `tests/lifecycle/bats/required-stack-destructive.bats`
- gated by `ARCHY_ALLOW_DESTRUCTIVE=1`; restarts required service containers and verifies endpoint recovery
- keeps destructive checks explicit and opt-in during migration work
- added restart retry and HTTP readiness polling to absorb transient podman/pasta port-bind races during rapid restart cycles on `.116`
### Validation run notes (latest)
- `.116`: `cargo test -p archipelago api::rpc::package::update::tests` -> PASS (4/4)
- `.116`: `cargo test -p archipelago api::rpc::package::config::tests` -> no direct tests matched filter (0 run, no failures)
- `.116`: `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1 tests/lifecycle/run.sh required-stack-destructive` -> PASS (3/3) after restart retry/readiness hardening
### Added next lifecycle gate (in progress)
- Added `tests/lifecycle/bats/package-update-smoke.bats`:
- destructive RPC-authenticated update smoke for `package.update` on `bitcoin-ui`
- optional stack smoke for `mempool` behind `ARCHY_ALLOW_STACK_UPDATE=1`
- Updated `tests/lifecycle/run.sh` usage examples with `package-update-smoke` target
- First `.116` run attempt blocked by missing `ARCHY_PASSWORD` environment variable (expected for auth-required suite)
### Newly observed UI routing issue (user report)
- Report: launching **Grafana** opens **Gitea** instead of Grafana.
- Likely collision/drift area to validate and fix:
- `core/archipelago/src/api/rpc/package/config.rs` currently maps both apps into the 3000/3001 neighborhood (`grafana` host `3000`, `gitea` host `3001` + historical nginx iframe comments).
- `neode-ui/src/stores/appLauncher.ts` resolves app sessions by URL port (`3000 -> grafana`), so stale/misrouted backend launch URLs or proxy rules can misdirect launches.
- Add regression checks after fix:
- container-list launch URL for grafana resolves to grafana service endpoint
- launching grafana from UI does not route to gitea content
### Grafana->Gitea misroute remediation (current)
- Root cause confirmed: legacy `gitea-iframe.conf` bound host port `3000`, colliding with Grafana launch expectations.
- Fixes applied:
- `core/archipelago/src/api/rpc/package/install.rs`
- stop deploying gitea dedicated nginx server on `3000`
- remove stale `/etc/nginx/conf.d/gitea-iframe.conf` during gitea install path
- set Gitea `ROOT_URL` to `http://<host>/app/gitea/`
- `image-recipe/configs/nginx-archipelago.conf`
- `/app/gitea/` proxy now targets `127.0.0.1:3001` (not `3000`)
- `image-recipe/configs/snippets/archipelago-https-app-proxies.conf` and `scripts/nginx-https-app-proxies.conf`
- added explicit `/app/gitea/ -> 127.0.0.1:3001`
- `neode-ui/src/views/appSession/appSessionConfig.ts`
- moved gitea away from direct port `3000`; route via proxy path mapping
- `neode-ui/src/stores/appLauncher.ts`
- `resolveAppIdFromUrl()` now recognizes `/app/{id}/` path-based URLs before port mapping
- `neode-ui/src/stores/__tests__/appLauncher.test.ts`
- added regression test for `/app/gitea/` routing
- Validation:
- `.116` vitest launcher suite passes (`12/12`) with gitea path regression test.
- removed live `/etc/nginx/conf.d/gitea-iframe.conf` on `.116` and reloaded nginx.
- Current runtime note:
- `gitea` container running on `3001`; `grafana` container not currently running on `.116`, so direct `/app/grafana/` proxy check returns 502 until Grafana is started.
### User directive (latest)
- Root cause to address later in planned sequence: **Grafana and Gitea must not share/clash ports**.
- Treat this as a dedicated root-fix item when we reach that phase; continue broader Step 8b migration/testing work in the meantime.
### Workflow note
- Todo list maintenance explicitly requested; keep statuses current as work advances to avoid stale execution state.
### Validation run notes (latest continuation)
- `.116`: `tests/lifecycle/run.sh required-stack-destructive` with `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1` -> PASS (3/3)
- `.116`: `cargo test -p archipelago api::rpc::package::update::tests` -> PASS (4/4)
- `.116`: `cargo test -p archipelago api::rpc::package::stacks::tests` -> PASS (1/1)
- `.116`: `cargo test -p archipelago api::rpc::package::install::tests` -> PASS (3/3)
### Validation run notes (latest continuation 2)
- `.116`: `tests/lifecycle/run.sh package-update-smoke` with `ARCHY_PASSWORD=archipelago ARCHY_ALLOW_DESTRUCTIVE=1` -> PASS (`bitcoin-ui` smoke passed; `mempool` optional test skipped without `ARCHY_ALLOW_STACK_UPDATE=1`)
- `.116`: `tests/lifecycle/run.sh required-stack` with `ARCHY_ALLOW_NOAUTH=1` -> PASS (9/9)
- `.116`: `tests/lifecycle/run.sh required-stack-destructive` with `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1` -> PASS (3/3)
- `.116`: `cargo test -p archipelago api::rpc::package::install::tests` -> PASS (4/4) after alias mapping additions
- `.116`: `cargo test -p archipelago api::rpc::package::update::tests` -> PASS (5/5) after alias mapping additions
- `.116`: `cargo test -p archipelago api::rpc::package::stacks::tests` -> PASS (1/1)
### Step 8b alias parity improvements
- `core/archipelago/src/api/rpc/package/install.rs`
- added orchestrator install app-id normalization (`bitcoin-knots -> bitcoin-core`, `electrs/mempool-electrs -> electrumx`)
- expanded orchestrator install allowlist to include alias IDs for parity with scanner/runtime naming
- added unit test: `install_aliases_map_to_manifest_app_ids`
- `core/archipelago/src/api/rpc/package/update.rs`
- added orchestrator update app-id normalization for same alias set
- orchestrator upgrade/health now uses normalized app-id while preserving package-level progress/state semantics
- added unit test: `update_aliases_map_to_manifest_app_ids`
### Lifecycle hardening + full-suite pass
- `tests/lifecycle/lib/rpc.bash`
- `wait_for_container_status` now uses `container-list` state first and uses `container-status` with `app_id` fallback (instead of stale `name` param)
- `tests/lifecycle/bats/bitcoin-knots.bats`
- made `container-status` assertion resilient to alias-migration drift by accepting either valid `container-status` result or valid `container-list` state for `bitcoin-knots`
- `.116`: full lifecycle suite pass
- `ARCHY_PASSWORD=archipelago ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1 tests/lifecycle/run.sh`
- result: `1..25`, all passing (with expected optional skips)
### Release-gate runtime status (latest)
- `.116` Bitcoin Knots chain sync remains in early IBD:
- `blocks=0`, `headers=342297`, `verificationprogress=7.28959974719862e-10`, `initialblockdownload=true`
- Several non-required containers remain unhealthy/exited and are not part of current required-stack release gate:
- examples: `homeassistant`, `immich_server`, `uptime-kuma`, `jellyfin`, `photoprism`, `vaultwarden`, `nextcloud`, `searxng`
### Runtime diagnostics note (non-blocking to Step 8b lane)
- Grafana container on `.116` required mapped UID ownership (`100472:100472`) on `/var/lib/archipelago/grafana` to run under rootless user-namespace mapping.
- Active nginx on `.116` still had `/app/gitea/` upstream pointing to `127.0.0.1:3000` prior to full config rollout; corrected live config to `3001` and reloaded.
- Per user directive, the root architectural fix for Grafana/Gitea port separation remains a planned dedicated step (not closed yet).
### Current `.116` proof status (latest run)
- Rust tests on `.116` all green for migration slices:
- `api::rpc::package::install::tests`
- `api::rpc::package::update::tests`
- `api::rpc::package::stacks::tests`
- `container::prod_orchestrator::tests`
- `archipelago-container manifest::tests::parse_every_real_manifest`
- `.116` required-stack lifecycle suite (`tests/lifecycle/bats/required-stack.bats`) re-run and passing (9/9).
### Automated `.116` gate execution now running in-loop
- Re-ran `tests/lifecycle/bats/required-stack.bats` on `.116` (read-only gate suite): all checks passing.
- Re-ran Rust migration tests on `.116` after code updates:
- `api::rpc::package::install::tests`
- `api::rpc::package::update::tests`
- `container::prod_orchestrator::tests`
- `archipelago-container manifest::tests::parse_every_real_manifest`
- all passing.
### Runtime stabilization update on `.116` (release-gate work)
- User directive recorded: all required containers on `.116` must be working and tested before release; no time constraint, choose best path.
- Best-path decision applied: move Bitcoin node to full mode (`txindex=1`, non-pruned) and rebuild chain state/indexes for durable ElectrumX/mempool compatibility.
Actions taken:
- Wrote `/var/lib/archipelago/bitcoin/bitcoin_rw.conf` with full-mode settings:
- `server=1`
- `txindex=1`
- `rpcbind=0.0.0.0:8332`
- `rpcallowip=0.0.0.0/0`
- `listen=1`
- `bind=0.0.0.0:8333`
- Recreated `bitcoin-knots` with proper caps and `-reindex` startup.
- Confirmed node is running non-pruned and syncing from genesis; sample check showed `blocks=5954`, `headers=946415`, `pruned=false`, `txindex thread` active.
- Recreated `electrumx` on `archy-net` with a real `/var/lib/archipelago/electrumx` data mount.
- Corrected mempool MariaDB data ownership mapping mismatch (`/var/lib/archipelago/mysql-mempool` to `100998:100998`) so tables are readable by the container's mysql user.
- Restarted dependent containers (`lnd`, `electrumx`, `mempool-api`) after Bitcoin mode switch.
Current status snapshot:
- `bitcoin-knots`: running, healthy, full reindex in progress.
- `electrumx`: running, initial sync catch-up in progress.
- `lnd`: running; health status noisy due to startup/wallet/macaroon checks while chain backend is syncing.
- `mempool-api`: running but endpoint still timing out during early-chain synchronization and repeated difficulty-update retries.
Important note:
- Because the node has been reset to a full reindex from genesis, downstream service health is expected to remain transitional until sufficient chain progress is reached. Release gate is still open (not yet met).
### 1) Orchestrator-first update path (partial migration)
- File: `core/archipelago/src/api/rpc/package/update.rs`
- Change:
- `handle_package_update` now attempts `orchestrator.upgrade(package_id)` first when eligible.
- Falls back to legacy update flow for stack/legacy packages.
- Handles `unknown app_id` from orchestrator as a non-fatal fallback case.
### 2) Orchestrator-first install path (initial allowlist)
- File: `core/archipelago/src/api/rpc/package/install.rs`
- Change:
- `handle_package_install` now attempts `orchestrator.install(package_id)` first for allowlisted apps:
- `bitcoin-ui`
- `electrs-ui`
- `lnd-ui`
- Other apps remain on legacy install path for now.
- Handles `unknown app_id` fallback to legacy installer.
### 3) Added unit tests
- `core/archipelago/src/api/rpc/package/update.rs`
- path-selection tests for orchestrator vs legacy.
- `core/archipelago/src/api/rpc/package/install.rs`
- allowlist tests for orchestrator-first install.
### 4) Test commands run and status
- Ran:
- `cargo test -p archipelago api::rpc::package::install::tests`
- `cargo test -p archipelago api::rpc::package::update::tests`
- Result: passing.
## Validation commands for target hosts
### Local host
```bash
ssh localhost 'sudo systemctl restart archipelago && sleep 2 && systemctl --no-pager --full status archipelago | sed -n "1,60p"'
```
### Remote host (.228)
```bash
ssh archipelago@192.168.1.228 'sudo systemctl restart archipelago && sleep 2 && systemctl --no-pager --full status archipelago | sed -n "1,60p"'
```
### Check orchestrator-path logs
```bash
ssh archipelago@192.168.1.228 'journalctl -u archipelago -n 300 --no-pager | egrep "INSTALL ORCH|UPDATE ORCH|unknown app_id|legacy flow"'
```
### Check container states
```bash
ssh archipelago@192.168.1.228 'podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.Image}}"'
```
## Recommended next steps
1. Expand orchestrator-install allowlist beyond UI apps to additional single-container manifest-backed apps.
2. Migrate stack updates (`mempool`, `btcpay`, `immich`, `indeedhub`) to orchestrator-driven stack plans.
3. Unify graceful stop timeout behavior in orchestrator runtime path for stateful apps.
4. Add SSH-driven integration tests (local + `.228`) as a release gate.
## 2026-04-24 15:10 UTC — continuity checkpoint (auto-memory)
- User requested: keep working continuously and always update resume memory before any stop.
- Persisted code changes deployed to `/usr/local/bin/archipelago` on `.116`:
- `core/archipelago/src/api/rpc/package/config.rs`
- `immich` stack uses public `docker.io/valkey/valkey:7-alpine`.
- Healthcheck defaults hardened:
- `searxng` uses `wget` probe (image lacks curl).
- `botfights` uses node-based fetch probe for `/api/health`.
- `nextcloud` uses reachability probe (`curl -s -o /dev/null .../status.php`).
- `portainer` healthcheck disabled by default (`return vec![]`) to avoid false unhealthy flap.
- Portainer socket mount path updated to rootless user socket:
- `/run/user/1000/podman/podman.sock:/var/run/docker.sock`.
- `core/archipelago/src/api/rpc/package/install.rs`
- `create_data_dirs()` fallback chown flow guarded for UID mapping (no underflow path when host UID is root-mapped 1000).
- Validation run on `.116`:
- `cargo fmt --all`
- `cargo test -p archipelago api::rpc::package::stacks::tests`
- `cargo test -p archipelago api::rpc::package::install::tests`
- All passing (warnings only).
- Runtime state after redeploy + reinstall checks:
- Healthy: `botfights`, `searxng`, `nextcloud`, `immich_postgres`, `immich_redis`; `immich_server` running and ping OK.
- `portainer` running with no healthcheck (`health=none`) per persisted default.
- Required Bitcoin stack remains up (`bitcoin-knots`, `lnd`, `mempool-api`, `mempool`, `electrumx`, UIs).
- Intentional unresolved blocker: `uptime-kuma` stays `Created` due planned root fix (`gitea` occupies host `3001`).
- Note: `nextcloud` private-registry pull failed; public literal install path works (`docker.io/library/nextcloud:28`) and is now healthy.
## 2026-04-24 15:20 UTC — continuation checkpoint
- Continued per request; no stop.
- Lifecycle regression fixed and verified:
- `tests/lifecycle/lib/rpc.bash` `wait_for_container_status()` fallback now maps aliases:
- `bitcoin-knots` -> `bitcoin-core`
- `electrs` / `mempool-electrs` -> `electrumx`
- This resolved flaky failure in `bats/bitcoin-knots.bats` stop/start wait path.
- Full lifecycle suite rerun:
- `ARCHY_PASSWORD=archipelago ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1 tests/lifecycle/run.sh`
- Result: `1..25` all passing (same optional skips as before).
- Runtime parity snapshot remains:
- Healthy/running: required Bitcoin stack, `immich_*`, `botfights`, `searxng`, `nextcloud`.
- `portainer` running with no healthcheck (`health=none`) by persisted default.
- Intentional remaining blocker unchanged: `uptime-kuma` `Created` due `gitea`/`3001` root conflict (deferred to root fix lane).
## 2026-04-25 09:35 UTC — continuation checkpoint
- Re-ran full lifecycle with stack update smoke enabled:
- `ARCHY_PASSWORD=archipelago ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1 ARCHY_ALLOW_STACK_UPDATE=1 tests/lifecycle/run.sh`
- Result: `1..25` all passing (including optional test 13).
- Container/endpoint parity check post-suite:
- Required Bitcoin stack remains up; HTTP endpoints for mempool API/web + bitcoin/lnd UI respond.
- Immich still healthy (`/api/server/ping` -> `pong`).
- Non-required app states stable from previous hardening (`botfights`, `searxng`, `nextcloud` healthy; `portainer` running with no healthcheck).
- Planned unresolved conflict unchanged: `uptime-kuma` still `Created` due `gitea` occupying host `3001`.
- Bitcoin sync status snapshot (for release-gate context):
- `blocks=0`, `headers=392976`, `initialblockdownload=true`, `verificationprogress~7.29e-10`, `pruned=false`.
## 2026-04-25 13:55 UTC — continuation checkpoint
- Continued stabilization after all lifecycle passes.
- Added noise-reduction tweak in `core/archipelago/src/electrs_status.rs`:
- Bitcoin RPC failures in ElectrumX status cache are now classified with `is_transient_error(...)`.
- Transient connection-style failures log at `debug` instead of `warn`.
- Non-transient failures still log as `warn`.
- Built + deployed updated backend binary and restarted `archipelago` service (`active`).
- Post-deploy runtime snapshot unchanged/stable:
- Healthy: required Bitcoin stack, `immich_postgres`, `immich_redis`, `botfights`, `searxng`, `nextcloud`.
- Running: `immich_server`.
- Known deferred blocker unchanged: `uptime-kuma` remains `Created` due `gitea` on host port `3001`.
## 2026-04-25 14:20 UTC — continuation checkpoint
- User directive recorded first for this continuation:
- "its on the thinkpad in projects/archy via fuse drive or ssh"
- "whatever the best access method is"
- Switched active workspace to the `.116` repo via FUSE mount:
- `/Users/dorian/mnt/archy-thinkpad`
- Root cause confirmed for current `package.update bitcoin-ui` blocker:
- Service is running with `ARCHIPELAGO_DEV_MODE=true`, so orchestrator `upgrade()` resolves through `DevContainerOrchestrator::load_manifest_for()`.
- Dev manifest loader only searched legacy path `<data_dir>/apps/<app_id>/manifest.yml` (`/var/lib/archipelago/apps/...`), which is missing on `.116`.
- Production manifests are under `/opt/archipelago/apps` (and repo-local `/home/archipelago/Projects/archy/apps` on dev nodes), causing orchestrator update to fail with missing manifest.
- Fix applied:
- `core/archipelago/src/container/dev_orchestrator.rs`
- `load_manifest_for()` now searches manifest locations in this order:
1. `$ARCHIPELAGO_APPS_DIR`
2. `/opt/archipelago/apps`
3. `/home/archipelago/Projects/archy/apps`
4. `<data_dir>/apps` (legacy fallback)
- Added helper `candidate_manifest_paths(...)` with de-dup logic.
- Added unit test coverage for fallback path inclusion.
- Validation attempt:
- Ran `cargo fmt --all && cargo test -p archipelago container::dev_orchestrator::tests` from `core/`.
- Local FUSE-mounted build failed early with Rust toolchain environment issue:
- `error[E0463]: can't find crate for parking_lot_core`
- Code compiles were not validated in this host context; next validation should run directly on `.116` shell (ssh) where the existing build toolchain is known-good.
## 2026-04-25 18:00 UTC — stabilization checkpoint (nginx/BTCPay/Uptime Kuma)
- User directive recorded for this lane:
- "just need to do it all, not bothered which order"
- "Uptime Kjuma opens gitty, we have an erroneous app called bitcoin UI and nginx proxy manager still doesnt work"
- Root causes confirmed on `.116`:
1. **BTCPay broken**: DB ownership mismatch on `/var/lib/archipelago/postgres-btcpay` after UID mapping drift.
- Symptoms: BTCPay/NBXplorer PostgreSQL errors `could not open file global/pg_filenode.map: Permission denied`.
2. **Uptime Kuma cannot bind/start on 3001**: hard conflict with Gitea (already mapped to host 3001).
3. **Nginx Proxy Manager app route broken**: `/app/nginx-proxy-manager/` pointed to `127.0.0.1:8181`, but live NPM is on `81`.
4. **Uptime Kuma route opening Gitea**: upstream/redirect behavior around `/app/uptime-kuma/` required explicit path redirect handling.
- Code fixes applied in repo (ThinkPad FUSE `.116` source):
- `core/archipelago/src/container/dev_orchestrator.rs`
- manifest lookup fallback order for dev-mode orchestrator upgrade/install:
`$ARCHIPELAGO_APPS_DIR` -> `/opt/archipelago/apps` -> `/home/archipelago/Projects/archy/apps` -> `<data_dir>/apps`.
- `core/archipelago/src/api/rpc/package/config.rs`
- `uptime-kuma` host mapping changed `3001:3001` -> `3002:3001`.
- `core/archipelago/src/api/rpc/package/install.rs`
- BTCPay Postgres UID map corrected to container uid 999 (`host 100998`) for `archy-btcpay-db`.
- `uptime-kuma` install path now forces `--entrypoint=/usr/bin/dumb-init` (bypass failing `setpriv --clear-groups` startup path under rootless/cap-drop).
- `core/archipelago/src/port_allocator.rs`
- reserve `3002` to avoid accidental reallocation conflicts.
- `core/container/src/podman_client.rs`
- `lan_address_for("uptime-kuma")` updated to `http://localhost:3002`.
- nginx templates:
- `image-recipe/configs/nginx-archipelago.conf`
- `image-recipe/configs/snippets/archipelago-https-app-proxies.conf`
- `scripts/nginx-https-app-proxies.conf`
- Changes:
- `/app/uptime-kuma/` upstream -> `127.0.0.1:3002`
- exact `location = /app/uptime-kuma/` now redirects to `/app/uptime-kuma/dashboard`
- `/app/nginx-proxy-manager/` upstream -> `127.0.0.1:81`
- UI filtering:
- `neode-ui/src/views/apps/appsConfig.ts` now treats `bitcoin-ui`/`lnd-ui`/`electrs-ui` as service containers so they dont appear as separate user apps.
- Live `.116` runtime actions executed:
- Corrected BTCPay Postgres data ownership to `100998:100998` and restarted `archy-btcpay-db`, `archy-nbxplorer`, `btcpay-server`.
- Recreated `uptime-kuma` on host `3002` using stable entrypoint (`/usr/bin/dumb-init -- node server/server.js`).
- Patched active nginx files (`sites-enabled` + snippets), validated with `nginx -t`, reloaded.
- Rebuilt and redeployed `/usr/local/bin/archipelago` from updated source; restarted `archipelago` service.
- Validation status after fixes:
- Rust tests on `.116`:
- `cargo test -p archipelago container::dev_orchestrator::tests` -> PASS
- `cargo test -p archipelago api::rpc::package::update::tests` -> PASS
- `cargo test -p archipelago api::rpc::package::install::tests` -> PASS
- Lifecycle gate:
- `tests/lifecycle/run.sh required-stack package-update-smoke` -> PASS (`1..11`, optional stack-update skipped unless enabled)
- Runtime smoke:
- `btcpay-server` login endpoint returns `200`.
- `uptime-kuma` container running healthy on `3002`; `/app/uptime-kuma/dashboard` returns `200` with Uptime Kuma HTML.
- `/app/nginx-proxy-manager/` returns `200` (no longer 502).
- `/app/gitea/` remains on `3001` and returns `200`.
- Remaining caveat for user UX confirmation:
- `/app/uptime-kuma/` intentionally returns `302` to `/app/uptime-kuma/dashboard`.
- If the browser still shows old behavior, clear cache/hard-refresh; live nginx and containers now reflect corrected routing.
### Latest user directive (new)
- "Continue if you have next steps, or stop and ask for clarification if you are unsure how to proceed."
### Continuation work completed after directive
- Objective: close the remaining UI caveat where `bitcoin-ui` could still appear as an app category influence when backend package key and manifest id differ.
- Added robust service detection by manifest identity, not only package key:
- `neode-ui/src/views/apps/appsConfig.ts`
- new helper `isServicePackage(id, pkg)` combines key-based and `manifest.id`-based service checks.
- `useCategoriesWithApps(...)` now filters using `isServicePackage(...)`.
- `neode-ui/src/views/Apps.vue`
- app/service tab split now uses `isServicePackage(id, pkg)` so service aliases cannot leak into My Apps.
- Added regression tests:
- `neode-ui/src/views/apps/__tests__/appsConfig.test.ts`
- verifies `bitcoin-ui` / `lnd-ui` / `electrs-ui` are always treated as services.
- verifies alias key case (`core-lnd-ui` with `manifest.id=bitcoin-ui`) is still classified as service.
- verifies service-only `money` category is removed when only real app is `filebrowser`.
### Validation attempt + blocker
- Tried running targeted frontend tests, but local dependency toolchain on this FUSE workspace is currently broken:
- initial error: missing optional module `@rollup/rollup-darwin-arm64`
- `pnpm install` failed with filesystem permissions error: `EPERM ... node_modules/.ignored`
- subsequent `pnpm test` failed because `vitest` binary was unavailable after failed install
- Result: code-level regression fix is in place, but frontend test execution is blocked by workspace `node_modules` permission/install state.
### Continuation update (this run)
- Proceeded to unblock validation as requested and completed targeted regression verification for the `bitcoin-ui` filtering fix.
- Frontend test infra recovery steps (workspace-local, no source-code logic changes):
- manually restored missing native optional binaries required by current platform:
- `@rollup/rollup-darwin-arm64@4.59.0`
- `@esbuild/darwin-arm64@0.27.3`
- repaired critical missing top-level packages/symlinks after interrupted mixed-package-manager install state (notably `vitest`, `vite`, `typescript`, `vue-tsc`, `jsdom`, `vue`, `pinia`, `vue-router`, `vue-i18n`, scoped deps under `@vitejs`, `@types`, etc.).
- Test execution status:
- default `vitest.config.ts` run remains blocked by `@vitejs/plugin-vue` resolving through `.ignored` path and failing compiler discovery in this FUSE/mixed-install state.
- added temporary local test config for TS-only unit suites:
- `neode-ui/vitest.novue.config.ts` (same alias/env basics, no Vue plugin)
- targeted regression suites now pass under this config:
- `pnpm test --config vitest.novue.config.ts src/views/apps/__tests__/appsConfig.test.ts src/stores/__tests__/appLauncher.test.ts` -> PASS (15/15)
- Lifecycle/host validation attempt from this macOS context:
- `tests/lifecycle/run.sh required-stack` -> blocked locally because `bats` is not installed in this environment (script exits with install hint).
- direct SSH to `.116` from this context is non-interactive blocked (`Permission denied`), so host-side lifecycle reruns require execution from the authorized `.116` session context.
### Continuation update (latest)
- FUSE mount was stale (`Device not configured`) despite mount table entry; recovered by unmounting and remounting `sshfs archy:Projects/archy -> /Users/dorian/mnt/archy-thinkpad`.
- Lifecycle validation re-run on `.116` (via SSH):
- `ARCHY_ALLOW_NOAUTH=1 tests/lifecycle/run.sh required-stack`
- first run had a transient fail on "required containers are running" while mempool family was still in startup window after prior restarts.
- immediate rerun passed fully (`1..9` all `ok`).
- `ARCHY_ALLOW_DESTRUCTIVE=1 ARCHY_ALLOW_NOAUTH=1 tests/lifecycle/run.sh required-stack-destructive` passed (`1..3` all `ok`).
- Frontend validation on `.116`:
- repaired host workspace dependency state by running `npm install` in `~/Projects/archy/neode-ui`.
- default Vitest config now works again.
- `npm run test -- src/views/apps/__tests__/appsConfig.test.ts src/stores/__tests__/appLauncher.test.ts` -> PASS (15/15).
- `npm run test -- src/stores/__tests__/app.test.ts src/stores/__tests__/container.test.ts` -> PASS (40/40).
- `npm run build` -> PASS, production bundle + PWA artifacts generated successfully.
- Status:
- `bitcoin-ui`/service filtering fix is validated with default test config on `.116`.
- required-stack + destructive required-stack gates both green on `.116` after transient startup window cleared.
- User clarified local machine workspace was intentionally removed; all code work must run on host in only.
- User re-emphasized launch/tab behavior should be port-based (not path proxy), as path routing has repeatedly failed in practice.
- User reports many apps failing to load and suspects path-based launch routing regressed broad app behavior; prioritize reverting to stable port-based launch/tab behavior and revalidate.
- User reports Gitea app icon is still missing; investigate app icon source/fallback mapping and fix UI asset resolution.
- User asked about unknown container; identified as unmanaged/named-by-podman Filebrowser container and should be reconciled into expected managed naming/state.
- User requested finalization: complete remaining cleanup/validation tasks and produce final production-readiness status for .
### Finalization sweep (latest)
- Removed unmanaged duplicate container `bold_lichterman`; managed `filebrowser` container remains healthy on host port `8083`.
- Confirmed launch behavior hardening:
- `gitea` is now treated as new-tab (iframe-blocking behavior).
- NPM/Kuma/Gitea new-tab/launch behavior is aligned in launcher + app session + app card tab-launch sets.
- App icon fallback now retries `.svg` when a `.png` icon path fails.
- UI validation:
- `neode-ui` targeted suites pass: `appLauncher` + `appsConfig` (23/23).
- Fresh production build completed and deployed to `/opt/archipelago/web-ui`.
- Served bundle verified from nginx: `/assets/index-ptu--7k0.js`.
- Runtime/container validation on `.116`:
- `podman ps` shows all expected containers running after cleanup.
- Host-port probe matrix executed; user-facing HTTP apps return `200` (gitea, kuma, npm, portainer, filebrowser, grafana, nextcloud, homeassistant, mempool, immich, etc.).
- Non-HTTP service ports (SSH/LN/RPC/TLS-only) are explicitly skipped or expected to not return HTTP.
- Lifecycle gates:
- `required-stack.bats`: PASS (`1..9`, all ok).
- `required-stack-destructive.bats` with `ARCHY_ALLOW_DESTRUCTIVE=1`: PASS (`1..3`, all ok).
Current readiness status:
- Container runtime + required stack gates: green.
- Launcher/icon regressions reported by user: addressed and redeployed.
- Remaining production gate work is final manual UI smoke across all app entry points (Apps/AppDetails/AppSession/Spotlight) and release checklist sign-off.
> let's go
- User approved final push: execute final smoke/checklist pass now and return go/no-go readiness report.
### Final gate rerun (go/no-go check)
- Re-ran and for release-gate confirmation.
- Observed one transient miss when tests were run concurrently with destructive restarts; immediate sequential rerun passed clean ( all ok).
- Destructive suite passed with gate enabled: ( all ok).
- UI regression suite remains green: launcher + appsConfig ().
Go/no-go verdict:
- **GO (technical gates)** on : required stack green, destructive restart recovery green, launcher/icon regressions fixed and deployed.
- Remaining non-automated item is manual browser click-through sanity across all entry points before publishing externally.
> gitea app icon still missing
- User reports Gitea icon still missing after prior fallback; investigate backend-provided icon field handling and harden icon URL resolution for token icons (e.g., ).
> Afterwards please build the latest ISO to test with all our work, commit and push too, we need an ISO of the unbundled version with just filebrowser bundled remember, thanks
- User requested final actions: build and test latest unbundled ISO variant (only filebrowser bundled), then commit and push changes.
> Where is the ISO?
- User asked where ISO is; current archived unbundled builder run is failing before artifact generation and must be repaired.
> please do not miss AIUI in the release build or remove it from the nodes whatever you do
- Critical release constraint: AIUI must remain bundled in release artifacts and must never be removed from existing nodes during update/deploy.
> please check the resume files for our latest plan and resume the work.
- Current directive: read the resume/plan files, resume the latest active work, and continue from the recorded release/ISO lane while preserving the AIUI release constraint above.

View File

@ -1,667 +0,0 @@
# RESUME HERE — Rust orchestrator migration
Updated: 2026-04-23 (Install UX polish: phase-based progress bar, post-install scanner kick for instant Launch button, .23 VPS retired with auto-purge migration, frontend/backend deployed to .228 as v1.7.43-alpha.)
**To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.**
---
## ✅ INSTALL UX POLISH + .23 RETIREMENT — SHIPPED (v1.7.43-alpha)
**Rounds 35 + config migration + changelog (2026-04-23)** — 5 commits on `main` (unpushed per user mirror protocol):
- `8cc84ebc` `feat(install): phase-based progress bar replaces unparseable pull bytes``podman pull` emits zero parseable progress when stderr is piped (no TTY), so the legacy byte-counting regex never matched. Replaced with 7 phase-based levels: Preparing (5%) → PullingImage (20%) → CreatingContainer (70%) → StartingContainer (80%) → WaitingHealthy (88%) → PostInstall (95%) → Done (100%). UI maps phases to fixed % and only advances forward (`Math.max`). Final phase label renamed from "Running post-install…" to "Finalizing…" after user feedback that it read like a regression to the install step.
- `f86d86c3` `fix(install): kick scanner post-install so Launch button appears immediately` — scan runs every 60s; post-install the state flipped to Running but the skeletal install-time manifest (`interfaces: None`) persisted until next scan, so `canLaunch(pkg)` returned false for up to a minute. Added `scan_kick: Arc<Notify>` + `scan_tick: Arc<watch::Sender<u64>>` on `RpcHandler`. Scan loop uses `tokio::select!` between the 60s interval and the notify. New `kick_scanner_and_wait` helper (2s timeout) called in install/update success paths BEFORE writing Running, so a fresh manifest lands first. Merge during Installing/Updating uses `merge_preserving_transitional` (keeps state, takes fresh manifest).
- `22052325` `chore: retire .23 VPS mirror, promote .168 OVH to primary` — dropped `DEFAULT_TERTIARY_MIRROR_URL`, promoted `.168` to `DEFAULT_SECONDARY_MIRROR_URL` as "Server 1 (OVH)". 2-entry default registry (.168 priority 0, tx1138 priority 10). Trusted-registry allowlist, catalog fallback, installer ISO registries, `marketplaceData.ts` REGISTRY, `image-versions.sh` all updated. Tests updated for new default counts (registry 3→2, mirror 3→2). URL-parser fixture tests in `update.rs` retain `.23` strings intentionally — they exercise string-parsing logic, not policy.
- `0ee16820` `fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs``load_mirrors`/`load_registries` normally only ADD missing defaults (explicit removals stick, by design). Existing nodes have `.23` baked into their saved `update-mirrors.json` + `config/registries.json` and would pay timeouts forever against a dead host. Added targeted one-time migration in both loaders: `.retain(|m| !m.url.contains("23.182.128.160"))` before the defaults-merge step. Narrow-scope exception to the stickiness rule, documented in-code. Triggers lazily on next load (install RPC, update RPC, Settings UI open).
- `008da477` `docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement` — 4 release-note bullets in `AccountInfoSection.vue` describing async-spawn, phase progress, scanner kick, and .23 retirement from the operator's perspective. Historical "Server 3 (OVH)" entries in older changelog blocks left intact — they describe what shipped at the time.
**Deployed to .228**:
- Backend binary md5 `d2b619949f19815faaeab10429e36ba0` at `/usr/local/bin/archipelago`.
- Frontend at `/opt/archipelago/web-ui/` (includes marketplaceData.ts .168 update + v1.7.43-alpha changelog entry). Deployed bundle verified: `.168` present in `Settings-*.js` + `Marketplace-*.js`, `.23` absent from all assets.
- `/var/lib/archipelago/update-mirrors.json` + `config/registries.json` were manually deleted + regenerated with new defaults during Round 5 verification; migration code will handle any other node on first load.
- Rollback targets from Round 2 still valid: `/usr/local/bin/archipelago.bak-pre-async-install` + `/opt/archipelago/web-ui.bak-pre-async-install/`.
**Git remotes cleaned on .116** (working-copy change only, not in any commit):
- `git remote remove gitea-vps` (dropped the .23 Gitea remote).
- `git remote set-url --delete --push origin http://.../23.182.128.160:3000/...` (dropped .23 from origin multi-push alias).
- Remaining push targets: `tx1138` (canonical), `gitea-local` (localhost Gitea), `gitea-vps2` (.168 OVH).
**Rollback Rounds 35** (same command as Round 2 — backups predate all of this):
```
ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
```
---
## ✅ ASYNC-SPAWN LIFECYCLE FIX — SHIPPED (Stop/Start/Restart + Install/Uninstall/Update)
**Round 2 (2026-04-23, install/uninstall/update)** — 3 commits on `main`:
- `2d5b859e` `feat(rpc): async-spawn install/uninstall/update lifecycle` — new `api/rpc/package/async_lifecycle.rs` with `spawn_package_install`, `spawn_package_uninstall`, `spawn_package_update`. Dispatcher + handler thread `self: Arc<Self>` so spawned tasks own their Arc. Install/update Ok arms explicitly set `Running` because `merge_preserving_transitional` refuses to let the scanner overwrite `Installing`/`Updating`. Removed redundant inner "already updating" guard in `update.rs`. Transient install entry uses empty icon (see commit 3 rationale).
- `0733ac40` `fix(ui): shorten install/uninstall/update timeouts for async RPCs` — drop 11m/45m timeouts to 15s across `rpc-client.ts`, `stores/server.ts`, and the 5 direct call sites in `Marketplace.vue`, `Discover.vue`, `MarketplaceAppDetails.vue`. Return types updated to `{ status, package_id }`.
- `e471ef75` `fix(rpc): empty icon in transient install entry to avoid broken-image flicker``progress.rs::create_installing_entry` no longer hardcodes `/assets/img/app-icons/<id>.png`. About half of bundled apps use `.svg`/`.webp` icons; the frontend's fallback chain (`backend_icon || curated.icon || placeholder`) now lands on the correct curated extension.
**Deployed to .228** (binary md5 `f66857b3b8b3640c8cac8bd25fe508ec` at `/usr/local/bin/archipelago`, backup at `/usr/local/bin/archipelago.bak-pre-async-install`; frontend at `/opt/archipelago/web-ui/`, backup at `/opt/archipelago/web-ui.bak-pre-async-install/`). User confirmed: uninstall fast and responsive, install of LND + SearXNG clean, icon flicker fixed.
**Known out-of-scope issue**: Vaultwarden container itself exits immediately on start with an internal error. The async wrapper correctly detects this via post-start exit verification and removes the state entry. Needs separate vaultwarden container-config investigation.
**Rollback Round 2 (if ever needed)**:
```
ssh archy228 'sudo cp -a /usr/local/bin/archipelago.bak-pre-async-install /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-install/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
```
---
**Round 1 (Stop/Start/Restart)** — 4 commits on `main` (unpushed per user mirror protocol):
- `44cd5eef` `feat(rpc): spawn_transitional helper for async lifecycle ops` — new `api/rpc/transitional.rs` with `Op::{Stop,Start,Restart}` and `RpcHandler::spawn_transitional` / `flip_to_transitional` / `set_state` helpers. `install_log` re-exported so sibling modules can use it.
- `19a99ca9` `fix(rpc): async container stop/start/restart; widen state mapping``container.rs` start/stop rewritten + restart added; `container-list` now emits all transitional variants instead of falling back to `"unknown"`. `dispatcher.rs` registers `container-restart`. `package/runtime.rs` mirrored with `do_package_*` helpers inside `tokio::spawn` and revert-on-error.
- `6712810b` `fix(state): preserve transitional state across container scans``server.rs` scan merge now keeps transitional states while taking fresh observability fields; 1200s stuck-timeout escape hatch via `transitional_since: HashMap<String, Instant>`. Three passing `server::merge_tests`.
- `9ce28f08` `fix(ui): single-button lifecycle control with transitional labels``ContainerApps.vue` and `ContainerAppDetails.vue` use a single primary button driven by `getAppVisualState()`. **Dashboard now routes through `container-start`/`container-stop`** (the async RPCs) instead of the legacy synchronous `bundled-app-*` path. `ContainerStatus.vue` widened to render all new variants.
**Deployed to .228** (ThinkPad demo device):
- Binary at `/usr/local/bin/archipelago` (md5 `de86b63f74c7e6fe6e555ffe30b86b4f`), backup at `/usr/local/bin/archipelago.bak-pre-async-stop`.
- Frontend at `/opt/archipelago/web-ui/`, backup at `/opt/archipelago/web-ui.bak-pre-async-stop/`.
- Release build took 3m56s on .116. Deploy via scp + atomic `install -m 755` + `systemctl restart archipelago`. `nginx -t` + `systemctl reload nginx` for frontend.
**Manual verification**: user clicked Stop on LND in the dashboard. Button flipped to `Stopping…` instantly, held for the full graceful-stop window, transitioned to `Start` when `podman stop` completed. No mid-flight revert to Running. User sign-off: _"absolutely beautiful"_.
**Rollback (if ever needed)**:
```
ssh archy228 'sudo cp /usr/local/bin/archipelago.bak-pre-async-stop /usr/local/bin/archipelago && sudo rsync -a --delete /opt/archipelago/web-ui.bak-pre-async-stop/ /opt/archipelago/web-ui/ && sudo systemctl restart archipelago && sudo systemctl reload nginx'
```
### Follow-ups to consider
1. **Chaos matrix / Step 11** — the original next-step gated behind this fix. Now unblocked.
2. **bundled-app-start / bundled-app-stop** — still synchronous in the backend. Dashboard no longer calls them, but the RPC methods remain for any external caller. Decide: deprecate, or mirror the async-spawn treatment for parity.
3. **`transitional_since` persistence** — currently in-memory only, so a backend restart mid-stop loses the timeout anchor. Acceptable for now (scan loop re-observes live podman state and reconciles), but worth revisiting if crash-recovery stories tighten.
4. **Test regressions inventory** — the full `cargo test -p archipelago` run on .116 shows 22 pre-existing failures in unrelated modules (mesh/wallet/credentials/avatar/session/transport/update-mirrors/fips/identity_manager/image_versions). Unrelated to this work but tech debt. Log at `/tmp/cargo-test-all.log` on .116.
5. **Amend STATUS.md's older "NEXT SESSION — START HERE" section** (below) — it is now stale. Left in place for historical reference of how the fix was designed; delete on the next pass if it gets confusing.
---
## ⚡ NEXT SESSION — START HERE (historical — fix above is now shipped)
**Goal**: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: _"best server containers in the world"_. Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live `Stopping…` label.
### How to work on this repo (SSH + SSHFS setup)
You are likely running on the **laptop** (macOS). The repo lives on the **ThinkPad** (.116). There are two access paths, use both in parallel:
1. **SSHFS mount at `~/mnt/archy-thinkpad/`** — for all file ops (`read`/`edit`/`write`/`glob`/`grep`).
2. **Direct SSH** — for everything that isn't file ops: `git`, `cargo`, `npm`, `systemctl`, running the server, tailing logs.
See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's _the_ thing that makes this dev setup work, and it will break periodically.
### FUSE / SSHFS development loop
**Why this exists**: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.
**Stack** (macOS laptop):
- **macFUSE** — kernel extension providing FUSE on macOS. Install via `brew install --cask macfuse` (requires reboot + security approval in System Settings the first time).
- **sshfs** — userspace mount tool. Install via `brew install gromgit/fuse/sshfs-mac` (the homebrew core `sshfs` was removed; use this tap).
- Verify: `which sshfs``/opt/homebrew/bin/sshfs`, `sshfs --version``SSHFS version 2.10 / FUSE library version 2.9.9`.
**Actual mount command currently running** (verified from `ps`):
```
sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
-o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
```
Breakdown:
- `archy:Projects/archy` — remote path via the `archy` SSH alias (uses `~/.ssh/archy_opencode`, no password prompt).
- `~/mnt/archy-thinkpad` — local mount point. Create once: `mkdir -p ~/mnt/archy-thinkpad`.
- `reconnect` — sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.
- `ServerAliveInterval=15` — sends a keepalive every 15s.
- `ServerAliveCountMax=3` — disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.
- `volname=archy-thinkpad` — Finder display name.
**Check mount health**:
```
mount | grep archy-thinkpad
# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)
ls ~/mnt/archy-thinkpad/ | head
# should list repo contents fast (<1s). If it hangs, mount is stale.
```
**Recovery when the mount hangs / goes stale** (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):
```
# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
sudo diskutil unmount force ~/mnt/archy-thinkpad
# fallback if diskutil can't see it:
sudo umount -f ~/mnt/archy-thinkpad
# 2. Kill any zombie sshfs process
pkill -f "sshfs archy:Projects/archy"
# 3. Remount
sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
-o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
# 4. Verify
ls ~/mnt/archy-thinkpad/ | head
```
If the mount point itself got wedged (`ls: /Users/dorian/mnt/archy-thinkpad: Device not configured`), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.
**When to use which path** (rules, not suggestions):
| Operation | Use | Why |
|---|---|---|
| `read` / `edit` / `write` | SSHFS mount | OpenCode tools want local paths |
| `glob` / `grep` | SSHFS mount | Local FS traversal is fine; remote would need rg over SSH |
| Reading many files | SSHFS mount | Each read is a round-trip but parallelizable |
| `git status` / `git diff` / `git log` | SSH | Git over FUSE is painfully slow (lots of stat calls) |
| `git add` / `git commit` | SSH | Same — commit times grow linearly with tree size on FUSE |
| `cargo check` / `cargo test` / `cargo build` | SSH | Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance |
| `npm install` / `npm run build` | SSH | Same reason — massive file churn |
| Running the server / tailing journal | SSH | Service lives on .116 |
| Deploying to .228 | SSH from .116 | SCP from ThinkPad; laptop isn't in the critical path |
**Don't do this** (will bite you):
- `cargo build` from the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.
- `rsync` without `--exclude="._*"` — macOS writes AppleDouble metadata files, they leak to the remote as `._*` siblings of every real file. `.gitignore` already excludes them (commit `13858842`), but they clutter the tree.
- Writing big binary files via the mount — use `scp` over SSH instead.
- Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.
**Editing workflow in a typical session**:
1. Laptop: OpenCode `read`s a file via `/Users/dorian/mnt/archy-thinkpad/...`. FUSE fetches it over SSH, caches briefly.
2. Laptop: OpenCode `edit`s the file — FUSE writes the new bytes back to .116 immediately (synchronous mount).
3. Laptop: `ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago"` — runs on the real filesystem on .116, sees the edit.
4. Laptop: `ssh archy "cd ~/Projects/archy && git diff path/to/file"` — confirms the edit landed.
5. Laptop: `ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'"` — commit from .116.
The SSHFS mount and the SSH shell are pointing at **the same inodes** — edits via the mount are instantly visible to `cargo`/`git` over SSH. There's no "sync" step.
**Cache caveat**: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's `synchronous` flag (visible in `mount` output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or `stat --file-system ~/mnt/archy-thinkpad/<file>` to force a refresh.
**Direct SSH** access (use when FUSE isn't the right tool):
- `ssh archy``archipelago@192.168.1.116` using `~/.ssh/archy_opencode`
- `ssh archy228``archipelago@192.168.1.228` using `~/.ssh/archy_opencode`
- Full host form also works: `ssh archipelago@192.168.1.116` / `ssh archipelago@192.168.1.228` (same key resolves via IdentitiesOnly).
### SSH keys — what's where
**Laptop `~/.ssh/` (macOS, user `dorian`)**:
| File | Purpose |
|---|---|
| `archy_opencode` / `.pub` | **Primary key for this project.** Unlocks both `archy` (.116) and `archy228` (.228). Created 2026-04-22 specifically for OpenCode work. |
| `archipelago-deploy` / `.pub` | Older archipelago deploy key. Not needed for current work. |
| `id_ed25519` / `.pub` | Personal default key. Not used by archy/archy228 configs (`IdentitiesOnly yes` forces `archy_opencode`). |
| `id_ed25519_angor` / `.pub` | Angor project. Unrelated. |
| `id_ed25519_start9` / `.pub` | Start9 project. Unrelated. |
| `vps-ci-setup` / `.pub` | VPS CI. Unrelated. |
| `config` | Host aliases (shown above) |
**.116 `/home/archipelago/.ssh/`**:
| File | Purpose |
|---|---|
| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + 3 other keys (4 lines total). |
| `id_ed25519` / `.pub` | .116's OWN identity key. This is what lets `.116 → .228` work passwordless. |
| `archipelago-deploy` | Symlink → `id_ed25519` (legacy alias). |
| `id_ed25519_vps168` / `.pub` | For SSH to `146.59.87.168` (VPS). Unrelated to this work. |
| `config` | Host entry for the VPS only. |
**.228 `/home/archipelago/.ssh/`**:
| File | Purpose |
|---|---|
| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + .116's `id_ed25519.pub` + 2 others (4 lines total). |
| _(no `id_ed25519`)_ | .228 has no outbound key — it's a terminal node. Don't try to `ssh` _from_ .228 _to_ anywhere. |
**Connectivity matrix (all verified 2026-04-23)**:
| From → To | Works passwordless | Via |
|---|---|---|
| Laptop → .116 | ✅ | `archy_opencode` |
| Laptop → .228 | ✅ | `archy_opencode` |
| .116 → .228 | ✅ | .116's `id_ed25519` |
| .228 → anywhere | ❌ | no outbound key (by design) |
### Sudo — verified state
**.116** (dev ThinkPad):
- User `archipelago` is in `sudo` group.
- Sudo password required: **`ThisIsWeb54321@`**
- Sudoers drop-ins present: `/etc/sudoers.d/archipelago-ci`, `/etc/sudoers.d/archipelago-wg` (scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD).
- For most dev work you don't need sudo on .116.
**.228** (prod kiosk):
- User `archipelago` has **full passwordless sudo** via `/etc/sudoers.d/archipelago` containing `archipelago ALL=(ALL) NOPASSWD:ALL`.
- User is also in `sudo` group.
- Sudo password (if ever prompted, shouldn't be): **`archipelago`**
- Dashboard password: **`password123`**
### Cargo / npm / paths
- **Cargo PATH gotcha**: non-interactive SSH login has no cargo in PATH. Always use `~/.cargo/bin/cargo` over SSH.
- Example: `ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core`
- Or cd first: `ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'`
- **Long cargo builds** (>2 min Bash tool timeout): launch detached and poll the log:
```
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
ssh archy 'tail -30 /tmp/cargo-build.log'
ssh archy 'pgrep -a cargo' # to check if still running
```
- **npm / frontend** lives at `~/Projects/archy/neode-ui/` on .116 (also accessible via laptop mount at `~/mnt/archy-thinkpad/neode-ui/`). Node is on interactive PATH; for scripted SSH, `source ~/.nvm/nvm.sh && nvm use` or call the absolute path if nvm is used.
- Repo on .116: `~/Projects/archy/` (Cargo workspace at `core/Cargo.toml`).
- Web root on .228: check `/etc/nginx/sites-enabled/` for the live path; historically `/var/lib/archipelago/web-ui/` or `/opt/archipelago/web-ui/`.
### Deploying new server binary to .228
```
# 1. Build on .116 (detached — takes ~3-5 min for release)
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
# wait / tail log until "Finished `release` profile"
# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'
# 3. Atomic swap on .228 with backup
ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'
# 4. Verify
ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'
```
### Git workflow
- Branch: `main` on .116, currently **22 commits ahead of `tx1138/main`**.
- Remote `tx1138` exists but **do NOT push** — user mirrors to 4 Gitea remotes personally after reviewing.
- Atomic commits, one logical change per commit. Conventional Commits format (`feat:`, `fix:`, `docs:`, `refactor:`, `chore:`, `test:`, `perf:`).
- Never `--amend` unless the commit you're amending was created in this session AND has not been pushed. Safer: new commit.
- Never `--force` push. Never modify git config.
- If pre-commit hooks fail, create a NEW commit with the fix — don't `--amend` after a failed commit.
### Other
- Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
- No ship pressure. Do it properly.
- Use `question` tool for ambiguous decisions (don't guess user intent on design choices).
- Keep `docs/STATUS.md` fresh between sessions — it IS the session handoff.
### Hosts reference (quick)
| Host | IP | SSH alias | Role | Dashboard | Sudo |
|---|---|---|---|---|---|
| `archy` (ThinkPad X250) | 192.168.1.116 | `ssh archy` | dev host, Debian 13 | `archipelago` | `ThisIsWeb54321@` |
| `archy228` (HP ProDesk) | 192.168.1.228 | `ssh archy228` | prod kiosk, Rust orchestrator | `password123` | NOPASSWD (fallback `archipelago`) |
### Bug being fixed
Dashboard sequence when user clicks **Stop LND**:
1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via `loadingApps.add('lnd')`.
2. Frontend calls `container-stop` RPC. Server runs `podman stop -t 330 lnd` **synchronously inside the RPC handler** (via `orchestrator.stop()`). RPC blocks up to **5.5 min** for LND (330s timeout + overhead).
3. Meanwhile the 30-second package-scan loop in `server.rs:scan_and_update_packages` keeps running. It rebuilds `PackageDataEntry` from podman inspect — podman still reports `running` (stop hasn't completed) — and **blindly overwrites** the store entry at `server.rs:854`.
4. `container-list` RPC reads `state_manager` snapshot → returns `state = "running"`.
5. Frontend polling sees `running``getAppState()` returns `'running'` → the two-button (Start | Stop) block re-renders → the transitional button disappears → **UI looks like the stop silently failed**.
6. Eventually `podman stop` finishes → next scan → state flips to `Stopped` → buttons change _again_.
Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".
### Decisions already locked in (do not re-ask)
- **Full scope fix** (not minimal hotfix). User chose "Go full scope, do it right".
- **Async-spawn lives in the RPC layer**, not in the `ContainerOrchestrator` trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
- **`PackageState` already has `Stopping`/`Starting`/`Restarting`/`Installing`/`Updating`/`Removing`** variants — enum at `core/archipelago/src/data_model.rs:107-124`. No schema change needed.
- **UI collapses to one full-width button** with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when `not-installed`).
- **Helper API shape**: `RpcHandler::spawn_transitional(op: Op, app_id: String)` where `Op` is an enum `{Stop, Start, Restart}`. Helper dispatches to `orchestrator.stop/start/restart` internally, knows each op's transitional+final states, handles error → revert + `install_log()`.
- **`mark_user_stopped` must run BEFORE the spawn** (preserves ordering the crash recovery layer depends on — see `runtime.rs:145-148`).
### Implementation order (4 commits, local only)
**Commit 1 — `feat(rpc): spawn_transitional helper for async lifecycle ops`**
- New file: `core/archipelago/src/api/rpc/transitional.rs` (or extend `container.rs`; prefer new file for cohesion with future stacks/package variants)
- `enum Op { Stop, Start, Restart }` with `transitional_state()`, `final_state_on_success()`, `log_prefix()`, and async `dispatch(&orch, &app_id)` method
- `impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }`
- Capture `Arc<dyn ContainerOrchestrator>` + `Arc<StateManager>` clones
- Set transitional state via `state_manager.update_data()` (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
- `tokio::spawn(async move { ... })`
- Inside spawn: `install_log("{LOG_PREFIX}: {app_id}")`, `op.dispatch(&orch, &app_id).await`, on success set final state, on error log + `install_log("{LOG_PREFIX} FAIL: …")` + revert state to previous (cache pre-transition state in a local)
- Return `Ok(())` immediately after spawn
**Commit 2 — `fix(rpc): async container stop/start/restart; widen state mapping`**
- `api/rpc/container.rs:85-107` — rewrite `handle_container_stop` body: `validate_app_id`, `mark_user_stopped`, `spawn_transitional(Op::Stop, app_id.to_string()).await?`, return `Ok(json!({ "status": "stopping" }))`
- `api/rpc/container.rs:61-83` — rewrite `handle_container_start`: `clear_user_stopped`, `spawn_transitional(Op::Start, …)`, return `{ "status": "starting" }`
- **Add** `handle_container_restart` (currently missing in `container.rs` — only exists as `package.restart` at `runtime.rs:176-242`). Register RPC route name `container-restart`. Add matching frontend client method in `container-client.ts`.
- `api/rpc/container.rs:148-154` — widen the `container-list` state mapping: add arms for `Stopping → "stopping"`, `Starting → "starting"`, `Restarting → "restarting"`, `Installing → "installing"`, `Updating → "updating"`, `Removing → "removing"`, `Installed → "installed"`, `CreatingBackup`/`RestoringBackup`/`BackingUp` → their kebab-case strings. No more `"unknown"` fallback unless the variant is genuinely unknown.
- Mirror same spawn treatment in `api/rpc/package/runtime.rs`: `handle_package_start` (L28-119), `handle_package_stop` (L122-173), `handle_package_restart` (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) _inside_ the spawned future, not in the RPC body.
**Commit 3 — `fix(state): preserve transitional state across container scans`**
- `server.rs:847-857` — in the merge loop, before the `merged.insert(id.clone(), pkg.clone())` overwrite, check `merged.get(id).state` and skip overwrite if it's transitional: `matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)`
- Still allow _non-state_ fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep `existing.state` but merge updated fields from `pkg`. Write a tiny helper `merge_preserving_transitional(existing, fresh) -> PackageDataEntry`.
- Unit test: construct `existing.state = Stopping`, `fresh.state = Running`, assert merged.state stays `Stopping`.
- **Also check**: Is there a timeout escape hatch? If `Stopping` is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck `Stopping` forever. Mitigation: track a `transitional_since: Instant` in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.
**Commit 4 — `fix(ui): single-button lifecycle control with transitional labels`**
- `neode-ui/src/api/container-client.ts` — extend `ContainerStatus.state` union to: `'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'`. Add `restartContainer(appId)` method calling `container-restart`.
- `neode-ui/src/stores/container.ts` — add computed `getAppVisualState(appId)` that returns one of: `'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'`. Maps `exited``stopped`, `created``stopped`, `paused``stopped`, `installed``stopped`. Add `restartContainer(appId)` action (sets `loadingApps` for request dedup, calls client, does NOT `fetchContainers` immediately because server will broadcast state; a final `fetchContainers` after a short delay can backstop if WebSocket push is absent).
- `neode-ui/src/views/ContainerApps.vue:85-136` — replace the two-button conditional with a single full-width button bound to `getAppVisualState(app.id)`. Table:
| visual state | click action | label | spinner | disabled |
|-----------------|----------------|----------------|---------|----------|
| `not-installed` | installApp | Install | no | no |
| `running` | stopContainer | Stop | no | no |
| `stopped` | startContainer | Start | no | no |
| `starting` | — | Starting… | yes | yes |
| `stopping` | — | Stopping… | yes | yes |
| `restarting` | — | Restarting… | yes | yes |
| `installing` | — | Installing… | yes | yes |
| `updating` | — | Updating… | yes | yes |
| `removing` | — | Removing… | yes | yes |
- Add a separate Restart button next to the primary one when state is `running`, calling new `restartContainer` action. Restart button hides while transitional.
- `neode-ui/src/views/ContainerAppDetails.vue:83` (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
- Also audit line 239 of `ContainerApps.vue` (`some((app) => store.getAppState(app.id) === 'created')`) and the logic around lines 276, 295, 309, 312 — make sure they use `getAppVisualState` where appropriate.
### Verification gates (do not skip)
1. `~/.cargo/bin/cargo check -p archipelago` on .116 via SSH
2. `~/.cargo/bin/cargo test -p archipelago` on .116 via SSH — at least the new merge helper test must pass
3. Build release binary on .116: `nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown`. Poll until done.
4. SCP binary to .228 `/usr/local/bin/archipelago`, back up prior to `/usr/local/bin/archipelago.bak-pre-async-stop`. `sudo systemctl restart archipelago` on .228.
5. **Manual LND stop test on .228**:
- Open dashboard, confirm LND is Running (first: `ssh archipelago@192.168.1.228 'podman start lnd'` — LND is currently Exited(0) from the demo)
- Click Stop
- Expected: button _immediately_ becomes "Stopping…" with spinner (RPC returns <1s)
- Dashboard should stay on "Stopping…" for ~5 min
- Then flip to "Start" button with label "Start"
- At no point should it revert to "Running" mid-stop
6. Same test with Bitcoin Core stop (longest timeout, 600s)
7. Frontend build: `cd ~/Projects/archy/neode-ui && npm run type-check && npm run build`. Rsync `dist/` to `archipelago@192.168.1.228:/var/lib/archipelago/web-ui/` (or wherever the active web root is — check `/etc/nginx` on .228 first).
8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.
### Key files (exact lines of interest)
- `core/archipelago/src/api/rpc/container.rs:85-107``handle_container_stop` (blocking — target of fix)
- `core/archipelago/src/api/rpc/container.rs:61-83``handle_container_start`
- `core/archipelago/src/api/rpc/container.rs:148-154` — narrow state mapping (drops transitional → "unknown")
- `core/archipelago/src/api/rpc/package/runtime.rs:11-24``stop_timeout_secs` table (reference, unchanged)
- `core/archipelago/src/api/rpc/package/runtime.rs:122-173``handle_package_stop` (also blocking, mirror treatment)
- `core/archipelago/src/api/rpc/package/runtime.rs:28-119``handle_package_start`
- `core/archipelago/src/api/rpc/package/runtime.rs:176-242``handle_package_restart`
- `core/archipelago/src/api/rpc/package/progress.rs` — existing broadcast pattern to mirror (`set_install_progress`, `set_uninstall_stage`)
- `core/archipelago/src/api/rpc/mod.rs:62-100``RpcHandler` struct (already holds `Arc<dyn ContainerOrchestrator>` + state_manager)
- `core/archipelago/src/server.rs:812-857``scan_and_update_packages` (merge loop at L850-857 is where transitional-state clobber happens)
- `core/archipelago/src/container/docker_packages.rs:636-663``convert_state` + `package_state_str` (read-only reference, no change)
- `core/archipelago/src/container/traits.rs``ContainerOrchestrator` trait (stays synchronous, do not change)
- `core/archipelago/src/crash_recovery.rs``mark_user_stopped` / `clear_user_stopped` (call order preserved)
- `core/archipelago/src/data_model.rs:107-124``PackageState` enum (no change — all variants exist)
- `neode-ui/src/api/container-client.ts``ContainerStatus` type + RPC methods (extend)
- `neode-ui/src/stores/container.ts:93-312` — Pinia store (add `getAppVisualState`, add `restartContainer` action)
- `neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383` — two-button block + state reads
- `neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232` — details page Stop/Start
### Chaos harness (not in repo — lives on .116)
- `archipelago@192.168.1.116:~/ui-chaos/` — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
- `/tmp/chaos/` on laptop — canonical source for rsync to .116.
- Run: `cd ~/ui-chaos && npx playwright test tests/<spec>`
- Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
- Uses SSH+Playwright hybrid per design; includes the `bash -lc '<escaped>'` single-quote fix for ssh argv flattening and JSON-parsed `podman inspect` instead of Go templates.
### Pre-existing bugs still deferred (do not fix until Stop UX lands)
1. `archipelago --version` spawns server (should be a pure CLI query)
2. RPC unknown-method returns generic error (should return method-not-found with the bad method name)
3. `docker_packages.rs` filters out UI containers (`archy-lnd-ui`, `archy-electrs-ui`) — some views need them visible
4. `lnd.lan_address` stale on .228
5. first-boot silent failure on some hardware
6. `web-ui.failed.*` scar on .228 (benign systemd unit state)
7. `test_parse_image_versions` pre-existing broken assertion — fix or `#[ignore]` when touching that area
---
## Where we are
Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md).
- [x] **Step 1**`3767c267` ContainerConfig schema with `build:`, `ResolvedSource` enum, `resolve()`, 10 tests
- [x] **Step 2**`34af4d9d` ContainerRuntime trait gained `image_exists` + `build_image`, 4 argv tests, 25/25 pass
- [x] **Step 3**`b6a04d31` ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
- [x] **Step 4**`e8a59c93` ContainerOrchestrator trait, RpcHandler uses it in prod (+ `13858842` chore gitignore ._*)
- [x] **Step 5**`fc39b04b` BootReconciler with Arc<Notify> shutdown, 4 paused-time tests pass
- [x] **Step 6**`48f08aa3` main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
- [x] **Step 7**`069bc4a5` bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
- [x] **Step 8a**`a0707f4d` retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
- [x] **Step 9****Hot-swap on .228 verified.** All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
- [x] **.228 dashboard bugs** — ExtraHost `192.168.1.254` bug (`3ee192ba`) + LND macaroon permission bug (`be960023`). See "Post-Step 9 bug hunt" below.
- [ ] **Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps/<id>/manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work)
- [ ] **Step 8c** — Rename `first-boot-containers.sh``first-boot-setup.sh`, strip container ops, keep setup. Delete `reconcile-containers.sh` + `container-specs.sh`. Add ISO lines to copy `apps/` (final one-way door, requires 8b complete)
- [ ] **Step 10** — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
- [ ] **Step 11** — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
## Post-Step 9 bug hunt (.228, 2026-04-23)
User reported three visible dashboard bugs after Step 9 verification:
1. LND — "no connect details or QR"
2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
3. bitcoin-core — in scope for chaos testing
**Root cause #1 (ExtraHost, commit `3ee192ba`)**: `scripts/first-boot-containers.sh` computed `HOST_GATEWAY` from `ip route show default`, which returns the **LAN router** (e.g. 192.168.1.254), not the gateway to the host. Every container configured with `--add-host=host.containers.internal:$HOST_GATEWAY` was dialing the WiFi router instead of the host. LND crash-looped with `dial tcp 192.168.1.254:8332: connection refused`; ElectrumX's DAEMON_URL hit the same dead end; any `archy-net` bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic `host-gateway` literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected `--add-host`; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
**Root cause #2 (macaroon permissions, commit `be960023`)**: LND's `admin.macaroon` lives at `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon`, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (`getinfo`, `connect-info`, `export-channel-backup`) plus the shared `lnd_client()` helper failed with "Failed to read LND admin macaroon". **Confirmed pre-existing on .116 too** (long-standing bug unrelated to Step 9). Fix: centralised the path as `LND_ADMIN_MACAROON_PATH`, added a `read_lnd_admin_macaroon()` helper in `api/rpc/lnd/mod.rs` that tries direct read first then falls back to `sudo -n cat` (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — `curl -k https://<host>/lnd-connect-info` now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
## Step 9 evidence (.228, 2026-04-23)
- Binary: Step 9 build with `732df1b8` + `ba83f9bc`, scp'd to .228 as `/usr/local/bin/archipelago`. Old binary backed up at `/usr/local/bin/archipelago.bak-pre-step9`. Later replaced with macaroon-fix build (`be960023`); previous backed up at `/usr/local/bin/archipelago.bak-pre-macaroon`.
- DEV_MODE override disabled (`override.conf``override.conf.disabled-pre-step9`).
- `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` populated.
- `/opt/archipelago/docker/bitcoin-ui/Dockerfile` replaced with the Step 7 version (no `COPY nginx.conf`). Old dir backed up as `bitcoin-ui.bak-pre-step9`.
- Post-start snapshot:
- `🔗 Adopted 1 existing container(s): ["electrs-ui"]` — adoption of 13h-running container worked without recreation
- `🔄 Boot reconciler started (interval: 30s)` — every 30s, all three app_ids reach `NoOp` after the initial install pass
- `bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18` — pre-start hook fires in `install_fresh`
- `curl localhost:8334` → HTTP 200 (bitcoin-ui), `:8081` → 200 (lnd-ui), `:50002` → 200 (electrs-ui)
- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
## Bugs fixed this session
1. **`parse_memory_limit` truncation bug** (`732df1b8`): lowercased "128Mi" → "128mi" → `trim_end_matches('m')` → "128i" → f64 parse fails → `None.unwrap_or(0)` → OCI `memory.limit:0` → systemd rejects MemoryMax=0. 6 regression tests; `create_container` now omits instead of emitting 0.
2. **`archipelago.service` cgroup delegation missing** (`ba83f9bc`): belt-and-braces `Delegate=memory pids cpu io`.
3. **ExtraHost `192.168.1.254`** (`3ee192ba`): see Post-Step 9 bug hunt above.
4. **LND admin.macaroon unreadable** (`be960023`): see Post-Step 9 bug hunt above.
## Commits made this session
```
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer} (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
```
Branch is **19 commits ahead of tx1138/main** (local only — user pushes to mirrors personally).
## Uncommitted state
Clean. Only untracked: `tests/` (bats harness from prior session, not in scope), `tmp-dump-spec.py` (scratch).
## Answered design questions (no need to re-ask)
1. UI container naming → `archy-<app_id>` for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
3. Reconciler interval → 30 seconds
4. Concurrency → per-app `Mutex<()>` in a `DashMap`
5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
6. Step 4 extension → `ContainerOrchestrator` trait includes `install(app_id)`; the `manifest_path`-based install RPC stays dev-only
7. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`.
## Context: which host is what
| Host | IP | Role | Dashboard pw | Sudo pw |
|---|---|---|---|---|
| `archy` | 192.168.1.116 | **Dev ThinkPad** (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
| `archy228` | 192.168.1.228 | Kiosk HP ProDesk. **Step 9 landing zone** — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
Both are development alpha nodes — **full destructive latitude**, no need to ask before stop/start/rebuild.
## Next action
**Step 10 — Hot-swap on .116.**
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
Steps:
1. Disable DEV_MODE on .116 (check if override.conf exists — `/etc/systemd/system/archipelago.service.d/`)
2. Stage the already-built binary at `~/Projects/archy/core/target/release/archipelago``/usr/local/bin/archipelago.new`
3. Ensure `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` present (copy from repo)
4. Ensure `/opt/archipelago/docker/bitcoin-ui/` matches the Step-7 layout (no baked nginx.conf)
5. Snapshot: `podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"` → save to `/tmp/pre-step10-containers.txt`
6. `systemctl stop archipelago` → install binary → `systemctl start archipelago`
7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
8. If broken → restore `.bak` binary, re-enable DEV_MODE override.
9. Commit STATUS.md update.
**Risk on .116:** If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is `install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago`.
**After Step 10 we are blocked on Step 8b** (multi-day manifest ports) before Step 11 (chaos matrix).
---
### Why Step 8 got split (discovered 2026-04-23)
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
- `first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
- `update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
---
# Archipelago — Current State, Plan, and Releases
Updated: 2026-04-22
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md).
---
## Current state
### Fleet status
All four Gitea mirrors are synced to v1.7.40-alpha:
| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
Fleet test nodes:
| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug |
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
| .249 (ISO test) | unreachable today | |
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
### Known open issues (drives the plan below)
1. **UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
2. **bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
3. **`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
4. **Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal)
5. **LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2
6. **ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
### Recent field incident (2026-04-22)
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page.
- Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`).
- .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`.
- Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md).
---
## Plan
We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources.
### Release roadmap
| Release | Closes | What lands | Status |
|---|---|---|---|
| **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** |
| **v1.7.42** | FM4 (`host.containers.internal` wrong) | `/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1` | pending |
| **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending |
| **v1.7.44** | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber` | pending |
| **v1.7.45** | FM1 + FM3 (companion orphans) | `archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it | pending |
| **v1.7.46** | — | `archy-lnd-ui` → Quadlet | pending |
| **v1.7.47** | — | `archy-electrs-ui` → Quadlet | pending |
| **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending |
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
---
## Release history
### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22
**Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
Changes:
- `core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago`
- `core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server
### [v1.7.40-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/) — 2026-04-22
**Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`.
Changes:
- `scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify
- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
### [v1.7.39-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.39-alpha/) — 2026-04-22
**Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier.
### [v1.7.38-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.38-alpha/) — 2026-04-22
**Onboarding auto-heal + silent logins + App Store trim.**
Changes:
- `auth.rs`: `is_onboarding_complete()` auto-heals from `setup_complete` + `password_hash` (prevents clear-cache → onboarding wizard bug)
- `useOnboarding`: tri-state — backend-unreachable no longer defaults to `/onboarding/intro`
- Login sounds gated by `isFirstInstallPhase()` — silent after onboarding, typing sounds unaffected
- Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
- Deleted 15 image versions from tx1138, .168, gitea-local registries
- AIUI baked into release tarball via `demo/aiui/`
- `prebuild` hook syncs `app-catalog/catalog.json``public/catalog.json`
(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
### [v1.7.37-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.37-alpha/) — 2026-04-22
**Bitcoin Core install fixes + dynamic node UI + full-archive default.**
- Bitcoin Core passes explicit `-rpcbind/-rpcallowip/etc.` CLI args so vanilla image exposes RPC
- Split `bitcoin-core` from `bitcoin-knots` in backend `AppMetadata`
- bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
- Pull fallback to `docker.io` when no mirror carries the image
- Removed `prune=550` hardcode — full archive default
---
## Key docs
- [`bulletproof-containers.md`](./bulletproof-containers.md) — full reconcile architecture, code layout, test matrix, chaos scenarios, sources
- [`BETA-RELEASE-CHECKLIST.md`](./BETA-RELEASE-CHECKLIST.md) — existing beta checklist
- [`BETA-ISSUES-20260328.md`](./BETA-ISSUES-20260328.md) — prior beta-blocker tracking
- [`hotfix-process.md`](./hotfix-process.md) — release workflow
- [`architecture.md`](./architecture.md) — system architecture overview
---
## How to resume
1. Check fleet mirrors are all live: `curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
2. Read [`bulletproof-containers.md`](./bulletproof-containers.md) for the current plan
3. Check task list (`/list` or via Claude Code) for the in-flight release
4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified

View File

@ -1,179 +0,0 @@
# Step 8b Port Audit — container-specs.sh → apps/*/manifest.yml
Last updated: 2026-04-23
This audit is the scope-lock for Step 8b of `docs/rust-orchestrator-migration.md`. Every container currently declared in `scripts/container-specs.sh:ALL_CONTAINER_SPECS` must be port-faithful to `apps/<id>/manifest.yml` before Step 8c can delete the bash scripts.
Findings in short:
- `scripts/container-specs.sh` lists **30 containers** across 5 tiers.
- `apps/*/manifest.yml` exists for **27 app ids**, but the overlap is partial and most of the overlapping manifests are **aspirational stubs written in the original design phase, never reconciled against production behavior**. The image references, container names, network topology, env, and health checks disagree with what actually runs on `.116` and `.228`.
- Only the three UI apps (`bitcoin-ui`, `electrs-ui`, `lnd-ui`) plus `aiui` are truly ported (Step 7 scope).
- The Rust schema (`core/container/src/manifest.rs::AppManifest`) is **missing** several fields needed for a faithful port: `archy-net` network selection, `custom_args`, `entrypoint` override, derived host env (e.g. `HOST_MDNS`), secret-file env injection, and data-dir UID/GID mapping.
---
## Table — every spec, mapped
Legend for **Status**:
- ✅ PORTED — manifest exists and matches reality (Step 7 done).
- ⚠ STUB — `apps/<id>/manifest.yml` exists but disagrees with `container-specs.sh` (image, name, network, env, or health wrong).
- ❌ MISSING — no manifest file on disk.
- — N/A — intentionally out of Step 8b (optional app with no spec, or already managed by a different system).
| Tier | Spec name (container-specs.sh) | Actual container name | Image source | apps/<id>/ matches? | Status | Notes |
|-----:|----------------------------------|-----------------------|-------------------------------------|---------------------|--------|-------|
| 0 | archy-mempool-db | archy-mempool-db | `$MARIADB_IMAGE` | mempool/ | ⚠ | Existing manifest (if any) targets mempool combined stack, not the DB sidecar. Likely a companion of `apps/mempool`. |
| 0 | archy-btcpay-db | archy-btcpay-db | `$BTCPAY_POSTGRES_IMAGE` | btcpay-server/ | ⚠ | Existing manifest describes only the app container. DB is a silent companion in the current model. |
| 0 | immich_postgres | immich_postgres | `$IMMICH_POSTGRES_IMAGE` | (none) | ❌ | Optional. No `apps/immich/` dir. |
| 0 | immich_redis | immich_redis | `$VALKEY_IMAGE` | (none) | ❌ | Optional. No `apps/immich/` dir. |
| 1 | bitcoin-knots | bitcoin-knots | `$BITCOIN_KNOTS_IMAGE` | bitcoin-core/ | ⚠ | `apps/bitcoin-core/manifest.yml` references `bitcoin/bitcoin:28.4`; production runs Bitcoin **Knots** at `$ARCHY_REGISTRY/bitcoin-knots:latest`. App id mismatch: spec is `bitcoin-knots`, manifest is `bitcoin-core`. Decide: rename spec or rename app id. |
| 1 | electrumx | electrumx | `$ELECTRUMX_IMAGE` | (none) | ❌ | Separate from `electrs-ui`. No `apps/electrumx/` dir. |
| 2 | lnd | lnd | `$LND_IMAGE` | lnd/ | ⚠ | Manifest exists; needs verification against current env/ports/caps. |
| 2 | mempool-api | mempool-api | `$MEMPOOL_BACKEND_IMAGE` | mempool/ | ⚠ | Companion of `apps/mempool`. May need dedicated manifest or stack-form. |
| 2 | archy-mempool-web | archy-mempool-web | `$MEMPOOL_WEB_IMAGE` | mempool/ | ⚠ | Companion. |
| 2 | archy-nbxplorer | archy-nbxplorer | `$NBXPLORER_IMAGE` | btcpay-server/ | ⚠ | Companion of BTCPay. |
| 2 | btcpay-server | btcpay-server | `$BTCPAY_IMAGE` | btcpay-server/ | ⚠ | Stub; env, ports, deps need reconciliation. |
| 2 | fedimint | fedimint | `$FEDIMINT_IMAGE` | fedimint/ | ⚠ | **This is the bug from yesterday.** Stub references wrong image (`fedimint/fedimintd:v0.10.0` instead of `$ARCHY_REGISTRY/fedimintd:v0.10.0`), wrong RPC target (`bitcoin-core:8332` instead of `bitcoin-knots:8332`), missing `HOST_MDNS` env, missing `archy-net`, missing `FM_BIND_P2P`/`FM_BIND_API`, missing gateway ports etc. |
| 2 | fedimint-gateway | fedimint-gateway | `$FEDIMINT_GATEWAY_IMAGE` | (none) | ❌ | No manifest. Has complex LND-aware entrypoint in `container-specs.sh:load_spec_fedimint-gateway`. |
| 2 | immich_server | immich_server | `$IMMICH_SERVER_IMAGE` | (none) | ❌ | Optional. |
| 3 | homeassistant | homeassistant | `$HOMEASSISTANT_IMAGE` | home-assistant/ | ⚠ | id mismatch: `homeassistant` vs `home-assistant`. |
| 3 | grafana | grafana | `$GRAFANA_IMAGE` | grafana/ | ⚠ | Stub. |
| 3 | uptime-kuma | uptime-kuma | `$UPTIME_KUMA_IMAGE` | (none) | ❌ | Optional. |
| 3 | jellyfin | jellyfin | `$JELLYFIN_IMAGE` | (none) | ❌ | Optional. |
| 3 | photoprism | photoprism | `$PHOTOPRISM_IMAGE` | (none) | ❌ | Optional. |
| 3 | vaultwarden | vaultwarden | `$VAULTWARDEN_IMAGE` | (none) | ❌ | Optional. Known-bad container on `.228` (see STATUS.md). |
| 3 | nextcloud | nextcloud | `$NEXTCLOUD_IMAGE` | (none) | ❌ | Optional. |
| 3 | searxng | searxng | `$SEARXNG_IMAGE` | searxng/ | ⚠ | Stub. |
| 3 | onlyoffice | onlyoffice | `$ONLYOFFICE_IMAGE` | onlyoffice/ | ⚠ | Stub. |
| 3 | filebrowser | filebrowser | `$FILEBROWSER_IMAGE` | (none) | ❌ | **Critical** — this is Archipelago baseline (bootstrapped by first-boot), not an optional app. Lost `.filebrowser.json` yesterday. Must have a manifest. |
| 3 | nginx-proxy-manager | nginx-proxy-manager | `$NPM_IMAGE` | (none) | ❌ | Optional. |
| 3 | portainer | portainer | `$PORTAINER_IMAGE` | (none) | ❌ | Optional. |
| 3 | ollama | ollama | `$OLLAMA_IMAGE` | ollama/ | ⚠ | Stub. |
| 4 | archy-bitcoin-ui | archy-bitcoin-ui | `localhost/bitcoin-ui:local` | bitcoin-ui/ | ✅ | Step 7 done. |
| 4 | archy-lnd-ui | archy-lnd-ui | `localhost/lnd-ui:local` | lnd-ui/ | ✅ | Step 7 done. |
| 4 | archy-electrs-ui | archy-electrs-ui | `localhost/electrs-ui:local` | electrs-ui/ | ✅ | Step 7 done. |
### Non-spec apps that already have manifests (outside `container-specs.sh`)
These are managed entirely by the install RPC today and already have adoption paths in the Rust orchestrator. They are **not** in 8b scope:
- `aiui`, `botfights`, `core-lightning`, `did-wallet`, `endurain`, `gitea`, `indeedhub`, `lightning-stack` (stack), `meshtastic`, `morphos-server`, `nostr-rs-relay`, `router`, `strfry`, `web5-dwn`.
---
## Schema gaps blocking faithful ports
`core/container/src/manifest.rs::AppManifest` currently supports:
- `container.image` OR `container.build` (mutually exclusive, validated).
- `dependencies: Vec<Dependency>`, `resources: {cpu_limit, memory_limit, disk_limit}`.
- `security: { capabilities, readonly_root, network_policy: string, apparmor_profile }`.
- `ports: Vec<{host, container, protocol}>`, `volumes: Vec<{type, source, target, options}>`.
- `environment: Vec<String>` (each `"KEY=VALUE"`).
- `health_check: {type, endpoint, path, interval, timeout, retries}`.
- `devices: Vec<String>`, `extensions: HashMap<String, Value>` (flatten).
What `container-specs.sh` uses that the schema **does not** express first-class:
| Need | Example from bash | Proposed schema addition |
|---|---|---|
| Join the named `archy-net` bridge | `SPEC_NETWORK="archy-net"` | `container.network: Option<String>` (Some("archy-net"), or None for `isolated`, or "host"). Existing `security.network_policy` left as-is for policy knobs (e.g. firewall isolation layer); this new field is literally the podman `--network` value. |
| Extra args / custom flags | `SPEC_CUSTOM_ARGS="-server=1 -prune=550 ..."` | `container.custom_args: Vec<String>`. |
| Entrypoint override | `SPEC_ENTRYPOINT="gatewayd --data-dir /data ... lnd --lnd-rpc-host lnd:10009"` | `container.entrypoint: Option<Vec<String>>`. |
| Host-derived env (mDNS hostname, host IP) | `FM_P2P_URL=fedimint://$HOST_MDNS:8173` | `container.derived_env: Vec<{key, template}>` with a small allow-list of `{{HOST_MDNS}}`, `{{HOST_IP}}`, `{{DISK_GB}}` substitutions resolved at apply time. |
| Secret-file env (read from `/var/lib/archipelago/secrets/<name>`) | `FM_BITCOIND_PASSWORD=$BITCOIN_RPC_PASS` (from secret file in bash) | `container.secret_env: Vec<{key, secret_file}>`, secret_file relative to `$SECRETS_DIR`. Never logged. |
| Data dir UID/GID (for rootless mapped chown) | `SPEC_DATA_UID="100070:100070"` | `container.data_uid: Option<String>` (e.g. `"100070:100070"`). Applied as `chown -R` before container create. |
| Exec health check | `SPEC_HEALTH_CMD="bitcoin-cli ..."` | Extend `HealthCheck` so `type: exec` + `command: Vec<String>` works end-to-end; confirm the runtime honors it. |
| Optional/skip-when-not-installed semantics | `SPEC_OPTIONAL="true"` | Already covered: `BootReconciler` only installs if an `AppManifest` is registered. For baseline-on-first-boot containers (filebrowser), we use the same install path. No schema change. |
| Local-image flag (don't pull) | `SPEC_LOCAL_IMAGE="true"` | Already covered: `container.build` vs `container.image`. |
Everything else (tier ordering, dependency tree, readonly_root, tmpfs mounts) is either already in the schema or folded into `custom_args` cleanly.
### tmpfs
`SPEC_TMPFS="/tmp:rw,noexec,nosuid,size=256m ..."` used by `grafana`, `searxng`, `ollama`. Currently no first-class field. Proposed: `volumes[].type: tmpfs` with a new `tmpfs_options` field on `Volume`, or a dedicated `container.tmpfs: Vec<{target, options}>`. Either works; the `Volume`-variant keeps all mount declarations in one place.
---
## Proposed commit sequence
Each item is a separate commit. None recreates a container on the fleet.
**8b.0 — schema extensions, no manifest changes, no orchestrator changes**
1. `feat(container/manifest): add network, custom_args, entrypoint, derived_env, secret_env, data_uid, tmpfs fields` — add fields to `ContainerConfig`/`SecurityPolicy`/`Volume`, update `validate()`, add unit tests per new field. Backwards-compat: every existing `apps/*/manifest.yml` must still parse (verify with a `parse_every_real_manifest` test that walks `apps/*/manifest.yml` in the repo).
2. `feat(container/manifest): resolve derived_env against host facts` — add `HostFacts { host_ip, host_mdns, disk_gb }` struct and `resolve_env(facts) -> Vec<String>` method; unit test with a fixed `HostFacts`.
3. `feat(container/manifest): resolve secret_env against a SecretsProvider` — add trait `SecretsProvider { fn read(&self, name: &str) -> Result<String>; }`, stub `FileSecretsProvider` rooted at `/var/lib/archipelago/secrets`, unit test with a tmpdir provider.
**8b.1 — orchestrator honors the new fields**
4. `feat(prod_orchestrator): honor network/custom_args/entrypoint on create` — thread the new `ResolvedContainerConfig` into the runtime's create call. Mock-runtime unit tests for each field.
5. `feat(prod_orchestrator): chown data dir to data_uid before create` — called from `install_fresh`. Unit test with a tmpdir.
6. `feat(prod_orchestrator): resolve derived_env + secret_env before create` — wire in `HostFacts` + `SecretsProvider`. Unit test.
**8b.2 — first real backend port: fedimint**
7. `feat(apps/fedimint): port manifest from container-specs.sh with mDNS URLs + archy-net` — rewrites `apps/fedimint/manifest.yml` using the new schema. Includes `container_name: fedimint` (no prefix), `network: archy-net`, `derived_env: [FM_P2P_URL, FM_API_URL]`, `secret_env: [FM_BITCOIND_PASSWORD, ...]`.
8. `feat(apps/fedimint-gateway): new manifest with LND-aware entrypoint` — creates `apps/fedimint-gateway/manifest.yml`. Dynamic entrypoint is a 2-case template resolved by a derived field `{{LND_AVAILABLE}}` (presence of `/var/lib/archipelago/lnd/tls.cert`). May require a second commit to add that derived fact — scope-judge at write time.
9. `test(lifecycle): fedimint adoption + fresh-install` — bats scaffold per `docs/bulletproof-containers.md§Test harness`.
**8b.3 — remaining critical backends (one per commit)**
10. `feat(apps/filebrowser): new manifest — baseline Archipelago service` (fixes yesterday's `.filebrowser.json` loss by regenerating via `custom_args: ["--config", "/data/.filebrowser.json"]` + `caps: [..., NET_BIND_SERVICE]`).
11. `feat(apps/electrumx): new manifest`.
12. `feat(apps/bitcoin-knots): rename-or-merge with apps/bitcoin-core/manifest.yml` — decide naming once, update everywhere. Recommend: keep `apps/bitcoin-core/` dir (it's the user-visible app name) and use `extensions.container_name: bitcoin-knots` to preserve adoption.
13. `feat(apps/lnd): reconcile stub against spec`.
14. `feat(apps/btcpay-server + companions): multi-container stack` — reuse the existing stack path in `api/rpc/package/stacks.rs` OR decide to add `container.companions: Vec<ContainerConfig>`. Defer decision until 1013 land.
**8b.4 — mempool stack, optional apps**
Continue one-at-a-time until every ⚠ or ❌ row above is ✅.
**8b.5 — port `core/archipelago/src/api/rpc/package/update.rs`**
Replace `reconcile-containers.sh` calls with `ContainerOrchestrator::upgrade(app_id)`. Unblocks 8c.
**8c — delete bash scripts** (per `docs/rust-orchestrator-migration.md`).
---
## Runtime-only drift on `.116` — write it into manifests, not scripts
Per `docs/RESUME.md§Runtime-only fixes on .116`, yesterday's patches are:
1. `~archipelago/.config/containers/containers.conf` (`image_copy_tmp_dir = "storage"`) → lands in `first-boot-setup.sh` (renamed in Step 8c) OR in a Rust startup-side prereq hook. Not a per-manifest concern.
2. Secrets ownership `archipelago:archipelago` → Rust orchestrator's `ensure_secrets` path (already exists; verify it chowns).
3. `/var/lib/archipelago/filebrowser-data/.filebrowser.json` → handled by filebrowser's `custom_args: ["--config", "/data/.filebrowser.json"]` plus a pre-start hook (mirrors `bitcoin_ui` precedent) that writes the file if absent. Details in 8b.3 commit 10.
4. Fedimint data dir chown → handled by `container.data_uid: "100000:100000"` in the fedimint manifest.
All runtime-only fixes end up expressed as manifest fields or Rust-side hooks. None survives as bash.
---
## Open decisions (lock before writing code)
1. **`bitcoin-knots` vs `bitcoin-core` naming.** Recommend: app id stays `bitcoin-core` (user-facing), container name becomes `bitcoin-knots` via `extensions.container_name`, image is Knots. Or rename both to `bitcoin-knots` for honesty. Pick one and apply everywhere.
2. **`archy-` prefix rule.** Currently `UI_APP_IDS` in `prod_orchestrator.rs` hardcodes `["bitcoin-ui", "electrs-ui", "lnd-ui"]``archy-`. Several backends use `archy-` too (`archy-mempool-db`, `archy-mempool-web`, `archy-nbxplorer`, `archy-btcpay-db`). Recommend: drop the hardcoded list, rely on `extensions.container_name` everywhere, audit all existing manifests to set it explicitly so adoption doesn't orphan.
3. **Companions (mempool-api + mempool-web + mempool-db, btcpay-server + nbxplorer + btcpay-db).** Two options: (a) one manifest per container with explicit deps and an "app group" id; (b) extend `ContainerConfig` with `companions: Vec<…>`. `apps/lightning-stack/manifest.yml` already shipped probably has a precedent — check its shape before deciding.
4. **Keep `container-specs.sh` as the source of truth until 8b is fully ported?** Yes. `BootReconciler` only acts on what's in `apps/*/manifest.yml`; anything not ported stays on the bash path until its commit lands. Zero-downtime migration.
---
## Where to resume
After user approves this plan: commit 1 in 8b.0 (schema extensions + tests, no orchestrator or manifest changes). Smallest possible diff, highest leverage, and unblocks every subsequent port.
## Validation Snapshot - 2026-04-28
- Runtime cleanup: removed orphan `bold_lichterman` duplicate; retained managed `filebrowser`.
- Launch policy alignment: local app launches are port-based; iframe-blocked apps (including `gitea`) are forced to new-tab.
- App icon reliability: image fallback now retries `.svg` when `.png` does not exist.
- Required stack verification on `.116`:
- `tests/lifecycle/bats/required-stack.bats` -> PASS
- `ARCHY_ALLOW_DESTRUCTIVE=1 tests/lifecycle/bats/required-stack-destructive.bats` -> PASS
- Broad host-port probe confirms HTTP 200 responses for user-facing app UIs on mapped ports; non-HTTP ports intentionally excluded from HTTP pass/fail semantics.

View File

@ -1,288 +0,0 @@
# Weekly Release Tracker
Last updated: 2026-06-14 (session on node .116 / archi-thinkpad)
---
# ▶ IN PROGRESS — LND wallet auto-unlock fix (2026-06-14)
## RESUME PROMPT (paste into a fresh session, on .116 / archi-thinkpad, tree at /home/archipelago/Projects/archy)
> Resume the LND wallet-password fix. Read memory `project_lnd_wallet_password.md` FIRST (full
> root-cause + design + validated facts). Work is on branch `lnd-wallet-password-fix` (pushed to
> gitea-vps2, commit 91adc281, NOT merged to main, NOT shipped). Bug: hardcoded
> `WALLET_PASSWORD="hellohello"` left LND wallets LOCKED fleet-wide after OTA → Bitcoin-receive
> shows "wallet is locked" on every updated node. DONE + cargo-checked: per-node random secret
> (secrets/lnd-wallet-password), both init paths unified, candidate-unlock with fail-fast,
> login-time candidate-migration (ChangePassword). DETECTION GATE already shipped on main
> (commit 8c8e4d7a). DECISION: alpha, NO funds on nodes → destructive wipe+recreate is OK and
> wanted UNATTENDED for ALL nodes in the next update. A wallet locked with an unknown password is
> already inaccessible, so wiping loses nothing reachable.
## EXACT NEXT STEPS — LND fix (in order)
1. **Finish seed/fresh recovery** (REMAINING piece): in `container/lnd.rs ensure_wallet_initialized`,
when wallet.db exists but ALL unlock candidates fail → wipe wallet.db (+ macaroons + graph/chain
mainnet state, as root via host_sudo) and re-init fresh (random genseed + per-node secret) so the
node self-heals unattended at boot. (Login-time candidate-migration already handles nodes whose
pw matches.) Validate the wipe→reinit mechanic on the scratch LND first (see below).
2. **Scratch validation** (was in progress, .249 unreachable from .116's subnet → use a throwaway
`lnd-scratch` podman container on .116, regtest/neutrino, REST :18099 — already proven for
init/unlock/ChangePassword). Test: init(passA) → restart→LOCKED → delete wallet.db while locked →
confirm /v1/state→NON_EXISTING (may need container restart) → genseed+initwallet fresh → unlock.
NOTE: scratch wallet.db lives at the container's LND data dir (regtest), `podman exec lnd-scratch
find / -name wallet.db`. CLEAN UP: `podman rm -f lnd-scratch` when done.
3. `cargo check -p archipelago` (on .116 ~15-30s incremental; full test compile ~9min).
4. **End-to-end on .228** (reachable 192.168.1.x, SSH pw `archipelago`, UI pw unknown, NO funds —
has a locked unknown-pw wallet = perfect auto-recreate test): build binary
(`ARCHIPELAGO_TARGET=archipelago@192.168.1.228 scripts/deploy-to-target.sh` or per
reference_deploy_to_nodes), deploy, restart, confirm wallet auto-recreates+unlocks, lncli state
RPC_ACTIVE, lnd.newaddress returns an address. Run os-audit against .228 → lnd check PASS.
5. Merge `lnd-wallet-password-fix` → main, then **cut + publish v1.7.93-alpha** (carries the LND
fix). Ship ritual: create-release.sh 1.7.93-alpha → add CHANGELOG (≥3 layman bullets) → run
sync-whats-new.py (the new What's-New gate will require it) → publish-release-assets.sh gitea-vps2
→ push origin/gitea-vps2 + tags → verify live manifest==1.7.93-alpha. Heads-up: create-release
leaves core/Cargo.lock version-bump uncommitted (commit it as a chore, both .91 and .92 hit this).
## Context: how we got here (this session, all on node .116)
- Shipped **v1.7.91-alpha** (bitcoinReceive TS2538 build fix) and **v1.7.92-alpha** (ElectrumX
overlay-during-sync fix; L3 reboot os-audit gate; What's-New sync gate + 8-version backfill) —
both LIVE on vps2. Restored .116-local nginx `/lnd-connect-info` route (was dropped 2026-06-10).
- Triaged user symptoms: ElectrumX "can't connect" = electrs syncing / Bitcoin verifying (not a
regression); .228 "5/14 apps after reboot" = normal ~5min staggered startup (all 14 came up).
- LND lock bug found + detection gate shipped + forward fix & migration implemented (this section).
---
# ✔ DONE PASS — v1.7.91-alpha + v1.7.92-alpha (2026-06-14)
## Outcome (both releases PUBLISHED + LIVE on vps2)
- **v1.7.91-alpha** — bitcoinReceive.ts TS2538 build-blocker fixed; cut, published, verified
live (`manifest.version==1.7.91-alpha`), tag `v1.7.91-alpha` on vps2. The fleet OTA'd to it
(confirmed on .116 + .198).
- **v1.7.92-alpha** — cut, published, verified live (`manifest.version==1.7.92-alpha`), tag on
vps2, main@d462e444. Carries:
- `fix(ui)` ElectrumX **overlay-during-sync** bug — the "App not reachable / retry" overlay
no longer paints over the ElectrumX sync screen (AppSessionFrame.vue gated on `!electrsSync`).
- `test(resilience)` **L3 per-boot health gate**`batch_host_reboot` now runs os-audit.sh
after reboot (RPC/OTA/all-apps/FM-guards), not just container-set equality. os-audit validated
11/0/0 green on .116.
- `feat(release)` **What's New sync gate**`scripts/sync-whats-new.py` + `whats-new-sync`
stage in tests/release/run.sh. Backfilled the 8 missing modal blocks (v1.7.85→.92); the gate
fails any release whose CHANGELOG version isn't in the Settings modal.
- **.116 node fix (not shipped — local config)**: restored the `/lnd-connect-info` nginx proxy
route that a 2026-06-10 "before-116-routing" change had dropped (fell through to SPA). Backup at
`/etc/nginx/conf.d/rpc.tx1138.com.conf.bak-lndconnect-*`. Shipped template already has the route.
- **User symptoms triaged (none were .91/.92 regressions)**: receive-generate "unchanged" = .91's
receive change was a behavior-preserving build guard; ElectrumX "can't connect" on .198 = Bitcoin
node mid-"Verifying blocks…" (-28) so electrs was "waiting for Bitcoin node"; on .116 electrs was
~59% mid-sync. The overlay UX bug is fixed regardless.
## Known follow-ups (not blockers)
- **gitea-local mirror push fails** (`localhost:3000` → redirect to `/login`, token auth). vps2 is
the OTA source and is fine; gitea-local secondary mirror is stale. Diagnose the local Gitea token.
- `sync-whats-new.py` only **inserts missing** versions; it does not rewrite a block when CHANGELOG
bullets for an already-present version change (had to delete+resync the .92 block by hand to pick
up its 3rd bullet). Fine for the forward case; enhance to idempotently re-render if needed.
## What happened this session
- `scripts/create-release.sh 1.7.91-alpha` was running; its release gate PASSED all 7 checks,
backend built clean (7m22s), then it **FAILED at step [4/8] frontend build** with:
`src/utils/bitcoinReceive.ts(23,24): error TS2538: Type 'undefined' cannot be used as an index type.`
Cause: `noUncheckedIndexedAccess``codeMatch[1]` is `string | undefined` and was used directly
to index `RECEIVE_CODE_MESSAGES`. **FIXED**`const code = message.match(/\[([A-Z_]+)\]/)?.[1]`
then `if (code && RECEIVE_CODE_MESSAGES[code])`. `npx vue-tsc --noEmit` is now clean (exit 0).
The failed run aborted BEFORE bumping the manifest (still 1.7.90) or tagging (no v1.7.91 tag),
but it HAD already partial-bumped Cargo.toml/package.json/locks to 1.7.91 — those partial bumps
are reverted (create-release.sh re-owns the bump); only the genuine TS fix + harness are committed.
- Built a new OS-wide health harness `tests/lifecycle/os-audit.sh` (non-destructive, one scorecard):
Section A backend/RPC health, Section B all-apps lifecycle audit (delegates to remote-lifecycle.sh),
Section C FM-guards (port-drift + secret-completeness bats, orphan-container sweep). Section A
validated all-PASS on .116. Fixed a jq bug in the FM12 OTA-wedge check: `//` treats a legit
`false` as empty and fell through to "unknown" — now uses `has()`. Section B is slow (~3 min) and
opaque while running because output is captured (`out=$(...)`) not streamed — minor wart, TODO.
## EXACT NEXT STEPS — v1.7.91 (in order)
1. Confirm clean tree + on main (`git status`; create-release.sh requires `git diff --quiet HEAD`).
The TS fix + os-audit.sh are committed & pushed; version-bump artifacts reverted to 1.7.90.
2. Re-run the release: `scripts/create-release.sh 1.7.91-alpha`. Backend is cached (only a .ts
changed) so it's fast; the frontend build now passes. It bumps versions, builds, writes
releases/manifest.json (→1.7.91-alpha), commits, and tags v1.7.91-alpha.
- Memory guards: grep the staged frontend tarball for "1.7.91-alpha" before shipping (silent
vue-tsc failures); tarball must be flat (`tar -C web/dist/neode-ui .`).
3. Publish: `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` (origin pushes to BOTH gitea-local + vps2).
4. Verify manifest LIVE (this is "published"):
`curl -fsS http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
must show `1.7.91-alpha`. **Then notify the user — they asked to be told when 1.7.91 publishes.**
5. os-audit harness: run a full green pass on .116
(`ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD='ThisIsWeb54321@' tests/lifecycle/os-audit.sh`),
confirm Section A FM12 now reads `update_in_progress=false` (PASS not WARN), review B + C findings,
then wire os-audit.sh into the reboot-survival (L3) loop as the per-boot gate.
---
# ─ HISTORY — v1.7.89-alpha pass (2026-06-12), superseded ─
Last updated: 2026-06-12 ~17:45 EDT (session on node .116)
## RESUME PROMPT (paste into a fresh session)
> Continue the v1.7.89-alpha release pass from /home/archipelago/Projects/archy on node .116.
> Read docs/WEEKLY_RELEASE_TRACKER.md fully first — it has root causes, fixes already made,
> and exact next steps. Do NOT redo: AIUI revert (done, validated), updater fixes in
> core/archipelago/src/update.rs (done, uncommitted), .116 OTA unwedge (done). Resume at
> "EXACT NEXT STEPS" below.
## EXACT NEXT STEPS (in order)
1. Backend focused tests were running in background:
`cd core && timeout 1500 cargo test -p archipelago -- update:: lnd container::image_versions scanner`
(log: /tmp/claude-.../tasks/bds4jk19e.output — if lost, just rerun the command; first
attempt died at 400s timeout during test compile, 1500s is the right budget).
Need: all green.
2. RESOLVED before session end: vitest recheck passed clean — EXIT=0, 79 files / 645 tests,
even while cargo test was compiling. The earlier harness ui-unit-tests FAIL was load/flake
(machine saturated by the parallel cargo test compile), not a real failure. On resume just
rerun `tests/release/run.sh --quick` WITHOUT a parallel cargo build to confirm green;
if it ever fails again, the failing test name is in the stage output (drop `--silent`).
3. Run full harness: `tests/release/run.sh` (static+frontend+backend). Then commit ALL
working-tree changes (one commit, e.g. "fix: harden OTA updates, AIUI desktop gap, LND
no-proxy" — CHANGELOG v1.7.89 section is already curated).
4. Cut release: `scripts/create-release.sh 1.7.89-alpha` (needs clean tree, on main,
validates CHANGELOG section exists — it does). Then
`tests/release/run.sh --manifest` should pass, and grep the staged frontend tarball
for 1.7.89-alpha (memory: silent build failures).
5. Publish: `scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` and push gitea-local + tags too.
Verify manifest live on http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json
6. Verify OTA on THIS node (.116): schedule is auto_apply; either wait for the scheduler
or trigger via UI. Confirm /var/lib/archipelago/update_state.json current_version
becomes 1.7.89-alpha, `update_in_progress` returns to false, web-ui + binary versions
MATCH (this node currently has web-ui 1.7.84 / binary 1.7.85 mismatch — the OTA heals it),
and journalctl shows "Post-OTA verification succeeded" (the new probe falls back to
http://127.0.0.1/ which is what .116 serves).
7. Update this tracker + docs/PROGRESS_MEMORY.md, mark tasks done.
Purpose: live tracker for this pass — test everything shipped this week (v1.7.83→v1.7.89),
build the release test harness, fix OTA updates on .116, make updates bulletproof, cut v1.7.89-alpha.
If the session is cut off, resume from here.
## Task status
| # | Task | Status |
|---|------|--------|
| 1 | AIUI revert (mobile back/close gone, desktop gap fixed) | DONE — validated |
| 2 | Dev server on :8100 with embedded AIUI | DONE — see below |
| 3 | Inventory this week's release-log items | DONE — see checklist |
| 4 | Test harness covering this week + seed of system-wide harness | IN PROGRESS |
| 5 | Fix OTA updates on .116 + bulletproof updates | IN PROGRESS — diagnosis below |
| 6 | Cut v1.7.89-alpha release | PENDING (gates: 4, 5) |
## State of the working tree
- HEAD = 495b9078 (v1.7.89 changelog + AIUI mobile restore committed).
- Uncommitted, intended for v1.7.89-alpha:
- `neode-ui/src/views/Dashboard.vue` — chat route back to plain `h-full` (desktop bottom-gap fix). Validated.
- `core/.../rpc/lnd/*` + `container/lnd.rs` — LND REST no-proxy + wallet readiness/unlock fixes.
- Version bumps to 1.7.89-alpha (Cargo.toml, package.json, locks), CHANGELOG entry.
- `neode-ui/vite.config.ts` — added `/aiui` dev proxy (keep; dev-only convenience).
## AIUI validation (task 1) — DONE
- HEAD already removed the mobile back button and restored `hideClose=true` (495b9078).
- Working-tree Dashboard.vue removes `dashboard-scroll-panel mobile-scroll-pad` from the chat
route (that padding caused the desktop bottom gap); mesh keeps its styling.
- Chat CSS verified byte-identical to last-good 34c4e87d (May 20).
- Playwright check (desktop 1440x900, mobile 390x844): chat fills full viewport, no bottom gap,
no mobile back/close. `npm run type-check` + focused route tests + full vitest (645/645) pass.
## Dev server on :8100 (task 2) — DONE
- Running: `BACKEND_URL=http://127.0.0.1:5678 VITE_AIUI_URL=/aiui/ npx vite --host 0.0.0.0 --port 8100`
from `neode-ui/` (real local backend on 5678).
- AIUI now embeds in /dashboard/chat via new vite proxy `/aiui``http://127.0.0.1:80`
(the node's deployed AIUI), same-origin like production.
- Secondary throwaway instance for automated checks: :8101 against mock backend
(`node mock-backend.js` on 5959, password `password123`).
## This week's shipped items (v1.7.83 → v1.7.89) — test checklist
### Frontend (vitest/type-check/build cover most; full suite 645/645 green 2026-06-12)
- [x] AIUI fast launch, no availability probe (v1.7.88) — covered by visual check + Chat.vue tests
- [x] AIUI mobile layout restore (v1.7.89) — playwright visual check
- [x] App-session launch metadata from manifests / typed interfaces (v1.7.83) — appSessionConfig tests
- [x] OnlyOffice + Saleor removal (v1.7.83) — catalog tests
- [ ] Bitcoin receive UI flow end-to-end (v1.7.87/88) — needs live LND node check
- [ ] Fleet tab keeps node list/alerts during refresh, names not hashes (v1.7.85/86) — store tests?
- [ ] Credential interstitial full-screen overlay (v1.7.87) — visual
- [ ] Mobile federation/system-update buttons full width (v1.7.86) — visual
### Backend (cargo)
- [ ] LND REST no-proxy client + GET newaddress p2wkh (v1.7.88/89) — unit tests + live check
- [ ] LND wallet readiness/unlock after restart (v1.7.89) — unit + live
- [ ] Bitcoin trusted-node relay rpcauth/txrelay (v1.7.84) — unit tests exist? check
- [ ] Container scanner RAII in-flight guard (v1.7.84) — cargo test
- [ ] ElectrumX health-check startup window + cache tuning (v1.7.85/86)
- [ ] Portainer pin 2.19.4 / bitcoin-ui image pin (v1.7.84/85) — image-versions tests
- [ ] Fleet telemetry name/hostname/URL fields (v1.7.85)
- [ ] Federation no self-import (v1.7.85)
- [ ] Kiosk safe-area + self-update refreshes kiosk files (v1.7.84)
- [ ] Wi-Fi scan error/retry/escaped SSID/open networks (v1.7.84)
### OTA / updates (task 5)
- [ ] .116 stuck: current 1.7.85-alpha, `update_in_progress: true` since 1.7.88 attempt — diagnose+fix
- [ ] Updater hardening: stuck-in-progress recovery, resumable/atomic apply, verify post-restart version
## OTA diagnosis on .116 — ROOT CAUSES FOUND + FIXED (code staged for v1.7.89)
Four bugs, all reproduced from the journal (Jun 12 03:4504:33):
1. Post-OTA probe only tries `https://127.0.0.1/`; .116's nginx binds only :80 (443 is
tailscale's) → connection refused × 18 → a GOOD 1.7.85 update was "rolled back".
FIX: probe falls back to `http://127.0.0.1/` on connect error (update.rs probe_frontend_once).
2. That rollback's binary restore did `host_sudo cp` onto the RUNNING binary → ETXTBSY exit 1
→ binary stayed 1.7.85 while web-ui rolled back to 1.7.84 (mismatch confirmed live).
FIX: rollback now cp→tmp→atomic mv, same pattern as apply (update.rs rollback_update).
3. The rollback chown'd `update-backup/archipelago` root:root IN PLACE → next apply's
fs::copy (as service user) hit EACCES → "Failed to backup current binary" × 3 → 1.7.86/88
never applied. FIX: apply unlinks stale backup first; rollback chowns only its temp copy.
4. Failed apply left `update_in_progress: true` wedged (staging still populated so the
stale-flag guard never fires). Unwedged operationally; fixed structurally by 13.
Operational cleanup DONE on .116 (2026-06-12 17:15): removed root-owned
`update-backup/archipelago`, stale `update-staging/` (1.7.86), and the stale
`update-pending-verify.json`. Next state load clears `update_in_progress`.
NOTE: live web-ui is 1.7.84 / binary 1.7.85 (mismatch from bug 2). Not hand-patched —
the v1.7.89 OTA will resync both. Good 1.7.85 frontend is quarantined at
`/opt/archipelago/web-ui.failed.1781250438247`.
Verification plan: after v1.7.89 release, watch .116 auto-apply (schedule auto_apply),
confirm `update_state.json.current_version == 1.7.89-alpha` and web-ui version matches.
## Test harness (task 4) — CREATED at tests/release/run.sh
- Stages: static (git diff --check, cargo fmt, catalog drift, optional --manifest),
frontend (type-check, full vitest), optional --with-build (build + grep dist for version),
backend (cargo check + focused cargo test: update:: lnd container::image_versions scanner,
all wrapped in `timeout`), optional --live URL smoke (/, /aiui/, /rpc/v1).
- Results so far (2026-06-12): type-check PASS, full vitest 645/645 PASS, cargo fmt PASS,
cargo check PASS, catalog drift PASS (3 pre-existing MISSING_CATALOG warnings, exit 0,
identical on HEAD). Focused backend cargo tests running (first run hit the known slow
test-compile on .116 at 400s timeout; rerunning with 1500s).
- AIUI embed verified end-to-end via playwright on :8101 (mock backend): iframe loads,
`ready` handshake clears the loading overlay, hideClose honored.
- Release flow confirmed: commit all → `scripts/create-release.sh 1.7.89-alpha` (validates
curated CHANGELOG section, builds, manifests, commits, tags) →
`scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2` → push origin main + tags.
Tarball layout/perms safety is already inside create-release-manifest.sh.
- CHANGELOG v1.7.89 section rewritten layman-readable (updater fixes added).
## Release gates for v1.7.89-alpha (task 6)
1. All harness stages green locally.
2. OTA fix for stuck `update_in_progress` included + .116 updates successfully to the new release.
3. Frontend build: grep packaged tarball for "1.7.89-alpha" before shipping (memory: silent vue-tsc failures).
4. Flat tarball layout (`tar -C web/dist/neode-ui .`).
5. Commit, tag `v1.7.89-alpha`, push origin + gitea-local + tags, publish release assets, verify
manifest + node OTA picks it up.

View File

@ -0,0 +1,153 @@
# Archipelago App Registry — Status Survey
**Generated:** 2026-06-21 · **Survey node:** .228 (archi resilience node, 14-app) · **Binary:** v1.7.99-alpha
This document inventories every app in the registry and reports, per app:
manifest-based or not · installed on .228 · migration status (Quadlet/legacy) ·
automated test coverage / release-gate status.
---
## 1. Architecture context — "manifest-based or not"
**Every registry app is manifest-based.** That is the core architecture
(Pillar 4, *data-driven apps*): install/uninstall needs only the app's
`manifest.yml` + catalog entry — no host OS changes, no archipelago binary code
per app. The live registry on .228 is **40 loaded manifests**
(`Loaded 40 app manifest(s) from disk`).
The **only** non-manifest runtime units are:
- **4 companions**`archy-bitcoin-ui`, `archy-lnd-ui`, `archy-electrs-ui`,
`archy-fedimint-ui`. Built from `docker/<name>` contexts via
`core/archipelago/src/container/companion.rs`, *not* the manifest registry.
- **Stack sub-containers**`immich_*`, `indeedhub-*`, `netbird-*`. Spawned by
their parent manifest app.
---
## 2. Migration status (Quadlet-everywhere — Pillar 1)
"Migrated" = runs as a **Quadlet unit under `user.slice`**, so it survives an
`archipelago.service` restart (legacy in-cgroup containers get SIGKILLed on
restart and reconciled back).
On .228 migration is **effectively complete** — every installed app is
`QUADLET:running` **except one**:
| Status | Apps |
|---|---|
| ✅ Migrated (Quadlet / user.slice) | bitcoin-knots, electrumx, lnd, fedimint, fedimint-clientd, fedimint-gateway, btcpay-server (+archy-btcpay-db, archy-nbxplorer), mempool, mempool-api, archy-mempool-db, indeedhub (+7 sub-containers), netbird (+server, +dashboard), vaultwarden, jellyfin, filebrowser, portainer, botfights, nostr-rs-relay, homeassistant, + 4 companions |
| ⚠️ NOT migrated (legacy, service cgroup) | **immich_server** — still in `/system.slice/archipelago.service`. The only legacy holdout. (`immich_postgres`/`immich_redis` are pod members.) |
---
## 3. Exhaustive per-app registry table
| App (registry id) | Manifest | Installed on .228 | Migration | Test coverage |
|---|---|---|---|---|
| bitcoin-knots | yes | ✅ | QUADLET | **L1 RPC ●**, L2 UI ● |
| bitcoin-core | yes | ✗ (shares knots) | — | ◐ regression-gate |
| lnd | yes | ✅ | QUADLET | **L1 RPC ●**, L2 ● |
| electrumx | yes | ✅ | QUADLET | **L1 RPC ●**, L2 ● |
| btcpay-server | yes | ✅ | QUADLET | **L1 RPC ●**, L2 ● |
| mempool | yes | ✅ | QUADLET | **L1 RPC ●**, L2 ● |
| mempool-api | yes | ✅ | QUADLET | via mempool stack |
| archy-mempool-db | yes | ✅ | QUADLET | via mempool stack |
| archy-mempool-web | yes | ✗ | — | via mempool stack |
| archy-btcpay-db | yes | ✅ | QUADLET | via btcpay stack |
| archy-nbxplorer | yes | ✅ | QUADLET | via btcpay stack |
| fedimint (Guardian) | yes | ✅ | QUADLET | L1 ◐ container-only, L2 ● |
| fedimint-clientd | yes | ✅ | QUADLET | none |
| fedimint-gateway | yes | ✅ (this session) | QUADLET | none |
| filebrowser | yes | ✅ | QUADLET | L2 probe-only |
| indeedhub | yes | ✅ | QUADLET | none |
| jellyfin | yes | ✅ | QUADLET | none |
| vaultwarden | yes | ✅ | QUADLET | none |
| portainer | yes | ✅ | QUADLET | none |
| botfights | yes | ✅ | QUADLET | none |
| nostr-rs-relay | yes | ✅ | QUADLET | none |
| home-assistant | yes | ✅ (container `homeassistant`) | QUADLET | none |
| netbird | yes | ✅ (+server, +dashboard) | QUADLET | none |
| immich | yes | ✅ | ⚠️ **LEGACY** | none |
| grafana | yes | ✗ (unit *activating*, no container) | staged | none |
| strfry | yes | ✗ (unit *activating*) | staged | none |
| ~~onlyoffice~~ | — | removed 2026-06-21 | — | — |
| aiui | yes | ✗ | — | none |
| core-lightning | yes | ✗ | — | none |
| did-wallet | yes | ✗ | — | none |
| gitea | yes | ✗ | — | none |
| lightning-stack | yes | ✗ | — | none |
| meshtastic | yes | ✗ | — | none |
| morphos-server | yes | ✗ | — | none |
| nextcloud | yes | ✗ | — | none |
| photoprism | yes | ✗ | — | none |
| router | yes | ✗ | — | none |
| searxng | yes | ✗ | — | none |
| uptime-kuma | yes | ✗ | — | none |
| bitcoin-ui | yes | runs as companion `archy-bitcoin-ui` | QUADLET (companion) | L3 companions ● |
| lnd-ui | yes | runs as companion `archy-lnd-ui` | QUADLET (companion) | L3 companions ● |
| electrs-ui | yes | runs as companion `archy-electrs-ui` | QUADLET (companion) | L3 companions ● |
| fips-ui | yes | ✗ | — | none |
Notes:
- `home-assistant` (registry id) runs as container **`homeassistant`** — the
app-id ≠ container-name. A duplicate `home-assistant.service` quadlet unit
sits in *activating*; the live container is `homeassistant` (Up 6 days, healthy).
- `grafana` / `strfry` have Quadlet `.container` units but the units are stuck
*activating* with **no running container** — staged, not live. Worth a
separate investigation.
- `onlyoffice` was **removed from the registry on 2026-06-21**.
---
## 4. Test-gate reality
**No app has passed the formal release gate.** The gate is `run-20x.sh` green
across the full lifecycle matrix (install / UI reachable / stop / start /
restart / reinstall / reboot-survive / archipelago-restart-survive / uninstall),
**20× on .228 AND .198**. All 8 release-gate checkboxes in
`tests/lifecycle/TESTING.md` are **unchecked (☐)**.
What exists today:
| Layer | Status |
|---|---|
| L0 unit | 631 tests ● green |
| L1 RPC | ● for **6 core apps only**: bitcoin-knots, lnd, electrumx, btcpay, mempool, fedimint |
| L2 UI | ● dashboard + 7 proxy paths + bitcoin-ui:8334 |
| L3 lifecycle survival | companions ● ; backends ◐ (regression-gate only — fails until Phase-3 Quadlet flag flips by default) |
| Per-app L1+L2 matrix | **50 of 110 cells** |
| L4 browser / L5 chaos / L6 perf | ○ 0 — not started |
Regression suites added after v1.7.90-alpha (run read-only, abort releases on
failure): `bitcoin-receive.bats`, `port-drift.bats`, `secret-completeness.bats`.
**The other ~30 registry apps have zero automated coverage.**
---
## 5. Key gaps
1. **immich** is the last legacy (in-cgroup) app — migrate to Quadlet to finish Pillar 1.
2. **grafana / strfry** Quadlet units stuck *activating* with no container — investigate. (onlyoffice removed 2026-06-21.)
3. **fedimint-gateway / fedimint-clientd** (this session) now run but have no lifecycle test coverage.
4. The formal **20× release gate has never been green** — it is the blocker for the v1.7.52 tag.
---
## 6. This session's changes (2026-06-21)
- **Generated-secrets system** deployed to .228 (binary + manifests). Self-healing:
the root-owned `fedimint-gateway-hash` was regenerated archipelago-owned/readable
**fedimint-gateway now starts** (gatewayd webserver up on :8176). `fmcd-password`
generated for fedimint-clientd.
- **Guardian-UI CSS fix** applied on .228: rebuilt the stale `localhost/fedimint-ui:latest`
companion image (built 2026-06-12, pre-fix) from the corrected context
(`@guardian_assets` proxy fallback to :8177). Guardian's own CSS
(`/assets/bootstrap.min.css`, `/assets/style.css`) **404 → 200 text/css**.
Root cause: `companion.rs::ensure_image_present` skips rebuild when the
`:latest` image already exists, so the context fix never re-baked.
*Survey method: live `podman` cgroup inspection on .228 + `/opt/archipelago/apps`
manifest enumeration + `tests/lifecycle/TESTING.md`.*

View File

@ -1,37 +0,0 @@
# CI/CD Pipeline Plan
## CI Workflow (on push to main + PRs)
### Jobs
1. **Rust checks**
- `cargo clippy --all-targets --all-features` (zero warnings)
- `cargo fmt --all -- --check`
- `cargo test --all-features`
2. **Frontend checks**
- `npm run type-check` (vue-tsc)
- `npm run lint` (eslint)
- `npm test` (vitest)
3. **Script validation**
- `bash -n` on all .sh files
- `shellcheck` on critical scripts
### Merge policy
All checks must pass before merge.
## Release Workflow (on tag push v*)
### Jobs
1. Build Linux binary (cross-compile x86_64 + ARM64)
2. Build frontend (`npm run build`)
3. ISO build via SSH to build server
4. QEMU smoke test of ISO
## Pre-requisites
- GitHub Actions runners with Rust toolchain
- SSH key for build server access
- Branch protection on main
- Image digest manifest from `scripts/image-versions.sh`
## Estimated implementation: 2 weeks

View File

@ -1,5 +0,0 @@
# Current State
> This document has been consolidated into [`architecture.md`](architecture.md).
>
> See that file for the current system architecture, active nodes, codebase stats, and feature status.

View File

@ -1,229 +0,0 @@
# DHT work — RESUME HERE
**Last updated:** 2026-06-16 · **Branch:** `agent-trust-wip` · **Worktree:** `~/Projects/archy-dht`
This file is the single source of truth for resuming the DHT / peer-distribution
work after a restart. Read it top to bottom, run the **Verify state** block, then
continue at **Next step**.
---
## ⚠️ CRITICAL — where to work (do not skip)
- **Work ONLY in the worktree `~/Projects/archy-dht` on branch `agent-trust-wip`.**
- **NEVER run git checkout / branch-switch / commit in the shared tree `~/Projects/archy`.**
Another agent cuts releases on `main` there. Git branch state is **global to one
working tree**, so a checkout in the shared tree drags every session onto that
branch and can clobber uncommitted work. That already happened once — the worktree
exists specifically to prevent it. See memory `feedback_concurrent_agent_tree`.
- The shared tree stays on `main` for the release agent. Leave it alone.
## Build facts (so you don't get surprised)
- It's a **binary** crate: test with `cargo test --bin archipelago -- <filter>`
(there is no lib target).
- The **test profile is opt-level=3** → every incremental test rebuild of the
`archipelago` crate is **~5 min**; a cold build of the iroh feature tree is ~19 min.
Budget for it. Run builds in the background and poll.
- Default build = no iroh. The iroh swarm engine is behind the **`iroh-swarm`**
Cargo feature (off by default): `cargo build --features iroh-swarm`.
- Plain `cargo build` (no feature) is the fleet build and is unaffected by any DHT work.
## Verify state (run these first on resume)
```bash
cd ~/Projects/archy-dht
git branch --show-current # → agent-trust-wip
git log --oneline -7 # see the commit list below
git status --short # should be clean (or your in-progress edits)
git worktree list # archy-dht → agent-trust-wip; archy → main
# sanity compile (default, fast-ish):
cargo build --bin archipelago 2>&1 | tail -3
```
---
## What is DONE (committed on `agent-trust-wip`)
Design doc: `docs/dht-distribution-design.md` (the full plan).
| Commit | Phase | Summary |
| --- | --- | --- |
| `0fef8086` | base | parked trust module + `seed::derive_release_root_ed25519` (pre-existing) |
| `27f11bf8` | **0** | signed-catalog authenticity wired: `trust/` module verifies the release-root detached signature in `app_catalog::fetch_one`; release-root KAT pinned |
| `f0cb91ed` | **1** | BLAKE3 alongside SHA-256: `content_hash.rs`, `ComponentUpdate.blake3`, `BlobMeta.blake3` |
| `2523c9e3` | **2 seam** | `swarm/mod.rs``BlobProvider` + `fetch_content_addressed` (verify peer bytes, origin-always-wins); `iroh-swarm` flag; wired into `update.rs` |
| `082946aa` | **2 engine** | real `swarm/iroh_provider.rs` over iroh 1.0 + iroh-blobs 0.103 (optional deps). Dep tree proven to resolve+compile against the pinned stack |
| `9fa56a82` | **3 core** | `swarm/seed_advert.rs` — signed Nostr seed-advertisement protocol (NIP-33 kind 30081, d-tag=blake3) |
All tests green at each step. Total new modules: `trust/`, `content_hash.rs`, `swarm/`.
## task #12 — Phase 3 glue + wiring — DONE (2026-06-17, NOT yet committed)
Implemented in the worktree, **uncommitted** (release in flight — do not commit/merge
until the user says so). Verified: default `cargo build` clean, `cargo build
--features iroh-swarm` clean, `cargo test --bin archipelago -- swarm::` → **8/8 pass**.
1. **`NostrSeedDiscovery`** (`swarm/iroh_provider.rs`) — `ProviderDiscovery` made
**async** (`#[async_trait]`); impl queries relays via the new
`seed_advert::fetch_seed_endpoint_ids` and parses each string with
`EndpointId::from_str` (`EndpointId = PublicKey`, has `FromStr`/`Display`),
skipping unparseable. `try_fetch` now `.await`s discovery.
2. **Publish path** — dep-free `seed_advert::fetch_seed_endpoint_ids` +
`publish_seed_advert` (reuse now-`pub(crate)` `build_nostr_client` /
`load_or_create_nostr_keys`); `IrohProvider::seed_and_advertise` imports the blob
into the FsStore (`blobs().add_path``TagInfo`) with a defensive hash-match,
then publishes. Scope: releases/catalog only.
3. **Wiring**`swarm::init()` builds the `IrohProvider` once at startup into a
`OnceLock<SwarmRuntime>` (keeps endpoint/router alive → keeps seeding);
`providers()` returns the registered provider; `announce_held_blob()` is called
from `update.rs` after each release component passes both hash gates. New config
`swarm_enabled` (`ARCHIPELAGO_SWARM_ENABLED`, default false); `server.rs` calls
`swarm::init`. All iroh code stays behind `iroh-swarm`; default build inert.
**iroh-blobs paid-serving spike (open Q#1) — RESOLVED:** `BlobsProtocol::new(&store,
Some(EventSender))` + `EventMask` intercept gives native per-request allow/deny
(`RequestMode::Intercept``Result<(), AbortReason>`), connection-level reject
(`ConnectMode::Intercept`), and per-request throttle/meter (`ThrottleMode::Intercept`).
## NEW: Phase 4+ plan (paid streaming / relay / IndeeHub) — `docs/phase4-streaming-ecash-plan.md`
Design for: (1) ecash-paid swarm transport, (2) networking through nodes / relay,
(3) IndeeHub "Archipelago" content source (signed Nostr film catalog, kind 30082).
Headline: ~80% already exists (Cashu wallet, `streaming/` payment gate + metering,
4-tier transport, the swarm above). Also shipped this session: a **Networking Profits
→ Settings** UI in `neode-ui` (new `views/web5/Web5NetworkingProfitsSettings.vue` +
route + button in `Web5QuickActions.vue` + `common.settings` i18n) that drives the
existing `streaming.list-services`/`configure-service` RPCs; free-everything is the
default (all services ship `enabled:false`). Frontend typechecks clean (pre-existing
`Web5ConnectedNodes.vue` `.did` errors are NOT ours). `neode-ui` deps were
`npm install`ed to complete a partial install.
## F2 step 1 — cross-mint ecash swap — DONE (2026-06-17, NOT yet committed)
Plan §2a / phasing F2 step 1. Implemented in `wallet/ecash.rs`, **uncommitted**
(release in flight). Verified: `cargo test --bin archipelago -- wallet::ecash`
**25/25 pass** (6 new), default build clean, `--features iroh-swarm` build clean.
- `is_mint_trusted(data_dir, url)` — swap-into allow-list. Home Fedimint always
trusted; any other mint must be on `accepted_mints` (normalized, trailing-slash
tolerant). Reuses the list the streaming gate already advertises to payers.
- `mint_quote_at` / `melt_quote_at` / `send_token_at(data_dir, mint_url, amount)`
the home-mint-hardcoded helpers parameterized by target mint. `send_token` now
delegates to `send_token_at` with the home mint.
- `swap_between_mints(data_dir, from, to, amount, max_fee_sats) -> u64` — mint-quote
on B → melt-quote on A → **fee-cap check** (`swap_fee` = total_paid delivered;
bail if > cap so caller falls back to free origin) → select+melt A proofs →
**persist the spend BEFORE claiming** (crash can't double-spend) → poll B invoice
until PAID/ISSUED (`wait_for_mint_quote_paid`, 60s/2s) → mint+claim on B. Both legs
recorded in the tx log (peer field carries the counterpart mint).
## F2 step 2 — payer-side auto-swap payment builder — DONE (2026-06-17, NOT yet committed)
Plan §2a step 2. Implemented in `wallet/ecash.rs`, **uncommitted**. Verified:
`cargo test --bin archipelago -- wallet::ecash`**34/34 pass** (9 new). All on the
default path (no feature gating) so the `iroh-swarm` tree is unaffected.
- `WalletState::spendable_by_mint() -> Vec<(mint_url, balance)>` — per-mint holdings.
- `PaymentPlan { Direct{mint}, Swap{from,to}, Insufficient }` + pure
`plan_payment(holdings, accepted: &[(mint, trusted)], amount)` — the policy:
**Direct beats Swap** (already-held mint, no fee, no trust needed); a **Swap target
must be trusted** (`is_mint_trusted`); home mint is the tie-break for both legs;
`Insufficient` → caller uses free origin. Pure/sync, unit-tested without a mint.
- `build_payment_token(data_dir, accepted_mints, amount_sats, max_fee_sats) -> token`
annotates the seeder's `accepted_mints` with trust, runs `plan_payment` against
`spendable_by_mint()`, then `send_token_at` (direct) or `swap_between_mints` +
`send_token_at` (swap, honoring the fee cap). Bails (→ origin) when nothing covers
the amount within balance/trust/fee. This is the builder the fetch side calls.
## Fetch-side auto-pay + F2 step 3 hardening — DONE (2026-06-17, NOT yet committed)
Implemented; **uncommitted**. Verified: `cargo test --bin archipelago -- wallet::
swarm::` → **85/85 pass** (18 new across these + earlier steps), **0 warnings**,
default build clean. `--features iroh-swarm` build = (see below; re-run after these
edits).
- **`swarm/payment.rs`** (un-gated — builds without `iroh-swarm`): `PaymentPolicy
{ budget_sats, max_fee_sats }` + `auto_pay_token(data_dir, policy, accepted_mints,
price)` → `Ok(Some(token))` to pay / `Ok(None)` to use origin. Degrades any
wallet/mint error to `Ok(None)` so payment can never block content (origin always
wins). The on-wire token→peer exchange (in-band paid-blobs ALPN, "shape A") is the
remaining gap — deferred in the plan; this is the decision/builder brain it'll call.
- **`streaming.prepare-payment` RPC** (dispatcher + `handle_streaming_prepare_payment`):
the live, user-invokable entry to the payer-side builder. Params `{accepted_mints,
price_sats, budget_sats?, max_fee_sats?}` → `{status:"ready", token}` or
`{status:"declined"}`. This is what makes the whole payment chain reachable
(no dead code).
- **Idempotent swap resume** (`wallet/pending_swaps.json`): `swap_between_mints`
journals the in-flight swap (melt + mint quote ids) right after the source spend is
persisted, removes it on claim. `resume_pending_swaps(data_dir)` reclaims `PAID`
quotes, skips `ISSUED` (never double-claims), leaves unsettled — **wired at server
startup** (server.rs, after `swarm::init`).
- **Liquidity cache** (`wallet/swap_liquidity.json`): per-route success/failure;
`build_payment_token` orders swap targets by `target_liquidity_score` (proven routes
first, home still first). `swap_between_mints` records success/failure.
- Removed the unused `mint_quote_at`/`melt_quote_at` thin wrappers (swap calls
`MintClient` directly; nothing else used them).
## Shape-A paid-blobs negotiation ALPN — DONE (2026-06-17, NOT yet committed)
Plan §1 "shape A" — the on-wire exchange that lets a downloader pay a seeder before
fetching a gated blob. Implemented behind `iroh-swarm`; **uncommitted**. Compiles
clean (`cargo build --features iroh-swarm` → only the 2 pre-existing `trust/` warns).
**Caveat:** the request/grant *wire path* can only be fully verified with a live
two-node iroh test (serde + types are unit-tested; the QUIC round-trip is not).
- **`swarm/paid_alpn.rs`** (gated): ALPN `archy/paid-blobs/1` on a second handler on
the same endpoint/router. `PaidRequest { want, token? }` ↔ `PaidResponse
{ Granted | PaymentRequired{price_sats, accepted_mints} | Denied{reason} }`.
- **Serve side** `PaidBlobsProtocol` (`ProtocolHandler`): per bi-stream, keys the
peer by `connection.remote_id()`, runs `streaming::gate::check_gate(content-download,
peer, token, blob_size)`, maps to a verdict. Free when service disabled (default),
fail-OPEN (Granted) on gate error — mirrors `swarm/paid.rs`. A paid retry's token
opens the session the blob-GET gate then sees (same endpoint id → same session).
- **Fetch side** `negotiate_access(endpoint, data_dir, peer, hex, policy) -> bool`:
best-effort + additive. Asks with no token; on `PaymentRequired` calls
`payment::auto_pay_token` (cross-mint aware), retries with the token. Connect/
protocol failure ⇒ proceed (the GET gate is the real enforcement); explicit
`PaymentRequired` we won't/can't pay ⇒ skip peer → origin.
- **Wired into `iroh_provider.rs`**: registers the 2nd ALPN on the `Router`; `try_fetch`
negotiates with each discovered peer before `downloader.download`. `IrohProvider`
carries `data_dir` + `pay_policy` (defaults to `PaymentPolicy::free` → releases/
catalog never pay; a future film fetch passes a real budget).
### Remaining to make paid FILM fetch real (small, on top of shape A)
- Pass a non-free `PaymentPolicy` for the film scope (releases stay free) + surface an
auto-pay cap in Settings. The plumbing is all here; only the policy source is free.
- Live two-node integration test (tests/multinode/) to exercise the actual QUIC
request→pay→grant→GET path end to end.
## Remaining Phase 4 roadmap (NOT started — gated)
- **Relay protocol (§2b)** — single-hop paid `relay.fetch`. Needs design sign-off.
- **IndeeHub "Archipelago" source (steps AE)** — signed kind-30082 film catalog +
`film.catalog`/`GET /api/film/:blake3` + frontend source. Gated on user decisions
(publisher trust anchor, MinIO origin) + the external IndeeHub frontend repo.
**Shipping directive (user 2026-06-17):** ship the IndeeHub app change as a
**decoupled app-catalog update** (bump `releases/app-catalog.json`), not a binary
OTA. See `docs/phase4-streaming-ecash-plan.md` §4 note.
## After Phase 3
- **Phase 4** — IndeeHub films on the same blob layer (Blossom catalog + iroh swarm;
MinIO origin). Each HLS `.ts` segment = a content-addressed blob.
- **Phase 0 GO-LIVE (needs the user)** — the catalog/manifest signature anchor
`trust::anchor::RELEASE_ROOT_PUBKEY_HEX` is still `None`; the pinned KAT is the
TEST mnemonic, not the real key. Going live = signing ceremony with the **real
release master seed** (only the user has it) → derive release-root → bake its pubkey
into `anchor.rs` → sign the real `releases/app-catalog.json`. Until then verification
is advisory (verify-if-present, anchor not enforced).
## Mergeability
As of last check we were only ~4 commits diverged from `main`; the only shared-file
overlap is `seed.rs` + `update.rs`. **Do NOT merge to `main` while the release is in
flight** — that's the user's call. Sync (merge main → agent-trust-wip) once the
release lands and `main` is clean.
## Background build logs from the last session (may be stale)
`/tmp/dht-*.log` — phase test/build outputs. Safe to ignore/delete on resume.

View File

@ -0,0 +1,138 @@
# Registry-Distributed App Manifests — Design
**Status:** design (2026-06-21)
**Goal (north-star):** every app installs from a manifest distributed via the
signed app-catalog on the registry — **no OS-level code reliance, no
OTA-shipped disk manifest required**. Rootless, signed, robust, reboot-survivable.
See also: [`docs/dht-distribution-design.md`](dht-distribution-design.md) (this is
its "discovery/authenticity" layer), `MEMORY → project_manifest_driven_north_star`.
---
## 1. Where we are today
Two distinct mechanisms, only one of which is registry-distributed:
| Thing | Source | Reaches node via | Carries |
|-------|--------|------------------|---------|
| `apps/*/manifest.yml` (48) | repo working tree | **OTA**: `self-update.sh` rsyncs `apps/ → /opt/archipelago/apps/` | full manifest (the orchestrator's real source of truth) |
| `app-catalog.json` (28) | `releases/app-catalog.json` | **registry HTTP fetch**, hourly, **signed** (`app_catalog::refresh_catalog`) | version + image override only |
- Orchestrator registry = in-memory `state.manifests: HashMap<app_id, LoadedManifest>`,
populated by `ProdContainerOrchestrator::load_manifests()` walking the disk dir.
`install(app_id)``loaded(app_id)` → "unknown app_id" if absent.
- `app_catalog.rs` is already: signed (release-root, `trust::verify_detached` over
the raw JSON), mirror-derived URLs, atomic cache at `<data_dir>/app-catalog.json`,
**forward-compatible** (no `deny_unknown_fields` — adding fields never breaks old nodes).
**Gap:** the manifest itself is never registry-distributed. Every app — btcpay,
grafana, immich — depends on an OTA-shipped disk file. That is the OS-level
reliance to eliminate.
## 2. Target
The signed catalog entry carries the **full manifest**. The orchestrator loads
manifests from the catalog cache (origin), falling back to disk only during the
migration window. Publishing an app = editing the catalog + signing + push — no
binary OTA, no disk manifest.
```
publisher: apps/*/manifest.yml ──generate──▶ releases/app-catalog.json (embeds + signs)
node: refresh_catalog() ──fetch+verify──▶ <data_dir>/app-catalog.json
load_manifests() ──merge──▶ state.manifests (catalog wins; disk = fallback)
install(app_id) ──▶ render Quadlet unit (rootless, systemd-managed)
```
## 3. Schema change (`app_catalog::AppCatalogEntry`)
Add one optional, forward-compatible field:
```rust
/// Full app manifest, embedded so the app installs from the registry alone
/// (no OTA-shipped disk file). Carried as the raw value the publisher signed;
/// deserialized into `AppManifest` at load time. Absent during migration =>
/// the node uses the disk manifest fallback.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub manifest: Option<serde_json::Value>,
```
Why `serde_json::Value`, not `AppManifest`:
- keeps the **signed preimage** intact (we verify over the raw JSON bytes; a typed
round-trip could drop/reorder unknown fields and break the signature),
- decouples catalog schema from manifest schema churn,
- deserialize + `validate()` happens at orchestrator load, exactly like `from_file`.
Authenticity is **free**: `fetch_one` already verifies the release-root signature
over the whole document, so an embedded manifest is covered by the same signature.
A present-but-bad signature is already a hard reject.
## 4. Orchestrator load path (`load_manifests`)
Extend (not replace) the disk walk:
1. Load disk manifests as today → `disk: HashMap<app_id, LoadedManifest>`.
2. Load catalog manifests from the cache: for each entry with `manifest: Some(v)`,
`serde_json::from_value::<AppManifest>(v)` then `validate()`; on success build a
`LoadedManifest { manifest, manifest_dir }`.
3. **Merge, catalog-wins**: a catalog manifest overrides the disk one for the same
`app_id`. Disk remains the fallback for apps the catalog doesn't cover (migration).
- Rationale: the registry is the authoritative origin; disk is the legacy
transport we're retiring. This matches `app_catalog`'s "catalog verdict is
authoritative when it covers the app" posture.
4. A catalog manifest that fails parse/validate is logged and skipped → disk
fallback used (one bad entry never blocks the fleet, same as the disk walk).
### `manifest_dir` for registry manifests
`LoadedManifest.manifest_dir` is used for **build contexts** and **generated files**
(`GeneratedFile`) that live next to a disk manifest. Registry manifests have no dir.
- **Phase 1 scope = image-only apps** (no `build:`, no `generated_files:` needing a
source dir). immich, grafana, the fedimint apps, postgres/redis all qualify.
- Represent the absent dir explicitly: `manifest_dir: Option<PathBuf>` (or a
sentinel under `<data_dir>/registry-apps/<app_id>/` materialized on demand).
Companion build apps (bitcoin-ui, …) keep their disk path until a later phase
teaches the catalog to carry build contexts (content-addressed, per the DHT plan).
## 5. Publishing (publish-side generator)
Add a generator (extend `create-release.sh` / a small `scripts/gen-app-catalog`):
- walk `apps/*/manifest.yml`, parse, embed each as the entry's `manifest` (JSON),
- keep `version`/`image`/`images` derived from the manifest for the badge path,
- write `releases/app-catalog.json`, then **sign** with the existing release-root
ceremony (`archipelago ceremony` / Phase 0 seed). Unsigned still accepted in the
migration window.
## 6. Migration & rollback
- **Backward compatible**: old nodes ignore the new `manifest` field (no
`deny_unknown_fields`) and keep using disk manifests.
- **Forward**: new nodes prefer catalog manifests, disk as fallback. Once the
catalog covers every app and is verified live, drop `apps/` from the OTA rsync.
- **Rollback**: delete `<data_dir>/app-catalog.json` (or revert the published
catalog) → nodes fall back to disk manifests. No data touched.
## 7. Phases
1. **Schema + load merge** (this design): `manifest` field, `load_manifests`
catalog-wins merge, `manifest_dir: Option`, unit tests (catalog overrides disk;
bad catalog manifest → disk fallback; absent → disk). Image-only apps.
2. **Publisher generator + signing**: emit embedded+signed catalog; CI/release wiring.
3. **First real app end-to-end**: immich as 3 registry manifests
(`immich-postgres`/`immich-redis`/`immich-server`) installed via
`install_stack_via_orchestrator` (delete legacy `install_immich_stack`).
Uses `generated_secrets: [immich-db-password]` (already built).
4. **Build-context apps**: content-addressed build contexts in the catalog (DHT
swarm fetch) so companions stop needing disk too.
5. **Drop `apps/` from OTA** once coverage + live verification complete.
## 8. Open questions
- Do we embed manifests inline or reference them by content hash (BLAKE3) with a
separate signed blob? Inline is simplest for Phase 1; hashing aligns with the
DHT image-by-digest plan and keeps the catalog small. Lean inline now, revisit
at Phase 4 when build contexts (large) need addressing anyway.
- `generated_files` with inline content (vs. source-dir) — already supported in the
manifest schema? If so, registry manifests can carry small rendered files inline,
removing another disk dependency.

View File

@ -1,109 +0,0 @@
# Session handoff — 2026-06-18
> **UPDATE (later same day): ALL OPEN ITEMS RESOLVED + DEPLOYED** (v1.7.99-alpha → .116 + .198).
> - **#6 Pay-with-QR timeout** — real bug (both LNDs confirmed healthy by user). FIPS-first dial ate the whole budget before the working Tor fallback ran. Added `PeerRequest.fips_timeout` cap (`fips/dial.rs`); invoice/onchain request+status calls fast-fail FIPS (6s) + short Tor window (25s/15s); frontend ceilings 60s→45s. Large downloads keep the full FIPS timeout.
> - **#7 `!ai` gate** — added denied-asker capture (`MeshState.assist_denied`/`DeniedAsker`, `assist.rs::record_denied`) → `mesh.assistant-status.denied_askers` → "Recently denied" list with one-click Allow in `MeshAssistantPanel.vue`.
> - **#8 peer-file 403** — NOT a DID reset. Asymmetric federation: .198 had .116 trusted but .116 never added .198. Re-federated (.198 → .116 `nodes.json`, trusted). **Verified:** .116 `/content/<peersonly>` = 403 w/o DID, **200 (177KB png) with .198's DID**. Plus clearer 403 message + client surfaces the body. Listing left visible ("locked preview", user's choice).
> - **Dual-ecash receive** — active modal is `ReceiveBitcoinModal.vue` (not the commented-out `Web5SendReceiveModals.vue`); already used dual-detect `wallet.ecash-receive`, fixed Cashu-only wording.
> - **fedimint-clientd icon**`docker_packages.rs` arm → `fedimint.png` + `fedimint-clientd.png` asset.
> - **Cashu → 🥜**`HomeWalletCard.vue`.
>
> Deploy notes confirmed: binary swap needs atomic `mv` over the running file (`cp` → "Text file busy"); frontend rsync WITHOUT `--delete` to preserve the `aiui/` subdir in `/opt/archipelago/web-ui`.
Resume point for the multi-issue bug-fix + deploy session on **.116** (archi-thinkpad,
local dev/validation node) and **.198** (resilience node). Work was done in
`~/Projects/archy`. A separate agent's **fedimint dual-ecash** work landed as commit
`4288ae78` during the session (don't re-touch `wallet.rs` / `fedimint_client.rs` /
`prod_orchestrator.rs` / `Web5SendReceiveModals.vue` without checking with them).
## DEPLOY STATUS — done
A surgical deploy (binary + frontend + 2 companion images, **not** the .228-centric
`deploy-to-target.sh`, to avoid clobbering .116's custom nginx) shipped to BOTH nodes:
- **.116**: new binary `/usr/local/bin/archipelago` (backup at `archipelago.bak-pre-deploy-*`),
frontend at `/opt/archipelago/web-ui`, `localhost/{lnd-ui,bitcoin-ui}:latest` rebuilt,
`:local` tags dropped. Verified: `/bitcoin-status` serves `age_ms`; lnd-ui on `Network=host`
listening 18083; `/lnd-connect-info` → 200; both companion containers carry new index.html.
- **.198**: same (binary copied — .198 has **no Rust toolchain**, only npm+podman, so
build-on-.116-then-copy is mandatory). Verified identically. Force-recreated both companions.
Build notes: release build ~9 min (opt-level 3). Frontend vite outDir = `web/dist/neode-ui/`
(NOT `neode-ui/dist`). Companion images: `ensure_image_present` only builds if image ABSENT,
and prefers `localhost/<base>:local` over `:latest` — so to ship docker changes you must drop
`:local` and rebuild `:latest`, then the reconciler (`needs_repair` compares rendered quadlet
unit vs disk) recreates containers. bitcoin-ui needed an explicit `systemctl --user restart`
(its quadlet unit text didn't change, so the reconciler didn't auto-recreate it).
## FIXED & DEPLOYED
1. **Mesh chat/peer double-scroll**`useControllerNav.ts` (wheel scrolls container under
pointer, not focused el) + `Mesh.vue` (`@wheel.stop.prevent`).
2. **Second-level cloud folder zoom**`CloudFolder.vue` direction-aware
(`cloud-zoom-forward`/`-back`, matched depth-forward/back magnitudes 0.75↔1.2).
3. **"FIPS Mesh" → "Fuck IPs Mesh"** — `FipsNetworkCard.vue`, `Server.vue`.
4. **.116 connect-wallet QR "failed to fetch"** — lnd-ui migrated to host-network +
same-origin nginx proxy: `companion.rs` (host_network:true, ports:[]),
`docker/lnd-ui/{Dockerfile(EXPOSE 18083),nginx.conf(listen 18083 + proxy /lnd-connect-info,
/proxy/lnd/, /api/container/logs to 127.0.0.1:5678),index.html(getBackendUrl()→'' relative,
credentials:'include')}`. ROOT CAUSE was a cross-origin CORS failure (page on :18083 fetching
:80). Verified working in incognito; the user's earlier "still broken" was a **stale cached
old page**. Unit test `lnd_ui_uses_host_network` passes.
5. **.198 Bitcoin Knots stale "reconnecting" banner** — `bitcoin_status.rs` (new server-computed
`age_ms` field so the browser never subtracts across clocks; 20s `STALE_GRACE_MS` before
flipping stale; RPC timeout 8s→12s) + `docker/bitcoin-ui/index.html` (`snapshotAgeMs()` uses
server `age_ms`, falls back to old calc). Two root causes: browser/node clock skew + no grace
on single failed polls (swap-thrash node).
## OPEN ISSUES (diagnosed, NOT fixed)
6. **"Pay with QR" → request timeout** — full invoice chain intact (hardened in `790da4bd`);
60s timeout = seller node never answers (unreachable transport or hung LND). Runtime, needs
2 live nodes to repro. NOT a code defect found.
7. **`!ai` not working** — DIAGNOSED, config fix (awaiting user policy decision). Assistant is
`assistant_trusted_only:true` (`/var/lib/archipelago/mesh-config.json`). The trust gate
`is_sender_allowed` (mesh/listener/assist.rs) only matches askers by archipelago pubkey/DID
against federation-Trusted `nodes.json`, but RADIO (meshcore) askers present a firmware key,
not the archipelago identity, so they're silently denied (journal: "AssistQuery denied … from=15
name=Arch Optiplex"; federation contact_id ≥ 0x80000000, low ids = radio). Claude key + model
(`claude-opus-4-8`) tested HTTP 200 — NOT the problem. FIX: disable trusted_only, or add the
asker's presented key to the allowlist. Full notes in memory `project_mesh_ai_trusted_only_gate`.
8. **Peer-file download .116→.198 "Access denied — federation peer required"** — NEW, NOT yet
fixed. Gate at `content.rs:149` (returns on `content_server::ServeResult::Forbidden`). The
requesting node isn't recognized as an authorized federation peer by the content server /
per-file sharing ACL. User's strong hypothesis: a **DID/identity reset** changed a node's DID,
so the sharing ACL / nodes.json holds the OLD identity and no longer matches. User also notes
the file is still VISIBLE in the listing (so listing and download use different identity checks
— inconsistency to investigate). NEXT: read `content_server` Forbidden logic, compare the
requester DID/pubkey vs what's stored; check both nodes' `server_info`/identity vs each other's
`federation/nodes.json`. Same THEME as #7 (identity matching) but a different mechanism.
## NEW FRONTEND REQUESTS (not started — batch into one frontend rebuild+redeploy)
- **`fedimint-clientd.svg` 404** — new fedimint core-app (`public/catalog.json:294`) has no icon.
App-icon convention `/assets/img/app-icons/<id>.png` (default) — add a `fedimint-clientd` icon
(there's an existing `fedimint.png` to reuse/adapt). The 404 requests `.svg` so check the
catalog/curated-icon entry.
- **Cashu icon → cashew emoji** (🥜) — change the cashu wallet icon to a cashew nut emoji.
- **Receive ecash should support BOTH fedimint + cashu paste** — currently the ecash receive
only mentions Cashu for pasting a token; user expected the paste box to redeem both Cashu AND
Fedimint ecash. Lives in the fedimint agent's recently-committed dual-ecash UI
(`Web5SendReceiveModals.vue` / `Web5Wallet.vue` / `WalletSettingsModal.vue`) — investigate what
they built before changing.
- **Console noise** (lower priority): `cdn.tailwindcss.com` production warning in lnd-ui +
bitcoin-ui (uses Tailwind CDN); `api/app-catalog` 502 (check if persistent). Latent backend
nicety: `/lnd-connect-info` emits a DOUBLED `Access-Control-Allow-Origin` (backend empty ACAO
+ main-nginx `add_header $http_origin`) — harmless on the new same-origin page but should drop
the backend's redundant CORS since lnd-ui now fetches same-origin.
## ENV QUICK-REF
- .116 archi-thinkpad: data `/var/lib/archipelago`, nginx root `/opt/archipelago/web-ui`,
http :80 + custom nginx-proxy-manager; user reaches UI via Tailscale `100.69.68.39` AND LAN.
Deploy SSH key `~/.ssh/archipelago-deploy` is passphraseless; SSH-to-self + .198 work non-interactively.
- .198: `ssh archipelago@192.168.1.198` (passwordless sudo), podman+npm, NO cargo.
- Companion build-dir precedence: `/opt/archipelago/docker` > `~/archy/docker` > `~/Projects/archy/docker`.
- Uncommitted working-tree changes (mine, not yet committed): the 11 files for fixes #1#5.