archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	f6efe2f356	fix(transport/chunking): stop overwriting first 4 bytes of user data encode_chunked() split the payload into shards first, then overwrote the first 4 bytes of shard 0 with a u32 length header, then re-ran Reed-Solomon to regenerate parity over the now-corrupted shards. The decoder correctly read the length header and trimmed `[4..4+len]` from the reconstructed buffer, but those first 4 bytes had already been destroyed on the encode side, so every chunked mesh payload lost its first 4 bytes. Restructure: reserve 4 bytes for the length header up front, build a single contiguous [len][data][pad] buffer, then split into shards. Parity is computed over the correct shards on the first pass, no double-encode needed. Update test_chunk_roundtrip_medium: 500 bytes + 4-byte header = 504 bytes, which is 5 data shards (ceil(504/124)), not 4. The old test assertion was wrong all along and masked the corruption bug because it only checked the roundtripped bytes, which is exactly what we need to verify. New assertion is correct. Verified: all 7 transport::chunking tests pass.	2026-04-23 12:29:10 -04:00
archipelago	c4efb30382	docs(release-notes): v1.7.43-alpha bullet for install-log fix; prune stale RESUME note	2026-04-23 12:04:20 -04:00
archipelago	cd6f8bad70	fix(install-log): pre-create /var/log/archipelago/ so non-root backend can write The backend runs as `archipelago` and calls `install_log()` to append audit lines to the install log on every install / update / remove / start / stop / restart. Target path was /var/log/archipelago-container-installs.log, which does not exist and cannot be created by the service because /var/log/ is root-owned. OpenOptions errors were silently swallowed, so the log was never written on any node. Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and container-installs.log with archipelago:archipelago ownership. Move the const path to match, keeping logs inside the directory logrotate already rotates (image-recipe/configs/logrotate.conf). Install the rule from both the ISO build and self-update, and apply it immediately on self-update so existing nodes get a working log without needing a reboot. Verified on .228: file created, backend user can write, backend binary rebuilt with new const.	2026-04-23 12:02:46 -04:00
archipelago	9f3d66e24e	docs(release-notes): v1.7.43-alpha bullet for self-update script refresh Document that OTA updates now refresh the reconcile helper scripts, closing the deploy gap that kept fixes to those scripts from reaching existing nodes.	2026-04-23 11:51:04 -04:00
archipelago	a272a79706	fix(self-update): install reconcile scripts on OTA updates The OTA self-update path only refreshed image-versions.sh, leaving reconcile-containers.sh and container-specs.sh frozen at whatever version was baked into the ISO that originally provisioned the node. Any fix to those scripts (notably the --create-missing flag and the DISK_GB detection fix shipped this round) never reached existing nodes, and on .228 both scripts were outright missing because the node predated their inclusion in the ISO recipe. Install all three helper scripts to /opt/archipelago/scripts/ on every self-update run. Also preserve the legacy copy of image-versions.sh at /opt/archipelago/image-versions.sh for any older backend binaries still looking there first.	2026-04-23 10:07:53 -04:00
archipelago	694e5b0a9d	fix(update): pass --create-missing when rollback recreates a destroyed container The update flow removes the old container before starting the new one. If the update fails after removal, the rollback path tries `podman start <name>` first, then falls back to reconcile. But reconcile without --create-missing treats the now-absent container as an optional one that the install flow will (re)create later, and skips it. Result: container stays destroyed until someone notices and runs reconcile manually. Add --create-missing to the rollback reconcile invocation so the fallback actually rebuilds the container from its canonical spec. Fixes the failure mode observed on .228 where a bitcoin-knots update left the node with no bitcoin-knots container at all.	2026-04-23 10:06:55 -04:00
archipelago	0f1ad47aec	docs(release-notes): v1.7.43-alpha bullets for disk-detection and rollback recovery Add two user-facing release notes for fixes shipped this round: - Full-archive Bitcoin nodes no longer silently get pruned on reconcile because the disk-size check was reading the OS partition. - Failed updates can now recover via reconcile --create-missing instead of leaving a destroyed container behind.	2026-04-23 10:02:32 -04:00
archipelago	06dcdafda4	fix(specs): measure DISK_GB at /var/lib/archipelago, not / The reconcile spec for bitcoin-knots auto-enables prune=550 when DISK_GB < 1000. DISK_GB was measured via `df /`, which on every archy install reports the ~30 GB OS partition because user data lives on a separate encrypted /var/lib/archipelago volume. Result: every archy node with a 2 TB data drive was silently being configured as a pruned node, and any bitcoin-knots container recreated by reconcile would delete its historical blocks down to the 550 MB prune window on next start. Observed on .228 (2 TB box): blocks dir went from 384 GB to 926 MB after a reconcile-triggered restart. Historical archive unrecoverable without full re-IBD from genesis. Fix: check /var/lib/archipelago first (where bitcoin data actually lives). Fall back to / only on first-boot before the data partition is mounted.	2026-04-23 09:54:16 -04:00
archipelago	92612ddc70	feat(reconcile): add --create-missing flag for recovering from failed-update rollbacks Context: when package update fails after remove-old-container but before reconcile-recreate, the rollback path in update.rs tries to restart the old container by name. If the container is already gone (removed in step 3 of the update), rollback fails silently and the node is left with no live container for that app but on-disk data still intact. This is exactly the state .228 ended up in after the reconcile-script-missing bug killed bitcoin-knots and lnd. Reconcile was designed to only repair existing containers for optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries on the assumption that the install RPC creates them. That safety check is correct for normal operation but blocks recovery when an optional-marked container has been destroyed by a failed update. Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip. When set, reconcile treats absent containers exactly the same as broken containers — it creates them from the canonical spec using the existing on-disk data directory. Narrow-scope override; the default behaviour is unchanged. Updated --help to document all four flags. Verified on .228: after the failed bitcoin-core update took out both bitcoin-knots and lnd, running reconcile --container=bitcoin-knots --create-missing --force (as the archipelago user, not root — podman is rootless) brought bitcoin-knots back using the pruned chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All containers now running; electrumx reconnecting; UIs recovering. Does NOT fix the underlying update-flow rollback hole (rollback should be able to re-create a container from spec, not just restart by name). That is a separate commit — this flag is the manual recovery tool plus the primitive the improved rollback will call.	2026-04-23 09:42:19 -04:00
archipelago	353825b66c	docs: release-note image-versions fix, add marketplace QA tracker, update RESUME - AccountInfoSection.vue: append 5th bullet to v1.7.43-alpha entry explaining that update-available badges and version comparisons work again now that the pinned-image catalog is found at the correct deployed path. - docs/MARKETPLACE-QA.md: new tracker for the upcoming app-by-app install walk on .228. Documents the per-app fix workflow, the four layers we might need to fix at (app recipe, registry image, backend orchestrator, frontend), status-key table for tracking each catalog entry, and the release-notes policy for the walk. - docs/RESUME.md: refresh with a9908597 commit, updated binary md5 on .228, and split Immediate Next Step into Phase 1 (browser verification) and Phase 2 (marketplace walk) with a pointer to the new tracker.	2026-04-23 09:32:41 -04:00
archipelago	12f93cc15e	fix(image-versions): locate image-versions.sh at its actual deployed path The Rust search path listed /opt/archipelago/image-versions.sh and scripts/image-versions.sh (repo-relative for dev), but the image recipe deploys the file to /opt/archipelago/scripts/image-versions.sh. Production nodes therefore silently failed every lookup: find_file returned None, load_image_versions returned an empty HashMap, and both pinned_image_for_app and pinned_images_for_stack returned no matches. Symptom on deployed nodes: every container scan emitted "image-versions.sh not found in any search path" at DEBUG level, and the version-comparison logic in docker_packages.rs plus the update-check logic in api/rpc/package/update.rs silently degraded to no-op — users would not see update-available badges and upgrade RPCs could not resolve pinned targets. Fix: put the canonical deployed path first in PATHS, keep the older /opt/archipelago/image-versions.sh as a fallback for not-yet-updated nodes, and retain scripts/image-versions.sh as the dev-repo-relative fallback. Verified on .228: backend now logs "Parsed 57 image versions from /opt/archipelago/scripts/image-versions.sh" on scan. Pre-existing test_parse_image_versions failure in this module is unrelated (the NOT_AN_IMAGE assertion was broken before this change because the parser's _IMAGE-suffix retain keeps it). Leaving that for the general cargo-test cleanup pass.	2026-04-23 09:29:15 -04:00
archipelago	4faac9cb74	docs(resume): add RESUME.md for context-restart recovery Consolidated single-file snapshot of plan + progress for a fresh OpenCode session to pick up the install UX polish work: - Where we are: v1.7.43-alpha shipped, 5 commits on main, deployed to .228, browser verification in progress. - Immediate next step: await user's verification results from https://192.168.1.228/ browser checklist. - Working layout: SSHFS mount, ssh archy / archy228, deploy recipes. - Architecture patterns: async-spawn lifecycle, phase-based install progress, scanner kick, .23 auto-purge migration. - Backlog: Vaultwarden exit-on-start, install log perms, 22 stale cargo test failures, historical changelog entries left intact. - User preferences: "best long-term first", one-by-one, no push, Bitcoin-only, conventional commits. Complements STATUS.md (which remains the engineering log) with a tighter resume-the-work narrative focused on the current round.	2026-04-23 09:14:36 -04:00
archipelago	b62b731db0	docs(status): record rounds 3-5 + config migration + changelog as shipped Adds a new top section to STATUS.md covering v1.7.43-alpha: - Round 3: phase-based install progress bar - Round 4: post-install scanner kick for instant Launch button - Round 5: .23 VPS retirement, .168 promoted to Server 1 - Config migration: auto-purge .23 from saved registry/mirror JSONs - Changelog: new v1.7.43-alpha entry in AccountInfoSection All 5 commits, deployment md5, verification notes, and git remote cleanup captured. Round 2 rollback command still valid for the full stack since backups predate every round in this session.	2026-04-23 09:09:02 -04:00
archipelago	6c8cb50679	docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement Four release-note bullets describing the user-visible changes shipped in this round: - async-spawn install/update/uninstall (UI no longer freezes) - phase-based install progress bar (Preparing through Finalizing) - scanner kick post-install (Launch button appears immediately) - .23 Hetzner VPS retired, .168 OVH promoted to Server 1 with auto-purge migration for existing nodes Matches the tone of existing changelog entries: what changed from the operator's perspective, not internal implementation detail.	2026-04-23 09:07:29 -04:00
archipelago	28e38a36a9	fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs load_registries + load_mirrors normally only ADD missing defaults to the persisted JSON — explicit removals stick. After retiring the .23 Hetzner VPS we need the opposite: existing nodes have .23 baked into their saved configs and would spend seconds per install/update timing out against a dead host until the operator manually removes it via the Settings UI. Add a targeted one-time migration in both loaders: if any saved entry has 23.182.128.160 in its URL, drop it on load and rewrite the file. This is an exception to the usual "explicit removals stick" rule — the user never chose to add this mirror, it was a default. Narrow-scope migration (one hardcoded IP match, no schema version) because the cost/benefit of a general migration system isn't worth it for a single decommissioned host. Future retirements can follow the same pattern.	2026-04-23 08:51:26 -04:00
archipelago	d9d5fa65e5	chore: retire .23 VPS mirror, promote .168 OVH to primary The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it everywhere with the OVH VPS at 146.59.87.168, which was previously the tertiary mirror. - update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2. Default mirror list shrinks from 3 to 2. - container/registry.rs: default RegistryConfig drops .23, promotes .168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10. - api/rpc/package/config.rs: trusted-registry allowlist swaps .23 for .168. - api/handler/mod.rs: app-catalog fallback URL uses .168. - neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168. - scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168. - image-recipe/build-auto-installer-iso.sh: installer ISO registries use .168 (both podman registries.conf and backend registries.json). Tests updated to assert on the new 2-entry default lists (registry + mirror). URL-parser fixture tests in update.rs retain .23 strings — they exercise string-parsing logic, not mirror policy. Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin` multi-push alias (not part of this commit — pure working-copy change).	2026-04-23 08:22:32 -04:00
archipelago	980c1b25f4	fix(install): kick scanner post-install so Launch button appears immediately After install completes, the async-spawn wrapper wrote state=Running but the skeletal install-time manifest (interfaces: None) persisted until the next scheduled 60s scan. The frontend saw state=running but hasUI=false and hid the Launch button for up to a full minute. Add a shared Notify/watch pair between RpcHandler and the scan loop: - scan_kick (Notify): scan loop selects! between the 60s interval and this notify, running immediately on either. - scan_tick (watch<u64>): scan loop bumps the counter after each completed scan so callers can await completion. Install and update success paths now call kick_scanner_and_wait before flipping to Running. The scan merges via merge_preserving_transitional (state stays Installing/Updating, manifest refreshed from live podman with interfaces.main.ui populated from real port bindings). 2s timeout falls back to pre-fix behavior on slow podman — no regression.	2026-04-23 07:59:03 -04:00
archipelago	7e62ea07f7	feat(install): phase-based progress bar replaces unparseable pull bytes Podman emits zero parseable progress when stderr is piped (no TTY), so the old byte-counter regex never matched in real installs. Users saw 0% for the whole pull, then a jump to 95%, then silence through create-container, health-check, and post-install hooks. Replace with 7 explicit lifecycle phases wired through install.rs and update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%), StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%), Done (100%). Each maps to a fixed UI progress and status message. Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when present, falls back to byte-counter for legacy. A Math.max forward-only guard ensures the bar never regresses. Deleted the duplicate watcher in Discover.vue that was fighting the store's watcher with stale byte logic. Added shimmer CSS on the fill (with prefers-reduced-motion opt-out) so the bar looks alive during long phases.	2026-04-23 07:58:43 -04:00
archipelago	576ff1a6de	docs(status): mark install/uninstall/update async-spawn as shipped	2026-04-23 06:58:45 -04:00
archipelago	49b98e0271	fix(rpc): empty icon in transient install entry to avoid broken-image flicker create_installing_entry hardcoded /assets/img/app-icons/<id>.png for every new install. About half the app icons ship as .svg or .webp (lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the browser 404s on the wrong extension and renders the default broken-image glyph for the 10-30s window before the scanner refreshes with real manifest data. Send empty icon. The frontend's icon computed in AppCard.vue falls through to curatedMap which has correct extensions for bundled apps, and handleImageError still guards any remaining misses with a placeholder SVG.	2026-04-23 06:58:12 -04:00
archipelago	702b5d64d3	fix(ui): shorten install/uninstall/update timeouts for async RPCs With the backend flipped to async-spawn, install/uninstall/update return immediately with a { status, package_id } envelope. Client timeouts of 45m/11m were a leftover from synchronous handlers and masked real RPC failures. Drop all install/uninstall/update RPC timeouts to 15s. Progress and terminal state still arrive through the live state stream — the RPC only needs to confirm the spawn was accepted. Return-type annotations updated in rpc-client.ts and stores/server.ts. Five direct rpcClient.call sites across Marketplace.vue, Discover.vue, and MarketplaceAppDetails.vue updated with the shorter timeout.	2026-04-23 06:58:02 -04:00
archipelago	1ad889608f	feat(rpc): async-spawn install/uninstall/update lifecycle Extend the async-spawn treatment previously shipped for Stop/Start/Restart to the three remaining long-running lifecycle RPCs. Each wrapper validates params, rejects duplicate in-flight ops, flips state to the transitional variant (Installing/Removing/Updating), then spawns the existing inner handler on tokio. RPC returns immediately with { status, package_id }; the spawn task owns the terminal state write. Install and update success arms explicitly set state=Running. The scan loop merge (merge_preserving_transitional) refuses to overwrite transitional states, so the spawn task must write the terminal state. Uninstall's inner handler removes the entry entirely, so no explicit terminal write is needed there. Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so spawned tasks can hold their own Arc without extra field cloning. Transient install entry uses empty icon string. Hardcoding /assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp assets, which produces a broken-image flicker until the scanner refreshes with manifest data. Empty string causes the frontend's icon computed to fall through to the curated map, which has correct extensions. Removed the inner "already updating" guard in update.rs — the wrapper now owns duplicate-op detection for all three operations.	2026-04-23 06:57:50 -04:00
archipelago	0ea4f96de9	docs(status): mark async-spawn lifecycle fix as shipped Records the four landed commits, the .228 deploy (binary + frontend paths, backups, md5), the manual LND Stop verification, and the rollback incantation. Leaves the older "NEXT SESSION" design block in place as historical reference with a note that it's stale. Adds a follow-ups list: chaos matrix is now unblocked, bundled-app RPCs are still sync (deprecate or mirror-async?), transitional_since is in-memory only, and there are 22 pre-existing test failures in unrelated modules that should get their own cleanup pass.	2026-04-23 05:30:45 -04:00
archipelago	a8158b1ef5	fix(ui): single-button lifecycle control with transitional labels The app card and details view previously used a pair of Start/Stop buttons whose labels were driven off isAppLoading(), a client-side "I just clicked the button" flag. When the backend's graceful stop took longer than the RPC round-trip (up to 600s on bitcoin-core), the flag cleared while the container was still shutting down, the UI flipped back to "Running" as soon as the next 10s scan saw the still-alive container, and the user had no indication the stop was still in flight. Now that the backend flips PackageState to Stopping / Starting / Restarting / Installing / Updating / Removing for the duration of each lifecycle operation and the scan loop preserves those states, the UI can drive its label off the container state itself. A single full-width primary button replaces the Start/Stop pair. Its label, color, and disabled state come from getAppVisualState(), which collapses resting states (exited/created/paused/installed) into "stopped" and passes transitional states through untouched. Changes: - container-client.ts: widen ContainerStatus.state union to include the six transitional variants plus "installed". Add restartContainer() calling the new container-restart RPC. - stores/container.ts: add getAppVisualState() computed and the restartContainer() action. - ContainerApps.vue: single primary button (Start / Stop / Starting / Stopping / Restarting etc.) plus a separate circular Restart button visible only when running. Critically, handleStartApp and handleStopApp now route through store.startContainer and stopContainer (which call container-start / container-stop, the async RPCs) instead of the legacy synchronous bundled-app-start / bundled-app-stop path. Transitional-state polling widened from just "created" to the full set of transitional variants. - ContainerAppDetails.vue: same single-button pattern, Restart button now calls container-restart instead of the old stop-sleep-start sequence, added 2s polling interval for transitional states. - components/ContainerStatus.vue: widen state prop to match the shared union, render transitional labels with a trailing ellipsis and a yellow dot. No new tests — this is presentation logic. Manual verification on .228 will confirm the end-to-end async path: click Stop on LND, button becomes "Stopping" in under a second, stays that way for roughly 5 minutes, then flips to "Start" with a grey dot. The UI must never revert to "Running" mid-stop.	2026-04-23 05:20:15 -04:00
archipelago	cd69c3b2f6	fix(state): preserve transitional state across container scans The 30s package scan loop used to blindly overwrite every package entry from podman inspect. While a user-initiated Stop / Start / Restart was in flight, the RPC spawn task would flip the state to Stopping / Starting / Restarting, the next scan would see podman still reporting "running" (for the duration of the graceful stop, up to 600s for bitcoin-core), and clobber the transitional state back to Running. The dashboard would then flip Running -> Stopping -> Running -> Stopped, making it look like the stop had silently failed until it eventually completed. The merge loop now treats transitional variants (Stopping, Starting, Restarting, Installing, Updating, Removing, and the three backup variants) as owned by the RPC spawn task. For those variants, merge_preserving_transitional keeps the existing state while still taking live observability fields (health, exit_code, installed, lan_address, manifest, static_files, available_update) from the fresh scan so the UI continues to see live health readings. Adds an escape hatch via a per-scan transitional_since side table: if a package has been in a transitional state for more than 1200s (2x the longest graceful stop at 600s on bitcoin-core), the scan loop assumes the spawn task died without cleanup and overrides with podman's live state. Prevents a crashed background task from wedging a package in Stopping forever. Three unit tests cover the merge rule, the observability passthrough, and the transitional-variant classifier.	2026-04-23 05:15:13 -04:00
archipelago	39dd1d9dcc	fix(rpc): async container stop/start/restart; widen state mapping RPC handlers no longer block on podman operations. container-stop on bitcoin-core used to hold the connection for up to 600s while the UI showed a frozen spinner; it now returns in under a second with {status: stopping} after flipping the package state to Stopping and broadcasting over WebSocket. Same treatment for container-start and the new container-restart route. Widens container-list state mapping to emit the transitional variants (stopping, starting, restarting, installing, updating, removing, installed, and the backup states) instead of collapsing them to "unknown". Keeps the mapping in sync with the UI ContainerStatus.state union so the dashboard can render the right transitional label. Mirrors the treatment in package/runtime.rs for package.start, package.stop, and package.restart. The body of each handler is lifted into pure do_package_* helpers that the background task runs; state flipping is bracketed around the spawn with revert on error. The pre-existing post-start exit-check verification and restart stop+start fallback run inside the spawned task, not the RPC body. Adds container-restart route to the dispatcher. mark_user_stopped continues to run BEFORE the spawn, preserving the ordering contract with the crash recovery layer at runtime.rs:145-148.	2026-04-23 04:59:45 -04:00
archipelago	5baced5f5b	feat(rpc): spawn_transitional helper for async lifecycle ops Introduces a new RPC-layer helper that bridges the synchronous ContainerOrchestrator trait with RPC handlers that must return in <1s. The helper flips the package state to a transitional variant (Stopping / Starting / Restarting) in the StateManager so WebSocket clients see the live label immediately, then tokio::spawns the actual orchestrator call. On success it writes the final state; on error it reverts to the pre-transition state and logs via install_log(). The ContainerOrchestrator trait stays synchronous so the reconciler, boot flow, unit tests, and chaos harness keep deterministic behaviour. Async only lives in the RPC layer. Not wired to any handler yet — Commit 2 consumes this helper. Widens install_log visibility from pub(super) to pub(in crate::api::rpc) so the new sibling module can reach it.	2026-04-23 04:55:52 -04:00
archipelago	cad63bdd76	docs: STATUS.md — FUSE/SSHFS development loop section Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh split that makes this dev setup work. Includes: - Exact running mount command (pulled from ps) - macFUSE + sshfs-mac brew install path - Health check + recovery sequence for when mount hangs (it will) - Full which-path-for-which-operation table - Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc) - Cache caveat and inode-sharing note between mount and SSH views No code change.	2026-04-23 04:51:53 -04:00
archipelago	bb2e3fab42	docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session Expands NEXT SESSION header with fully verified access info so a fresh agent has zero ambiguity: - SSH key inventory across laptop, .116, .228 (every file, purpose noted) - Actual SSH config aliases (archy, archy228) with IdentitiesOnly - Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key) - Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins - SSHFS mount source command + AppleDouble gotcha - Cargo over SSH PATH gotcha + detached build pattern for >2min timeout - End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify) - Git workflow rules (no push, no amend, no force, conventional commits) Removes duplicate host-reference block that the prior edit left trailing. No code change.	2026-04-23 04:49:45 -04:00
archipelago	6a5fab709a	docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan Captures full design for the next session: - Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state) - 4-commit implementation order with exact file:line targets - Single-button UI spec with full label table - Verification gates including manual LND stop test on .228 - Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync No code change yet; next session implements.	2026-04-23 04:45:12 -04:00
archipelago	2a2f10608b	docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)	2026-04-23 04:17:56 -04:00
archipelago	7257f72f4a	fix(first-boot): use podman host-gateway magic for host.containers.internal The previous code computed HOST_GATEWAY from `ip route show default` to work around an alleged podman 4.3.x limitation. Two problems: 1. The comment was wrong. Podman 4.4+ supports --add-host=host-gateway natively, and we ship 5.4.2. 2. More critically, `ip route show default` returns the LAN router (e.g. 192.168.1.254) — the gateway to the internet, not the gateway to the host. Every container configured with DAEMON_URL or --bitcoind.rpchost=host.containers.internal was therefore dialing the WiFi router instead of the host machine, silently failing. Symptoms this caused on .228: - LND crash-looped with "dial tcp 192.168.1.254:8332: connection refused" - Dashboard showed no LND connect details or QR - ElectrumX DAEMON_URL broken; stuck at 2 KB index for days - Any service reaching bitcoin-core through the `archy-net` bridge Replace the computed value with the literal string "host-gateway", which podman translates to the correct in-network gateway at container start. Also drop the stale HOST_GATEWAY reference in the Tor-bootstrap branch (it always fell back to TARGET_IP anyway). Verified on .228: after recreating bitcoin-core/electrumx/lnd with the new flag, LND reached the chain backend, ElectrumX resumed indexing, and the dashboard /lnd-connect-info endpoint succeeded.	2026-04-23 04:16:42 -04:00
archipelago	30b31b3670	fix(lnd): read admin macaroon via sudo fallback LND's admin.macaroon is owned by a rootless-podman subordinate UID (typically 100000) with mode 640. The archipelago server runs as UID 1000 and cannot read the file directly, which caused every dashboard LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client to fail with "Failed to read LND admin macaroon". Add a read_lnd_admin_macaroon() helper that first tries a direct read (for operators who have relaxed permissions) then falls back to `sudo -n cat`, mirroring the pattern already used for Tor hidden service hostnames in handle_lnd_connect_info. Centralise the canonical macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers through the helper. Verified on .228: GET /lnd-connect-info now returns 200 with cert, macaroon, and tor_onion fields. Dashboard QR/connect-string UI unblocked.	2026-04-23 04:15:44 -04:00
archipelago	28819d1197	docs: STATUS.md through Step 9 (.228 hot-swap verified) Logs Step 9 acceptance evidence, the two bugs caught and fixed during the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.	2026-04-23 03:46:23 -04:00
archipelago	80765c5755	feat(systemd): delegate cgroup controllers to archipelago.service Adds Delegate=memory pids cpu io to the archipelago.service unit. Context: the service runs as User=archipelago under system.slice with rootless podman. When podman creates transient libpod-*.scope units for containers under user.slice, systemd needs the caller to hold CAP_SYS_ADMIN on the target cgroup subtree \u2014 which happens iff Delegate= lists the controllers we want to set. Without Delegate, any future code path that goes through the podman CLI (runtime.rs) instead of the libpod HTTP API (podman_client.rs) would hit MemoryMax rejections that have exactly the same symptom as the bug I just fixed in parse_memory_limit but with a completely different root cause. Belt-and-braces: current production path uses PodmanClient and was fixed in the preceding commit. But the DockerRuntime CLI path in runtime.rs:262-268 (cmd.arg("--memory")) is still reachable via AutoRuntime fallback on hosts without podman, and future rust orchestrator code may legitimately need cgroup delegation. This directive is no-op harmful on hosts that already delegate upstream (systemd gracefully handles duplicate/nested delegation).	2026-04-23 03:44:36 -04:00
archipelago	8acf7d1112	fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes The libpod HTTP API path (PodmanClient::create_container) ran manifest memory_limit values like "128Mi" through parse_memory_limit which lowercased+trim_end_matches("m"), leaving "128i" which parse::<f64>() rejected. The resulting None became 0 via .unwrap_or(0), and podman serialised that into the OCI config as memory.limit:0. At container start time systemd then rejected MemoryMax=0 with "Value specified in MemoryMax is out of range". Silently wrong for every manifest in apps/ that uses Kubernetes-style suffixes (all of them). Became visible on .228 when Step 9 first exercised the ProdContainerOrchestrator path for bitcoin-ui and lnd-ui installs \u2014 the old first-boot-containers.sh bash script used podman run --memory 128m directly, which podman-the-CLI parses correctly, so the bug never surfaced before. Two parts: - parse_memory_limit now recognises Ki/Mi/Gi/Ti (IEC binary, what k8s and our manifests use), kB/MB/GB/TB (SI decimal), k/K/m/M/g/G/t/T (docker shorthand, treated as IEC binary for backwards compat), and bare byte integers. Filters out zero/negative results. - create_container omits the memory/cpu fields entirely when the manifest has no limit or parsing fails, rather than emitting 0. The libpod API treats absent as unlimited; 0 is "set MemoryMax=0" which systemd rightly rejects. Defence in depth against the next weird suffix someone puts in a manifest. Six regression tests in the new tests module cover IEC, SI, shorthand, raw bytes, invalid input (empty/garbage/0/negative), and whitespace.	2026-04-23 03:44:23 -04:00
archipelago	c396be8068	feat(iso): Step 8a — retire archipelago-reconcile systemd timer BootReconciler (in-process, 30s interval, spawned from main.rs as of Step 6 commit 48f08aa3) fully replaces the timer-driven bash reconciliation path. Delete the systemd unit + timer and their ISO-builder touchpoints. Removed: - image-recipe/configs/archipelago-reconcile.service - image-recipe/configs/archipelago-reconcile.timer - image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer) - image-recipe/build-auto-installer-iso.sh L449 (systemctl enable) - image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR) Kept (intentionally): - scripts/reconcile-containers.sh - scripts/container-specs.sh Reason: core/archipelago/src/api/rpc/package/update.rs still invokes reconcile-containers.sh at two sites (OTA update + rollback paths). Porting those call sites to ContainerOrchestrator::upgrade() requires manifests for every container update.rs might touch — that scope belongs in Step 8b. Until then the script stays on disk, just no longer runs on a periodic timer. No Rust code changes. cargo check -p archipelago clean, 6 pre-existing warnings. Skipped full ISO rebuild validation per user decision — edits are 5 textual deletions with zero behavioral ambiguity; Step 9 live hot-swap on .228 will catch any regression.	2026-04-23 03:04:58 -04:00
archipelago	236a2dee85	docs: split Step 8 into 8a/8b/8c Discovered during Step 8 execution that first-boot-containers.sh creates 30+ containers with per-container logic (wallet loads, DB init, rpcauth derivations, post-create health waits) and does substantial non-container setup (secret gen, rootless-podman subuid chowns, Tor hostnames, WireGuard, firewall, nostr-relay). Only 3 of the 30+ containers have manifests today (the UIs from Step 7). Deleting the bash in a single step bricks first-boot on fresh installs. Split into: - 8a: delete reconcile-containers.sh + container-specs.sh + reconcile systemd unit + timer. BootReconciler fully covers these. Safe, atomic, no manifest porting required. - 8b: port remaining ~25 containers into apps/<id>/manifest.yml. One manifest per commit, validated against current bash behavior. Multi-day scope. - 8c: rename first-boot-containers.sh -> first-boot-setup.sh, strip container ops, keep secret/dir/Tor/WG/firewall setup. Final one-way door, requires 8b complete.	2026-04-23 02:34:43 -04:00
archipelago	758d3e47d8	docs: STATUS.md through Step 7	2026-04-23 02:21:01 -04:00
archipelago	3e9c192b48	feat(container): bitcoin-ui pre-start hook renders nginx.conf from embedded template Replaces the first-boot-containers.sh sed/envsubst approach with a Rust-native render step bound into the ContainerOrchestrator lifecycle. - New container::bitcoin_ui module: embeds the nginx.conf template via include_str!, reads the plaintext RPC password from /var/lib/archipelago/secrets/bitcoin-rpc-password, substitutes {{BITCOIN_RPC_AUTH}} with base64(archipelago:<password>), and atomic- writes (tmp + rename) to /var/lib/archipelago/bitcoin-ui/nginx.conf. Idempotent: byte-compares before writing so unchanged input is a no-op (no inode churn, no restart cascade). - ProdContainerOrchestrator gains run_pre_start_hooks(app_id) returning HookOutcome::{Rewritten, Unchanged}. Fires in install_fresh before create_container, and in ensure_running: on Running + Rewritten triggers a restart; on Stopped re-renders then starts. - bitcoin-ui Dockerfile no longer COPYs a default.conf; the file now arrives via runtime bind-mount of the rendered config. If the bind- mount is ever missing, nginx starts with no site configured and returns 404 everywhere — safe failure vs. serving upstream RPC with a stale Authorization header. - apps/{bitcoin,electrs,lnd}-ui/manifest.yml land as first-class manifests. bitcoin-ui declares the bind-mount target and a dependency on bitcoin-core; electrs-ui and lnd-ui declare their own deps and health checks. - 8 new unit tests on the render fn (idempotency, rotation, trimming, missing/empty secret, template invariants) plus an integration test asserting install(bitcoin-ui) actually lands a substituted nginx.conf on disk via the hook. 39/39 container:: tests pass (test_parse_image_versions pre-existing failure unchanged, out of scope).	2026-04-23 02:19:52 -04:00
archipelago	ba8bd0bb86	docs: STATUS.md through Step 6	2026-04-22 19:20:17 -04:00
archipelago	6a0809d386	feat(container): wire ProdContainerOrchestrator + BootReconciler into main Step 6 of the rust-orchestrator migration. Construct the container orchestrator once in main.rs, call load_manifests + adopt_existing immediately after Config::load, log the adoption report, and spawn BootReconciler::run_forever with the 30s default interval. Thread the orchestrator through Server::new -> ApiHandler::new -> RpcHandler::new so the reconciler and RPC layer share one instance. Wire a tokio::sync::Notify through the SIGTERM/SIGINT shutdown path so the reconciler exits cleanly alongside the server drain. Uses notify_one so the signal stores a permit if the reconciler is mid reconcile_all when the signal fires. Delete the commented-out run_boot_reconciliation block in main.rs that documented the prior bash-script approach being unsafe on unbundled installs — the new reconciler is manifest-driven and only touches apps present in /opt/archipelago/apps, fixing that concern. cargo check -p archipelago clean (6 pre-existing dead-code warnings on trait methods not yet exercised until Step 9 hot-swap). Container test suite 43/44 pass; the one failure (container::image_versions:: test_parse_image_versions) is pre-existing and unrelated.	2026-04-22 19:20:13 -04:00
archipelago	81c1613040	feat(container): BootReconciler — periodic reconcile loop for prod orchestrator Step 5 of the rust-orchestrator migration. New file boot_reconciler.rs holds a small Tokio task that calls ProdContainerOrchestrator::reconcile_all() on a 30-second cadence (answered design Q3). * BootReconciler::new(orch, interval, shutdown) — shutdown is an Arc<Notify> so callers can trigger a graceful exit without pulling in tokio-util. * run_forever(self) — does one reconcile immediately, then loops on tokio::select! { sleep_until \| shutdown.notified() }. Shutdown interrupts the sleep but never an in-flight reconcile_all call. * Per-pass outcomes are logged at debug/warn; failures never propagate out because reconcile_all already absorbs per-app errors into ReconcileReport. Four tokio::test(start_paused = true) tests verify the loop cadence against a CountingRuntime test double: * initial_pass_fires_immediately — first reconcile runs with no delay * second_pass_fires_after_interval — second pass fires after exactly interval elapses in paused-clock time * shutdown_terminates_loop — notify_one() lets run_forever return * failure_in_one_pass_does_not_stop_loop — the loop keeps ticking even when the first pass had to install a missing container Not wired into main.rs yet — that is Step 6. Re-exported from container::mod as BootReconciler + RECONCILER_DEFAULT_INTERVAL for the wire-up step.	2026-04-22 19:04:34 -04:00
archipelago	89199bb03b	docs: update STATUS.md — Step 4 done, Step 5 next Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build clean with expected unused-method warnings) and queues the BootReconciler implementation for Step 5.	2026-04-22 18:57:43 -04:00
archipelago	ca299e70e8	chore: gitignore macOS AppleDouble files from SSHFS writes The laptop mounts ~/Projects/archy over SSHFS and macOS finder / Spotlight sidecars write ._<name> resource-fork files alongside every edit. They are noise; keep them out of git.	2026-04-22 18:56:58 -04:00
archipelago	40a6eaca72	feat(container): ContainerOrchestrator trait, RpcHandler uses it in prod Step 4 of the rust-orchestrator migration. Unifies the container lifecycle surface behind a single trait so the RPC layer stops caring whether it is talking to the dev or prod orchestrator. * New trait core/archipelago/src/container/traits.rs: ContainerOrchestrator with install / start / stop / restart / remove / upgrade / status / list / logs / health, all keyed by app_id. Every method is async_trait-based. * ProdContainerOrchestrator: the lifecycle methods are moved from inherent impl into the trait impl (avoids name-shadowing recursion). Adoption and reconcile remain inherent since only main.rs / BootReconciler call them. * DevContainerOrchestrator: new trait impl that forwards to the existing Dev-named methods, applying the dev container-name + port-offset rules internally. New load_manifest_for() helper resolves app_id to <data_dir>/apps/<app_id>/manifest.yml so trait-level install(app_id) works in dev too. install_container(manifest, path) stays inherent for the manifest-path RPC shape. * RpcHandler now holds Option<Arc<dyn ContainerOrchestrator>> and, when in dev mode, a separate Option<Arc<DevContainerOrchestrator>> for the manifest_path install RPC. In prod mode RpcHandler::new() constructs a ProdContainerOrchestrator and calls load_manifests() at startup. * All seven container-* RPC guards no longer say dev mode required. container-install still requires dev mode because its manifest_path argument has no prod meaning; every other container RPC now works in both modes via the trait. BOOT STILL DOES NOT USE THIS. main.rs wire-up (Step 6) and BootReconciler (Step 5) come next. Until then the prod orchestrator is constructed but nothing populates /opt/archipelago/apps so it has zero manifests to manage, matching the pre-Step-4 behaviour. Verification: cargo build -p archipelago clean (11 expected unused method warnings for methods not yet wired from main.rs). cargo test -p archipelago: all 21 container::* tests pass (16 prod_orchestrator + 5 others). 24 other test failures are pre-existing and unrelated (identity_manager / session / wallet / mesh / credentials — all independently flaky on file-backed state).	2026-04-22 18:56:52 -04:00
archipelago	e103925a4e	feat(container): ProdContainerOrchestrator with build-or-pull, adoption, reconcile Step 3 of the rust-orchestrator-migration. New file prod_orchestrator.rs (999 LOC) implements the full public surface that will replace scripts/first-boot-containers.sh: * install / start / stop / restart / remove / upgrade / status / list / logs / health * adopt_existing: read-only scan that claims containers matching our manifests by name, without recreating — preserves the v1.7.42 fixture on .116. * reconcile_all: level-triggered, per-app failures collected rather than aborting. * install_fresh: build-or-pull (Step 2 trait methods), relative build contexts resolved against the manifest directory. Naming rule (answered design Q1): UI app IDs (bitcoin-ui/electrs-ui/lnd-ui) get the archy- prefix; backends keep their bare ID. An explicit extensions.container_name always wins. Codified in compute_container_name() with unit tests for all three tiers. Concurrency (answered design Q4): per-app tokio::sync::Mutex<()> created lazily, protecting every mutating op against the reconciler loop. Acquiring the per-app lock only needs a read lock on the map, so independent apps do not serialize. 16 tests: 3 sync naming rule tests + 13 tokio async tests covering install (pull, build-absent, build-present, relative-context), reconcile (noop/exited/missing/ mixed-failure), adopt-by-name, upgrade sequence ordering, list filtering, health state mapping, and unknown-app-id rejection. All pass. Not wired into main.rs yet — that is Step 6. Crate builds clean with expected unused warnings for the new re-exports.	2026-04-22 18:32:31 -04:00
archipelago	56af57a6f8	feat(container): runtime trait gains image_exists + build_image Adds two methods to ContainerRuntime so the upcoming ProdContainerOrchestrator can inspect local image storage and build images from BuildConfig: - image_exists(image_ref) -> Result<bool>: local-storage check only, does not consult registries. Distinguishes exit 0 (present) from exit 1 (absent) from other failures (environment error). - build_image(&BuildConfig) -> Result<()>: shells out to podman/docker build with -t, -f, deterministically-sorted --build-arg pairs, and the context path last. Implemented on all three runtimes: - PodmanRuntime: new podman_cli helper shells out alongside the existing HTTP API calls (build and image inspect are awkward over the HTTP API) - DockerRuntime: native docker CLI, same exit-code semantics - AutoRuntime: delegates to the selected inner runtime Argv construction extracted into pure build_args_for_podman helper so it can be unit-tested without a real podman. 4 new tests cover minimal args, custom Dockerfile path, deterministic build-arg sorting (guards against HashMap iteration non-determinism), and context-is-last (positional arg placement is load-bearing for podman build). Step 2 of docs/rust-orchestrator-migration.md. 25/25 tests pass.	2026-04-22 17:46:47 -04:00
archipelago	919055f3f1	feat(container): add build source to manifest schema ContainerConfig.image is now Option<String>, mutually exclusive with a new optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image or build must be present, enforced in AppManifest::validate. Adds ResolvedSource enum (Pull \| Build) and ContainerConfig::resolve + ::image_ref helpers so the orchestrator can treat pull and build uniformly. All 26 existing pull-only manifests continue to parse unchanged (covered by existing_pull_only_manifests_still_parse test). Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator. Dev orchestrator errors out cleanly on Build sources until Step 2 lands build_image support on the runtime trait. Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass. Also includes: docs/rust-orchestrator-migration.md (design spec) and docs/STATUS.md resume section for the next session.	2026-04-22 17:46:36 -04:00
archipelago	0ac673deb4	release(v1.7.42-alpha): bitcoin RPC retry wrapper so syncing nodes stop flashing red Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block validation. The old 10s client-side timeout was rejecting roughly 30% of UI calls even though the node was perfectly healthy. 20x stress test on the live .116 node (caught in IBD catch-up at block 797k) used to drop 10 of 20 calls; now drops 0 of 20. What changed: - core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up to 3 times with 500ms and 1500ms backoffs between attempts. Only transient transport errors (timeout, connect refused, send/recv IO) trigger retry. A well-formed bitcoind error response is surfaced immediately - real RPC bugs are never masked. - Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top of reqwest's own timeout, so DNS starvation or TLS wedging can't steal the entire retry budget. - handle_bitcoin_getinfo client builder gained a 3s connect_timeout so a dead bitcoind is fast-failed inside the first attempt instead of eating the whole 15s. - Retry policy extracted into a RetryConfig struct so tests can dial down timeouts to ~100ms per attempt. Production defaults live in RetryConfig::production(). Not changed (tracked as follow-up): - mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the same 10s-timeout pattern. Not migrated to the new wrapper in this release; scheduled for v1.7.43 alongside the render_bitcoin_conf work. - lnd/info.rs and electrs_status have similar 10s/15s timeouts but different failure profiles - audit first, migrate only the ones that actually exhibit the bug. Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing. Uses an in-process hyper server (already a transitive dep) to simulate bitcoind responses; no new crates required. - happy_path_first_attempt: no retry when first attempt succeeds - retries_on_timeout_then_succeeds: first attempt times out, second succeeds, returns OK (uses a short-timeout RetryConfig so the test runs in <1s instead of 15s) - retries_exhausted_on_persistent_connect_refused: all attempts fail against a closed port, error bubbles up, elapsed time confirms backoffs actually ran - does_not_retry_on_rpc_level_error: bitcoind-returned error body is surfaced immediately, no retry - does_not_retry_parse_errors: non-JSON response (e.g. 503 with html body) is NOT retried - guards against the tempting "retry all non-2xx" mistake that would mask real bitcoind misconfig - retry_budget_invariants: asserts total wall-time ceiling stays under 60s so a bumped constant can't silently hang a UI call forever Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the old 10s timeout. Worst-case latency was 48.9s during peak validation; happy-path latency (cached result) remains 28-77ms. v1.7.42-alpha	2026-04-22 16:46:28 -04:00

... 2 3 4 5 6 ...

1127 Commits