Region (EU_868) + shared channel "archipelago" auto-provisioning shipped in
8fdb45e8 and riding the rolled #9 fleet binary (0060dcd6). Discovery, RF, and
sending verified on .116+.228; the one open blocker is the running driver not
surfacing received messages. Slotted after WS-F #9–11.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched
channels, so nodes only ever saw themselves. Bring them to MeshCore parity
using the official Meshtastic admin API:
- Auto-provision LoRa region (set_config, AdminMessage field 34) from a new
mesh-config `lora_region` (e.g. EU_868) when the radio's region differs.
- Auto-provision a shared primary channel (set_channel, field 33) with a
PSK derived deterministically from channel_name, so every node converges on
one mesh — the parity equivalent of MeshCore's named "archipelago" channel.
- Read current region/channel from want_config; only write when different
(no reboot loop); cap attempts so a radio that won't persist can't loop.
- Active NodeInfo advert scaffolding + aggressive serial drain.
Verified on .116+.228: region+channel persist, discovery works (both see each
other as named reachable contacts), bidirectional RF + sending confirmed.
Receiving in the running driver is still under diagnosis (instrumentation added).
Also removes the unwanted `meshtastic` daemon app from the registry (it was
never meant to be a container — native driver provides system-level support):
deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) +
test refs. Meshtastic stays native, like MeshCore.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure_bind_mount_dirs chowned a freshly-created no-data_uid bind dir
with --reference={immediate_parent}. For a NESTED bind source like
jellyfin's /var/lib/archipelago/jellyfin/config (or netbird's .../netbird/
data), `mkdir -p` creates the intermediate <app> dir root:root too, so
referencing the immediate parent just copied ROOT — leaving the dir
unwritable and the app EACCES-crash-looping on reinstall (found by the
all-apps-lifecycle pass: jellyfin "/config/log denied" exit 139;
netbird-server "unable to open database file"). It only ever worked for
direct children of the data root (immich).
Fix: anchor to the nearest PRE-EXISTING ancestor (the rootless data root,
owned by the service user) and chown -R the entire newly-created subtree
to it. Extracted the walk into fresh_subtree_anchor() with a unit test
covering nested / direct / second-volume cases.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 0.4.11 edit affordance only lived on ServerConnectScreen, which a
connected user never sees. Add edit to NESMenu — the settings modal
reached via two-finger hold while connected: a ✎ pencil on each saved
server opens the form pre-populated (Edit Server header + Cancel),
persists via ServerPreferences.updateSavedServer(), and reconnects when
the edited server is the live one.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an edit affordance to each saved server in ServerConnectScreen: a
pencil button loads the entry into the form (Edit Server mode) with
Save Changes / Cancel actions. Persisted via a new
ServerPreferences.updateSavedServer() that replaces by connection
identity (address/port/scheme) and keeps the active record in sync when
the edited server is the active one.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the 2026-06-26 lessons durably: ship via the hardened publish
script only, v1+v2+v3 signing is enforced by apksigner (AGP ignores
enableV1Signing at minSdk>=24), diagnose install failures with adb
install FIRST, signature-key changes force a one-time uninstall, and
keep all phone/adb work scoped to com.archipelago.app.debug.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The published companion APK was v2-only (AGP silently ignores
enableV1Signing for minSdk>=24) and clean builds broke on stray
space-named resource dirs. Harden scripts/publish-companion-apk.sh:
clean build, remove/ýreject space-named res dirs, force v1+v2+v3 via
zipalign+apksigner, and abort unless all three schemes verify. Wire
ship-companion.sh to the shared script. Re-sign the served 0.4.10 APK.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Active counterpart to the read-only all-apps-matrix.bats: drives
stop/start/restart for every installed app and, under
ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall →
no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core
suites. App set is discovered from My Apps ∩ the node catalog; reinstall
spec comes from catalog.json {dockerImage, containerConfig}.
PROTECTED by default (never cycled or torn down): bitcoin*/electrum*
(expensive resync) AND lnd/btcpay*/fedimint* (teardown = irreversible
wallet/channel/guardian loss). The user asked to protect only
bitcoin+electrum; the wallet apps are added for safety and can be
removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised
pass, not folded into run-gate. Validated on .228: discovery excludes
the 6 protected installed apps; lifecycle tier cycles a single app
(botfights) stop/start/restart green; teardown gated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AppCard's uninstall bar was hardcoded `w-full bg-red-400/60 animate-pulse`
— a solid, full-width, red, fake-pulsing block that never moved and read
as an error, no matter the actual teardown progress (the install bar, by
contrast, renders a real percentage). Derive a truthful percentage from
the backend's existing `uninstall-stage` label — "Stopping containers
(X/N)" → 10–50%, "Cleaning up volumes" → 70%, "Removing app data" → 90%
— and render it exactly like install: neutral fill, real width + percent,
shimmer (not a fake pulse) carrying motion when a stage has no number.
Frontend-only; the backend already broadcasts these stages.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite
(uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression
guard) existed but was never enabled by the gate. Add an opt-in single
cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires
ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out
of the 5× loop deliberately — uninstall/reinstall every iteration would
balloon runtime and re-pull images; one pass guards the class. Default
gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Workstream F now in-progress: the immich/grafana uninstall hang →
ghost/stuck-bar/reinstall-block is root-caused (unbounded systemctl/
podman in quadlet::disable_remove) and fixed (71cc9ac4); cascade-
uninstall.bats 7/7 on .228. Records the remaining F items + the pending
gate-wiring decision.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Uninstalling immich/grafana could hang with a frozen full-red progress
bar, leave a ghost entry stuck in My Apps, and then refuse reinstall.
Single root cause: quadlet::disable_remove() — called first in the
uninstall task (via companion + orchestrator teardown) — ran
`systemctl --user stop`, daemon-reload, and `podman rm -f` with NO
timeout. On rootless podman a generated unit can wedge in "deactivating"
while podman hangs underneath, so `systemctl stop` blocks forever. The
spawned uninstall task then never returns Ok or Err, so:
- set_uninstall_stage() (after the stop) never fires → progress frozen;
- remove_package_state_entry() never runs → entry stranded in
`Removing` → ghost in My Apps;
- the install guard rejects reinstall with "already Removing".
The spawn wrapper already reverts state on Err and removes the entry on
Ok — the only failure mode was a hang that returns neither. Bound the
teardown so it always terminates:
- systemctl stop → QUADLET_STOP_TIMEOUT, escalate to kill+reset-failed
on timeout (reuses the existing helpers);
- daemon_reload_user() → bounded systemctl_user_status (30s);
- defensive `podman rm -f` → wrapped in tokio timeout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
§10b: replace per-app static launch-port map with a manifest-first +
non-HTTP-port-skipping heuristic (the gitea :2222 class).
§10c: generalize the un-pruned/archival Bitcoin install blocker from a
hardcoded requires_unpruned_bitcoin() match to a manifest-declared
dependency, with a clear pre-install UX.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Banner + §8b: zombie-container guard (0a8db904, live-proven on .228) and
gitea launch-port fix (670ebb06) shipped in binary 040df5ce, rolled to
the fleet. Logs the mempool env-drift recreate-loop and nostr-rs-relay
follow-ups.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gitea publishes two host ports — SSH on 2222 and the web UI on 3001.
The launch URL comes from manifest_lan_address_for() (the manifest's
interfaces.main → 3001), but Gitea had no entry in the static
lan_address_for() fallback map. On a node where the gitea manifest is
absent or stale (no interfaces block), the lookup returns None and the
code falls through to extract_lan_address(), which returns whichever
port podman lists first — frequently the SSH port. Result: the app
launched at :2222 instead of :3001 (observed on tailscale node
100.82.34.38).
Add the canonical "gitea" => http://localhost:3001 entry to the static
map, matching every other core app, so the web UI is pinned regardless
of manifest presence.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
podman trusts its own state DB: when a container's conmon dies without
podman observing it (cgroup-cascade SIGKILL on archipelago.service
restart, a crash), `podman ps` keeps reporting it "Up" long after the
process is gone. The reconciler NoOp'd such a zombie forever, so a dead
dependency with no published host port never recovered.
Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with
a dead State.Pid → its nginx proxy 502'd → NetBird login broke
("Unauthenticated"). The dashboard publishes no host port, so the
Running branch had nothing to probe and never recreated it.
Add a zombie guard to the Running branch: verify the recorded State.Pid
is alive (its /proc entry exists) before trusting "running"; on a
concrete dead PID, stop+remove+install_fresh from the manifest.
Conservative by design — any uncertainty (inspect failed, PID
unparseable) assumes alive, so a transient podman hiccup never destroys
a healthy container. Unit test covers live/dead/out-of-range PIDs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Node apps (e.g. NetBird on :8087) terminate TLS with a self-signed cert
so the dashboard gets a secure context (OIDC / window.crypto.subtle, #15).
The WebView's default onReceivedSslError CANCELs untrusted certs, so those
apps rendered blank in the companion — exactly the netbird "won't load in
the webview" report. Override onReceivedSslError in both WebViewClients
(kiosk + in-app browser) to proceed() only when the failing cert's host
matches the connected node; reject everything else (no blanket trust).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
netbird is fully manifest-driven (apps/netbird-*/manifest.yml via the signed
catalog): install_stack_via_orchestrator renders the 3-member stack with
generated_certs (self-signed TLS for the #15 OIDC secure context), base64
generated_secrets, and templated config — and adopts the running stack by live
container name. The hardcoded `podman run` fallback was therefore dead code on
any node with the embedded catalog (verified live: .228 https:8087 -> 200).
Removes the per-app Rust installer anti-pattern the master plan calls out:
- install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer)
- deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert,
read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip,
wait_for_netbird_oidc_ready), 3 NETBIRD_*_IMAGE consts, unused base64::Engine import
- ~485 lines removed; prod_orchestrator doc-comments updated
Behavioural parity: the manifest path already executed on the fleet, so this
changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed
by the manifest path; if that race resurfaces, add an OIDC-ready gate to the
manifest rather than resurrecting the Rust fn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 5x destructive gate on heavy nodes false-failed on transient windows
during stack recovery, not real regressions:
- immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis
->server (DB migrations on boot) stack can take >30s to republish :2283 after
a churn-induced recreate; destructive-tier immich tests already allow 180-240s.
- mempool.bats: orphan-container check now polls to steady state (<=30s) instead
of a single-shot count, which caught a recreated member briefly visible
alongside its replacement mid-reconcile.
- run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when
installed, so the next iteration's read-only probe doesn't race a still-
recovering stack. Settle returns the instant every probe is green.
A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only
absorb the transient recreate window under sustained churn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NOT yet validated on a node or fleet-deployed — cargo check passes, release build
+ .228 canary validation pending. Committed as a checkpoint so the work survives.
Two fixes the immich .198 incident exposed:
Fix A (reconcile_all_with_mode): a previously-running app whose container vanished
(e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now,
when boot reconcile would leave an app 'absent' but it was running at the last
running-containers snapshot, recreate it (install_fresh). New
crash_recovery::load_last_running_names() reads the snapshot without the PID/crash
gate (+2 unit tests). Match is exact on compute_container_name (incl stack
members); user-stopped + uninstalled apps are already excluded, so no false
positives.
Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a
no-data_uid app running as container-root (→ host rootless user) hit EACCES and
crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir
for a no-data_uid app is chowned via --reference=<parent> to match the rootless
data root — no host-uid guessing, only fresh dirs (no regression for existing
installs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two console-noise fixes from a live error dump:
- remote-relay.ts reconnected on a FIXED 5s interval with no backoff, so when
the backend is briefly down it floods the console/network with failed-WS
attempts for the whole outage. It's a secondary feature (companion input), so
add exponential backoff 1s->30s (mirrors websocket.ts), reset on open/start.
- cryptpad's catalog/marketplace entries pointed at a non-existent
/assets/img/app-icons/cryptpad.webp -> a 404 on every marketplace render.
Point it at the existing default icon (handleImageError swapped to it anyway).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global error handler (Vue errorHandler + window error + unhandledrejection)
fired a red 'Something went wrong: <raw msg>' toast AND an auto on-device overlay
on every caught error — deliberately loud for bug-bash, but it surfaces benign,
non-actionable noise (e.g. a transient RPC rejection during a ws reconnect, or
the service worker failing to register over a self-signed cert) right in the
user's face.
Demote the catch-all to SILENT capture: keep console.error + the
window.__archyErrors ring buffer, and expose the screenshot-able overlay
on-demand via window.__archyShowErrors() — but never auto-pop. Components that
need to report a specific, actionable failure still call toast.error() directly.
Also filter known-benign environmental noise (PWA service-worker registration
failing over a self-signed cert — needs a trusted cert, #56) so it doesn't even
occupy a ring-buffer slot and push out real errors.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others
(jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats
derives the app set from server.get-state package-data (no hardcoded list) and
asserts baseline health across EVERY installed app:
- settles to a non-transitional state within a window (the #13/#14 stuck-ghost
class, generalized fleet-wide — installing/removing that never settles)
- not in error/failed
- reports a recognized (non-garbage) state
- every running UI app (manifest ui=="true") exposes a non-null lan-address
(the immich/port-drift unreachable-UI failure, generalized to all UI apps)
Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as
nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where
the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14
reinstall stalling on stale state). New cascade-uninstall.bats drives the full
teardown path on a throwaway app (default grafana, precondition-skips if already
installed so it can't destroy real data) and asserts:
- fresh install reaches running via a truthful, non-silent progression
- uninstall makes the entry DISAPPEAR from server.get-state package-data
(the literal My Apps map) — no ghost, no stuck uninstall stage
- container + (on-node) data dir are gone
- reinstall returns to running
- node left as found
Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical
gate. Verified 7/7 against .228.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure_running_container_ownership re-probed and re-attempted the in-container
chown on every reconcile pass. For a mount that can't be re-owned from inside the
userns (observed: mempool-api /data -> 'Operation not permitted'), this burned
CPU and logged a WARN on every pass, forever (~6x/30min on .228/.116).
Remember hard chown failures in a process-lifetime set keyed by (container-id,
dest) and skip the probe+chown for known-unrepairable mounts. Keyed by Id (not
name) so a recreated container gets a fresh repair attempt. Verified on .116:
one recorded failure at startup, then silent across subsequent reconciles.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reconnect banner showed 'Connection lost'/'Reconnecting' instantly on every
socket close, even ones that recover in 100ms-2s (load spikes, Tailscale/relay
TCP resets). On a healthy node the drops are brief and self-healing, but each one
flashed a jarring banner, reading as constant instability.
Debounce the transient banner by 2.5s: only surface after the connection issue
persists past the grace window; hide immediately on recovery. Deliberate server
lifecycle transitions (restart/shutdown) bypass the debounce and still show at
once. A genuine persistent outage keeps isOffline true and surfaces after 2.5s.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Force-add the gitignored releases/app-catalog.json so nodes resolve
146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/app-catalog.json
(currently HTTP 404 → disk-manifest fallback). Embedded-manifest delivery
is default-on; origin-wins overlay with disk as fallback. Unsigned (migration
window accepts unsigned). Includes netbird x3 manifests.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
handle_package_uninstall lumped every teardown failure into one `errors` vec
and returned Err on any of them BEFORE removing the package state entry — so a
non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a
volume/network removal) left the app's containers gone but its entry in
package_data → a ghost in My Apps, and the spawned task reverted it to Installed.
Split the failures: container removal that even force-rm can't complete (app
genuinely still present) keeps the entry + returns Err; everything after the
containers are gone is best-effort. Remove the state entry as soon as the
containers are gone — BEFORE the slow volume/data teardown — so My Apps updates
immediately and residue can never ghost the app. set_uninstall_stage is a no-op
once the entry is gone (if-let guard), so the later stages don't re-create it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wait_for_manifest_host_ports TCP-connect-probed every published port, including
UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe
failed forever and drove an endless host-port repair/reconcile loop on .228
(netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now
embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes
install from the signed catalog (origin-wins overlay, disk = fallback) with no
OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before
load_manifests so a fresh boot overlays the latest embedded catalog instead of a
restart later; offline/ISO boot falls through to disk and never hangs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
image_exists ran `podman image inspect <image>` via .status() (inherits the
service stdout) with no --format, so every hit dumped the image's full ~249-line
manifest JSON into the journal — once per companion image, every reconcile pass
(.228: 21.6k journal lines / 10 min, 4131 inspect dumps). The service never
crashed (NRestarts=0); the sustained journald/IO flood starved the async runtime
and dropped the UI /ws/db websocket -> constant "connection lost"/reconnect.
Discard the child's stdout/stderr; only the exit status is used.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It called bats-assert's `fail` (not loaded in this file) → "fail:
command not found"/127, masking the real reason. Emit+return instead,
bump the cold-restart RPC window 60s→120s (block-index reload), and
note a node mid-IBD legitimately can't serve getinfo (environmental
precondition, not a product regression).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 2026-06-23 5×-green gate is DESTRUCTIVE-tier / ~8 core apps only —
it skips uninstall/reinstall (cascade) and has no progress-UI or
all-apps coverage. Manual multinode testing found real bugs it never
ran (immich+grafana uninstall hangs at full-red bar + ghost in My Apps;
grafana reinstall stops; fedimint guardian "waiting for bitcoin sync").
Adds §4 row F, §6b post-deploy order (netbird→Phase-3→F), §6c scope +
observed bugs + definition-of-done, a §5 warning, and §10 backlog to
investigate TanStack-Query/push-based state management for neode-ui.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Serve the companion download as a plain .apk so a phone installs it
straight from the link/QR with no unzip step. Repoint the in-app
download URL, the ship + publish scripts, and the pre-push hook at
archipelago-companion.apk, and drop the legacy .apk.zip.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Companion WebView now supports file inputs and downloads, and apps
opened in the in-app tab get a proper loading splash and a footer
control bar matching the web app-session bar.
- onShowFileChooser wired to an ActivityResultLauncher so <input
type=file> opens the system file browser (kiosk + in-app tab)
- DownloadListener: http(s) via DownloadManager (forwarding session
cookies), blob: via JS->base64->MediaStore, data: decoded inline
- in-app tab: app-icon + progress loading splash (eager favicon
fetch, upgraded via onReceivedIcon)
- footer controls (back/forward/refresh/open/close) matched to the
web AppSession mobile bar, with the same SVG glyphs as drawables
- bump to 0.4.8 (versionCode 12)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-gate.sh 5/5 on .228. Reframe the TOP PRIORITY banner as
gate-green; keep the master plan as north-star source of truth; mark
the gate definition-of-done green and point at multinode as the next
exit criterion.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-gate.sh 5×-green on .228, 0 not-ok (gate-5x5.log). Records the
milestone in the header/banner, §4 workstream E, §6 sequence, and §8b;
demotes the priority banner per §6 item 6. Next: bundled testing deploy
(.116/.198 + UX frontend), multinode pass, workstreams B/C/D.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- InAppBrowser now has a bottom control bar (back/forward/reload/open-in-browser/
close) mirroring the web mobile footer, plus a centered loading screen
(app favicon + progress bar) instead of a bare top bar over black.
- Commit a repo-dedicated debug keystore and pin signingConfigs.debug to it so
every machine — and the published companion download — signs debug builds with
the SAME key (fixes "App not installed" signature-mismatch on update). Force v1+v2.
- Bump versionCode 10→11, versionName 0.4.6→0.4.7.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Mobile launches use the store-driven panel (no route push) so the background
tab no longer changes and closing returns to where you launched from.
- Tab-only apps open directly (in-app WebView on companion / new tab on PWA) —
no "this app opens in a tab" interstitial.
- Shared AppLoadingScreen (app icon + progress bar) on the app session and the
legacy iframe overlay instead of a black screen.
- Pin the dashboard to 100dvh on mobile so the mesh chat/tools panes stop sliding
under the bottom tab bar in mobile browsers (no-op in the companion WebView).
- ElectrumX/electrs/electrs-ui ids now resolve to the real ElectrumX icon in My Apps.
- isMobile made reactive so overlay/footer/teleport decisions track the viewport.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5× run #4 flaked iter4 on "immich exposes its web UI lan-address
(port 2283)": container-list returned lan_address=null because
immich_server was momentarily mid-recreate when the read-only tier
queried it (passed the other 4 iterations; immich_server does publish
0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots
state probe — poll <=30s for the exposed port instead of one read. A
genuinely unexposed immich never publishes 2283, so real port drift
is still caught.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record the overnight 5× outcome (2/5) and the triage: all three
fails were distinct one-offs. iter1 #5 bitcoin-knots = pre-launch
churn (hardened anyway); iter2 #74 + iter5 #73 = one real
orchestrator bug (phantom stack-member injection in
ordered_containers_for_start), now fixed + live-verified on .228.
Update the resume check command to gate-5x4.log.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.restart resolved its container list via
ordered_containers_for_start, which injected every name from the
union startup_order list that wasn't already present — including
variant names not live on a given node (mysql-mempool,
archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is
2nd in the mempool start order, so do_orchestrator_package_start hit
its unknown-app-id fallback, do_package_start failed the inspect
("no such object"), and the `?` aborted the whole start sequence —
leaving mempool-api + the frontend down until the health monitor
recovered them minutes later. That was the source of the 5× gate
flakes #73 (frontend not running in 180s) and #74 (api not queryable
in 300s); root-caused from the .228 journal
("Start failed: mysql-mempool").
Replace the inject-then-sort logic with a pure helper
order_present_containers that orders only the actually-present
containers and never adds phantom entries. startup_order remains a
union of name variants across install generations — it's now used
purely to order what's live, not to inject what isn't. +3 unit tests.
Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a
settled state instead of a single-shot read, so a container caught
mid-reconcile (transient restarting/configured) can't flake a 20-min
iteration. A genuinely-stuck container never settles, so real
breakage is still caught.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontend nginx used a literal proxy_pass host with no resolver, so it
pinned mempool-api's IP at worker startup. When the backend restarts (gate,
OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying
to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a
manual nginx reload. Same stale-upstream-IP class as the netbird 502.
Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to
re-resolve the backend per-request via 'resolver' + a variable proxy_pass.
Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers
on the network gateway, not Docker's 127.0.0.11). Per-location path mapping
preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite).
Proven on .228: backend IP change now auto-recovers with no reload; the
literal-host control still 502s. Migrated the manifest off the retired
tx1138 registry to vps2.
Also: mempool.bats #74 waited only 180s post-restart (the slow path) and
called an undefined 'fail' helper (status 127). Bumped to 300s to match the
passing parity probes and emit a real failure instead.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log,
nohup — survives terminal close) with the exact check-from-any-machine command; all
shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in
repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx
re-registered); the run-ON-the-node lesson; and remaining work.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO
recover — lnd synced, mempool just mid-restart when probed — but slower than the
windows when restarted back-to-back). Hardening:
- run-20x.sh: best-effort settle_stack() before each iteration (wait for
mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run).
- required containers present/running (80/81): wait-loops (180s) not single-shot.
- mempool api/frontend (87/88): retry ~180s not single-shot.
- mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s.
lnd getinfo (60): 90s->240s retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Companion app: open every app in the in-app WebView (not just non-iframeable),
carrying the mobile-iframe footer controls into the WebView. Mobile web (PWA):
open tab-apps directly in a new tab. No interstitial on either surface. Touch
points + prior commits (b5a9deb8, d1fbcd9b) noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC).
Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md
with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale
nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites.
Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is
green; these 4 were test-harness issues:
- lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart
recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded
node but DOES complete (synced_to_chain:true).
- bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may
have just been recreated by the companion-survives test).
- probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for
post-restart proxy/UI readiness instead of single-shot.
- required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL
app (not in required_containers) — only assert it when NPM is installed; and make
the trailing lncli getinfo a retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the
container 'running' state — single-shot lncli getinfo raced that window and
false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is
functional (getinfo returns cleanly once ready).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3-
container stack (postgres->redis->server w/ DB migrations), so it needs at least
as long as the start test (180s) — the old 120s was inconsistent and false-failed
on loaded nodes. immich does return to running.
- fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the
legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex
omitted it -> total>known false orphan on every node running fedimint-clientd.
Add fedimint-clientd to known.
Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node
(.116), not the RPC target — surfaced while driving the .228 gate green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui
recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL
rm/systemctl --user, so running it from .116 via RPC tests .116's companions with
.116's binary, NOT the remote target — must run ON the target node. Explains the
'failed on both nodes' runs (both silently tested .116).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The companion-unit repair stage ran at the END of each boot-reconciler tick, after
reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a
deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any
reasonable window (gate test 31 'deleted unit recreated within one reconcile tick'
timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is
cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass.
Handle is aborted when the main loop exits (shutdown uses notify_one, so a second
waiter would steal the wake permit). tick() is now app-reconcile only.
All 4 boot_reconciler cadence tests still green (companion_stage=false in tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Last 2 .228 stragglers confirmed load/timing, not bugs: test 31 (companion recreate)
= contamination + ~108s reconcile cadence > 90s window; test 55 (immich restart) =
heavy stack restarts >120s under load but DOES return. Path to literally-green gate
is infra (bitcoin sync, re-quadletize .228) + minor test-window tuning. Optional
product improvement noted: independent ~30s companion-reconcile cadence.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
companion::reconcile only recreates a deleted companion unit when its parent
backend is in manifest_ids. On contaminated .228, electrumx ran as plain podman
and was NOT a tracked manifest install (manifest on disk but unloaded), so the
reconciler never iterated it -> archy-electrs-ui companion orphaned. Proven:
package.install electrumx re-registered it + restored the companion. Self-heal
logic is sound; test 31 clears on re-quadletize. electrumx on .228 de-contaminated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.228 104/110, .198 94/110 with the 3-fix binary. Every package.stop test passes on
healthy apps. .198's 14/16 failures trace to bitcoin in IBD (test 83: ~137k blocks
behind) cascading to lnd/btcpay/electrumx/mempool. 2 node-independent: companion
recreate (31, both nodes), fedimint orphan pollution (44). Path to green 5x gate is
now infra (sync bitcoin, re-quadletize .228) + minor (test 31), not lifecycle bugs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stop failure was 3 real product bugs (grace / reconcile-resurrection /
container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) +
deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was
probe-induced churn (stable when left alone). Validating breadth next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running'
in container-list because its UI companion (electrs-ui, …) still serves the launch
port, and the state-refresh upgrades any reachable launch port to 'running'. The
gate's wait_for_container_status <app> stopped therefore never saw 'stopped'.
Fix: load the user_stopped marker in handle_container_list and force 'stopped' for
those apps before the launch-port refresh. The reconcile guard keeps the backend
down, so the marker is authoritative. package.start clears it first, so a started
app reports 'running' normally.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler
restarts it within ~8s: the reconcile filter's dependency_required override
re-includes a user-stopped app that an active app depends on, and the in-memory
disabled set is wiped on manifest reload — so ensure_running runs, the stopped
app's unreachable ports look like a fault, the host-port repair restarts it, and
package.stop never sticks (gate 'transitions to stopped' times out).
Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single
choke point every reconcile flows through) → Left('user-stopped'). Explicit
install/start clear the marker first (added clear_user_stopped to orchestrator
install/start, symmetric with disabled.remove; start/restart RPC already cleared
it) so user actions are unaffected. The container itself already stopped correctly
— this stops the resurrection.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation
showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on
both nodes can't be stopped; (3) host-listener repair watchdog restarts
port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end
'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s
gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced
NEXT STEPS (fedimint health is the new top blocker).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30
timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide
bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd
330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the
orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI
-t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as
podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks
table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet
is the intended runtime. .228's plain-podman state traced to my cascade-gate
uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs
remain (start should regen quadlet; stop podman-fallback gap). Next: canonical
gate on CLEAN .198 first to tell real-bug from contamination.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5x gate run surfaced a real blocker: package.stop does not stop electrumx/
bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait
times out). Root cause chain: these backend apps run as plain podman
--restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI
companions + home-assistant have .container files; bitcoin-core.container is
.disabled). orchestrator.stop() podman-fallback fires for filebrowser but not
electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state
reporting itself is correct (filebrowser proof, user_stopped guard).
Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE);
restored .228 after my cascade-gate left apps stranded.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On the loaded .198 the frontend churned (created → "unhealthy" → reconciler
recreates → loop). The http health check fetched / through nginx (SPA +
sub_filter) and false-failed under node load; the reconciler then treated the
frontend as wedged and recreated it. nginx binds 7777 at startup, so a tcp
liveness check passes immediately and stays green under load while still
catching a real "nginx not listening" failure. Generous retries/start_period.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live on .228 the post_install `exec` steps failed with "crun: write
cgroup.procs: Permission denied / OCI permission denied": a `podman exec`
launched from archipelago.service can't place its child in the container's
cgroup (under the service's own slice). Wrap `exec` in
`systemd-run --user --scope --quiet --collect podman exec …` so it gets its own
delegated cgroup — same trick as `podman_user_scope` for pasta starts.
`copy_from_host` (a host-side `cp`, no in-container process) stays direct.
Without this only copy_from_host worked; indeedhub happened to be unaffected
(its image pre-bakes the nginx config so the exec steps were no-ops), but the
hook capability is only generally useful with exec working. hooks unit tests
pass; live verify on .228 next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live fresh-create on .228 (post special-case removal) had nginx workers die
with "setgid(101) failed (Operation not permitted)" → workers exited code 2,
port published but nothing served (HTTP 000). The orchestrator does
--cap-drop=ALL, so unlike the legacy `podman run` (default caps) nginx's master
couldn't drop workers to the nginx user. Declare CHOWN/DAC_OVERRIDE/SETGID/SETUID
(SET* to drop the worker user, CHOWN+DAC_OVERRIDE for the tmpfs proxy cache).
Verified on .228: frontend fresh-creates, caps applied, nginx serves, UI 200
incl. /api/ and /nostr-provider.js.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fresh-create path was blocked by hardcoded indeedhub orchestrator logic
that predated and conflicted with the manifest migration:
- ensure_running routed app_id=="indeedhub" → reconcile_indeedhub_stack, which
REFUSED to create the frontend from its manifest (returned Left("stack-managed")).
- run_pre_start_hooks("indeedhub") → start_indeedhub_backends →
wait_for_indeedhub_dependencies_ready(120) — a DNS gate with a chicken-and-egg
bug (required the frontend's own alias present before the frontend could be
created), which failed install_fresh with "dependencies were not ready within
120s" and left the frontend down (caught live on .228).
Delete all of it (−382 lines): reconcile_indeedhub_stack, start_indeedhub_backends,
wait_for_indeedhub_dependencies_ready, indeedhub_api_dependency_dns_ready,
indeedhub_required_aliases_present, repair_indeedhub_network_aliases,
indeedhub_alias_present, patch_indeedhub_nostr_provider, and the INDEEDHUB_*
consts. The manifests now carry everything these did: network_aliases (short
hostnames), generated_secrets, dependencies, and the post_install nginx hook. So
"indeedhub" + every member flows through the generic install_fresh/reconcile path
— the frontend fresh-creates normally and runs its hook.
(crash_recovery.rs's frontend-after-deps ordering guard is kept — it's beneficial
startup ordering, not a blocker.) cargo check + release build green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on
.228 AND .198 for now, down from 20x. Restore to 20x before the final ship.
Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author the IndeedHub stack as 7 manifests (postgres/redis/minio/relay/api/
ffmpeg + frontend) and route install_indeedhub_stack through the
orchestrator first (immich pattern), falling back to the legacy installer
only when the manifests aren't deployed.
Data-preserving by construction — the manifests reproduce the live install
exactly so an existing node ADOPTS rather than recreates:
- container_name = the live hyphenated names the runtime already references
(health_monitor tiers/deps, crash_recovery).
- named volumes indeedhub-{postgres,redis,minio,relay}-data (not bind mounts).
- dedicated indeedhub-net + network_aliases [postgres|redis|minio|relay|api]
so the api/ffmpeg env hostnames and the frontend nginx upstreams resolve
unchanged.
- generated_secrets (indeedhub-db-password/-minio-password owned by their
backends, indeedhub-jwt by the api) reuse the live /var/lib/archipelago/
secrets values (ensure_one no-ops on existing files; postgres pw is fixed
at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept.
The frontend carries the post_install hook (#20) that replaces the hardcoded
patch_indeedhub_nostr_provider: strip X-Frame-Options, refresh
nostr-provider.js from /opt/archipelago/web-ui, inject the <script> if
absent, reload nginx — defensive/idempotent since indeedhub:1.0.0 already
bakes these. Frontend manifest also corrected off its dead Next.js shape
(health check now nginx :7777, tmpfs /run + /var/cache/nginx).
Builds + unit-tested; live adoption/lifecycle verification on .228 next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `container.network_aliases: Vec<String>` (serde default, DNS-label
validated) so a stack member can answer to short hostnames its peers bake
in, beyond its own container name. Rendered in both runtime paths:
- podman_client: merged (deduped) into the custom-network aliases array.
- quadlet from_manifest: appended after the container name; emitted only
for Bridge networks (slirp/pasta reject aliases).
Needed for the indeedhub migration: its frontend nginx proxies to
`api:4000` / `minio:9000` / `relay:8080`, so those members declare
`network_aliases: [api|minio|relay]` to keep the short names resolvable on
the dedicated indeedhub-net (vs. colliding generic aliases on archy-net).
Also fixes 4 pre-existing from_manifest test failures (unrelated to this
change, surfaced now that the quadlet suite runs green): test manifests
used the long-invalid `network_policy: archy-net` (allowlist is
isolated/bridge/host → moved to network_policy: isolated + container.network)
and bind sources outside /var/lib/archipelago.
Tests: container crate 53 pass; archipelago quadlet+alias 47 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add container::hooks::run_post_install — runs an app's declarative
post_install hooks against its own running container:
- Exec -> podman exec <container> <args…> (60s timeout-bounded)
- CopyFromHost -> resolve src against allowlist roots (<data_dir>/<app>
and /opt/archipelago), canonicalise + prefix-check (defeats symlink
escape), then podman cp <abs-src> <container>:<dest>
Best-effort + idempotent: a failed step is warned and skipped, never
fails the install — matching the legacy patch_indeedhub_nostr_provider
behaviour this replaces. Wired into install_fresh after the container is
up, so it runs only on a freshly created container (not plain start), and
re-applies on recreate-after-drift.
5 unit tests on resolve_copy_src (accept in-data-dir, reject absolute /
traversal / missing / symlink-escape). cargo test -p archipelago green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add controlled post_install/pre_start hook schema to AppDefinition:
LifecycleHooks/HookStep (Exec | CopyFromHost)/HostCopy with allowlist
validation (relative src, no '..', absolute container dest, non-empty
exec). Re-exported from the crate root. Design: docs/manifest-hooks-design.md.
Also add the missing generated_secrets: vec![] field to three
pre-existing ContainerConfig test literals (the field was added to the
struct in 03a4ee1b but the container crate's own tests were never rerun,
so -p archipelago-container failed to compile). cargo test green: 53 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
container-list reports stack apps package-level (.name="immich"), so the suite
checks the "immich" package (presence, valid state, :2283 lan-address) rather than
individual container names. Destructive tier fires async stop/start/restart and
asserts on the end state via wait_for_container_status.
KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs
ops back-to-back with no settling while immich's async stack ops take 30s+, and
stopped reports as "exited" not "stopped". The immich migration itself is verified
working (manual stop/start/restart succeed; all 3 containers healthy). Hardening
the harness for stack apps (inter-op settling + stopped|exited acceptance) is a
follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.stop/start/restart broke ("no containers found" / "no such object
immich_postgres") because the runtime hardcodes the immich stack's container names
as immich_server/immich_postgres/immich_redis (underscore) across 8 files
(lifecycle, health, crash-recovery, ports, config). The migration had named the
containers by app_id (hyphen), mismatching all of it.
Root cause of the earlier failed attempt: container_name was nested under an
`extensions:` block, but `app.extensions` is serde(flatten) — container_name must
be a TOP-LEVEL app key to be read by compute_container_name. Fixed: set
container_name: immich_server / immich_postgres / immich_redis at top level, and
point DB_HOSTNAME/REDIS_HOSTNAME at the underscore aliases. App ids stay hyphen
(immich/immich-postgres/immich-redis) so the catalog identity (title+icon) holds.
Manifest-only change — container names now match existing runtime references, no
code edits to the 8 files. (Deriving stack containers from manifests instead of
hardcoded lists remains a north-star follow-up.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack
(immich + immich-postgres + immich-redis): presence + valid state of all 3 members,
a guard that no legacy underscore containers exist (catches botched migration /
legacy-installer fallback), destructive stop/start/restart of the server with
postgres+redis staying up, and cascade uninstall/reinstall (preserve_data).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Orchestrator-installed backends (immich, btcpay-db, …) run as plain podman
`--restart=unless-stopped` containers until the Phase-3 Quadlet rollout flips
use_quadlet_backends on. Nothing in the codebase enabled the user's
podman-restart.service, so those containers had NO reboot-survival mechanism.
Enable it (idempotent, best-effort) at orchestrator startup so unless-stopped
containers come back after a reboot. Already applied manually on .228 (covers
31 containers incl. immich + btcpay); this codifies it fleet-wide.
The deeper fix (render Quadlet for all orchestrator installs) remains the gated
Phase-3 Quadlet-everywhere rollout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the manifest migration the launcher installed as "immich-server" (app_id),
which has no catalog entry → showed the raw id and no icon. Rename the server
manifest app_id immich-server→immich so it matches the catalog/curated "immich"
entry (title "Immich", icon immich.png) and is recognised as a known launcher app
(APP_CATEGORY_MAP) → stays in My Apps. immich_stack_app_ids now installs
[immich-postgres, immich-redis, immich]; orchestrator.install bypasses package
routing so there's no recursion with the "immich"→stack-installer mapping.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Classify databases/APIs/backends into Services (#10): add immich-postgres/redis
to SERVICE_NAMES; isServiceContainer matches -postgres/-redis/-valkey/-cache/-db
suffixes; isWebsitePackage final fallback now routes any no-UI, non-known package
to Services ("anything that isn't the frontend UI launcher").
- Services show their parent app's icon (#14): backends reuse the app logo
(immich-* → immich, archy-btcpay-db → btcpay, indeedhub-* → indeedhub, etc.)
via explicit APP_ICON_FALLBACKS + prefix map, instead of 404 → 📦.
- Categories sub-nav for Services (#12): getServiceCategory + buildServiceCategories
+ useServiceCategories; Services tab gets the same desktop/mobile category strips
(Databases/Caches/APIs/Backends), shown only for categories with items. Shared
selectedCategory resets to 'all' on tab switch.
- Mobile swipe (#11): the tab-swipe gesture is suppressed over .mobile-category-strip
so swiping the category chips scrolls them instead of changing tabs (covers both
My Apps and the new Services strip).
vue-tsc build clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).
- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
* named by app_id (dropped container_name override) — using container_name
spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
on the same PGDATA, which corrupted a postgres cluster. Server reaches its
siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
* immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
host 100998 under rootless; verified the fresh dir is chowned correctly).
* immich-server version "release"→"2.7.4" (manifest validation requires a digit;
the bad version made the manifest silently skip → partial orchestrator install
→ legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
errors instead of double-creating containers on shared data (the corruption
root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
manifest — this gap let the bad immich-server version through.
Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
immich becomes a manifest-driven stack (the legacy install_immich_stack — hardcoded
podman run + sudo chown — is the anti-pattern being retired). Three image-only
manifests modelled on the btcpay stack + the live .228 container config:
- immich-postgres / immich-redis / immich-server on archy-net; container_name set
to the underscore form (immich_postgres/_redis/_server) so the server's
DB_HOSTNAME/REDIS_HOSTNAME aliases resolve.
- generated_secrets: [immich-db-password] (idempotent — reuses the live secret on
existing nodes; postgres is already initialised with it).
- server depends on postgres+redis (install ordering); upload bind preserved.
Inert for now: not added to the UI catalog and install_immich_stack still the
default, so nothing installs these until the orchestrator wiring + on-node
ownership (data_uid) validation lands. Schema validated by the all-manifests
round-trip test. See docs/PRODUCTION-MASTER-PLAN.md §6.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generate-app-catalog.sh gains opt-in EMBED_MANIFESTS=1: embeds each
apps/<id>/manifest.yml into its catalog entry's `manifest` field (whole document,
top-level app: preserved — exactly what the Rust side deserializes). Default off
so routine catalog regen is unchanged during the migration window; turn on
deliberately, then sign via the existing release-root ceremony. Verified: default
embeds 0; EMBED_MANIFESTS=1 embeds 40 manifests (generated_secrets preserved).
Adds a round-trip guard test: every shipped apps/*/manifest.yml must deserialize
+ validate through catalog_manifest_to_overlay (image apps accepted, build apps
defer to disk) — catches schema drift between disk manifests and the catalog path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a
full manifest per entry; the orchestrator overlays it over the disk manifest
(origin-wins) with disk as the migration fallback. Moves apps toward
registry-distributed manifests with no OTA-shipped disk file.
- app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible,
covered by the existing release-root signature over the raw JSON);
`catalog_manifest_values()` accessor.
- prod_orchestrator: `load_manifests` overlays catalog manifests after the disk
walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on
unparseable value / app-id mismatch / failed validate() / build source
(build contexts aren't registry-distributed yet — phase 1 is image-only).
- manifest_dir stays PathBuf (build-only field); image-only apps never read it.
- 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so
existing nodes are unaffected.
See docs/registry-manifest-design.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform
north star: every app manifest-driven (zero OS-level reliance), manifests via the
signed registry, developer-ready external marketplace; rootless/secure/robust/
100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until
the 20x lifecycle gate is green. New design doc registry-manifest-design.md.
Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and
superseded trackers (content folded into the master plan or already in memory).
Kept all evergreen design/reference docs + ADRs (the master links them).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Buyer-side paid downloads now persist: purchases are cached on disk
(content_owned.rs) keyed by (seller onion, content_id), the gallery shows
an "Owned" badge unblurred, and items view/play in-app from the local
cache with no re-payment or reliance on a browser download (which
silently failed on the mobile companion). New RPCs content.owned-list /
content.owned-get. Validated e2e .116<-.198 (paid 100 sats via Fedimint,
166KB jpeg returns, survives restart).
fedimint-clientd manifest: restore the standard container capability set
(CHOWN/DAC_OVERRIDE/FOWNER/SETUID/SETGID) so fmcd's startup chown of an
existing-federation /data succeeds instead of dying EPERM (#7). Confirmed
the orchestrator applies these to the running container.
FIPS perf: tighten the supervisor warm-path keepalive 45s -> 25s so peer
paths stay inside the ~30-60s NAT cold window. Dials now reliably land on
FIPS instead of re-punching and falling back to Tor. Measured to the same
peer: cloud browse 18-22s -> 0.4s; full Fedimint paid download 29s -> 11s
(residual is the seller-side guardian reissue round-trip).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fmcd crash-looped "Operation not permitted (os error 1)" on .116 (kernel
6.12.74): the default rootless seccomp profile blocks a syscall its Mainline-DHT
/ iroh transport needs, so the REST API never came up (:8178 → HTTP 000) and
federations couldn't be joined. Verified: with seccomp=unconfined fmcd boots and
answers /v2/* (HTTP 401 instead of dead). fmcd works on other nodes, so this is
kernel/seccomp-specific — but the relaxation is safe for an outbound-networking
daemon and harmless where not needed.
- new `security.seccomp_unconfined` manifest flag (SecurityPolicy);
- libpod backend sets `seccomp_profile_path: "unconfined"` (== --security-opt
seccomp=unconfined); quadlet backend emits `SeccompProfile=unconfined`;
- enabled in apps/fedimint-clientd/manifest.yml.
NOTE: manifests live on-disk at /opt/archipelago/apps/<id>/manifest.yml, so the
node needs the updated manifest deployed + the fmcd container recreated to apply.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- PeerFiles: new confirmation step after "pay from ecash" — shows the amount and
which wallet will be spent (Cashu/Fedimint) with balances, lets the user switch
backends, and a styled Confirm button. The chosen backend is passed to the
payment so it spends exactly what was confirmed.
- content.download-peer-paid: accept `method` (cashu|fedimint) to honor the
confirmed choice; log the backend + outcome; backend-specific rejection errors
("not in the same Fedimint federation" / "doesn't accept your Cashu mint").
- AUTO-REFUND: a minted token whose sale fails (peer unreachable, rejected, or
error) is now reclaimed (fedimint reissue / cashu receive) so the buyer no
longer loses the spent ecash — fixes the stuck-Fedimint-notes report.
- wallet.ecash-balance already reports cashu_sats/fedimint_sats/total_sats which
the confirm screen uses to pick/show the covering wallet.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chrome://inspect isn't always reachable on the Android companion WebView, so the
real error stayed invisible. Add a plain-DOM, screenshot-able overlay (built
without Vue so it survives a crash in Vue itself) that shows the captured error
message + stack and a Copy button for the full window.__archyErrors buffer.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Right after install the dashboard SPA opens and, if it loads before NetBird's
embedded OIDC provider is serving, caches a bad auth state — the user appears
logged-in but can't log out until it self-corrects. Container "running" != OIDC
ready, so gate the install's Done phase on the management server's
/oauth2/.well-known/openid-configuration answering (best-effort, 60s cap, never
fails the install since the stack is already up).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On Cloud/files (and any scrolling view), the bottom of the list could sit behind
the fixed mobile tab bar. Cause: DashboardMobileNav measured the bar's
offsetHeight and wrote it to --mobile-tab-bar-height, but when the bar was hidden
or not yet laid out the measurement was 0 — and writing "0px" defeats the
", 88px" fallback in the .mobile-scroll-pad clearance calc (an explicit 0 is
still a set value), so the clearance collapsed and the ~88px bar overlapped the
last row.
- never write 0px: only set a real measured height, else remove the var so the
88px fallback applies.
- re-measure after first paint (rAF) and after the WebView safe-area injection,
so the clearance reflects the bar's final laid-out height.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Paying for a peer file minted a Cashu-only token, so a node whose ecash balance
lived in Fedimint couldn't pay even with funds. Now both backends are tried:
- payer (content.download-peer-paid): mint a Cashu token first; on failure fall
back to spending Fedimint notes. Only error if BOTH backends can't cover it.
- seller (verify_and_receive_payment): accept Fedimint notes as well as Cashu —
anything not starting with "cashu" is redeemed via reissue_into_any.
- new fedimint_client::spend_from_any() — spend from whichever joined federation
has the balance, returning the notes + federation id (mirrors reissue_into_any).
- wallet.ecash-balance now also reports fedimint_sats + combined total_sats; the
pay-for-file pre-check uses the combined total so a Fedimint-funded node isn't
wrongly blocked.
Compiles (cargo check + vue-tsc). Live cross-node federation validation pending
(dual-ecash phase 6) — needs two nodes sharing a federation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global Vue errorHandler swallowed every crash into "Something went wrong.
Please refresh the page." — which hides exactly what we need to diagnose the
companion-app (Android WebView) post-login crash. Now:
- the toast shows the real (truncated) error message;
- a 25-entry ring buffer is kept on window.__archyErrors for retrieval where
there's no console (companion WebView via chrome://inspect, or a debug view);
- window 'error' and 'unhandledrejection' listeners catch async/non-Vue errors
that Vue's errorHandler misses (e.g. a JS API absent in an older WebView).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A node reachable both over LoRa and federation has two MeshPeer rows (radio
twin: low contact_id + firmware key; federation twin: high contact_id +
archipelago key), and messages key by peer_contact_id split across the two ids
— so opening one twin shows an empty thread (the .120->.89 symptom).
- backend: new group_peer_twins() helper groups peers by arch_pubkey_hex (set on
BOTH twins by bind_federation_twins), keeps the radio id as the mesh-first
send target, and unions messages across all twin ids. Wired into
conversations.list / conversations.messages / mesh.contacts-list. +3 unit tests.
- frontend: the live chat list merges client-side (mergedPeers) and matched twins
by the "Archy-z6Mk..." advert prefix, which the Meshtastic device rename broke
(radio now advertises the server name). Merge by arch_pubkey_hex instead, which
the backend reliably sets on both twins. Expose arch_pubkey_hex on MeshPeer.
- fix unrelated stale test: EcashTransaction test missing the new `kind` field.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resume notes for the 1.8.0 bug-bash mesh work: Meshtastic rename shipped +
verified; .120->.89 'non-delivery' diagnosed to a duplicate-contact surfacing
bug (messages inject fine, split across federation/radio twin contact_ids);
design for the dedup fix (#12) and the netbird logout-race map (#10).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Meshtastic device rename was a no-op — set_advert_name only updated an
in-memory field and never told the radio, so the device kept its firmware
default ('Meshtastic xxxx') and wasn't findable from external Meshtastic
apps. MeshCore already renamed correctly (CMD_SET_ADVERT_NAME); this brings
Meshtastic to parity.
Send an AdminMessage{set_owner=User{long_name,short_name}} to the locally
connected node (admin packet to our own node_num on the ADMIN_APP port).
Local serial admin needs no session passkey, matching the official client.
long_name = server name (<=39 chars); short_name = first 4 alphanumerics,
upper-cased. Verified on real hardware: .120 -> 'Archy-X250-EXP', .5 ->
'Archy-X250-Beta' (name read back from the radio after reconnect).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Calibrated from a device home-screen screenshot: launcher3 crops less than the
App-info view, so the ring at 0.53 sat ~78% out. Scale 0.65 reaches the edge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ring uses logo.svg's #000->#666 gradient (stroke 22.8834) pushed to scale 0.53
so it sits at the launcher's visible crop edge (calibrated from a device
screenshot). Grid at 0.55. versionCode 9 so launcher3 refreshes its icon cache.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Device App-info screenshot showed the launcher only renders the central ~54%
of the adaptive icon, clipping the ring. Calibrated the ring to scale 0.50 so it
lands at the visible circle edge; grid to 0.55. Bump versionCode 8 so launcher3
refreshes its icon cache (it keys the cached bitmap by versionCode).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Joining a Fedimint federation is heavy and routinely outlasts the default 15s
client timeout while still succeeding server-side, so the UI wrongly showed
failure. Bump the join timeout to 90s, and on any error re-check the list: if a
new federation appeared the join worked — show 'Federation joined.' instead of
a misleading error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A stock meshcore client (e.g. a phone) can't sign our typed envelopes, so it is
never 'authenticated' — which meant ticking it as an allowed assistant contact
had no effect and !ai stayed denied. The explicit per-contact allowlist is a
deliberate operator opt-in for a specific key, so match it regardless of
authentication, keyed on the asker's resolved identity (bound archipelago key,
else firmware routing key — how meshcore addresses the contact). The spoofable
federation-trust-list match still requires authentication.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- reissue_into_any now tries the UNION of the local registry AND fmcd's live
joined set (/v2/admin/info) before failing, so a valid Fedimint token isn't
wrongly rejected when the registry has drifted. On all-fail it returns a
friendly message: notes already redeemed into this wallet (funds safe) vs
didn't match any connected federation.
- Unified transaction history: a local Fedimint tx log (recorded on each
successful redeem) is merged with the Cashu history in wallet.ecash-history,
newest-first, each tagged kind=cashu|fedimint. Previously a Fedimint receive
appeared nowhere.
- fedimint-clientd healthcheck -> type:tcp. It was probing /health, which fmcd
doesn't serve (only /v2/*), pinning the container in (starting) forever; the
TCP probe is skipped by the Quadlet renderer (host-side lifecycle verifies),
so it reports running. Cosmetic for ecash, which worked throughout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Put dark fill + inset metallic ring (0.88) + grid (0.58) all in the background
(renders to the mask edge, no safe-zone crop); transparent foreground. Matches
a locally-rendered, circle-masked preview so the ring is visible and uncut.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the metallic ring into the background (renders to the mask edge, unlike the
foreground which is cropped to the safe zone) so the border is finally visible
at the circle's rim; shrink the grid to ~0.55 so the mark isn't too big.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Apps tab: a horizontal swipe that starts on an app icon no longer flips the
top tab — it lets the app-page scroll / icon tap win (swipe empty space to
change tab). Fixes the swipe conflict with two pages of apps.
- Files: file cover tiles are forced square on mobile (aspect driven by CSS,
not a Tailwind arbitrary class) so the grid is uniform and tappable.
- Files: scroll container gets bottom safe-area + tab-bar padding so the last
row clears the mobile back button / bottom nav.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A !ai (or any typed message) from a trusted, federated node was denied when
it arrived over the radio. The radio half of a node that is also a federation
peer carried no archipelago identity (identity adverts are no longer broadcast
on the public channel), so the trusted_only gate and signature verification
had no key to check the asker against — and the same node showed up as two
contacts (a radio twin + a federation twin).
- bind_federation_twins(): correlate a radio contact with its federation twin
by exact, case-insensitive advert_name and copy the federation peer's
arch_pubkey_hex/did/x25519 onto the radio record. Called from
upsert_federation_peer and refresh_contacts. Ambiguous names (held by >1
federation peer) are skipped. This is only a CANDIDATE key — security is
unchanged: the inbound envelope signature must still verify against it.
- send_message now signs the typed Text envelope (new_signed) so a radio !ai
authenticates against the bound key. A meshcore node merely named like a
trusted node cannot forge the signature, so it is still denied.
Receiver-side verification (handle_typed_envelope_direct) and federation-trust
matching (is_sender_allowed) already existed; this supplies the missing key
binding and signature. Also resolves the radio/federation duplicate-contact
display for same-named nodes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert to a pure adaptive icon (the bare round PNG was getting legacy-wrapped
onto a white circle by the launcher). One ring only, in the foreground, using
the SVG's dark #000->#666 gradient on a plain dark tile.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert the brightened grey->white ring back to the original logo.svg gradient
(black->#666, stroke 22.8834) on both the round PNG icon and the adaptive
foreground.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Render the full circular badge (bright grey->white ring + grid) to round-icon
PNGs at all densities and drop the adaptive round XML, so launchers that use
round icons show a real edge-to-edge circle instead of a mask-cropped coin.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scale the whole badge to ~0.64 so the bold grey->white ring isn't clipped at
the edge by the launcher mask; bigger, brighter ring. Background is plain dark.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ring at 0.96 sat in the adaptive-icon bleed zone (outer ~18dp cropped by the
launcher), so only the grid showed. Scale badge + grid to 0.68 so the ring lands
at the edge of the visible circle, and brighten it to grey->white.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the badge ring into the background layer (brightened grey->white so it
reads on #0A0A0A) at ~0.96 so it sits at the masked-circle edge; foreground is
just the white grid. Also honor SHIP_COMPANION in the pre-push hook.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/publish-companion-apk.sh builds the debug APK and refreshes the served
download neode-ui/public/packages/archipelago-companion.apk.zip; .githooks/pre-push
runs it on every push to main that touches Android. Enable per clone with
git config core.hooksPath .githooks
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adaptive icon foreground now draws the full badge (black→grey gradient ring +
white grid) scaled to ~0.94 so the ring reads as a clean border at the circle
edge. Adds ship-companion.sh: builds the debug APK and publishes it to
neode-ui/public/packages/archipelago-companion.apk.zip, then commits + pushes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The controller body/face were opaque, so the synthwave backdrop only peeked
out above/below the controller. Make the DARK palette surfaces translucent
(body/face/inlay) and drop the opaque shadow platform + the gradient's forced
0.95 alpha, so the backdrop reads through the controller as glass. CLASSIC
palette stays solid.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tapping a dashboard app icon now scales it down immediately (CSS :active)
and shows a per-icon spinner until the app overlay opens, so the tap is
acknowledged even while the app session spins up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- New circular badge logo (ic_logo) on Intro + Connect screens; launcher
icon rebuilt as dark circle + white grid.
- Reddish synthwave backdrop (bg-intro-2) behind Intro, Connect, and the
remote/gamepad (edge-to-edge with a light scrim); controllers no longer
paint an opaque fill over it.
- Server name: added to ServerEntry/prefs, the Connect form, the modal
add-form, and saved-server rows; removal now matches by connection
identity (rename- and legacy-format-safe).
- NESMenu modal restyled to glassmorphism #0A0A0A with centered, larger
fields. Connect-form glass cards given a darker base for legibility.
- Intro title/subtitle set to #FAFAFA.
- Deleting the last server clears the active server and returns to Connect.
- D-pad auto-repeat initial delay raised to 500ms so a tap sends one key
(fixes doubled nav sound).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Messages to a federated peer that is out of LoRa range (e.g. on another
continent) were dropped into the radio with no fallback, or hung on a dead
FIPS path before reaching Tor — so they never arrived.
- Route a radio contact over the federation transport (FIPS->Tor) when it is
the same node as a federated peer (known archipelago identity -> onion) AND
it is not currently reachable over the radio. Reachable radio peers stay on
the mesh (preferred); oversized/file envelopes still always take federation.
- Resolve the onion via the archipelago identity key (arch_pubkey_hex), not
the firmware routing key, so a radio contact maps to its nodes.json onion.
- Add .fips_timeout(8s) to the federation message POST so an unreachable FIPS
overlay fast-fails to Tor (~3-5s) instead of burning the 120s budget.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Companion app QR encoded a relative path (/packages/...apk.zip) which
can't resolve when scanned by a phone. Point it at the absolute 146
release-server URL so the download works from any device.
- Dashboard tab-swipe: guard tabs[next] (noUncheckedIndexedAccess) so the
frontend type-checks/builds.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Federated nodes failed to converge to full-mesh across the LAN<->Tailscale
boundary: nodes were invisible to peers, sync 'took ages'/timed out, and
names only updated on a manual sync. Onions were healthy in both directions
(~3-5s); the failures were app-layer.
- B: federation dials fast-fail a dead FIPS path via .fips_timeout(6s) in
sync_with_peer + notify_join, so the Tor fallback isn't stuck behind the
full 30s FIPS budget when LAN and remote peers share no FIPS path.
- A: notify_join (peer-joined) now spawns with retries+backoff instead of a
single awaited best-effort POST, so the join RPC returns instantly (no
'Request timeout') and the inviter reliably learns the joiner (was
asymmetric).
- C: new 90s periodic federation auto-sync (none existed) so renamed nodes
and roster changes propagate without a manual Sync click.
- self-heal: each auto-sync re-asserts membership to any peer that doesn't
list us back, converging the fleet to full-mesh and healing pre-existing
asymmetry with no manual re-joins.
Validated live across 7 nodes: a previously fleet-invisible node became
fully meshed automatically (logs: 'auto-sync ... reasserted=1',
'peer-joined ... delivered').
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
UI (this session):
- Global audio player now scales the whole interface into the space above it
on desktop (sidebar + main) and docks directly above the tab bar on mobile;
it stays visible while navigating.
- Mesh mobile redesign: floating Chat / BTC / Dead Man / AI / Map tab strip
with a single fixed, internally-scrolling pane (page no longer scrolls);
tabs hide while a conversation is open; floating back button; collapsible
Device panel (starts collapsed); keyboard-aware conversation sizing via
VisualViewport so the chat sits just above the keyboard.
- Cloud file grid: uniform 4/3 card heights (folders + images match).
- Swipe left/right switches tabs on the Apps and Web5 screens.
- Map tool fills its pane (no bottom gap); fix skewed Share Location toggle
on mobile (global min-height rule was deforming the switch).
- Trim redundant helper copy from the mesh AI tab.
Also bundles pre-existing in-progress work that was already in the tree:
mesh listener/session + wallet + container + bitcoin-status backend changes,
docker UI updates, and assorted other UI tweaks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fedimint never appeared in Wallet > Settings > Fedimint because the
fmcd (fedimint-clientd) sidecar was never installed: ensure_default_
federation() needs the fmcd password to reach the daemon, found none,
and silently no-oped, leaving the registry empty.
- prod_orchestrator: add fedimint-clientd to the baseline auto-install
set so it self-heals onto every node and auto-joins the default
federation; generate the fmcd-password secret before secret_env
resolves.
- fedimint_client: ensure_fmcd_password (random hex, 0600) shared with
the container's secret_env; from_node reads the same secret (legacy
fmcd/password kept as fallback); reissue_into_any redeems received
notes into the first joined federation that accepts them.
- wallet.ecash-receive: dual-token — cashu* tokens redeem at the mint,
anything else is reissued via fmcd; returns the kind + federation_id.
- UI: receive box advertises "Cashu or Fedimint" and reports which kind.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Meshtastic DMs were falling back to a channel broadcast, so every node
on the LoRa channel saw a "direct" message. Send a directed MeshPacket
(to = node num, decoded from the synthetic pubkey's node-id bytes)
instead — the Meshtastic analog of the meshcore CMD_SEND_TXT_MSG fix.
DMs now reach only the recipient; firmware auto-PKC-encrypts them
end-to-end once NodeInfo keys are exchanged.
Capture E2E status at the driver level (no shared-type/UI change):
- learn each peer's real Curve25519 key from User.public_key (field 8)
and inbound MeshPacket.public_key (16), kept in a side-map separate
from the synthetic routing key so unicast routing is untouched
- detect inbound MeshPacket.pki_encrypted (17) to tell a true E2E DM
from a channel-PSK fallback
- peer_is_pkc_capable() seam for a future mesh-tab E2E badge
Hot-swap preserved: no dispatched MeshRadioDevice signature or the
shared ParsedContact changed, so meshcore and meshtastic stay
interchangeable behind the listener.
Adds tests/multinode/meshtastic.sh, a two/three-radio on-air parity
harness (detect, discover, DM round-trip, DM privacy, channel
broadcast, typed envelope, reachability).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The offline/reconnecting banners were in-flow (mx-6 mt-6) and pushed the whole
dashboard down when shown. Teleport them to <body> as a fixed, top-centered
overlay with a fade/slide transition and safe-area inset, so they no longer
shift layout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
applicationIdSuffix=".debug" + versionNameSuffix so a debug/test build
installs alongside the release app instead of failing on signature mismatch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When ArchipelagoNative is present (the Android companion app), openInNewTab()
now calls openInApp(url) so non-iframeable apps open in the in-app WebView
instead of a suppressed window.open popup. Falls back to window.open in a
plain mobile browser. Logic only; no visual change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kiosk's "Open in new tab" used window.open(..., 'noopener,noreferrer'),
which the WebView suppresses, so launching apps that can't be iframed did
nothing. Route such node apps (same host) into a local in-app WebView overlay
instead, keeping the kiosk view alive underneath; genuinely external links
still go to the system browser. Wired through onCreateWindow,
shouldOverrideUrlLoading, and a new ArchipelagoNative.openInApp() bridge.
Perf (no visual change): enable setOffscreenPreRaster to stop scroll
checkerboarding, and enable WebView remote debugging on debuggable builds
for chrome://inspect profiling.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-key FileGrid on the current folder path and wrap it in a cloud-zoom
Transition so the depth/zoom animation replays at every folder level; the
header + breadcrumb nav stay fixed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Cashu default mint was the local Fedimint guardian (:8175), wrongly surfacing
a Fedimint URL in the Cashu mints list. Default is now Minibits
(https://mint.minibits.cash/Bitcoin) — Cashu and Fedimint are distinct
protocols (Fedimint lives under its own tab).
- Peer-file (buy) invoice creation: retry the LND REST call (3× / 400ms) so a
transient LND-REST blip (swap pressure / just-restarted / TLS race) no longer
hard-fails as an opaque 503, and surface the real error chain ({:#}) in the
response + logs instead of a generic "Failed to create invoice".
- Autojoined default federation now shows a friendly name ("Archipelago
Federation") in the Fedimint tab instead of a bare federation id.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sizes bitcoind -dbcache to host RAM (~1/16, floor 300MB, cap 4096) instead of a
fixed 2048/4096. A multi-GB UTXO cache on an 8GB node running the full app stack
pushed memory past physical RAM and triggered system-wide swap thrash: the disk
saturated, bitcoind could not answer its own RPC, and the dashboard backend's
sqlite reads stalled — surfacing as fleet-wide /rpc/v1 502s and a blank Bitcoin
UI. Applied in scripts/container-specs.sh (reconciler path) and the config.rs
bitcoin-core path.
Bitcoin status cache now polls every 5s (was 10/15) with an 8s timeout (was 20s)
and fetches the four RPCs concurrently, so the cached snapshot tracks bitcoind's
responsive windows during IBD and the UI stops dwelling on "reconnecting...".
Unifies the divergent discover AppGrid/FeaturedApps image-error handlers onto the
canonical placeholder fallback so missing app icons render the placeholder.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- DMs now use native meshcore unicast (CMD_SEND_TXT_MSG) instead of @DM2 channel
broadcasts: private (E2E-encrypted to the recipient pubkey by firmware), off the
public channel, and decodable by stock clients. Plain text (split, not MC-chunked)
to non-archipelago contacts; typed envelopes to archy peers.
- !ai replies now DM the asker privately (RadioDm) instead of broadcasting on ch0.
- Auto contact-import: a heard advert (PUSH_CONTACT_ADVERT/0x80, 32-byte pubkey) is
added via CMD_ADD_UPDATE_CONTACT (0x09) so contacts appear without a flood advert.
- clear-all now DELETES firmware contacts via CMD_REMOVE_CONTACT (0x0F) instead of
blocklisting; blocking filter removed entirely. Wiped contacts return when reachable.
- Contact reachability: MeshPeer carries last_advert + reachable (path-based); UI shows
a reachability dot.
- Peers list: contact search box (filter by name/DID/npub/pubkey) with a clear button.
- send_message routes stock contacts as plain native text (fixes garbled envelopes).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- mesh: stop broadcasting ARCHY:2 identity on the public channel (startup + every advert tick); receive path still parses inbound. No more public-channel spam.
- mesh assistant: trigger on !ai/!ask typed in 1:1 chat (was only the dead AssistQuery path + bare channel text); route the reply transport-aware via MeshService::send_message (Tor for federation peers, LoRa for radio) through a new AssistChatReply event consumed at the server layer — fixes replies never reaching federation askers.
- mesh assistant: per-contact !ai allowlist (allowed_contacts) bypassing trusted_only; config + RPC + is_sender_allowed.
- fedimint-clientd manifest: network_policy open -> bridge (invalid value made the loader skip the whole manifest, so fmcd never ran and federations never joined/listed).
- ui: AI panel — Claude model dropdown (Haiku/Sonnet/Opus presets) + allowlist contact picker.
- ui: Settings — App Updates + App Registry moved under Account.
- ui: mesh chat — overscroll-behavior: contain so chat scroll no longer bleeds to the contacts panel.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Component tests mounted without main.ts's bootstrap, so the $ver global
template helper (app.config.globalProperties.$ver = displayVersion) was
undefined — AppSidebar/AppHeroSection/MarketplaceAppCard tests failed with
"_ctx.$ver is not a function", blocking the release gate's ui-unit-tests
stage. Add a vitest setup file that mirrors main.ts via config.global.mocks
and wire it into vitest.config.ts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the assistant scheduler, MeshAssistantPanel UI, and the remaining
config-RPC / live-toggle / Ollama-detect wiring on top of Phase 1.x.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2 backend. AssistantConfig is now live-updatable (RwLock) so the UI
toggle applies without a listener restart. New RPCs:
- mesh.assistant-status -> {enabled, model, trusted_only, default_model,
ollama_detected, models[]} (probes local Ollama :11434/api/tags)
- mesh.assistant-configure -> set enabled/model/trusted_only live + persist
MeshService::assistant_config / configure_assistant. Compiles clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ARCHY:2 identity broadcast (DID + ed25519 + x25519) was unwired dead
code on both send and receive. Wiring it lets a radio peer prove its
archipelago identity, so the assistant's trusted-only gate (and encrypted
DMs) work over meshcore AND Meshtastic — the latter otherwise only exposes
synthetic node keys.
- session.rs: broadcast ARCHY:2 as channel text at startup + each advert tick
- frames.rs: parse inbound ARCHY:2 on the channel path, dedupe-keyed by
archipelago pubkey (federation_peer_contact_id) so it MERGES with the
federation-seeded peer instead of duplicating; self-echo guarded
- threads our_x25519_secret into handle_channel_payload (was reserved)
Reuses the existing handle_identity_received verifier (ed/x25519 consistency
check + shared-secret derivation). Compiles clean. Needs a live 2-radio test
before trusting trusted-only over radio.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A plain '!ai <q>' / '!ask <q>' on the channel is now answered by the node's
local model and broadcast back as plain text, so ANY client (bare meshcore
or Meshtastic) can ask. Generalised run_assist with an AssistReply target:
Typed chunks to a peer (archipelago UI path) vs plain channel-text (bare
clients). Trust/rate gate unchanged; asker identity is separate from reply
mode. Works over both radios.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Headless containers (databases, APIs, backends without a UI) belong in a
tab labelled 'Services', not 'Websites'. The categorisation logic already
routes UI-less packages there (built under #45); this finishes the rename
of the user-facing label across Apps, Marketplace, Discover and the mobile
nav, and makes 'services' the canonical tab state/query param. Old
?tab=websites bookmarks still resolve (back-compat acceptor kept).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Destructure the first 4 pubkey bytes into typed locals so vue-tsc's
noUncheckedIndexedAccess doesn't fail the build (the bytes.length<4 guard
doesn't narrow per-element access). No behaviour change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The IndeeHub API needs MinIO (object storage) up to serve, but the
health monitor's dependency map listed only postgres + redis, so it
would restart the API while MinIO was still starting — the "recovers
only after 1-2 container restarts" symptom. Add indeedhub-minio to the
API's deps; MinIO has no deps of its own so the monitor restarts it
first, no deadlock. (First-start ordering in the stack definition is a
deeper, separate follow-up.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
federation.remove-node only edited nodes.json, so a removed/renamed node
(e.g. a stale "Arch HP") lingered in the mesh chat list with its old
thread. Capture the node's pubkey before removal, then purge its
synthetic mesh peer, shared secret, messages, presence, and persisted
contact entry via the new mesh::purge_federation_peer. Combined with the
#42 name refresh, stale federation contacts can now be fully cleaned from
a node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per the rule that only front-end apps with a UI belong in "My Apps"
(databases/backends/headless go to Websites), make the manifest's
interfaces.main.ui the deciding signal. isWebsitePackage now treats any
package that declares a UI as an app even when it isn't in the curated
APP_CATEGORY_MAP, and falls through headless LAN-reachable packages to
Websites. Additive — service-by-name infra and curated known apps are
unchanged, so no currently-correct app moves.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Settings shows the node-level Nostr key (HKDF derive_node_nostr_key,
read via node.nostr-pubkey) while Web5 > Identities showed the identity
record's own key — the mirrored "Node" identity stores nostr=None and
seed identities use a different BIP-32 NIP-06 key, so the two surfaces
disagreed.
Resolve the node-level Nostr key once in identity.list and override it
onto whichever identity record is the node's own (ed25519 == server_info
.pubkey). Display-only — no stored key is rewritten, so it self-applies
to existing nodes with no migration and the discovery identity is
unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A peer accepted via invite is seeded into the mesh peer table with
name=None, so it shows as "Archipelago <pubkey8>" in chat. Federation
sync later learns the real name (update_node_state writes it to
nodes.json) and discovers transitive peers (merge_transitive_peers),
but nothing pushed those into the live mesh peer table — the chat list
stayed stale until the next mesh restart, and transitive peers never
appeared as contacts at all.
Add RpcHandler::refresh_federation_mesh_peers() (re-runs the idempotent,
onion-deduped seed_federation_peers_into_mesh) and call it after every
periodic sync cycle (server.rs) and after the manual federation.sync-all
RPC. Names now correct themselves and the full roster meshes within a
sync cycle, no restart needed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
phase4-streaming-ecash-plan.md: design for ecash-paid swarm transport, paying
across different mints (§2a, Lightning-bridged swaps), networking-through-nodes
relay, and an IndeeHub "Archipelago" content source. Records the resolved
iroh-blobs paid-serving spike. dht-RESUME.md: task #12 + step F marked done.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a Settings control to the Networking Profits card that opens a new page
where the operator controls what their node charges sats for and how much.
Drives the existing streaming.list-services / streaming.configure-service RPCs;
"free everything" is the default (all priced services ship disabled, surfaced
with a reassurance banner). New route web5/networking-profits + common.settings
i18n (en/es).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 3 wiring (task #12):
- NostrSeedDiscovery: async ProviderDiscovery that queries relays for signed
seed adverts and parses endpoint ids (swarm/iroh_provider.rs, seed_advert.rs).
- seed_and_advertise publish path; dep-free fetch/publish helpers reuse the
node's Nostr identity (build_nostr_client/load_or_create_nostr_keys made
pub(crate)).
- swarm::init builds the IrohProvider once into a OnceLock runtime; providers()
returns it; announce_held_blob() is called from update.rs after a release
component passes both hash gates.
- config swarm_enabled (ARCHIPELAGO_SWARM_ENABLED, default off); server.rs init.
Paid swarm serving (Phase 4 step F):
- swarm/paid.rs gates the iroh-blobs provider through streaming::gate,
intercepting connect + GET (peer push hard-disabled). Free by default
(content-download service disabled); denies unpaid peers when enabled;
fails open on internal error so a payment fault never blocks distribution.
Wired into IrohProvider::new.
All iroh code behind the iroh-swarm feature; the default build is inert.
Default build clean; --features iroh-swarm: 11/11 swarm tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bitcoin-core was missing from APP_CATEGORY_MAP, so isKnownApp() was false and
isWebsitePackage() fell through to 'has a runtime LAN address'. Once the running
container's LAN address (the bitcoind RPC port :8332) showed up ~a minute after
launch, Bitcoin Core was reclassified as a website: it dropped out of the Apps
tab and search, moved under Websites, and launching it opened :8332 (raw RPC)
instead of the :8334 custom UI that Knots opens.
Add 'bitcoin-core': 'money' alongside bitcoin-knots/bitcoin-ui so isKnownApp is
true, isWebsitePackage is false, and launchAppNow routes through openSession ->
resolveAppUrl (:8334 custom UI). Fixes search, category, and the launch URL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Messaging a federation-only peer (e.g. 'Arch Dev') failed with 'Missing
contact_id'. The UI gave federation-only rows a *negative* placeholder
contact_id derived from a DID hash, but the backend parses contact_id as u64,
so a negative value deserialized to None. The negative id also never matched
the positive federation-synthetic id that federation-routed messages are stored
under, so those threads looked empty.
- Frontend: derive the SAME positive federation-synthetic id the backend uses
(federationContactId mirrors federation_peer_contact_id) so mesh.send accepts
it and messages thread correctly.
- Backend: send_typed_wire now resolves a federation-synthetic contact_id from
nodes.json when it isn't in the live mesh peer table (radio-less node),
instead of bailing 'Unknown federation peer'.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Large peer downloads (~178MB) failed with a generic 'Operation failed', and
the download path had three stacked problems:
- The FIPS reqwest client used a hard-coded 20s total timeout regardless of the
caller's .timeout(), so a big transfer over the mesh aborted at 20s before
the Tor fallback could help. Honor the per-request timeout (client_with_timeout).
- The peer-content proxy buffered the whole file into node memory via
resp.bytes() before sending a byte, and capped the transfer at 60s. Stream
the body through with hyper::Body::wrap_stream (constant memory) and raise the
timeout to 900s; bump the nginx peer-content read timeout to match.
- Free downloads pulled the file as base64 over RPC, doubling it in node memory
and the browser — fatal for large files. Download free files by streaming
from /api/peer-content straight to disk, after a 1-byte Range probe that
surfaces the real reason (peer offline on mesh and Tor) instead of a generic
failure. Paid downloads now return the real error through the {error} channel
the UI already displays.
Adds the reqwest 'stream' feature for bytes_stream().
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prod orchestrator only checked whether a build-image tag was *present*
before deciding to skip the build. The local UI images (bitcoin-ui, lnd-ui,
electrs-ui) COPY a built neode-ui dist, so a UI update changed the source but
left the old tag in place and the new UI never shipped.
Gate the build on a content fingerprint of the build context (sorted relative
path + length + mtime, SHA-256) recorded in a per-tag stamp under data_dir.
Rebuild whenever the fingerprint differs from the one that produced the
existing image; podman's own COPY-layer cache keeps a no-op rebuild cheap.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single source of truth for picking the DHT work back up after a restart:
worktree/branch rules, all phase commits, the exact next task (#12 Phase 3
glue), build-time facts, and the Phase 0 go-live ceremony.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apps could fail install when a stack member exited on its first start
because a dependency (db/redis/the bitcoin node) was not ready yet — a
transient crash, not a broken install. wait_for_stack_containers now
restarts each exited/dead container up to 3 times before declaring the
install failed; the runtime supervisor keeps it alive afterwards.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The discovery wire format that feeds the swarm's ProviderDiscovery seam: a
node announces 'I seed blake3 H from iroh endpoint E' as a signed NIP-33
addressable Nostr event. Scope is releases/catalog content ONLY (decided
2026-06-16) — never private user blobs.
- swarm/seed_advert.rs: kind 30081, d-tag = blake3 hex (one current advert
per author+hash, latest-replaces), content {"v":1,"endpoint_id":...}.
advertisement_builder / advertisement_filter / parse_endpoint_id /
endpoint_ids_from_events (dedup). Endpoint ids stay opaque strings so the
protocol is dep-light + unit-testable on the default build.
4/4 tests pass (sign->parse roundtrip, filter targeting, reject wrong-kind/
empty, dedup across nodes).
Next (task #12): gated NostrSeedDiscovery glue (query relays, parse ids ->
iroh::EndpointId), publish path, wire swarm::providers().
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pulls iroh 1.0 + iroh-blobs 0.103 as OPTIONAL deps under the iroh-swarm
feature and implements a real BlobProvider over them. Verified: the full
iroh QUIC dep tree (260 pkgs) resolves and compiles against the pinned
bitcoin/nostr-sdk/reqwest-rustls stack; the provider compiles against the
0.103/1.0 API.
- swarm/iroh_provider.rs: IrohProvider::new binds a QUIC Endpoint, opens a
persistent FsStore (data_dir/iroh-blobs), and serves blobs via the
iroh-blobs protocol/Router — a node that fetches also SEEDS. try_fetch
maps ContentDigest -> iroh Hash, asks discovery for seed EndpointIds, then
downloader.download(hash, providers) (range-verified) + export to staging.
- ProviderDiscovery trait: the seam Phase 3 (signed Nostr advertisement
events) fills. discovery=None -> no seeds -> origin-only, so enabling the
feature is never worse than today.
- Default build untouched: iroh is optional, the module is cfg-gated, and
providers() stays empty until Phase 3 wires discovery in.
Build: cargo build --features iroh-swarm succeeds (dev). Default build +
44 swarm/update/content_hash/blobs tests unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The backend already sends did in federation peer lists, but the Peer
type omitted it and federationNodeToPeer() dropped it when mapping. Add
did?: string to Peer and pass node.did through, so trusted/observer
node rows route to Federation/Mesh by their real DID (falling back to
pubkey/onion) instead of failing the build on a missing property.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lands the transport/swarm orchestration layer (the iroh engine attaches
later, behind a flag). The seam is fully exercised today with the origin
HTTP path; with no swarm providers registered the behaviour is byte-for-byte
identical to before.
- swarm/mod.rs: BlobProvider trait + fetch_content_addressed() — tries each
provider in order, VERIFIES peer-sourced bytes against the content digest
before accepting (untrusted seeds can't inject tampered bytes), falls back
to the origin closure if none serve. Returns Swarm|Origin.
- Cargo: iroh-swarm feature (off by default; heavy QUIC dep tree attaches
here). providers() is empty until enabled → every fetch hits origin.
- update.rs: components with a BLAKE3 digest route through the seam, using
the existing resumable HTTP downloader as the origin fallback; a swarm hit
is re-checked against the mandatory SHA-256 manifest gate (re-fetch from
origin on any disagreement). Components without blake3 take the original
path untouched.
44/44 swarm/update/content_hash/blobs tests pass (incl. swarm hit/miss,
tampered-bytes-rejected→origin, fall-through ordering).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the iroh-native, range-verifiable hash next to the incumbent SHA-256
so the swarm can later fetch/verify by BLAKE3 with the registry/origin as
fallback. Non-breaking: SHA-256 stays the mandatory gate; BLAKE3 is verified
only when present.
- content_hash.rs: HashAlg + ContentDigest (parse/verify '<alg>:<hex>'
multihash strings), blake3_hex/sha256_hex; BLAKE3 known-answer test
- update.rs: ComponentUpdate.blake3 (serde-default); verified ALONGSIDE
SHA-256 in the resumable download loop, re-download on mismatch
- blobs.rs: BlobMeta.blake3 computed on put (on-disk path stays
SHA-256-keyed for back-compat; advertises the future swarm address)
Drive-by: fix a pre-existing stale test (test_save_and_load_state_roundtrip)
that never wrote the .download-complete marker #26 requires, so load_state's
self-heal cleared update_in_progress. Unrelated to BLAKE3 — surfaced by
running the full update:: suite.
40/40 content_hash/update/blobs tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The #26 fix makes has_staged_update require the .download-complete
marker, so the state self-heal treats a marker-less staging dir as a
partial download and clears update_in_progress. The roundtrip test
staged a binary file but not the marker, so it began failing. Write
the marker to simulate a *complete* staged update.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the parked trust module and wires it into the live build:
- main.rs: register `mod trust`
- app_catalog::fetch_one: verify the release-root detached signature when
present (verify against raw JSON so forward-compat fields stay in the
signed preimage); accept unsigned during the migration window, hard-reject
a present-but-bad signature so a tampering mirror can't pass altered bytes
- seed: pin release-root Ed25519 known-answer test (priv+pub) for the
signing ceremony / pinned-anchor / external-verifier cross-check
- signed_doc: drop unused import
20/20 Phase 0 unit tests pass (trust::canonical/did/signed_doc/anchor,
seed release-root, app_catalog). Crate compiles clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moved here so main stays clean for the v1.7.98 release. Contains the trust/
module (canonical.rs, did.rs, signed_doc.rs) + seed::derive_release_root_ed25519.
Not wired into the build yet. Continue this work on this branch.
Captures the verified 2026-06-16 design: swarm-assist/origin-always-wins,
iroh-blobs as the swarm engine, BLAKE3 addressing, signed Nostr/release-root
authenticity, and the Phase 0-4 plan. Foundation doc for the dht branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kiosk chromium pinned ~92% of a core (software-compositing spin from
--enable-gpu-rasterization on a GPU-less/headless node), saturating the machine
and starving the backend + container builds — it caused the .198 receive timeout
and the deploy storms.
- archipelago-kiosk.service: CPUQuota=75% + MemoryMax/High + Delegate, so a
runaway kiosk can never take the whole node down.
- archipelago-kiosk-launcher.sh: detect /dev/dri — use GPU rasterization only
when a GPU exists, else --disable-gpu (avoids the headless spin).
- bootstrap::ensure_kiosk_hardened: OTA self-heal that installs the updated
unit+launcher on already-deployed nodes, daemon-reloads, and only try-restarts
a *running* kiosk (never re-enables an operator-disabled one).
cargo check clean; launcher bash -n clean; unit syntax valid.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
immich_server/redis/postgres + indeedhub-* are multi-container stack members
whose sub-container app_ids are NOT in package_data, so the health monitor skips
them as "orphans" and never restarts them when they exit — Immich/IndeedHub stay
down until the next reboot (the boot-only start_stopped_stack_containers was the
only recovery). Spawn a 120s supervisor that reuses that same recovery at
runtime. It cheaply skips already-running containers and honours the user-stopped
list (set on every container by package.stop), so it only revives genuinely
crashed members and never fights a user stop.
cargo check clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- All four tabs (trusted/observers/messages/requests) capped at max-h-72 with
internal scroll, so the screen stays short instead of growing very long.
- Clicking a node row navigates to that node in the Federation screen
(?node=did); the Message button (stop-propagation) deep-links to that peer\047s
mesh chat (?peer=), using the Mesh.vue ?peer handler.
type-check clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A resumable-but-failed download leaves partial component files in update-staging.
has_staged_update() treated ANY staged file as "install-ready", so the state
self-heal kept update_in_progress=true and the UI showed Install instead of
Download (no clean retry).
- update.rs: write a .download-complete marker only after EVERY component
downloads+verifies; has_staged_update() now checks that marker. Partial/failed
downloads (no marker) correctly read as not-staged → self-heal clears
update_in_progress → UI shows Download. Resume still works (partial files kept).
- SystemUpdate.vue: on a genuine download failure, reset downloaded/in_progress
and re-sync, so the user lands back on Download immediately.
cargo check + vue-tsc clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sendArchMessage looped over every federation node sequentially (await
sendMessageToPeer per node), so the spinner stayed up until the slowest/offline
node's Tor request finished — long after online peers had received the message.
Send to all peers concurrently (Promise.allSettled); the spinner now clears
after the slowest single delivery, not the sum.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- DID: the Identity card read the DID only from localStorage('neode_did'), so
nodes/browsers that never cached it (e.g. .116/.228) showed no DID. Fall back
to the node.did RPC and cache it — the DID now shows everywhere.
- npub: add the node's seed-derived Nostr public key (npub) to the Identity card
next to the DID + onion, fetched from node.nostr-pubkey, with a copy button.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make each peer file card a flex column filling its grid cell (flex flex-col
h-full) and pin the body row (filename + Play/Download) with mt-auto, so cards
with a media preview and cards without line their footers up across the row.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
User: chat history (messages + mesh/Tor contacts) must persist and be
secure/encrypted per best practice. Root cause of the .198 loss was the B17
mount race writing empty stores over real data (B17 already fixes the trigger);
this hardens storage so it can never silently lose or expose data:
- storage_crypto: shared at-rest envelope mirroring credentials::store — key =
SHA-256(domain ‖ node identity key) (seed-derived, per-store domain
separation), ChaCha20-Poly1305 AEAD with a random 96-bit nonce, tamper-evident.
Transparent migration of legacy plaintext files. Unit-tested (round-trip,
wrong-key/tamper rejection, plaintext detection).
- messages.json: encrypted at rest + ATOMIC write (temp+rename) so a crash/
reboot mid-write cannot corrupt history; decrypt-with-migration on load; a
failed decrypt never overwrites the on-disk data.
- mesh contacts (alias/notes/pinned/blocked): were ONLY in memory and lost on
every restart — now persisted to mesh-contacts.json (encrypted, atomic),
loaded on MeshState startup, saved after contacts-save/contacts-block.
Explicit clear (mesh.clear-all) still wipes everything, as intended.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
User priority: FIPS is the main transport but it was unreliable and needed a
manual "Activate" button. Improvements (all in the FIPS dial/supervisor):
- Auto-activate: ensure_activated() installs the daemon config + starts the
service on its own once seed onboarding has materialised the key — no Activate
button needed. Idempotent; runs from the supervisor every 45s so a node that
onboards after boot still comes up automatically.
- Dial retry: try_fips_get/post now retry ONCE on a connect/timeout error. The
first dial to a peer triggers NAT hole-punching and often times out before the
path is up; the retry lands on the now-warm path — the main reason calls were
dropping to Tor despite the peer being FIPS-reachable.
- More patient connect_timeout (5s→8s) so a reachable-but-cold peer isn't
abandoned to Tor while hole-punching completes.
- Path warmer: spawn_fips_supervisor() keeps hole-punched paths to known
federation peers warm (every 45s, concurrent), so on-demand dials are fast and
land on FIPS.
- Confirmed the daemon config already enables BOTH udp + tcp transports
(render_config_yaml), so FIPS already uses TCP where UDP is blocked; the Tor
fallback was path-establishment, addressed above.
cargo check + fmt clean. Backend — needs a binary rebuild+deploy to validate on
.116/.198 (watch last_transport flip fips, and FIPS coming up with no button).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add a close (X) button to the message toast (closeToast, @click.stop) like the
system notifications.
- Carry the sender pubkey on the toast; clicking now deep-links to that
conversation (/dashboard/mesh?peer=<pubkey>) instead of the generic mesh page.
- Mesh.vue reads ?peer= on mount and opens the matching peer (by pubkey_hex/did),
gracefully falling back to the mesh list when no match (B1/B2 identity).
type-check clean; useMessageToast tests 11/11.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Streaming a peer file connects over mesh/Tor before the first frame, so the
player sat blank. Add a loading state:
- PeerFiles video modal: spinner overlay ("Connecting to peer…") until the
<video> fires playing/canplay; an error overlay on failure instead of a
silent black box.
- useAudioPlayer: loading flag driven by loadstart/waiting vs canplay/playing;
GlobalAudioPlayer shows a spinner in the transport button while connecting.
- Fix the misleading audio error "Could not play audio. File Browser may not be
running." (wrong for peer content) → "Could not play this audio file. The peer
may be offline…" (B22).
type-check clean; useAudioPlayer tests 10/10.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- docker/fedimint-ui/nginx.conf: the local /assets/ handler 404'd the real
fedimint guardian UI's own bundled CSS (bootstrap.min.css, style.css) →
unstyled app. B13 fixed our local icon; this adds a @guardian_assets proxy
fallback to :8177 so the guardian's own /assets/* resolve. Verified live on
.116: /app/fedimint/assets/bootstrap.min.css 404→200 text/css. (needs
archy-fedimint-ui image rebuild to persist on nodes.)
- Home.vue: Quick Start Goals card regained lg:col-span-2 so it fills its row
on desktop instead of sitting at half width.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B4 fix made listDirectory require a JSON content-type (to detect the
SPA-fallback HTML / 502 cases) and changed the non-OK error string, but its
tests still mocked headerless responses + the old message, so they failed —
which also polluted the run and tripped AppIconGrid's teardown. Give the JSON
mock a content-type, update the non-OK expectation, and add a test for the
guard's friendly-error path. Full suite now 667/667 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On production nodes /var/lib/archipelago (the app data dir AND podman's
graphroot=/var/lib/archipelago/containers/storage) is a separate
device-mapper volume. archipelago.service ordered only After=network-online
.target, so on cold boots it (and its ExecStartPre) could start BEFORE
var-lib-archipelago.mount, write to the bare mountpoint on rootfs, fail every
podman call, exit, and be restarted every 5s until the volume mounted — the
"~20x [FAILED] Failed to start over ~5min" boot flap. Proven live on .198:
"var-lib-archipelago.mount: Directory /var/lib/archipelago to mount over is
not empty, mounting anyway" — the service had written there pre-mount.
Fix: RequiresMountsFor=/var/lib/archipelago (adds Requires= + After= on the
mount unit).
- image-recipe/configs/archipelago.service: ships the directive on fresh ISOs.
- bootstrap::ensure_archipelago_mount_ordering(): self-heals already-deployed
nodes' installed unit + daemon-reload (boot-ordering only, effective next
reboot; never restarts the running service). Idempotent; harmless on rootfs
installs (maps to the always-mounted root).
Verified on .198: after applying, systemctl shows After=var-lib-archipelago
.mount and systemd-analyze verify is clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Home > System bitcoin tile is gated on bitcoinAvailable===true, so any
transient bitcoin.getinfo failure (RPC busy during heavy IBD, route-change
scan) could blank it even though the node is fine. Add a bitcoinStale flag:
- getinfo fails while the container is Running, or package data is momentarily
absent → retain the last-known value and mark it stale (tile stays, shows
"Updating…" instead of a frozen figure presented as live).
- container authoritatively Stopped/Exited → flip to not-available as before
(no stale-as-live).
- first-ever poll times out but container Running → show the tile as updating
rather than staying hidden on a syncing node.
Harness: src/stores/__tests__/homeStatus.test.ts (6 cases) — red before, green
after. type-check clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CORE_RPC_HOST was hardcoded to bitcoin-knots in three env-render paths, so on a
bitcoin-core node (container named bitcoin-core) mempool-api could not reach
Bitcoin RPC. Both node variants are reachable on archy-net by container name —
only the name differs.
- Legacy direct-podman (stacks.rs) and config.rs::get_app_config now use a new
dependencies::detect_bitcoin_rpc_host() (pure, unit-tested pick_bitcoin_host).
- Quadlet/manifest path (the modern fleet default): add a {{BITCOIN_HOST}}
derived-env placeholder — HostFacts.bitcoin_host + resolve_derived_env render
it; prod_orchestrator detects Knots/Core via podman ps, resolved on demand
only for manifests that use the placeholder. mempool-api manifest moves
CORE_RPC_HOST from static env to derived_env: {{BITCOIN_HOST}}.
Tests: pick_bitcoin_host (5 cases incl. substring safety), container-crate
resolve_derived_env, and orchestrator mempool_core_rpc_host_follows_bitcoin_node
(core->bitcoin-core, knots->bitcoin-knots). No-regression confirmed: picker
returns bitcoin-knots live on .198. Live bitcoin-core validation pending (no
core node available). Sibling hardcodes (lnd/btcpay/electrumx/fedimint) tracked
as B12b.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B13 template fix only fixed fresh ISOs. Already-deployed nodes keep their
old nginx config, where /app/fedimint/ proxies to :8175 without rewriting the
Guardian UI's root-rooted asset URLs (src="/assets/...", url("/assets/...")).
Those resolve against the SPA root: bg-network.jpg exists there by luck, but
app-icons/fedimint.jpg 404s (location /assets/ uses try_files =404) — the
visibly-broken icon.
bootstrap.rs::patch_nginx_conf now heals both paths on startup:
- Style A (main conf, HTTP): swaps the old single nostr-provider sub_filter tail
for the full reroot set; byte-matches the shipped template.
- Style B (HTTPS app-proxy snippet): the snippet's fedimint block has no
sub_filter and a per-node-varying trailing directive, so anchor on the unique
:8175 proxy_pass and insert the reroot set after it (nginx ignores directive
order). Snippet added to the bootstrap nginx loop (skipped on HTTP-only nodes).
missing_* flags are now gated on their splice anchors so the included snippet
neither attempts the main-conf-only patches nor logs warn-skips every boot.
Idempotent via the 'href="/' 'href="/app/fedimint/' marker.
Verified on .198 (both paths): fedimint app-icon 404 -> 200 image/jpeg; nginx -t
OK; containers survived restart (Quadlet); idempotent steady state, no warn spam.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fedimint UI HTML/CSS reference absolute /assets/* paths; under /app/fedimint/
those hit the main SPA, not the fedimint container, so the UI renders
unstyled. Add the proven sub_filter asset-rewrite pattern (as indeedhub/
botfights use) to the /app/fedimint/ block in the nginx template + https
snippet (also rewrites url(...) for the CSS background image). Bootstrap
self-heal for already-deployed nodes is the documented resume point.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
B15: Home system stats (incl. bitcoin sync %) polled every 30s — too slow;
now 10s so sync progress tracks the actual block height more closely.
B7: the ElectrumX sync overlay was gated only on status!=='synced', so if
the status never flips to 'synced' (ElectrumX stale/disconnected) the loader
stuck on top forever. Now the overlay hides and the app iframe loads when
the sync status is stale (fail-open), while still showing during active
indexing. type-check EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B3 streaming proxy endpoint existed in the backend but nginx had no
location for /api/peer-content/*, so the browser's requests fell through to
the SPA (200 text/html) and media still wouldn't play. Add an
NGINX_PEER_CONTENT_BLOCK that bootstrap patches into every server block
(forwards Cookie for session auth + Range, proxy_buffering off). Idempotent;
covers fresh-ISO nodes too since bootstrap runs on every startup.
Verified on .198: after restart the async nginx patch lands and
/api/peer-content/<onion>/<id> returns 401 (reaches backend, auth-gated)
instead of the SPA; nginx block present in both server blocks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Peer media (music/video) wouldn't play: the frontend downloaded the whole
file via RPC as base64 and made a non-seekable Blob URL, so <video>/large
<audio> stalled and big files hit the RPC timeout.
Add GET /api/peer-content/<onion>/<id> — a same-origin, session-gated proxy
that forwards the browser's Range header to the peer's /content/<id> (which
already returns 206 Partial Content) and passes status + Content-Range +
Content-Type back. PeerFiles.playMedia() now points <video>/<audio> at this
streaming URL for free content instead of buffering a base64 blob, so the
player can seek and start immediately. Onion/id validated to prevent
SSRF/path traversal. (Paid preview keeps its existing flow.)
Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
content.browse-peer now returns the transport that actually reached the
peer (fips/tor/mesh/lan). PeerFiles shows it as a small coloured pill next
to the peer name (FIPS/Mesh green, LAN blue, Tor amber) and the loading
text no longer hardcodes "Connecting via Tor" (it was misleading when FIPS
was used). Pairs with B14 (transport recording).
Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B14 commit referenced crate::federation::storage::record_peer_transport
but `storage` is a private module — record_peer_transport is re-exported at
crate::federation::. E0603 broke the build. Use the re-exported path (as
load_nodes/fips_npub_for_onion already do). Verified: cargo build --release
EXIT 0. Also logs B21 (Tor/FIPS pill) plan.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 4 content peer handlers (browse, download, download_paid, preview)
captured the transport returned by PeerRequest::send_get() but discarded
it, so the federation node's last_transport was never updated for cloud
activity — the UI showed Tor/none even when FIPS was used. Call
record_peer_transport() after each successful fetch (same as sync does).
Note: live data shows FIPS still reaches only some peers (many genuinely
fall back to Tor) — tracked separately as B14b (FIPS reachability).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
B1/B2: the same physical node can linger in the federation list under two
dids (e.g. after a did/key change). An onion is a node's unique stable
identity, so two entries with the same onion are one node. This showed the
node twice in the trusted-node list (B1) and as two mesh chat contacts —
one by name+logo, one by raw did (B2).
- storage::load_nodes now collapses same-onion entries (keep first, merge
fips_npub/name/last_state) so every consumer (list + chat seed + sync)
sees one entry per node.
- federation::sync merge_transitive_peers also matches by onion (not just
did) so new transitive hints don't re-add a known node under a new did.
- mesh::seed_federation_peers_into_mesh skips already-seeded onions (belt
and suspenders).
- Unit tests for dedup_nodes_by_onion (collapse + onion-suffix handling).
B4: filebrowser-client.listDirectory only checked res.ok before res.json(),
so when File Browser is absent (nginx serves the SPA index.html, 200) or
down (502) the JSON parse threw the opaque "Unexpected token '<'". Now it
checks the content-type and throws a friendly "File Browser is not
available" the Cloud view already renders as an empty state.
Verified: dedup unit tests 2/2; live .198 (15 entries→13 distinct onions)
restarted healthy on new binary; B4 guard present in built bundle + deployed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The LND wallet UI (served on its own app port) fetches /lnd-connect-info
and /proxy/lnd/* cross-origin, so both need correct CORS headers.
(a) Older nginx configs add their own Access-Control-Allow-Origin in the
/lnd-connect-info location on top of the one the backend sets, yielding
a DUPLICATE header that browsers reject ("multiple values"). bootstrap
now strips that redundant nginx add_header (backend owns CORS).
(b) /proxy/lnd/* returned a 401 with no CORS headers when the session
check failed, so the browser saw an opaque CORS error instead of a
readable 401. Add unauthorized_cors() and use it on that path.
Adds tests/production-quality/ (bug tracker + lnd-cors-test.sh harness).
Verified: harness 4/4 on .116, .198, .103.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The add-anchor form previously hardcoded transport=udp. Expose a
TCP/UDP selector (default tcp) so public internet anchors and
local-network anchors can both be added. Includes changelog + What's
New entry for v1.7.96-alpha.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kiosk attached-display showed a separate app-tile launcher grid
(Kiosk.vue at /kiosk) instead of the normal onboarding/login/dashboard.
The grid is auth-gated, so it only surfaced once the kiosk browser held a
persisted session; otherwise it bounced to login — masking the issue.
Remove the grid entirely. /kiosk now just persists kiosk mode + safe-area
insets and redirects to the root app. The launcher keeps pointing at
/kiosk (not directly at /) so the 'kiosk' localStorage flag is still set —
App.vue uses it to skip the remote relay, which would otherwise double
xdotool input on the kiosk display. Route made public so the auth guard
doesn't bounce it before the redirect runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tests/multinode/smoke.sh on the existing multinode.bash lib: an
assertion suite (pass/fail + non-zero exit) driving two real nodes through
login, onion + FIPS identity, FIPS anchor-connected, federation pairing
both directions, peer content browse over the mesh, and the removed-node
tombstone (with an optional 3rd node C for the transitive-reappear case).
Guards the v1.7.94/v1.7.95 fixes. Content-browse + tombstone checks
skip-with-note against peers older than v1.7.95.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FIPS peer content browse over the mesh was failing with "Peer returned
error: 404 Not Found" and never falling back to Tor. `is_peer_allowed_path`
only allowed `/content/<id>` (item fetches) — the catalog endpoint is
exactly `/content` (no trailing slash), so it 404'd over the FIPS peer
listener. A FIPS 404 was also treated as a successful response, so the dial
never retried Tor. Fixes: allow `/content` over the mesh; add
`fips_should_fall_back()` so a FIPS 404/5xx in Auto mode falls back to Tor
(handles version-skew peers reaching a different route). Also correct the
reconnect hint text — the public anchor is TCP/8443, not UDP/8668.
Federation: deleted nodes reappeared because transitive discovery
(`merge` of a peer's advertised trusted peers) re-added any unknown DID.
Add a tombstone store (`removed-nodes.json`): remove_node tombstones the
DID, transitive merge skips tombstoned DIDs, and a remote-triggered
peer-joined is ignored for a removed DID. Explicit local re-add (add_node)
clears the tombstone.
UI: the app credentials modal panel stretched edge-to-edge (height:100%,
max-width:none, items-stretch overlay). Constrain it to a centered card
(max-width 34rem, rounded, dimmed full-screen backdrop) matching the
AppIconGrid / wallet-receive modal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The whole fleet was silently never reaching the FIPS mesh: the default
public anchor was configured as fips.v0l.io:8668/udp, but the anchor only
answers on TCP/8443. Fix the default to 185.18.221.160:8443/tcp (IPv4
literal — the hostname resolves IPv6-first and the daemon binds v4-only,
which fails the handshake with EAFNOSUPPORT), and auto-seed it in
anchors::load() so every node dials it without operator action (removal
still persists). Proven live on .116: cold start → anchor_connected in
~400ms, anchor became mesh parent.
Wire fips::update::apply() against upstream GitHub releases (stable
channel only): resolve /releases/latest → SHA256-verify the .deb against
checksums-linux.txt → install → restart. dpkg runs via `systemd-run` to
escape archipelago's ProtectSystem=strict sandbox (else /var/lib/dpkg is
read-only), with --force-confold (archipelago manages /etc/fips conffiles)
and --force-downgrade (dev builds sort newer than the stable tag).
Validated live: .116 upgraded 0.3.0-dev -> stable v0.3.0.
Also: standalone fips-ui dashboard app (apps/fips-ui + docker/fips-ui,
static nginx proxying /rpc/v1 same-origin, copiable own-anchor address);
reserve UI port 8336; register fips/fips-ui as platform-managed. Includes
the Lightning wallet cross-origin (CORS) + LND proxy auth + nginx
self-healer fix so the wallet screen connects instead of "failed to fetch".
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When an existing LND wallet is locked and none of the candidate passwords
(per-node secret, legacy constant) open it, the node can never auto-unlock
unattended. unlock_existing_wallet now returns Ok(false) for "all candidates
actively rejected" (vs Err for transient "LND not ready"), and
ensure_wallet_initialized responds by recreating the wallet:
- mark the lnd container user-stopped so the health monitor won't
re-launch it (and re-open the wallet) mid-wipe,
- stop lnd, delete its wallet/chain/graph state as root,
- start lnd, wait for NON_EXISTING, re-init a fresh wallet on the
per-node secret, then clear the user-stopped flag.
LND runs as a plain bridge-network podman container (not a Quadlet unit),
so it is restarted via `systemd-run --user --scope podman`, matching the
orchestrator/health-monitor path.
Alpha nodes hold no funds and a wallet locked with an unknown password is
already inaccessible, so the wipe loses nothing reachable. Completes the
forward fix from 91adc281 for nodes whose wallet pre-dates the per-node
secret and whose password is unrecorded (e.g. .116/.228).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the fleet-wide hardcoded WALLET_PASSWORD='hellohello' that left wallets
LOCKED after OTA/reboot (auto-unlock used the wrong password fleet-wide).
Forward fix (both init paths unified, validated cargo check + LND REST mechanics
on a scratch wallet):
- Per-node random 256-bit secret in secrets/lnd-wallet-password (0600), mirroring
secrets/bitcoin-rpc-password. read_wallet_password (no-gen) vs
ensure_wallet_password (gen at init only).
- container/lnd.rs init AND api/rpc/lnd/wallet.rs seed-derived init both use the
per-node secret (wallet.rs keeps recoverable derived entropy; password unified).
- Unlock tries [per-node secret, legacy 'hellohello']; single-attempt primitive
distinguishes invalid-passphrase (fail fast, try next) from not-ready (retry),
so a wrong password no longer hangs the boot path ~60s.
Migration (candidate-unlock + rotate, best-effort at login):
- change_wallet_password (WalletUnlocker.ChangePassword) + migrate_locked_wallet:
if LOCKED, try candidates as current pw and ChangePassword onto the per-node
secret so future boots auto-unlock. Hooked into auth.login (non-blocking) with
the just-verified password as the candidate.
NOT YET: seed-recovery fallback for wallets where no candidate matches (e.g.
.116/.228) — destructive, needs entropy-source/funds-safety handling; next pass.
NOT shipped: pending end-to-end validation on a real node.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
create-release.sh bumps Cargo.toml but not the lock's archipelago version line;
the cargo build regenerates it post-commit. Same as the 1.7.91 leftover — worth
fixing create-release.sh to stage Cargo.lock, tracked separately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A wrong/locked LND wallet password leaves the wallet LOCKED after every
restart/OTA, breaking all Bitcoin-receive + Lightning ops fleet-wide — and the
harness was blind to it: live-lnd-address-type treats 'wallet locked' as PASS,
os-audit treated lnd-unreachable as WARN, and the archipelago lnd.getinfo RPC
masks a locked wallet (returns all-zero success).
- tests/release/run.sh: new 'live-lnd-unlocked' stage polls LND's unauth
/v1/state and FAILs if still LOCKED after a 60s grace window.
- tests/lifecycle/os-audit.sh: probe lnd.newaddress (the real receive path,
which surfaces LND_WALLET_LOCKED) instead of lnd.getinfo; locked = hard FAIL,
not-installed = WARN.
Proven on .116 (genuinely locked): os-audit now reports
'[FAIL] lnd wallet unlocked (lnd.newaddress) wallet LOCKED'.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
create-release staging requires >=3 curated release-note bullets. The What's
New restoration is itself user-facing, so it's an honest third note; mirror it
into the modal's v1.7.92 block via sync-whats-new.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The What's New modal (AccountInfoSection.vue) hardcodes one block per release
and had silently drifted: it sat at v1.7.84 while the fleet shipped through
v1.7.92, so eight releases of notes never reached users in Settings.
- scripts/sync-whats-new.py: renders a modal block from each CHANGELOG version
that's missing one (curated bullets, dev-process 'Validation…' lines dropped),
inserts newest-first; never touches older hand-written pre-CHANGELOG history.
--check mode lists anything missing and exits non-zero.
- tests/release/run.sh: new 'whats-new-sync' static gate runs --check, so a
release with an un-surfaced CHANGELOG entry fails before shipping.
- Backfilled the eight missing blocks (v1.7.85 … v1.7.92) into the modal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
batch_host_reboot previously asserted only container-set equality after the
reboot. Add the os-audit.sh per-boot health gate: after rpc_login succeeds
post-reboot, run os-audit against the target (ARCHY_LOCAL=0, https) and record
host_reboot_osaudit PASS/FAIL. This asserts the node is actually healthy after
a reboot — RPC up, OTA not wedged (FM12), every app reachable with valid launch
metadata, FM-guards green — not just that the right containers exist. Validated
green on .116 (11 pass / 0 fail / 0 warn).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When ElectrumX is still building its index (or waiting on the Bitcoin node),
AppSessionFrame shows a sync 'pre UI'. The iframe-blocked fallback ('App not
reachable / retrying') was not gated on electrsSync, so it painted over the
sync screen and read as a hard connection error. Gate it on !electrsSync,
mirroring the iframe's own guard.
Also harden the lifecycle health probe: container_health used jq '// "unknown"',
which only catches null/false — an empty-string health (a brief window under
load) rendered as a blank 'bad health: X is '. Map empty to 'unknown' so the
retry loop keeps waiting instead of failing on a transient.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
create-release.sh bumps Cargo.toml; the lock's archipelago version line is
regenerated by the subsequent cargo build and was left uncommitted after the
v1.7.91-alpha release commit. The shipped binary is built from the bumped
Cargo.toml, so this is bookkeeping only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
os-audit.sh: one non-destructive scorecard tying backend/RPC health, the
all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards
(port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The
per-boot building block for the reboot-survival loop. FM12 check uses jq has()
not // (// treats a legit false as empty). Section A validated all-PASS on .116.
docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
codeMatch[1] is string|undefined under noUncheckedIndexedAccess; using it
directly as an index into RECEIVE_CODE_MESSAGES failed vue-tsc (TS2538) and
aborted create-release.sh at the frontend build step. Bind to a const and
narrow before indexing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Before/after on the live node confirms the launch_url_port fix:
jellyfin/btcpay/fedimint/gitea/portainer/botfights all went from
lan_address=None to a resolved http://localhost:PORT/ URL; harness
focused audit passed, exit 0. Also documents that archipelago.service
restarts are safe on .116 (containers run in the user-1000 slice, a
different cgroup, and survived the restart).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
reachable_lan_address() parsed the launch port with url.rsplit(':')
which yields "8096/" for manifest interfaces.main URLs that carry a
path (http://localhost:8096/). That fails to parse and silently drops
a perfectly reachable launch URL, so apps like jellyfin, btcpay-server,
fedimint, gitea, nextcloud and portainer showed running with no launch
link in the UI. New launch_url_port() reads digits after the final
colon (mirroring port_from_url in the RPC layer) and tolerates a
trailing path. Adds regression tests.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes three Bitcoin/wallet failures observed across the fleet on v1.7.90-alpha
(all nodes were already on the latest build — these were live bugs, not stale
builds), plus the missing ElectrumX tile, and adds automated coverage so each
can't regress silently.
Receive address (".116 receive fails", ".228 false 'wallet is locked'"):
- LND publishes its REST API on a host port that can drift from the manifest
(a container created when the mapping was 8080 kept publishing 8080 after the
manifest moved to 18080). The in-process client connects to the manifest port,
gets connection-refused, and wallet init fails forever while the container
looks "Up". Add published-port drift detection to the reconciler
(container_ports_drifted / host_port_bindings_drifted) that recreates a
drifted backend even for restart-sensitive apps — a drifted container is
already broken, so leaving it "untouched" only perpetuates the failure.
- Receive errors now carry a stable [CODE] token (REST_UNREACHABLE, WALLET_LOCKED,
WALLET_UNINITIALIZED, SYNCING) and always start with "Bitcoin address" so they
survive the RPC error sanitizer instead of collapsing to the generic
"Operation failed". The UI maps the code instead of guessing wallet state from
substrings — so an unreachable REST endpoint is no longer mislabelled "locked".
Bitcoin install (".198 bitcoin gone / reinstall just stops"):
- bitcoin-knots requires the secret bitcoin-rpc-txrelay-rpcauth, which was only
generated by the tx-relay flow. Nodes that never used tx-relay lacked it, so
secret resolution hard-failed and the whole Bitcoin stack cascaded. Generate
it idempotently before bitcoin starts (ensure_app_secrets, reusing
ensure_txrelay_credentials), and name the missing secret in the error so a
genuine gap is actionable instead of a bare "IO error".
ElectrumX app tile missing on every node with it installed:
- The catalog generator dropped electrumx because the manifest had no
interfaces.main block, so the tile had no launch URL and was hidden. Declare
the companion UI port (50002) in the manifest, regenerate the catalog, and let
an app with a known launch URL stay launchable while its backend is still
"starting" (ElectrumX indexes for 10m+).
Test harness:
- New lifecycle bats suites: bitcoin-receive, port-drift, secret-completeness
(validated live; port-drift catches the real .116 drift).
- Rust unit tests for drift detection, the receive reason-code classifier, and
the named-missing-secret error; vitest for the UI code mapping.
- create-release.sh now runs tests/release/run.sh and aborts the release on
failure — previously it ran no tests at all.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- LND wallet: request correct address type so receive-address generation
no longer 400s
- AIUI/app session: on-screen pointer can click + type into app content
(incl. app store search); "open in new tab" opens the phone browser;
mobile credential modal centered instead of full-height
(remote-relay.ts, AppSession.vue, AppSessionFrame.vue, AppIconGrid.vue,
openExternal.ts, WebViewScreen.kt) + remote-relay tests
- health_monitor: electrs auto-recovers from a corrupt index and shows a
percent/block-height progress screen while reindexing (useElectrsSync.ts)
- update.rs: drop retired tx1138 secondary mirror (one-time migration);
longer download timeout for slow connections
- CHANGELOG: v1.7.90-alpha notes
- tests/release/run.sh: harness tweaks
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a Quadlet unit file already exists for an orchestrator-managed
backend, sync its on-disk bytes against what the current renderer
produces. write_if_changed makes this idempotent — when bytes match,
no IO; when they differ (post-deploy of a renderer change), the file
is rewritten and systemctl --user daemon-reload runs once.
We deliberately do NOT restart the .service when the file changes:
running containers keep their current config until the operator
restarts them. That's the right tradeoff — file updates are cheap and
non-destructive; service restarts are the SIGKILL cascade we're
trying to eliminate.
Why this matters: pre-this-commit, every renderer change required a
fresh package.install RPC per app to take effect. Observed live on
.228 2026-05-02 — the TimeoutStartSec=600 fix shipped in code but
existing units stayed on the old format because nothing triggered a
re-render. Combined with state.json being empty (so the reconciler's
auto-install path didn't fire either), the fix was invisible until
manual unit deletion.
Companions (UI_APP_IDS) are skipped — companion.rs renders those units
with a different shape; syncing here would clobber them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug surfaced live on .228 2026-05-02 — every backend Quadlet unit
(lnd, electrumx, fedimint, btcpay-server, mempool-api, bitcoin-knots)
hit systemd's default 90s start timeout because Notify=healthy makes
systemctl wait for the first green health probe, but
HealthInterval=30s × HealthRetries=3 = 90s minimum even on a healthy
service. Race: timeout fires the moment the third probe MIGHT succeed.
Result was three different post-states (inactive+running, failed+missing,
inactive+stopped) depending on whether systemd's ExecStopPost ran
podman rm before the orchestrator's adoption logic re-grabbed the
container.
Fix: when health is set, render TimeoutStartSec=600 (10 minutes) into
[Service]. Long enough for slow-starting backends (electrumx index
replay, lnd wallet unlock) without being so long that a truly stuck
unit hangs forever. Companions stay unchanged (no health → no override,
default 90s applies).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the first real-node validation of Phase 3.2-3.4
on .228 (2026-05-02), both caught before flipping the default.
Bug 1 — translate_health_check double-prefixed http://. Manifests in
the wild carry the scheme inside the endpoint string ("http://localhost:8175"),
and we were prepending another http:// unconditionally. Result on .228:
every backend HealthCmd read `curl -fsS -m 5 http://http://localhost...`,
every probe failed, fedimint hit a 14-restart loop. Now we accept either
form and skip appending hc.path when the endpoint already carries one.
Regression test asserts no double-prefix and that an in-endpoint path
is honoured.
Bug 2 — Phase 3.3 migration ran for UI companions (bitcoin-ui /
electrs-ui / lnd-ui) that have shipped via Quadlet since v1.7.41.
Migration tore down the running companion + raced companion.rs render,
producing "Phase 3.3: re-install archy-bitcoin-ui via Quadlet" reconcile
errors and leaving archy-bitcoin-ui down. Companions now short-circuit
out of migrate_to_quadlet_if_needed before any IO. Also: when try_exists
returns Err for an unrelated reason (permissions, EIO), we now skip
migration instead of treating "I can't tell" as "go ahead and migrate" —
migrating on top of a possibly-existing unit is destructive.
What this does not fix yet:
* the orchestrator's reconciler iterating every manifest in
/opt/archipelago/apps/, not just installed apps. Pre-existing
behavior (also affects the legacy path) — separate scope.
* fedimint /data UID mismatch surfaced when Quadlet started fedimint
fresh. Likely orthogonal — defer.
* no rollback when install_via_quadlet fails after a remove_container.
Tracked as Phase 3.3.1 — defer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an env-var lever for Phase 3.2's use_quadlet_backends flag so the
20× harness can flip the path on per-node without a config.json edit
(which would require an archipelago.service restart — and that triggers
FM3 cgroup cascade until Phase 3.5 ships, so we can't ask anyone to
reconfigure live nodes that way today).
Truthy parsing centralised in `parse_truthy_env` (1, true, yes, on —
case-insensitive, whitespace-trimmed). Anything else is false. The
helper is unit-tested so future env-var flags can reuse the same shape.
Also adds a default-off regression test for use_quadlet_backends so
flipping the default ahead of the 20× verification fires immediately.
TESTING.md documents the Environment= snippet for the systemd drop-in
so the next operator can flip the flag on a debug node without
re-deriving the recipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A six-test bats suite that validates what install_via_quadlet (Phase 3.2)
is supposed to leave behind:
* `.container` unit on disk in $XDG_CONFIG_HOME/containers/systemd/
with [Container] / [Service] / [Install] sections, Image= present,
and Restart=on-failure (the backend invariant — companions use Always)
* Phase 3.4 cross-check: any unit with HealthCmd= must also emit
Notify=healthy, otherwise systemctl start won't gate on health
* `systemctl --user is-active` returns 0 for the .service
* podman shows the container running
* the container's cgroup is under user.slice/, NOT under
archipelago.service — the kernel-level proof that FM3 cgroup
cascade SIGKILL is structurally fixed for this container
Auto-skips on every test when no backend Quadlet units exist (today's
default state, use_quadlet_backends=false) — so the suite is a no-op
on current fleet boxes and turns into a hard regression gate the
moment anyone flips the flag and reinstalls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
QuadletUnit gains an optional HealthSpec; from_manifest translates the
manifest's health_check (tcp/http/cmd) into a HealthCmd= directive and
emits Notify=healthy alongside it. systemctl start <unit>.service then
blocks until the container's first green probe — eliminating the
"container up but RPC not ready" race the orchestrator currently papers
over with post-start polling.
Translation policy:
* tcp, endpoint "host:port" -> nc -z host port
* http, endpoint "host:port", path -> curl -fsS -m 5 http://endpoint<path>
* cmd, endpoint "<shell command>" -> verbatim
* unknown type / malformed endpoint -> None (skip Notify=healthy rather
than emit a HealthCmd that hangs the unit start forever)
Companion units leave health: None and remain byte-identical to before
this PR — the renderer only emits the Health* / Notify= block when set.
+4 quadlet unit tests (19 total). Dropped a never-used test setter that
was generating a dead_code warning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When use_quadlet_backends flips from off → on, existing fleet boxes
have backend containers parented under archipelago.service's cgroup
(the bad shape that triggers FM3 cascade SIGKILL on every archipelago
restart). ensure_running now notices and corrects this:
* If there's already a `<name>.container` unit on disk → no-op
(subsequent reconcile ticks take this fast path).
* Else if a podman container with that name exists → it's a pre-3.3
artifact. Stop+remove it (volumes survive — bind mounts are not
touched by `podman rm`), then write the Quadlet unit, daemon-reload,
and start the new managed service.
* Else → fall through to install_fresh, which already routes through
install_via_quadlet when the flag is on.
The migration is idempotent and self-healing: if a fleet box is
half-migrated (unit on disk but no service active, or service active
but stale unit), the next reconcile tick converges. Bitcoin chain
data, lnd wallet state, and electrumx index all live on host bind
mounts and are unaffected by the container-record swap.
Volume safety audited per backend in `uses_orchestrator_install_flow`
allowlist — every entry mounts its data dir as a host bind mount.
Default still off. To migrate a node:
/etc/archipelago/config.toml: use_quadlet_backends = true
followed by `systemctl restart archipelago` — the next reconcile tick
walks every managed app and migrates each in turn.
Tests: 624 passing, 0 cargo warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prod_orchestrator::install_fresh now branches on the new
Config::use_quadlet_backends flag (default false):
* off (today's production behavior) — unchanged: runtime.create_container
+ start_container, container parented under archipelago.service's
cgroup, FM3 cascade SIGKILL on every archipelago restart.
* on — install_via_quadlet renders the manifest as a Quadlet unit via
QuadletUnit::from_manifest, writes it atomically into
~/.config/containers/systemd/, calls daemon-reload, and starts the
generated <name>.service. Container ends up under user.slice — no
more cgroup parented under archipelago, so archipelago restarts
don't touch the container's lifetime.
Default off so this commit is structurally safe to ship: nothing
changes at runtime until an operator opts in. Flip the default once
tests/lifecycle/run-20x.sh has gone green against the new path on
.228 + .198 (the v1.7.52 release gate).
Plumbing:
* config.rs — `use_quadlet_backends: bool` w/ Default false
* prod_orchestrator.rs — flag stored on the struct, threaded through
new(), with set_use_quadlet_backends(bool) test setter
* prod_orchestrator.rs — install_via_quadlet helper
* dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest /
parse_memory_mib / RestartPolicy::OnFailure now that the call path
exists; if a future revert removes the wiring, the warnings come back.
Tests: 624 passing, cargo check clean (0 warnings). Existing companion
behavior unaffected — render_skips_backend_directives_when_default
still passes byte-equal to before quadlet.rs grew the new fields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The QuadletUnit struct now covers everything a backend manifest needs
(ports, environment, devices, add_hosts, entrypoint+command, read-only
root, no_new_privileges, cpu_quota, restart policy choice). Adds
QuadletUnit::from_manifest(&AppManifest, name) that translates a parsed
manifest into a unit, plus parse_memory_mib for "1g"/"512m"/raw-MiB
forms. The renderer skips empty/false directives so existing companion
units render byte-identically — no behavior change for shipping
companions; the backend renderer is dead code until Phase 3.2 wires it
into the orchestrator.
Eight new unit tests cover:
* parse_memory_mib forms (1024, 512m, 2g, garbage)
* shell_join quoting (whitespace, embedded quotes)
* RestartPolicy → systemd string mapping
* render emits backend directives when set
* render skips them when defaulted (companion regression gate)
* from_manifest happy path on a bitcoin-knots-shaped manifest
* from_manifest read-only volume detection
* from_manifest tmpfs filtering
* end-to-end manifest → render bytes assertion
Tests: 615 → 624 (+9 net; one pre-existing parse_memory_mib path was
implicitly covered before but is now explicit). Cargo warnings: 0.
`from_manifest`, `parse_memory_mib`, and `RestartPolicy::OnFailure` are
marked allow(dead_code) with explicit references to Phase 3.2 — if
3.2 doesn't wire them, the dead-code warning resurfaces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings L1 (RPC API) + L3 (lifecycle survival) parity coverage to the
three multi-app stacks that were previously only touched by
required-stack.bats. Combined with bitcoin-knots / lnd / electrumx
already shipping, the six core apps now have dedicated bats files.
Each suite is shaped like the existing single-container suites
(bitcoin-knots / lnd / electrumx) and gates every assertion on the
backing container actually being present, so a node without the stack
installed gets clean skip messages instead of false fails.
* btcpay.bats — 9 tests, including stack-wide presence and a
"supporting containers don't cascade-restart" guard
* fedimint.bats — 8 tests, single container
* mempool.bats — 9 tests, mixed legacy + orchestrator-managed stack;
reuses the :8999 mempool-api probe from required-stack for parity
Total bats now: 88 (was 53 → +35).
TESTING.md matrix advances 23 → 50 of 110 cells.
UI URL coverage for these three apps already lives in
ui-coverage.bats, so this PR doesn't duplicate proxy-path probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single source of truth for "where are we, where are we going" on the
v1.7.52 container excellence work. Replaces ad-hoc tracking in chat.
Sections:
* Test layers L0..L6 with toolchain + per-iteration latency
* Per-app × per-state coverage matrix (23 of 110 cells today; goal 110)
* Layer-by-layer status (L0+L1+L2 ●; L3 ◐; L4..L6 ○)
* Run commands (single suite / full suite / 20×)
* LoC budget — -270 committed, ~1,616 more possible if Phase 3 ships
* Performance KPIs (TBD — measure first, target second)
* Release gates — 8 boxes that must tick before v1.7.52 ships
The file lives in-repo so PR diffs to it answer "what did this commit
improve?". If you can't tick the box, the change isn't ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the coverage gap where existing bats suites would report green
on a node whose dashboard tiles 502 because the proxy upstream is dead.
First pass against .198 caught real prod issues immediately:
/app/lnd/ → 502 (lnd container exited)
/app/mempool/ → 502 (mempool container exited)
/app/fedimint/ → 502 (fedimint container exited)
while existing tests reported only "container is up: false" with no
404/502 distinction.
* lib/ui-probes.bash — sourced helper. probe_https_200,
probe_app_url (skip-if-container-down else assert-200),
probe_dashboard_shell (asserts the Vue SPA HTML, not nginx default —
catches the layout regression from feedback_release_tarball_layout.md),
probe_dashboard_catalog (asserts /catalog.json non-empty).
* bats/ui-coverage.bats — 9 @test cases covering the dashboard +
bitcoin-ui :8334 + the seven HTTPS_PROXY_PATHS most users hit
(lnd, electrumx, mempool, fedimint, btcpay, filebrowser).
URL list mirrors HTTPS_PROXY_PATHS in
neode-ui/src/views/appSession/appSessionConfig.ts. Divergence between
the two is the exact bug class we're guarding against.
Loops clean under run-20x.sh. Container-state oracle is via local
podman inspect, so the suite must run on the archy host (same as
companion-survives-archipelago-restart.bats).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extracted the heal_podman_state cleanup list as a module-level
HEAL_RUNTIME_SUBDIRS const so a unit test can structurally enforce
the invariant: the list must contain "containers" + "libpod" but
must NOT contain "podman" (which holds systemd's podman.sock
listener and was the bug fixed in commit bb421803).
If anyone re-adds "podman" — accidentally, by reverting, or by
copy-paste from old plan memory — this test fires before we ship,
not on the next deploy when it nukes the orchestrator's HTTP path.
Total tests: 614 → 615.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sister suite to companion-survives-archipelago-restart.bats. That one
tests the same property for UI companions, which already ship via
Quadlet (commit 6e716f68) and so already pass.
This new suite tests the property for backend containers (bitcoin-knots
/ bitcoin-core / lnd / electrumx). Until v1.7.52 Phase 3 ships these
under Quadlet too, the suite is EXPECTED TO FAIL on fleet boxes — it's
the executable definition of "FM3 fixed".
Observed live on .198 on 2026-05-01: `sudo systemctl stop archipelago`
killed every container in archipelago.service's cgroup. The dedicated
"backends survive archipelago restart" test catches exactly that, and
also verifies the SAME container instance survives (compares pre/post
.Id), so an orchestrator that recreates a fresh container after the
SIGKILL doesn't read as pass.
Three @test cases:
* destructive gate (skip-marker for the suite)
* baseline: at least one backend installed + running
* backends survive: same .Id pre + post archipelago restart
Don't gate releases on this passing until Phase 3 lands; before then
treat it as a "expected to fail / shows progress" indicator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape as bitcoin-knots.bats and lnd.bats so the 20× release-gate
exercises electrumx through the same state matrix it uses for the other
two core apps. electrumx previously had a single TCP-port check inside
required-stack.bats; this adds destructive + cascade-destructive tiers.
10 @test cases:
* read-only: presence, valid state, TCP port (50001) reachable, no
orphan containers beyond {electrumx, archy-electrs-ui}
* destructive: stop, start, restart, TCP port recovers within 120s of
cold restart (longer than bitcoind because electrumx replays its
index against bitcoind on start)
* cascade: uninstall, reinstall (240s timeout for index rebuild)
With this suite, the three single-container core apps (bitcoin-knots,
lnd, electrumx) now have parity coverage. Multi-container stacks
(btcpay, mempool, fedimint) come next.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors bitcoin-knots.bats so the 20× release-gate run exercises lnd
through the same state matrix. lnd previously had only a single
read-only check inside required-stack.bats; this adds the destructive
and cascade-destructive tiers that match what we already test for
bitcoin-knots.
10 @test cases:
* read-only: presence, valid state, lncli getinfo, no orphan containers
* destructive (ARCHY_ALLOW_DESTRUCTIVE=1): stop, start, restart,
RPC recovers within 90s of cold restart (longer than bitcoind
because the wallet has to unlock first)
* cascade (ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1): uninstall, reinstall
Reuses the same lncli invocation as required-stack.bats so divergence
shows up clearly if either test breaks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of the v1.7.52 container excellence plan: a release-gate harness
that loops the bats suite N times in a row, with teardown between
iterations, and reports a pass/fail tally.
* setup-teardown.sh — clears /tmp/archy-rpc-session-* between runs so
iteration N+1 doesn't reuse a logged-out cookie from iteration N.
Idempotent; safe to run anytime. Designed to grow as we add suites
that leave other transient state.
* run-20x.sh — wraps run.sh in a loop of ARCHY_ITERATIONS (default 20).
Tracks per-iteration pass/fail with wall-clock timing, prints a
results block, exits non-zero on any failure. Honors ARCHY_FAIL_FAST
for short-circuit during dev.
Suggested release-gate command:
ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
tests/lifecycle/run-20x.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed live on .198: heal_podman_state was removing
$XDG_RUNTIME_DIR/podman/ alongside containers/ and libpod/. That dir
holds the systemd-bound podman.sock — the listener systemd creates for
socket-activated podman.service. Removing it broke every libpod HTTP
call from the orchestrator until `systemctl --user restart
podman.socket` ran. Far worse than any wedge it was trying to repair.
Drop podman/ from the cleanup list. The runtime state we actually want
to clean for FM6 (bolt_state.db drift) lives in containers/ and
libpod/ only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cargo check was showing five real warnings, all genuinely dead:
* container/mod.rs — re-exports compute_container_name, AdoptionReport,
ReconcileAction, ReconcileReport were unused outside
prod_orchestrator. Drop from the pub use line.
* prod_orchestrator — with_runtime + insert_manifest_for_test only exist
for the test module in the same file. Mark them
#[cfg(test)] so they don't appear in release builds.
* async_lifecycle — remove_package_entry has no callers; doc claims
"used for install-failure cleanup" but nothing
cleans up. Delete (10 lines).
* registry.rs — `use tracing::{debug, info};` had no consumers.
* fips.rs — unused-assignment chain on last_status. The poll
loop always sets it on every break path, so the
initial `None` and the unwrap_or_else fallback
were both dead. Refactored to `let after = loop
{ ...; break s; };`.
cargo check is now clean. cargo test --workspace --bins: 614 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes FM6 (podman bolt_state.db / runtime drift) — observed live on
.198 today: bitcoind was running for several minutes, but podman's
state DB reported the container as Exited. The reconciler then tried
to "restart" it, racing the still-bound port 8332 and failing in a
loop.
heal_podman_state() runs as the last bootstrap stage, BEFORE the
orchestrator's reconcile loop ticks. It probes `podman info` with a
5s timeout; on failure it removes the runtime-state dirs under
$XDG_RUNTIME_DIR and re-probes. Persistent storage under
~/.local/share/containers/storage/ is never touched, so containers
re-discover from manifests on next call.
Cleanup never includes `podman system reset` or `system renumber` —
those are destructive and must stay operator-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DependencyResolver had zero call sites in prod or tests outside the
module itself. The actual install-time dependency check lives in
install.rs::detect_running_deps + check_install_deps; this DAG-walk
solver was never wired up. -268 LoC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If bitcoin-core was installed but never started (e.g. port 8332 already
bound by bitcoin-knots), the container sticks in `created` state forever.
The old conflict check refused EVERY future bitcoin install — including
re-install of the running variant — leaving no UI path to recovery.
Now the check distinguishes states:
- missing → no conflict, continue
- running → real conflict, refuse install
- created/exited/configured/... → stuck; auto-remove and continue
Volumes are untouched; only the dead container record goes away.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bitcoin containers were exiting in ms after start because the orchestrator
install path skipped the credential-materialisation step the legacy path
did. resolve_secret_env then failed to read
/var/lib/archipelago/secrets/bitcoin-rpc-password, the container started
with no password, and bitcoind crashed before logs were useful.
Two changes:
1. install.rs — call bitcoin_rpc_credentials() for bitcoin/bitcoin-core/
bitcoin-knots before any install branch runs. The function generates +
persists on first call (OnceCell-cached), so this is idempotent.
2. manifest.rs::resolve_secret_env — return ManifestError::Invalid when a
resolved secret trims to empty, instead of silently producing
`KEY=` env vars that crash auth.
Adds a unit test for the empty-secret rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3a of the install path consolidation. Two coupled changes:
1. install.rs handle_package_install: gate the legacy "container exists →
adopt + return" probe on !orchestrator_managed. Apps the orchestrator
knows about (bitcoin-knots, bitcoin-core, lnd, electrumx, fedimint,
filebrowser, btcpay-server stack apps, mempool stack apps, plus the
companion UIs that just moved to Quadlet) skip the legacy probe and
fall straight into the orchestrator branch.
The legacy adopt block was returning success on a bare `podman start`
exit-0 — even when the process inside the container crashed seconds
later. That's the .228 "running but unreachable" failure mode. The
orchestrator's ensure_running honors the manifest's health check and
pre-start hooks (e.g. re-renders bitcoin-ui's nginx.conf if the RPC
password rotated), so this is a behavioral upgrade, not just a
refactor.
2. ProdContainerOrchestrator::install: make idempotent. Previously it
blindly called install_fresh which would fail on `podman create` if
the container name already existed. Now it delegates to ensure_running:
- Container Running + healthy → no-op (refresh hooks, restart if
config rewritten)
- Container Stopped/Exited → start (with hook refresh)
- Container missing → install_fresh
- Container in wedged state (Created/Paused/Unknown) → force-recreate
Without this, change #1 would regress every "container already exists"
case for the 18 orchestrator-managed app IDs. With it, install becomes
the single source of truth for "make app X be in the desired state."
Tests: 654 passed across the workspace (614 unit + 37 orchestration + 3
rpc), 0 failures. The 20 prod_orchestrator tests cover the install /
ensure_running / reconcile paths the new install delegates through.
Net delta: install.rs grows by ~30 lines (gating wrapper + comments),
prod_orchestrator.rs grows by ~30 lines (idempotent install body). Both
are temporary — the larger deletions (~1700 lines) come once every app
has been verified through the orchestrator path in subsequent phases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn
blocks from install.rs. If archipelago crashed mid-spawn or the
container's cgroup was reaped, companions vanished from podman ps -a
and only a manual rm/run could bring them back (the .228 incident).
Now each companion is rendered as a Quadlet .container unit under
~/.config/containers/systemd/, daemon-reloaded, and started via
systemctl --user. systemd owns supervision from that point on:
- archipelago can crash, restart, or be uninstalled without touching
any companion.
- Quadlet's Restart=always + RestartSec=10 handles container exits.
- A 30s reconcile tick in boot_reconciler enumerates expected
companion units and re-installs any whose unit file or service
vanished — defense-in-depth against external tampering.
New module layout:
- container/quadlet.rs: pure unit renderer + atomic write_if_changed
+ systemctl helpers (daemon_reload_user / enable_now / disable_remove
/ is_active). 6 unit tests, no I/O in the renderer.
- container/companion.rs: per-app companion specs, install/remove/
reconcile, image presence (build local first, fall back to insecure
registry only via image_uses_insecure_registry whitelist). 2 tests.
install.rs handle_package_install now ends with a single call to
companion::install_for(package_id), replacing 287 lines of spawn-and-
hope shellouts plus a ~120-line nginx auth-injector helper that worked
around per-node RPC password baking. The helper is gone too — the
pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/
bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only.
runtime.rs handle_package_uninstall now disables companions before
the container rm loop. Otherwise systemd's Restart=always would
respawn each companion within ~10s of removal.
Tests: 53 container tests pass, including 6 quadlet renderer tests
(host network, bridge network, capability set, atomic write idempotence)
and 2 companion specs (per-app companion lookup, build_unit shape).
boot_reconciler tests gain a #[cfg(test)] without_companion_stage()
flag so the paused-clock fixtures don't race the real systemctl I/O.
A bats regression test (companion-survives-archipelago-restart.bats,
gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode
cannot recur: every installed companion has a unit file, services
stay active across systemctl --user restart archipelago, and a
deleted unit file is recreated within one reconcile tick.
Net delta: +941 / -363, but the +941 is mostly tests (~440 lines)
and the new declarative layer; the imperative tokio::spawn block and
its nginx-auth helper are gone, removing two failure classes
(orphan companions on archipelago crash, and post-start exec races
under tightly-confined cgroups) that previously needed manual SSH
recovery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small, focused tightenings:
- core/container/src/podman_client.rs: drop the legacy Hetzner
23.182.128.160:3000 mirror from image_uses_insecure_registry().
It was decommissioned in v1.7.x and is stripped from active
registry config at load time; leaving it in the bypass list let
a stale config still skip TLS. Replace the inline match with a
named INSECURE_REGISTRY_HOSTS slice so future entries are one
line. Test now also pins the spoofing-immune semantics
("evil.example/146.59.87.168:3000/x" must NOT match).
- core/archipelago/src/api/rpc/package/config.rs: split bitcoin
from lnd in get_app_capabilities(). bitcoind never opens raw
sockets — drop CAP_NET_RAW from bitcoin/bitcoin-core/bitcoin-knots.
lnd/fedimint/fedimint-gateway keep it because they enumerate
network interfaces during cert generation.
- core/archipelago/src/bootstrap.rs: tighten_secrets_dir()
enforces 0700 on /var/lib/archipelago/secrets and 0600 on every
file inside on each startup. The dir-mode is the load-bearing
isolation boundary against rootless container escapes (their UID
maps to >=100000, can't traverse uid=1000/0700). The per-file
sweep is defense-in-depth against any installer that wrote 0644.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshots the in-flight hardening work so subsequent reconcile/Quadlet
phases land on a clean before/after diff.
Changes:
- core/container/src/podman_client.rs: image_uses_insecure_registry()
whitelist for the OVH (146.59.87.168:3000) and legacy Hetzner
(23.182.128.160:3000) HTTP mirrors; podman_network_settings() lifts
custom networks into the Networks map so containers can join them.
- core/archipelago/src/container/prod_orchestrator.rs:
ensure_container_network() creates per-manifest networks on demand;
apply_data_uid() now goes through host_sudo for mkdir -p + chown so
bind-mount roots get created and chowned without password prompts.
- core/archipelago/src/api/rpc/package/{install,update,stacks}.rs:
podman pull adds --tls-verify=false only for whitelisted registries.
- core/archipelago/src/bootstrap.rs: removes stale dev-mode systemd
override on startup (live nodes carried it from old installers).
- core/archipelago/src/config.rs: ignore ARCHIPELAGO_DEV_MODE in prod
binaries — it had been silently rerouting volumes to /tmp.
- apps/bitcoin-{core,knots}/manifest.yml: locate bitcoind at runtime
so image-layout differences don't break entrypoint.
- scripts/app-catalog-image-smoke-test.py: production catalog/image
smoke test that probes a target node before users click Install.
- .gitignore: cover .codex, .pnpm-store, __pycache__, *.bak.
Removes filebrowser.rs.bak and two stale catalog.json.bak files
(verified identical to live counterparts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hotfix: archipelago.service ExecStartPre now mkdirs /run/containers and
/var/lib/containers before the unit's mount-namespace setup tries to bind
them. Without this, fresh nodes that don't have /run/containers (e.g.
nodes provisioned without a prior podman session) fail at the namespace
step with:
Failed to set up mount namespacing: /run/containers: No such file or directory
Failed at step NAMESPACE spawning /bin/bash: No such file or directory
Existing nodes don't pick up systemd unit changes via OTA — they need a
one-time `systemctl edit archipelago` adding the same mkdir. ISO installs
from this version forward have the fix baked in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sync-perf tuning for bitcoin/bitcoin-core/bitcoin-knots/electrumx.
- Drop the --cpus=2 cap on bitcoin/electrumx variants. Script verification
is parallelizable; the cap halved IBD speed on 4-8 core machines.
- Bump bitcoin --memory 4g→8g so dbcache=4096 has headroom for mempool +
connection buffers + I/O. 4g was OOM-prone during heavy IBD.
- Bump electrumx --memory 1g→2g + add CACHE_MB=2048 + MAX_SEND=10MB.
- bitcoin-core CLI args gain -dbcache=4096 -par=0 -maxconnections=125.
- bitcoin-knots manifest matched (1024MB pruned / 4096MB full + par=0).
Future v2: host-RAM-aware dbcache scaling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.
Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
replaces fragile post-start exec that failed under restricted-cap rootless
podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition
Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
tester, every app × every transition. Run before each release.
Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
self-update.sh previously rebuilt only the backend binary and Vue
frontend. The custom UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) were left untouched forever. That meant any change to
docker/<ui>/{Dockerfile, nginx.conf, index.html, ...} never reached a
running node through OTA; it required a manual SSH + rebuild. This is
exactly why the lnd-ui port fix didnt reach .228 in v1.7.43-alpha.
Add a sync-and-rebuild stage:
1. Hash each docker/<ui>/ tree (content-only, path-stable via
`cd && find` so src and dst compare equal when identical).
2. rsync changed trees to /opt/archipelago/docker/<ui>/.
3. For each changed UI: rebuild image as the archipelago user
(rootless podman), then stop+remove+recreate the container using
the canonical spec from scripts/container-specs.sh. Port mappings,
caps, memory, and security opts all come from the spec, so the
runtime cant drift from the tree.
Also install first-boot-containers.sh into /opt/archipelago/scripts/ so
a later reconciler run or reboot picks up current orchestration logic.
Idempotent: if no UI tree changed since the last update, the whole stage
is a no-op beyond the hash compare. Verified end-to-end on .228 with a
synthetic change to lnd-ui: detection, sync, build, recreate, and HTTP
200 on both the direct container port and the host-nginx /app/lnd/
proxy.
The LND UI container was unreachable on .228 after the v1.7.43-alpha
deploy because three sources of truth disagreed on which port nginx
listens on inside the container:
- docker/lnd-ui/nginx.conf listen 8081
- docker/lnd-ui/Dockerfile EXPOSE 8080
- apps/lnd-ui/manifest.yml host networking, ports: []
- scripts/first-boot-containers.sh -p 8081:8080
- scripts/deploy-to-target.sh -p 8081:80 (de-facto)
- scripts/deploy-tailscale.sh -p 8081:80
- scripts/container-specs.sh SPEC_PORTS=8081:80
Result: podman published host 8081 to container port 80, but no one was
listening on 80 inside, so connections were reset. Canonicalize on
container:80 with host:8081 publish, matching the three deploy paths
already in agreement.
Changes:
- docker/lnd-ui/nginx.conf: listen 8081 -> listen 80
- docker/lnd-ui/Dockerfile: EXPOSE 8080 -> EXPOSE 80
- apps/lnd-ui/manifest.yml: replace host-network (never true) with
bridge networking and explicit 8081:80 port mapping, correcting a
documentation-vs-reality mismatch
- scripts/first-boot-containers.sh: -p 8081:8080 -> -p 8081:80, and
fix the internal-port comment
Verified on .228 after rebuild: curl http://127.0.0.1:8081/ returns HTTP
200 and the /app/lnd/ host-nginx proxy resolves cleanly.
Releases no longer ship as bootable ISOs. Archipelago updates are
distributed as the backend binary plus a frontend tarball referenced by
releases/manifest.json. Nodes OTA-update via scripts/self-update.sh.
Filebrowser and AIUI remain bundled inside the frontend tarball and
deployed atomically, verified present in v1.7.43-alpha release artifact
(189 AIUI files, filebrowser-client bundle).
Archived under image-recipe/_archived/ (resurrectable if ISO distribution
is reintroduced):
- build-auto-installer-iso.sh
- build-unbundled-iso.sh
- test-iso-qemu.sh
- scripts/convert-iso-to-disk.sh
- BUILD-ISO-STATUS.md, ISO-BUILD-CHECKLIST.md
- branding/isohdpfx.bin
- .gitea/workflows/build-iso-dev.yml
Updated release process docs to drop ISO references:
- scripts/create-release.sh (next-steps text)
- docs/BETA-RELEASE-CHECKLIST.md
- docs/hotfix-process.md
- README.md
Every OTA self-update and every ISO capture was implicitly relying on
/opt/archipelago/web-ui/aiui/ already being present on disk. Any node that
had its web-ui directory atomically swapped (for example by a manual
deployment shipping only neode-ui dist output) lost aiui entirely and the
AI Assistant tab fell through to the "needs to be enabled" placeholder.
self-update.sh: drop the rsync --exclude aiui preservation trick and
instead stage demo/aiui into the freshly-built dist tree before rsync.
demo/aiui in the repo is now the source of truth; every update overwrites
the on-disk copy with a matching version rather than carrying forward
whatever stale bundle happened to survive.
build-auto-installer-iso.sh: prepend demo/aiui to the AIUI search list so
ISO builds from a fresh repo clone pick it up automatically, without
requiring a side-checkout of the AIUI project or a live dev server.
This matches create-release-manifest.sh which already bakes demo/aiui
into the release tarball (lines 86-89).
Four production-code fixes merit user-visible mention: the transport
chunking data-corruption fix (real user-affecting bug for multi-chunk
mesh payloads), the avatar u16 overflow panic (backend crash on certain
seeds), the outbox TTL boundary, and the image-versions parser hardening.
Several tests had drifted from the current production behavior:
- identity_manager: create() already auto-provisions a Nostr key, so the
explicit create_nostr_key() call failed with "already exists". Rewrite
the test to assert on record.nostr_npub from create() directly.
- mesh/protocol: test_build_app_start read the app name from frame[4..]
but the v2 layout is [0:marker][1-2:len][3:cmd][4:version][5..:name].
test_identity_broadcast_roundtrip expected input DID = output DID but
the v2 decoder derives DID from the ed25519 pubkey, so the roundtrip
compares against did_key_from_pubkey_hex(&pub) now.
- mesh/bitcoin_relay: test_build_block_header_announcement asserted
sig.is_some(), but the builder intentionally emits an unsigned envelope
to fit the 160-byte LoRa limit; assert sig.is_none(). Also widen
placeholder hashes to the required 64 hex chars (32 bytes).
- update: load_mirrors() now merges default mirrors post-migration, so
the roundtrip test must assert the custom mirror survives alongside
the defaults rather than strict equality.
- wallet/cashu: test_proof_c_as_pubkey used hex that is not on the curve;
replace with the secp256k1 generator point G so parsing succeeds.
- fips: test_status_reports_no_key_pre_onboarding asserted npub.is_none(),
which fails on dev boxes where the fips daemon is already running. Keep
the !key_present assertion and drop the npub one.
Credentials tests created a fresh tempdir and immediately invoked
encrypt/decrypt, but load_encryption_key reads <dir>/identity/node_key
which did not exist, so every test failed with "node key not found".
Add a test_dir_with_node_key() helper that writes a deterministic 32-byte
key and switch all 8 call sites to it.
SessionStore::new() reads /var/lib/archipelago/sessions.json, which on
any node with an active dashboard contains live sessions that pollute
test state and cause intermittent failures. Introduce a cfg(test) only
new_for_tests(PathBuf) constructor and switch the test suite to it so
tests always start from a clean tempdir.
The parser retained any key ending in _IMAGE, so a harmless-looking
variable like NOT_AN_IMAGE="something" would be treated as a pinned
container image. Add a value-shape check: the value must contain both
a registry separator (/) and a tag separator (:) to qualify.
is_expired used age > ttl_secs, so a message with ttl_secs=0 whose age
rounded to 0 seconds was considered live forever. Switch to >= so the
zero-TTL boundary expires on the first check, matching the intuitive
meaning of TTL and the behavior the tests assert.
hue_color and accent_color computed (seed as u16) * 360, which overflows
u16 when seed >= 182 — debug builds panicked, release wrapped silently.
Widen to u32 before the multiplication.
This also unblocks several identity_manager tests that constructed avatars
through master_node_svg and were aborting on the panic.
encode_chunked() split the payload into shards first, then overwrote
the first 4 bytes of shard 0 with a u32 length header, then re-ran
Reed-Solomon to regenerate parity over the now-corrupted shards. The
decoder correctly read the length header and trimmed `[4..4+len]`
from the reconstructed buffer, but those first 4 bytes had already
been destroyed on the encode side, so every chunked mesh payload
lost its first 4 bytes.
Restructure: reserve 4 bytes for the length header up front, build
a single contiguous [len][data][pad] buffer, then split into shards.
Parity is computed over the correct shards on the first pass, no
double-encode needed.
Update test_chunk_roundtrip_medium: 500 bytes + 4-byte header = 504
bytes, which is 5 data shards (ceil(504/124)), not 4. The old test
assertion was wrong all along and masked the corruption bug because
it only checked the roundtripped bytes, which is exactly what we
need to verify. New assertion is correct.
Verified: all 7 transport::chunking tests pass.
The backend runs as `archipelago` and calls `install_log()` to append
audit lines to the install log on every install / update / remove /
start / stop / restart. Target path was /var/log/archipelago-container-installs.log,
which does not exist and cannot be created by the service because
/var/log/ is root-owned. OpenOptions errors were silently swallowed,
so the log was never written on any node.
Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and
container-installs.log with archipelago:archipelago ownership. Move
the const path to match, keeping logs inside the directory logrotate
already rotates (image-recipe/configs/logrotate.conf). Install the
rule from both the ISO build and self-update, and apply it
immediately on self-update so existing nodes get a working log
without needing a reboot.
Verified on .228: file created, backend user can write, backend
binary rebuilt with new const.
Document that OTA updates now refresh the reconcile helper scripts,
closing the deploy gap that kept fixes to those scripts from
reaching existing nodes.
The OTA self-update path only refreshed image-versions.sh, leaving
reconcile-containers.sh and container-specs.sh frozen at whatever
version was baked into the ISO that originally provisioned the
node. Any fix to those scripts (notably the --create-missing flag
and the DISK_GB detection fix shipped this round) never reached
existing nodes, and on .228 both scripts were outright missing
because the node predated their inclusion in the ISO recipe.
Install all three helper scripts to /opt/archipelago/scripts/ on
every self-update run. Also preserve the legacy copy of
image-versions.sh at /opt/archipelago/image-versions.sh for any
older backend binaries still looking there first.
The update flow removes the old container before starting the new
one. If the update fails after removal, the rollback path tries
`podman start <name>` first, then falls back to reconcile. But
reconcile without --create-missing treats the now-absent container
as an optional one that the install flow will (re)create later,
and skips it. Result: container stays destroyed until someone
notices and runs reconcile manually.
Add --create-missing to the rollback reconcile invocation so the
fallback actually rebuilds the container from its canonical spec.
Fixes the failure mode observed on .228 where a bitcoin-knots
update left the node with no bitcoin-knots container at all.
Add two user-facing release notes for fixes shipped this round:
- Full-archive Bitcoin nodes no longer silently get pruned on reconcile
because the disk-size check was reading the OS partition.
- Failed updates can now recover via reconcile --create-missing instead
of leaving a destroyed container behind.
The reconcile spec for bitcoin-knots auto-enables prune=550 when
DISK_GB < 1000. DISK_GB was measured via `df /`, which on every
archy install reports the ~30 GB OS partition because user data
lives on a separate encrypted /var/lib/archipelago volume.
Result: every archy node with a 2 TB data drive was silently being
configured as a pruned node, and any bitcoin-knots container
recreated by reconcile would delete its historical blocks down to
the 550 MB prune window on next start.
Observed on .228 (2 TB box): blocks dir went from 384 GB to 926 MB
after a reconcile-triggered restart. Historical archive unrecoverable
without full re-IBD from genesis.
Fix: check /var/lib/archipelago first (where bitcoin data actually
lives). Fall back to / only on first-boot before the data partition
is mounted.
Context: when package update fails after remove-old-container but
before reconcile-recreate, the rollback path in update.rs tries to
restart the old container by name. If the container is already gone
(removed in step 3 of the update), rollback fails silently and the
node is left with no live container for that app but on-disk data
still intact. This is exactly the state .228 ended up in after the
reconcile-script-missing bug killed bitcoin-knots and lnd.
Reconcile was designed to only repair existing containers for
optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries
on the assumption that the install RPC creates them. That safety
check is correct for normal operation but blocks recovery when an
optional-marked container has been destroyed by a failed update.
Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip.
When set, reconcile treats absent containers exactly the same as
broken containers — it creates them from the canonical spec using
the existing on-disk data directory. Narrow-scope override; the
default behaviour is unchanged.
Updated --help to document all four flags.
Verified on .228: after the failed bitcoin-core update took out both
bitcoin-knots and lnd, running reconcile --container=bitcoin-knots
--create-missing --force (as the archipelago user, not root —
podman is rootless) brought bitcoin-knots back using the pruned
chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All
containers now running; electrumx reconnecting; UIs recovering.
Does NOT fix the underlying update-flow rollback hole (rollback
should be able to re-create a container from spec, not just restart
by name). That is a separate commit — this flag is the manual
recovery tool plus the primitive the improved rollback will call.
- AccountInfoSection.vue: append 5th bullet to v1.7.43-alpha entry
explaining that update-available badges and version comparisons
work again now that the pinned-image catalog is found at the
correct deployed path.
- docs/MARKETPLACE-QA.md: new tracker for the upcoming app-by-app
install walk on .228. Documents the per-app fix workflow, the
four layers we might need to fix at (app recipe, registry image,
backend orchestrator, frontend), status-key table for tracking
each catalog entry, and the release-notes policy for the walk.
- docs/RESUME.md: refresh with a9908597 commit, updated binary md5
on .228, and split Immediate Next Step into Phase 1 (browser
verification) and Phase 2 (marketplace walk) with a pointer to
the new tracker.
The Rust search path listed /opt/archipelago/image-versions.sh and
scripts/image-versions.sh (repo-relative for dev), but the image
recipe deploys the file to /opt/archipelago/scripts/image-versions.sh.
Production nodes therefore silently failed every lookup: find_file
returned None, load_image_versions returned an empty HashMap, and
both pinned_image_for_app and pinned_images_for_stack returned no
matches.
Symptom on deployed nodes: every container scan emitted
"image-versions.sh not found in any search path" at DEBUG level, and
the version-comparison logic in docker_packages.rs plus the
update-check logic in api/rpc/package/update.rs silently degraded to
no-op — users would not see update-available badges and upgrade RPCs
could not resolve pinned targets.
Fix: put the canonical deployed path first in PATHS, keep the older
/opt/archipelago/image-versions.sh as a fallback for not-yet-updated
nodes, and retain scripts/image-versions.sh as the dev-repo-relative
fallback. Verified on .228: backend now logs "Parsed 57 image
versions from /opt/archipelago/scripts/image-versions.sh" on scan.
Pre-existing test_parse_image_versions failure in this module is
unrelated (the NOT_AN_IMAGE assertion was broken before this change
because the parser's _IMAGE-suffix retain keeps it). Leaving that for
the general cargo-test cleanup pass.
Consolidated single-file snapshot of plan + progress for a fresh
OpenCode session to pick up the install UX polish work:
- Where we are: v1.7.43-alpha shipped, 5 commits on main, deployed
to .228, browser verification in progress.
- Immediate next step: await user's verification results from
https://192.168.1.228/ browser checklist.
- Working layout: SSHFS mount, ssh archy / archy228, deploy recipes.
- Architecture patterns: async-spawn lifecycle, phase-based install
progress, scanner kick, .23 auto-purge migration.
- Backlog: Vaultwarden exit-on-start, install log perms, 22 stale
cargo test failures, historical changelog entries left intact.
- User preferences: "best long-term first", one-by-one, no push,
Bitcoin-only, conventional commits.
Complements STATUS.md (which remains the engineering log) with a
tighter resume-the-work narrative focused on the current round.
Adds a new top section to STATUS.md covering v1.7.43-alpha:
- Round 3: phase-based install progress bar
- Round 4: post-install scanner kick for instant Launch button
- Round 5: .23 VPS retirement, .168 promoted to Server 1
- Config migration: auto-purge .23 from saved registry/mirror JSONs
- Changelog: new v1.7.43-alpha entry in AccountInfoSection
All 5 commits, deployment md5, verification notes, and git remote
cleanup captured. Round 2 rollback command still valid for the full
stack since backups predate every round in this session.
Four release-note bullets describing the user-visible changes shipped
in this round:
- async-spawn install/update/uninstall (UI no longer freezes)
- phase-based install progress bar (Preparing through Finalizing)
- scanner kick post-install (Launch button appears immediately)
- .23 Hetzner VPS retired, .168 OVH promoted to Server 1 with
auto-purge migration for existing nodes
Matches the tone of existing changelog entries: what changed from the
operator's perspective, not internal implementation detail.
load_registries + load_mirrors normally only ADD missing defaults to
the persisted JSON — explicit removals stick. After retiring the .23
Hetzner VPS we need the opposite: existing nodes have .23 baked into
their saved configs and would spend seconds per install/update timing
out against a dead host until the operator manually removes it via
the Settings UI.
Add a targeted one-time migration in both loaders: if any saved entry
has 23.182.128.160 in its URL, drop it on load and rewrite the file.
This is an exception to the usual "explicit removals stick" rule —
the user never chose to add this mirror, it was a default.
Narrow-scope migration (one hardcoded IP match, no schema version)
because the cost/benefit of a general migration system isn't worth
it for a single decommissioned host. Future retirements can follow
the same pattern.
The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it
everywhere with the OVH VPS at 146.59.87.168, which was previously
the tertiary mirror.
- update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into
the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2.
Default mirror list shrinks from 3 to 2.
- container/registry.rs: default RegistryConfig drops .23, promotes
.168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10.
- api/rpc/package/config.rs: trusted-registry allowlist swaps .23
for .168.
- api/handler/mod.rs: app-catalog fallback URL uses .168.
- neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168.
- scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168.
- image-recipe/build-auto-installer-iso.sh: installer ISO registries
use .168 (both podman registries.conf and backend registries.json).
Tests updated to assert on the new 2-entry default lists (registry +
mirror). URL-parser fixture tests in update.rs retain .23 strings —
they exercise string-parsing logic, not mirror policy.
Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin`
multi-push alias (not part of this commit — pure working-copy change).
After install completes, the async-spawn wrapper wrote state=Running
but the skeletal install-time manifest (interfaces: None) persisted
until the next scheduled 60s scan. The frontend saw state=running but
hasUI=false and hid the Launch button for up to a full minute.
Add a shared Notify/watch pair between RpcHandler and the scan loop:
- scan_kick (Notify): scan loop selects! between the 60s interval
and this notify, running immediately on either.
- scan_tick (watch<u64>): scan loop bumps the counter after each
completed scan so callers can await completion.
Install and update success paths now call kick_scanner_and_wait before
flipping to Running. The scan merges via merge_preserving_transitional
(state stays Installing/Updating, manifest refreshed from live podman
with interfaces.main.ui populated from real port bindings). 2s timeout
falls back to pre-fix behavior on slow podman — no regression.
Podman emits zero parseable progress when stderr is piped (no TTY), so
the old byte-counter regex never matched in real installs. Users saw
0% for the whole pull, then a jump to 95%, then silence through
create-container, health-check, and post-install hooks.
Replace with 7 explicit lifecycle phases wired through install.rs and
update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%),
StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%),
Done (100%). Each maps to a fixed UI progress and status message.
Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when
present, falls back to byte-counter for legacy. A Math.max forward-only
guard ensures the bar never regresses. Deleted the duplicate watcher
in Discover.vue that was fighting the store's watcher with stale byte
logic. Added shimmer CSS on the fill (with prefers-reduced-motion
opt-out) so the bar looks alive during long phases.
create_installing_entry hardcoded /assets/img/app-icons/<id>.png for
every new install. About half the app icons ship as .svg or .webp
(lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the
browser 404s on the wrong extension and renders the default broken-image
glyph for the 10-30s window before the scanner refreshes with real
manifest data.
Send empty icon. The frontend's icon computed in AppCard.vue falls
through to curatedMap which has correct extensions for bundled apps,
and handleImageError still guards any remaining misses with a
placeholder SVG.
With the backend flipped to async-spawn, install/uninstall/update return
immediately with a { status, package_id } envelope. Client timeouts of
45m/11m were a leftover from synchronous handlers and masked real RPC
failures.
Drop all install/uninstall/update RPC timeouts to 15s. Progress and
terminal state still arrive through the live state stream — the RPC
only needs to confirm the spawn was accepted.
Return-type annotations updated in rpc-client.ts and stores/server.ts.
Five direct rpcClient.call sites across Marketplace.vue, Discover.vue,
and MarketplaceAppDetails.vue updated with the shorter timeout.
Extend the async-spawn treatment previously shipped for Stop/Start/Restart
to the three remaining long-running lifecycle RPCs. Each wrapper validates
params, rejects duplicate in-flight ops, flips state to the transitional
variant (Installing/Removing/Updating), then spawns the existing inner
handler on tokio. RPC returns immediately with { status, package_id }; the
spawn task owns the terminal state write.
Install and update success arms explicitly set state=Running. The scan
loop merge (merge_preserving_transitional) refuses to overwrite
transitional states, so the spawn task must write the terminal state.
Uninstall's inner handler removes the entry entirely, so no explicit
terminal write is needed there.
Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so
spawned tasks can hold their own Arc without extra field cloning.
Transient install entry uses empty icon string. Hardcoding
/assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp
assets, which produces a broken-image flicker until the scanner refreshes
with manifest data. Empty string causes the frontend's icon computed to
fall through to the curated map, which has correct extensions.
Removed the inner "already updating" guard in update.rs — the wrapper
now owns duplicate-op detection for all three operations.
Records the four landed commits, the .228 deploy (binary + frontend
paths, backups, md5), the manual LND Stop verification, and the
rollback incantation. Leaves the older "NEXT SESSION" design block
in place as historical reference with a note that it's stale.
Adds a follow-ups list: chaos matrix is now unblocked, bundled-app
RPCs are still sync (deprecate or mirror-async?), transitional_since
is in-memory only, and there are 22 pre-existing test failures in
unrelated modules that should get their own cleanup pass.
The app card and details view previously used a pair of Start/Stop
buttons whose labels were driven off isAppLoading(), a client-side
"I just clicked the button" flag. When the backend's graceful stop
took longer than the RPC round-trip (up to 600s on bitcoin-core),
the flag cleared while the container was still shutting down, the
UI flipped back to "Running" as soon as the next 10s scan saw the
still-alive container, and the user had no indication the stop was
still in flight.
Now that the backend flips PackageState to Stopping / Starting /
Restarting / Installing / Updating / Removing for the duration of
each lifecycle operation and the scan loop preserves those states,
the UI can drive its label off the container state itself. A single
full-width primary button replaces the Start/Stop pair. Its label,
color, and disabled state come from getAppVisualState(), which
collapses resting states (exited/created/paused/installed) into
"stopped" and passes transitional states through untouched.
Changes:
- container-client.ts: widen ContainerStatus.state union to include
the six transitional variants plus "installed". Add
restartContainer() calling the new container-restart RPC.
- stores/container.ts: add getAppVisualState() computed and the
restartContainer() action.
- ContainerApps.vue: single primary button (Start / Stop / Starting
/ Stopping / Restarting etc.) plus a separate circular Restart
button visible only when running. Critically, handleStartApp and
handleStopApp now route through store.startContainer and
stopContainer (which call container-start / container-stop, the
async RPCs) instead of the legacy synchronous bundled-app-start /
bundled-app-stop path. Transitional-state polling widened from
just "created" to the full set of transitional variants.
- ContainerAppDetails.vue: same single-button pattern, Restart
button now calls container-restart instead of the old
stop-sleep-start sequence, added 2s polling interval for
transitional states.
- components/ContainerStatus.vue: widen state prop to match the
shared union, render transitional labels with a trailing ellipsis
and a yellow dot.
No new tests — this is presentation logic. Manual verification on
.228 will confirm the end-to-end async path: click Stop on LND,
button becomes "Stopping" in under a second, stays that way for
roughly 5 minutes, then flips to "Start" with a grey dot. The UI
must never revert to "Running" mid-stop.
The 30s package scan loop used to blindly overwrite every package
entry from podman inspect. While a user-initiated Stop / Start /
Restart was in flight, the RPC spawn task would flip the state to
Stopping / Starting / Restarting, the next scan would see podman
still reporting "running" (for the duration of the graceful stop,
up to 600s for bitcoin-core), and clobber the transitional state
back to Running. The dashboard would then flip Running -> Stopping
-> Running -> Stopped, making it look like the stop had silently
failed until it eventually completed.
The merge loop now treats transitional variants (Stopping, Starting,
Restarting, Installing, Updating, Removing, and the three backup
variants) as owned by the RPC spawn task. For those variants,
merge_preserving_transitional keeps the existing state while still
taking live observability fields (health, exit_code, installed,
lan_address, manifest, static_files, available_update) from the
fresh scan so the UI continues to see live health readings.
Adds an escape hatch via a per-scan transitional_since side table:
if a package has been in a transitional state for more than 1200s
(2x the longest graceful stop at 600s on bitcoin-core), the scan
loop assumes the spawn task died without cleanup and overrides with
podman's live state. Prevents a crashed background task from wedging
a package in Stopping forever.
Three unit tests cover the merge rule, the observability passthrough,
and the transitional-variant classifier.
RPC handlers no longer block on podman operations. container-stop on
bitcoin-core used to hold the connection for up to 600s while the UI
showed a frozen spinner; it now returns in under a second with
{status: stopping} after flipping the package state to Stopping and
broadcasting over WebSocket. Same treatment for container-start and
the new container-restart route.
Widens container-list state mapping to emit the transitional variants
(stopping, starting, restarting, installing, updating, removing,
installed, and the backup states) instead of collapsing them to
"unknown". Keeps the mapping in sync with the UI ContainerStatus.state
union so the dashboard can render the right transitional label.
Mirrors the treatment in package/runtime.rs for package.start,
package.stop, and package.restart. The body of each handler is lifted
into pure do_package_* helpers that the background task runs; state
flipping is bracketed around the spawn with revert on error. The
pre-existing post-start exit-check verification and restart stop+start
fallback run inside the spawned task, not the RPC body.
Adds container-restart route to the dispatcher. mark_user_stopped
continues to run BEFORE the spawn, preserving the ordering contract
with the crash recovery layer at runtime.rs:145-148.
Introduces a new RPC-layer helper that bridges the synchronous
ContainerOrchestrator trait with RPC handlers that must return in <1s.
The helper flips the package state to a transitional variant
(Stopping / Starting / Restarting) in the StateManager so WebSocket
clients see the live label immediately, then tokio::spawns the
actual orchestrator call. On success it writes the final state; on
error it reverts to the pre-transition state and logs via
install_log().
The ContainerOrchestrator trait stays synchronous so the reconciler,
boot flow, unit tests, and chaos harness keep deterministic
behaviour. Async only lives in the RPC layer.
Not wired to any handler yet — Commit 2 consumes this helper.
Widens install_log visibility from pub(super) to
pub(in crate::api::rpc) so the new sibling module can reach it.
Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh
split that makes this dev setup work. Includes:
- Exact running mount command (pulled from ps)
- macFUSE + sshfs-mac brew install path
- Health check + recovery sequence for when mount hangs (it will)
- Full which-path-for-which-operation table
- Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc)
- Cache caveat and inode-sharing note between mount and SSH views
No code change.
Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync
No code change yet; next session implements.
The previous code computed HOST_GATEWAY from `ip route show default` to
work around an alleged podman 4.3.x limitation. Two problems:
1. The comment was wrong. Podman 4.4+ supports --add-host=host-gateway
natively, and we ship 5.4.2.
2. More critically, `ip route show default` returns the LAN router
(e.g. 192.168.1.254) — the gateway to the internet, not the gateway
to the host. Every container configured with DAEMON_URL or
--bitcoind.rpchost=host.containers.internal was therefore dialing
the WiFi router instead of the host machine, silently failing.
Symptoms this caused on .228:
- LND crash-looped with "dial tcp 192.168.1.254:8332: connection refused"
- Dashboard showed no LND connect details or QR
- ElectrumX DAEMON_URL broken; stuck at 2 KB index for days
- Any service reaching bitcoin-core through the `archy-net` bridge
Replace the computed value with the literal string "host-gateway",
which podman translates to the correct in-network gateway at container
start. Also drop the stale HOST_GATEWAY reference in the Tor-bootstrap
branch (it always fell back to TARGET_IP anyway). Verified on .228:
after recreating bitcoin-core/electrumx/lnd with the new flag, LND
reached the chain backend, ElectrumX resumed indexing, and the
dashboard /lnd-connect-info endpoint succeeded.
LND's admin.macaroon is owned by a rootless-podman subordinate UID
(typically 100000) with mode 640. The archipelago server runs as UID
1000 and cannot read the file directly, which caused every dashboard
LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client
to fail with "Failed to read LND admin macaroon".
Add a read_lnd_admin_macaroon() helper that first tries a direct read
(for operators who have relaxed permissions) then falls back to
`sudo -n cat`, mirroring the pattern already used for Tor hidden
service hostnames in handle_lnd_connect_info. Centralise the canonical
macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers
through the helper.
Verified on .228: GET /lnd-connect-info now returns 200 with cert,
macaroon, and tor_onion fields. Dashboard QR/connect-string UI
unblocked.
Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.
Adds Delegate=memory pids cpu io to the archipelago.service unit.
Context: the service runs as User=archipelago under system.slice with
rootless podman. When podman creates transient libpod-*.scope units for
containers under user.slice, systemd needs the caller to hold
CAP_SYS_ADMIN on the target cgroup subtree \u2014 which happens iff
Delegate= lists the controllers we want to set. Without Delegate, any
future code path that goes through the podman CLI (runtime.rs) instead
of the libpod HTTP API (podman_client.rs) would hit MemoryMax
rejections that have exactly the same symptom as the bug I just fixed
in parse_memory_limit but with a completely different root cause.
Belt-and-braces: current production path uses PodmanClient and was
fixed in the preceding commit. But the DockerRuntime CLI path in
runtime.rs:262-268 (cmd.arg("--memory")) is still reachable via
AutoRuntime fallback on hosts without podman, and future rust
orchestrator code may legitimately need cgroup delegation. This
directive is no-op harmful on hosts that already delegate upstream
(systemd gracefully handles duplicate/nested delegation).
The libpod HTTP API path (PodmanClient::create_container) ran manifest
memory_limit values like "128Mi" through parse_memory_limit which
lowercased+trim_end_matches("m"), leaving "128i" which parse::<f64>()
rejected. The resulting None became 0 via .unwrap_or(0), and podman
serialised that into the OCI config as memory.limit:0. At container
start time systemd then rejected MemoryMax=0 with "Value specified in
MemoryMax is out of range".
Silently wrong for every manifest in apps/ that uses Kubernetes-style
suffixes (all of them). Became visible on .228 when Step 9 first
exercised the ProdContainerOrchestrator path for bitcoin-ui and lnd-ui
installs \u2014 the old first-boot-containers.sh bash script used podman
run --memory 128m directly, which podman-the-CLI parses correctly, so
the bug never surfaced before.
Two parts:
- parse_memory_limit now recognises Ki/Mi/Gi/Ti (IEC binary, what k8s
and our manifests use), kB/MB/GB/TB (SI decimal), k/K/m/M/g/G/t/T
(docker shorthand, treated as IEC binary for backwards compat), and
bare byte integers. Filters out zero/negative results.
- create_container omits the memory/cpu fields entirely when the
manifest has no limit or parsing fails, rather than emitting 0. The
libpod API treats absent as unlimited; 0 is "set MemoryMax=0" which
systemd rightly rejects. Defence in depth against the next weird
suffix someone puts in a manifest.
Six regression tests in the new tests module cover IEC, SI, shorthand,
raw bytes, invalid input (empty/garbage/0/negative), and whitespace.
BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.
Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)
Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh
Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.
No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.
Discovered during Step 8 execution that first-boot-containers.sh
creates 30+ containers with per-container logic (wallet loads, DB
init, rpcauth derivations, post-create health waits) and does
substantial non-container setup (secret gen, rootless-podman subuid
chowns, Tor hostnames, WireGuard, firewall, nostr-relay). Only 3 of
the 30+ containers have manifests today (the UIs from Step 7).
Deleting the bash in a single step bricks first-boot on fresh
installs. Split into:
- 8a: delete reconcile-containers.sh + container-specs.sh + reconcile
systemd unit + timer. BootReconciler fully covers these. Safe,
atomic, no manifest porting required.
- 8b: port remaining ~25 containers into apps/<id>/manifest.yml. One
manifest per commit, validated against current bash behavior.
Multi-day scope.
- 8c: rename first-boot-containers.sh -> first-boot-setup.sh, strip
container ops, keep secret/dir/Tor/WG/firewall setup. Final
one-way door, requires 8b complete.
Replaces the first-boot-containers.sh sed/envsubst approach with a
Rust-native render step bound into the ContainerOrchestrator lifecycle.
- New container::bitcoin_ui module: embeds the nginx.conf template via
include_str!, reads the plaintext RPC password from
/var/lib/archipelago/secrets/bitcoin-rpc-password, substitutes
{{BITCOIN_RPC_AUTH}} with base64(archipelago:<password>), and atomic-
writes (tmp + rename) to /var/lib/archipelago/bitcoin-ui/nginx.conf.
Idempotent: byte-compares before writing so unchanged input is a
no-op (no inode churn, no restart cascade).
- ProdContainerOrchestrator gains run_pre_start_hooks(app_id) returning
HookOutcome::{Rewritten, Unchanged}. Fires in install_fresh before
create_container, and in ensure_running: on Running + Rewritten
triggers a restart; on Stopped re-renders then starts.
- bitcoin-ui Dockerfile no longer COPYs a default.conf; the file now
arrives via runtime bind-mount of the rendered config. If the bind-
mount is ever missing, nginx starts with no site configured and
returns 404 everywhere — safe failure vs. serving upstream RPC with
a stale Authorization header.
- apps/{bitcoin,electrs,lnd}-ui/manifest.yml land as first-class
manifests. bitcoin-ui declares the bind-mount target and a dependency
on bitcoin-core; electrs-ui and lnd-ui declare their own deps and
health checks.
- 8 new unit tests on the render fn (idempotency, rotation, trimming,
missing/empty secret, template invariants) plus an integration test
asserting install(bitcoin-ui) actually lands a substituted nginx.conf
on disk via the hook. 39/39 container:: tests pass
(test_parse_image_versions pre-existing failure unchanged, out of
scope).
Step 6 of the rust-orchestrator migration. Construct the container
orchestrator once in main.rs, call load_manifests + adopt_existing
immediately after Config::load, log the adoption report, and spawn
BootReconciler::run_forever with the 30s default interval. Thread the
orchestrator through Server::new -> ApiHandler::new -> RpcHandler::new
so the reconciler and RPC layer share one instance.
Wire a tokio::sync::Notify through the SIGTERM/SIGINT shutdown path so
the reconciler exits cleanly alongside the server drain. Uses notify_one
so the signal stores a permit if the reconciler is mid reconcile_all
when the signal fires.
Delete the commented-out run_boot_reconciliation block in main.rs that
documented the prior bash-script approach being unsafe on unbundled
installs — the new reconciler is manifest-driven and only touches apps
present in /opt/archipelago/apps, fixing that concern.
cargo check -p archipelago clean (6 pre-existing dead-code warnings on
trait methods not yet exercised until Step 9 hot-swap). Container test
suite 43/44 pass; the one failure (container::image_versions::
test_parse_image_versions) is pre-existing and unrelated.
Step 5 of the rust-orchestrator migration. New file boot_reconciler.rs holds a
small Tokio task that calls ProdContainerOrchestrator::reconcile_all() on a
30-second cadence (answered design Q3).
* BootReconciler::new(orch, interval, shutdown) — shutdown is an Arc<Notify>
so callers can trigger a graceful exit without pulling in tokio-util.
* run_forever(self) — does one reconcile immediately, then loops on
tokio::select! { sleep_until | shutdown.notified() }. Shutdown interrupts
the sleep but never an in-flight reconcile_all call.
* Per-pass outcomes are logged at debug/warn; failures never propagate out
because reconcile_all already absorbs per-app errors into ReconcileReport.
Four tokio::test(start_paused = true) tests verify the loop cadence against a
CountingRuntime test double:
* initial_pass_fires_immediately — first reconcile runs with no delay
* second_pass_fires_after_interval — second pass fires after exactly
interval elapses in paused-clock time
* shutdown_terminates_loop — notify_one() lets run_forever return
* failure_in_one_pass_does_not_stop_loop — the loop keeps ticking even when
the first pass had to install a missing container
Not wired into main.rs yet — that is Step 6. Re-exported from container::mod
as BootReconciler + RECONCILER_DEFAULT_INTERVAL for the wire-up step.
Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build
clean with expected unused-method warnings) and queues the BootReconciler
implementation for Step 5.
The laptop mounts ~/Projects/archy over SSHFS and macOS finder / Spotlight
sidecars write ._<name> resource-fork files alongside every edit. They are
noise; keep them out of git.
Step 4 of the rust-orchestrator migration. Unifies the container lifecycle
surface behind a single trait so the RPC layer stops caring whether it is
talking to the dev or prod orchestrator.
* New trait core/archipelago/src/container/traits.rs: ContainerOrchestrator
with install / start / stop / restart / remove / upgrade / status / list /
logs / health, all keyed by app_id. Every method is async_trait-based.
* ProdContainerOrchestrator: the lifecycle methods are moved from inherent
impl into the trait impl (avoids name-shadowing recursion). Adoption and
reconcile remain inherent since only main.rs / BootReconciler call them.
* DevContainerOrchestrator: new trait impl that forwards to the existing
Dev-named methods, applying the dev container-name + port-offset rules
internally. New load_manifest_for() helper resolves app_id to
<data_dir>/apps/<app_id>/manifest.yml so trait-level install(app_id)
works in dev too. install_container(manifest, path) stays inherent for
the manifest-path RPC shape.
* RpcHandler now holds Option<Arc<dyn ContainerOrchestrator>> and, when in
dev mode, a separate Option<Arc<DevContainerOrchestrator>> for the
manifest_path install RPC. In prod mode RpcHandler::new() constructs a
ProdContainerOrchestrator and calls load_manifests() at startup.
* All seven container-* RPC guards no longer say dev mode required.
container-install still requires dev mode because its manifest_path
argument has no prod meaning; every other container RPC now works in both
modes via the trait.
BOOT STILL DOES NOT USE THIS. main.rs wire-up (Step 6) and BootReconciler
(Step 5) come next. Until then the prod orchestrator is constructed but nothing
populates /opt/archipelago/apps so it has zero manifests to manage, matching
the pre-Step-4 behaviour.
Verification: cargo build -p archipelago clean (11 expected unused method
warnings for methods not yet wired from main.rs). cargo test -p archipelago:
all 21 container::* tests pass (16 prod_orchestrator + 5 others). 24 other
test failures are pre-existing and unrelated (identity_manager / session /
wallet / mesh / credentials — all independently flaky on file-backed state).
Step 3 of the rust-orchestrator-migration. New file prod_orchestrator.rs (999 LOC)
implements the full public surface that will replace scripts/first-boot-containers.sh:
* install / start / stop / restart / remove / upgrade / status / list / logs / health
* adopt_existing: read-only scan that claims containers matching our manifests by
name, without recreating — preserves the v1.7.42 fixture on .116.
* reconcile_all: level-triggered, per-app failures collected rather than aborting.
* install_fresh: build-or-pull (Step 2 trait methods), relative build contexts
resolved against the manifest directory.
Naming rule (answered design Q1): UI app IDs (bitcoin-ui/electrs-ui/lnd-ui) get the
archy- prefix; backends keep their bare ID. An explicit extensions.container_name
always wins. Codified in compute_container_name() with unit tests for all three tiers.
Concurrency (answered design Q4): per-app tokio::sync::Mutex<()> created lazily,
protecting every mutating op against the reconciler loop. Acquiring the per-app
lock only needs a read lock on the map, so independent apps do not serialize.
16 tests: 3 sync naming rule tests + 13 tokio async tests covering install (pull,
build-absent, build-present, relative-context), reconcile (noop/exited/missing/
mixed-failure), adopt-by-name, upgrade sequence ordering, list filtering, health
state mapping, and unknown-app-id rejection. All pass.
Not wired into main.rs yet — that is Step 6. Crate builds clean with expected
unused warnings for the new re-exports.
Adds two methods to ContainerRuntime so the upcoming ProdContainerOrchestrator
can inspect local image storage and build images from BuildConfig:
- image_exists(image_ref) -> Result<bool>: local-storage check only, does
not consult registries. Distinguishes exit 0 (present) from exit 1
(absent) from other failures (environment error).
- build_image(&BuildConfig) -> Result<()>: shells out to podman/docker
build with -t, -f, deterministically-sorted --build-arg pairs, and the
context path last.
Implemented on all three runtimes:
- PodmanRuntime: new podman_cli helper shells out alongside the existing
HTTP API calls (build and image inspect are awkward over the HTTP API)
- DockerRuntime: native docker CLI, same exit-code semantics
- AutoRuntime: delegates to the selected inner runtime
Argv construction extracted into pure build_args_for_podman helper so it
can be unit-tested without a real podman. 4 new tests cover minimal args,
custom Dockerfile path, deterministic build-arg sorting (guards against
HashMap iteration non-determinism), and context-is-last (positional arg
placement is load-bearing for podman build).
Step 2 of docs/rust-orchestrator-migration.md. 25/25 tests pass.
ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.
Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).
Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.
Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.
Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.
Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on
a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block
validation. The old 10s client-side timeout was rejecting roughly 30% of
UI calls even though the node was perfectly healthy. 20x stress test on
the live .116 node (caught in IBD catch-up at block 797k) used to drop
10 of 20 calls; now drops 0 of 20.
What changed:
- core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up
to 3 times with 500ms and 1500ms backoffs between attempts. Only
transient transport errors (timeout, connect refused, send/recv IO)
trigger retry. A well-formed bitcoind error response is surfaced
immediately - real RPC bugs are never masked.
- Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top
of reqwest's own timeout, so DNS starvation or TLS wedging can't
steal the entire retry budget.
- handle_bitcoin_getinfo client builder gained a 3s connect_timeout
so a dead bitcoind is fast-failed inside the first attempt instead
of eating the whole 15s.
- Retry policy extracted into a RetryConfig struct so tests can dial
down timeouts to ~100ms per attempt. Production defaults live in
RetryConfig::production().
Not changed (tracked as follow-up):
- mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the
same 10s-timeout pattern. Not migrated to the new wrapper in this
release; scheduled for v1.7.43 alongside the render_bitcoin_conf
work.
- lnd/info.rs and electrs_status have similar 10s/15s timeouts but
different failure profiles - audit first, migrate only the ones
that actually exhibit the bug.
Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing.
Uses an in-process hyper server (already a transitive dep) to simulate
bitcoind responses; no new crates required.
- happy_path_first_attempt: no retry when first attempt succeeds
- retries_on_timeout_then_succeeds: first attempt times out, second
succeeds, returns OK (uses a short-timeout RetryConfig so the test
runs in <1s instead of 15s)
- retries_exhausted_on_persistent_connect_refused: all attempts fail
against a closed port, error bubbles up, elapsed time confirms
backoffs actually ran
- does_not_retry_on_rpc_level_error: bitcoind-returned error body is
surfaced immediately, no retry
- does_not_retry_parse_errors: non-JSON response (e.g. 503 with html
body) is NOT retried - guards against the tempting "retry all
non-2xx" mistake that would mask real bitcoind misconfig
- retry_budget_invariants: asserts total wall-time ceiling stays
under 60s so a bumped constant can't silently hang a UI call
forever
Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD
catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the
old 10s timeout. Worst-case latency was 48.9s during peak validation;
happy-path latency (cached result) remains 28-77ms.
Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.
What changed:
- apply_update() writes a pending-verify marker with old+new version and
a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
present and within its freshness window, the new binary waits 15s for
nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
nothing else happens.
- On window-exhaust, the new binary:
1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
(quarantined, not deleted, so we can post-mortem).
2. Restores web-ui.bak on top of web-ui.
3. Calls rollback_update() to restore the previous binary.
4. Updates state.current_version to reflect the rollback.
5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
probing, so a crashed-during-startup marker from weeks ago cannot
spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
tokio::fs::copy, so it escapes the service's ProtectSystem=strict
mount namespace. Without this, the rollback silently failed with
EROFS on /usr/local/bin and orphaned the rollback - the exact
opposite of what auto-rollback is for.
Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.
Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
v1.7.38 and v1.7.39 both shipped with `./` inside the frontend tarball marked
drwx------ (700). Tar extraction preserves archive perms, so every node that
pulled the OTA landed with /opt/archipelago/web-ui at 700, nginx (www-data)
returned 500 "permission denied" on every page, and the browser showed
"Internal Server Error nginx". .116 hit this on both v1.7.38 and v1.7.39
rollouts. The v1.7.39 runtime self-heal in main.rs was the wrong layer —
systemd's ReadOnlyPaths namespace made /opt/archipelago read-only from inside
the archipelago service, so chmod from there returned EROFS.
Root cause: create-release-manifest.sh used mktemp -d (700 default umask) for
staging, then tar preserved that 700 in the archive's root entry.
Fix the archive itself:
- chmod 755 staging dir + `find -type d -exec chmod 755` + `-type f chmod 644`
before tar, so the on-disk entries are correct.
- tar --owner=0 --group=0 --mode='u=rwX,go=rX' to normalize archive perms
belt-and-braces in case file-mode drift ever reappears.
- Post-tar verify: `tar tvzf | head -1` must show drwxr-xr-x at root, or
the release script aborts before the manifest is even generated.
Binary unchanged semantically — the main.rs self-heal stays in as a last-
resort belt (can't hurt on nodes whose FS isn't namespace-isolated), and the
update.rs in-extractor chmod stays in so v1.7.40-onwards extractors are
double-safe. The authoritative fix is the archive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.38 shipped with an OTA bug: the tar-extracted staging dir inherited 700
perms and nginx (www-data) returned 500/403 on every request after the swap.
.116 hit this on rollout; had to chmod by hand to recover.
- update.rs: after extraction, explicitly chmod 755 dirs + 644 files on the
new staging dir before the mv into place, so nginx can stat/serve them.
- main.rs: self-heal on startup — if /opt/archipelago/web-ui is not
world-readable, run `sudo chmod -R u=rwX,go=rX` to repair. This is what
rescues nodes upgrading from v1.7.37/v1.7.38, since their extractor
(running on the old binary) doesn't have the chmod fix yet — the new
binary's first boot fixes the mess before nginx serves a single request.
Everything v1.7.38 shipped is still in this release:
- auth.rs auto-heals is_onboarding_complete() from setup_complete +
password_hash so nodes don't bounce back to /onboarding/intro after
browser clear / reboot / update
- useOnboarding tri-state: backend-unreachable no longer defaults to intro
- login sounds gated by isFirstInstallPhase() — silent after onboarding,
typing sounds unaffected
- FIPS app / Nostr Relay / Nostr VPN / Routstr / Penpot removed from
catalog + frontend + Rust + docker + icons; 15 image versions deleted
from tx1138, .168, gitea-local
- AIUI baked into release tarball via demo/aiui/
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- auth.rs now infers onboarding-complete from setup_complete + password_hash so
nodes stop bouncing users through the intro wizard after browser clear / update
/ reboot; the flag self-heals to disk on next check
- frontend: "backend uncertain" no longer defaults to /onboarding/intro —
useOnboarding returns null + callers poll / retry instead of flashing the wizard
- login sounds (synthwave, welcome voice, pop, whoosh, oomph) gated by
isFirstInstallPhase(); typing sounds unaffected
- removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog,
frontend config, Rust AppMetadata + install dispatch + install_penpot_stack;
docker/fips-ui + docker/nostr-vpn-ui + apps/penpot dirs and 5 icons deleted;
15 image versions deleted from tx1138, .168, gitea-local registries (.160
Gitea was 502 at release time — follow-up)
- AIUI baked into frontend release tarball via demo/aiui/; deploy-to-target
falls back to demo/aiui/ when the AIUI sibling checkout is missing
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json so the
two copies can no longer drift (was the source of the "apps still visible"
bug — public/ had stale data)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install flow
- api/rpc/package/install.rs: always append the literal image URL as a
last-resort pull candidate in do_pull_image, so images not carried by
any configured mirror (docker.io/bitcoin/bitcoin:28.4) still install
instead of masquerading as a generic pull failure across every mirror.
- api/rpc/package/install.rs: write_bitcoin_conf now skips on any stat
error, not just "file exists". Once bitcoin-knots' first-boot chowns
/var/lib/archipelago/bitcoin into the container's user namespace (700
perms, UID 100100/100101), the archipelago daemon can't even traverse
in — try_exists returns Err which unwrap_or(false) treated as "not
present" and drove a doomed write. Now errors out of the directory
traversal are treated as "conf already owned by container user" and
the write is skipped. Mirrors the lnd.conf pattern.
- api/rpc/package/install.rs: drop the hardcoded `prune=550` from the
conf default. Operators with multi-TB drives shouldn't be silently
pruned; users who want a pruned node can set it in bitcoin.conf
themselves. Full archive is the only honest default.
- api/rpc/package/config.rs: bitcoin-core now passes explicit
-server/-rpcbind/-rpcallowip/-rpcport/-printtoconsole/-datadir CLI
args. Vanilla bitcoin/bitcoin:28.4 has no entrypoint wrapper and
reads conf + argv only; without these the RPC listens on 127.0.0.1
inside the container and rootlessport can't reach it, so the
bitcoin-ui companion gets 502 on every /bitcoin-rpc/ call.
Bitcoin Knots keeps its own entrypoint-driven defaults.
- container/docker_packages.rs: split bitcoin-core out of the shared
AppMetadata arm. bitcoin-core now surfaces as "Bitcoin Core" with
bitcoin-core.svg and a Reference-implementation description; the
bitcoin + bitcoin-knots ids keep the Knots branding. Fixes the home
card showing "Bitcoin Knots" for a Core install.
Bitcoin node UI (docker/bitcoin-ui)
- index.html: impl name/tagline/logo now dynamic. applyImplBranding()
reads subversion from getnetworkinfo — /Satoshi:X/Knots:Y/ resolves
to Bitcoin Knots, plain /Satoshi:X/ resolves to Bitcoin Core. Both
get their own icon and subtitle. Settings modal replaced its
hardcoded Regtest/txindex=1/port-18443 placeholders with live values
from getblockchaininfo + getindexinfo + getzmqnotifications.
- index.html: new Storage info card (Full Archive · X GB /
Pruned · X GB from blockchainInfo.pruned + size_on_disk) visible on
the main dashboard, same level as Network. Settings modal mirrors it
with the prune height when applicable.
- Dockerfile + assets/: bitcoin-core.svg, bitcoin-knots.webp, and the
bg-network.jpg used by the dashboard are now COPY'd into the image
under /usr/share/nginx/html/assets. Previously the <img src> pointed
at paths that 404'd into the SPA fallback and the onerror handler
hid the broken logo silently.
Frontend
- appSession/appSessionConfig.ts: add bitcoin-core to APP_PORTS (8334),
HTTPS_PROXY_PATHS (/app/bitcoin-ui/), and APP_TITLES (Bitcoin Core).
Without these the AppSessionFrame showed "No URL found for
bitcoin-core" and the home/app-list title fell through to the raw id.
- settings/AccountInfoSection.vue: backfill What's New entries for
v1.7.31 through v1.7.37 that had been missed in earlier cuts.
Release plumbing
- releases/v1.7.37-alpha/: binary + frontend tarball.
- releases/manifest.json: v1.7.37-alpha, sha256/size refreshed.
- Cargo.toml / package.json: version bumps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The trusted-registry allowlist in api/rpc/package/config.rs splits the
image on '/' and matches the first segment against a fixed set (docker.io,
ghcr.io, git.tx1138.com, 23.182.128.160:3000, ghcr.io, localhost). A bare
'bitcoin/bitcoin:28.4' splits to registry="bitcoin" which isn't on the
list, so the install RPC was returning 'Invalid Docker image format'.
Live catalogs on .160 and gitea-local already hotfixed directly; these
static copies keep ISO builds and the final hardcoded fallback in sync.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- neode-ui/public/assets/img/app-icons/bitcoin-core.svg (NEW): 256×256
Umbrel community Bitcoin icon sourced from getumbrel.github.io/
umbrel-apps-gallery/bitcoin/icon.svg. Referenced by the static
catalog, the curated fallback, and the upstream lfg2025/app-catalog
entry so every surface shows the same image.
- app-catalog/catalog.json + neode-ui/public/catalog.json: add
bitcoin-core (v28.4) entry pointing at bitcoin/bitcoin:28.4. Same
entry pushed to the lfg2025/app-catalog repo on .160 and the local
gitea mirror so nodes see it without needing a full archipelago
update. Sovereignty Stack entry added to FEATURED_DEFINITIONS with
a description that frames it as a Knots alternative, not a rival.
- core/archipelago/src/api/handler/mod.rs: handle_app_catalog_proxy
is now instance-scoped (&self) and derives its upstream list from
load_registries — each active container registry contributes one
`<scheme>://<reg.url>/app-catalog/raw/branch/main/catalog.json` URL
in priority order (scheme follows tls_verify). When the operator
switches mirrors in Settings, the App Store now follows. Falls back
to the legacy hardcoded .160/tx1138 pair only when registry config
can't be loaded, so the App Store still renders on nodes that
haven't persisted one yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- core/archipelago/src/bootstrap.rs (NEW): embed scripts/container-doctor.sh
and image-recipe/configs/archipelago-doctor.{service,timer} via
include_str! and sync to disk + enable the timer on every archipelago
startup. Idempotent (content-hash compare), dev-box symlink guard keeps
the git checkout untouched, best-effort (warn-only on failure) so
bootstrap never blocks server readiness. Wired in main.rs as a
background tokio task.
- scripts/container-doctor.sh: add fix_rootless_netns_egress(). Detects
when the rootless-netns has lost its pasta tap (container-to-container
still works but outbound DNS/TCP fails) via an nsenter probe into
aardvark-dns; with a two-probe 10s debounce to rule out transients and
a host-precheck that bails out if the host itself is offline. When the
rootless-netns is truly broken, does a graceful podman stop --all /
start --all so pasta + aardvark-dns rebuild the netns from scratch.
Bitcoin-knots and every other outbound container recover in one cycle.
- core/archipelago/src/update.rs: host_sudo → pub(crate) so bootstrap.rs
can reuse the existing systemd-run escape hatch.
- apps/bitcoin-core/manifest.yml: bump app version 24.0.0 → 28.4.0 and
image bitcoin/bitcoin:24.0 → bitcoin/bitcoin:28.4. Resources aligned
with the real container-specs.sh large-disk tune (4 GiB memory cap,
cpu_limit: 0 so bitcoind can run -par=auto across every core).
- neode-ui/src/views/apps/AppCard.vue + Apps.vue: add an Update button
+ Updating spinner to every app card that has available-update set.
Wires through serverStore.updatePackage(id) — the same RPC the detail
view already calls. common.update / common.updating i18n keys added in
en.json and es.json.
- core/archipelago/src/identity_manager.rs: add create_from_signing_key()
that mirrors an existing Ed25519 key as a manager-level identity with
a deterministic id (`node-<pubkey16>`). Idempotent across restarts,
gets the hex-SVG master avatar.
- core/archipelago/src/server.rs: the auto-create path on first boot now
mirrors the node's own signing_key (seed-derived on onboarded installs)
as a "Node" identity instead of generating a random "Default" keypair.
Once this ships, the DID on the Web5 DID Status card (via node.did
RPC), the Node entry on the Identities page (via identity.list), and
the DID used for peer-to-peer connects (via server_info.pubkey) all
resolve to the same seed-derived pubkey.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- useOnboarding.ts: when the backend gives a definitive answer
(true/false, not a null retry failure), re-seed the
neode_onboarding_complete localStorage flag accordingly. Fixes the
case where a user clears site data on an already-onboarded node —
OnboardingWrapper's useVideoBackground computed reads localStorage
synchronously, so without this re-seed the intro video would fire
again on /login even though RootRedirect correctly sent them
straight to /login.
- OnboardingWrapper.vue: login background now rotates through
bg-intro-1..6 on each /login mount, with the current index
persisted to localStorage (neode_login_bg_idx) so subsequent
logouts advance rather than repeat the same image.
- Dashboard.vue: subsequent-login branch drops the 1.2s showZoomIn
entirely. Only the first dashboard entry after onboarding plays
the full zoom + glitch reveal; every re-login now just fades in
with the welcome typing (~300ms).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- useOnboarding.ts: prefer the backend over localStorage when checking
onboarding completion. The old order (localStorage first) meant any
browser that had ever onboarded a node would treat every new fresh
node as already-onboarded and skip the wizard, dumping the user
straight at the inline set-password form. Backend is now authoritative;
localStorage stays as the offline fallback.
- OnboardingWrapper.vue: skip the intro video on `/login` once
`neode_onboarding_complete` is set. Returning logged-out users now
get the static lock-screen background + glitch overlay instead of
replaying the full intro on every logout.
- RootRedirect.vue: when the health check fails, only show the full
BootScreen if the node was never onboarded. For already-onboarded
nodes (i.e. an OTA-update blip), keep the spinner and poll the
health endpoint every 2s for up to 60s before falling back to the
boot screen. Fixes the "fake boot loader" / "server starting up"
screens flashing on every successful update.
- loginTransition store: new `justCompletedOnboarding` flag distinct
from `justLoggedIn`. Set true only by the inline setup-password
flow (handleSetup). Dashboard.vue branches on it: full glitch+zoom
reveal for the post-onboarding entry, quick zoom + welcome typing
on every other login (no triple glitch flashes, ~1.2s vs 8s).
- vite.config.ts: bump assets cache from `assets-cache-v2` to
`assets-cache-v3` so service workers running the previous bundle
invalidate their cache and pick up the new UI cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- HOTFIX: v1.7.31-alpha's frontend tarball was packaged with a
`neode-ui/` top-level directory instead of the flat layout v1.7.30
and earlier used. Nodes that applied v1.7.31 ended up with
`/opt/archipelago/web-ui/neode-ui/index.html` instead of
`/opt/archipelago/web-ui/index.html`, and nginx returned 403/500.
v1.7.32's tarball is built with `tar -C web/dist/neode-ui .` so
files land directly at web-ui root. Broken nodes auto-heal on this
update (web-ui dir is replaced).
- transport/lan.rs: add Drop impl that calls ServiceDaemon::shutdown()
on the mdns_sd daemon. Without this the OS thread it spawns, plus
the blocking `receiver.recv()` task, keep the tokio runtime alive
past SIGTERM — long enough for systemd's TimeoutStopSec to SIGKILL
the service and mark it Failed. Was visible on every update:
"shut down cleanly" logged, then 15s later systemd forcibly kills.
- main.rs: after logging "Archipelago shut down cleanly", call
`std::process::exit(0)` explicitly. Belt-and-suspenders against
any future non-daemon thread creeping in (reqwest resolver pool,
etc.) and causing the same SIGKILL regression.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: install.rs registry reachability probe now strips the
`host[:port]/namespace` suffix before appending `/v2/` (the Docker
V2 API lives at the host root, not under the namespace) and accepts
HTTP 405 in addition to 200/401 as "registry daemon alive". This
fixes false "unreachable" reports on the Test button for Gitea and
other registries that protect their /v2/ endpoint.
- Backend: stacks.rs install_indeedhub_stack now force-removes any
leftover indeedhub-* containers and indeedhub-net before creating
the stack. A partial install (or the old first-boot stub racing the
installer) used to leave containers around that blocked re-install
with "name already in use". Re-running the App Store install now
self-heals.
- Backend: registry.rs load_registries auto-merges any default
registry URLs missing from the saved config (appended with priority
max+10+i, persisted). Lets new default mirrors (e.g. Server 3 OVH)
roll out to existing nodes without manual config edits. Explicit
removals still stick — URLs absent from disk AND absent from
defaults stay gone.
- Backend: update.rs adds DEFAULT_TERTIARY_MIRROR_URL at
http://146.59.87.168:3000/ (Server 3 OVH) to default_mirrors, with
the same auto-merge-on-load behavior as registries. Test updated
for 3-mirror default (.160, tx1138, .168).
- Scripts: dropped the first-boot IndeedHub stub (~38 lines in
first-boot-containers.sh §8b). It predated the proper stack
installer, raced it, and was the main source of the name-conflict
mess the stacks.rs cleanup above now also guards against.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: unified pull-progress streaming across primary AND fallback
registries. Earlier code only streamed for the primary attempt; if it
failed fast (VPS 404, etc.) the UI froze at 0% until the fallback
finished. The waterfall now uses a single shared helper that streams
podman stderr through update_install_progress for every URL tried.
- Backend: PackageDataEntry gains uninstall_stage, set at each phase of
handle_package_uninstall ("Stopping containers (i/total)",
"Cleaning up volumes", "Removing app data"). State flips to Removing
during the pipeline.
- Frontend: MarketplaceAppCard renders the live progress bar with byte
counts during installs, matching the System Update download bar style.
- Frontend: AppCard renders the live uninstall stage label per app.
Modal closes immediately on confirm so concurrent uninstalls each
show their own progress on their own card.
- Cleanup: removed dead helpers (image_candidates, rewrite_for_primary,
primary_image_url, pull_from_registries_with_skip) made unused by
the install.rs refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New Settings → App registries page (/dashboard/settings/registries)
that mirrors the update-mirrors experience: list of configured
registries, test reachability, set primary, add/remove. New
registry.set-primary RPC; existing registry.{list,add,remove,test}
reused.
- Default RegistryConfig flipped: VPS (23.182.128.160:3000/lfg2025) is
now Server 1 (primary), tx1138 is Server 2 (fallback).
- Install pipeline now rewrites the first pull to the primary registry
URL before attempting it. Before this, installs always hit whichever
registry the image was hardcoded to, so changing the primary didn't
actually affect where images came from. On failure, the existing
fallback walk skips the primary (already tried) and walks the rest.
- App catalog proxy UPSTREAMS order flipped so the catalog follows the
same VPS-first rule.
- Reboot overlay: animated "a" logo now sits in the center of the ring
(matches the screensaver composition). Extracted the logo-wrapper
pattern inline.
7/7 registry tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New reboot progress overlay: full-screen black with the screensaver's
pulsing ring, rebooting → reconnecting → back-online → stalled stages,
elapsed counter, auto-reload on health-check success, manual reload
button at 3 min stall. Mirrors the existing update overlay.
- Ring extracted from Screensaver.vue into a reusable ScreensaverRing
component so the reboot overlay reuses the same animation.
- default_mirrors() now puts the VPS as Server 1 (primary) and tx1138 as
Server 2 — new nodes fetch manifests from VPS first; existing nodes
keep whatever mirror order they've customized.
- What's New entry prepended for v1.7.28-alpha.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New "Served by {mirror}" line on the System Update page so operators can see
which mirror actually served the available manifest (vs. which is configured
primary). Backend threads the served URL through UpdateState.manifest_mirror.
- New update.test-mirror RPC + per-row lightning-bolt button that pings a
mirror and renders reachable/latency or error inline under the URL.
- UI polish on the mirrors section: Set Primary, Remove, and the new Test
action are compact icon buttons; add-mirror form moved into a dialog.
- "What's New" block prepended for v1.7.27-alpha.
21/21 update module tests pass. vue-tsc + vite build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a multi-mirror manifest fetch. `check_for_updates` walks a
configurable list (data_dir/update-mirrors.json) in priority order
and falls through to the next mirror on any HTTP / parse / timeout
failure. Two defaults bake in: Server 1 (git.tx1138.com) and Server 2
(23.182.128.160:3000).
Critical fix: after parsing a manifest, rewrite every component's
`download_url` so its origin matches the manifest URL we fetched.
Before this, the manifest hard-coded absolute URLs pointing at one
specific server — so even when a node fetched the manifest from a
faster mirror, the actual 200MB download went back to the slow
original. Now the faster mirror wins end-to-end.
New RPCs: update.list-mirrors, update.add-mirror, update.remove-mirror,
update.set-primary-mirror. New UI section on the System Update page
for operator management. 5 new unit tests for origin parsing and
manifest rewriting (21/21 green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-adds the TCP transport (`0.0.0.0:8443`) to the rendered fips.yaml
alongside UDP. Upstream factory default enables both; we had
inadvertently narrowed to UDP-only when the yaml rewriter was last
touched, which left nodes unable to reach fips.v0l.io (the public
anchor only answers on TCP right now) or talk across networks that
block UDP.
Backend startup now compares the installed yaml against the current
rendered schema and restarts whichever fips unit is active when they
differ — so OTA-upgrading nodes pick up the new transport without
anyone having to click Reconnect.
Dropped the earlier plan to auto-add federated peers as seed anchors:
invites don't carry a FIPS-reachable IP:port, and once TCP reconnects
the public mesh, federated peers become npub-routable without needing
a seed entry.
Seed Anchors modal cleanup: replaced malformed header icon with a
three-arc broadcast glyph, and the close button now matches the
What's New modal (embedded in the card header, same icon + hover
style) instead of the earlier floating off-design placeholder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The npm run build step in the release ritual had been silently failing
for roughly seven releases. vue-tsc died with EACCES on a root-owned
node_modules/.tmp, exited non-zero, and my `tail -5` of the build
output happened to only show vite's precache summary — which makes
vite look successful even when the typecheck that precedes it failed.
The resulting archipelago-frontend-*.tar.gz files were rebuilds from
whatever content happened to live in web/dist/neode-ui/ at the moment
(files left over from v1.7.9, owned root:root from an earlier sudo'd
operation, unchanged since).
Fixed by chowning both paths back to the archipelago user and
rebuilding. Every published frontend tarball from v1.7.17 through
v1.7.23 therefore shipped the same frozen UI; v1.7.24 is the first
release in that stretch whose frontend actually matches its backend.
Recorded the build-verification rule as a persistent feedback memory
(feedback_frontend_build_verify.md) — future ships must grep the
packaged tarball for the new version string before push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a gear button next to the FIPS Mesh card's status pill that
opens a Teleport-ed modal containing FipsSeedAnchorsCard. The card
was landed on disk in v1.7.21 but never wired into a UI entry point
per the entry-point convention, so users couldn't access the
Add/Remove/Apply controls at all. One gear click now opens them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fips::service::active_unit() picks whichever fips unit is running
(archipelago-fips.service vs upstream fips.service) so
handle_fips_restart and handle_fips_reconnect don't silently no-op
on hosts where the archipelago-managed unit was never created.
- peer_connectivity_summary(anchor_candidates) replaces the old
identity-cache check. anchor_connected is now true when at least
one authenticated peer's npub matches the public anchor OR any
entry in seed-anchors.json, which matches what the user actually
cares about ("am I in the mesh?") rather than what the card used
to claim ("is this one specific public anchor reachable?").
- FipsStatus::query takes data_dir now (so it can read seed-anchors)
rather than identity_dir. All call-sites updated.
- handle_fips_reconnect re-pushes seed anchors after restart so the
new daemon gets dialed without waiting for the 5-min apply loop.
- FipsNetworkCard label drops "(fips.v0l.io)" — misleading now that
multiple anchors may be configured.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a local seed-anchor list at <data_dir>/seed-anchors.json. Each
entry is {npub, address, transport, label}. On archipelago startup
and every 5 minutes the list is pushed into the running fips daemon
via `fipsctl connect <npub> <addr> <transport>`, so a cluster can
anchor itself independently of the global fips.v0l.io. A flaky or
unreachable public anchor no longer strands a fresh install.
New RPCs:
- fips.list-seed-anchors
- fips.add-seed-anchor (validates npub1… + host:port)
- fips.remove-seed-anchor
- fips.apply-seed-anchors (on-demand re-dial)
New standalone UI card at views/server/FipsSeedAnchorsCard.vue. Not
wired into Home.vue / Server.vue — operator places it per the
entry-point convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3AM auto-update path called std::process::exit(0) immediately
after apply_update returned. apply_update had already spawned a 2s-
delayed systemctl restart, but exit(0) killed the runtime before that
spawned task could run — and the unit's Restart=on-failure does not
trigger on a clean exit 0, so the service stayed dead until someone
SSH'd in and started it manually (.253 hit this today).
Scheduler now returns from the task without killing the process;
apply_update's existing restart path (same one the UI's Install
Update button uses) brings the new version up cleanly.
Also hardens the ISO CI: the AIUI inclusion step now falls back to
extracting from the newest release tarball if the runner's cached
/opt/archipelago/web-ui/aiui path is missing, so a reprovisioned
runner can't silently ship a frontend tarball without AIUI. The ISO
build step also sanity-checks the binary exists before invoking the
builder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
load_state now drops any stored available_update whenever the running
binary version differs from what's on disk — the old migration only
cleared it when the stale entry happened to match the new version, so
skipping releases (e.g. sideloading 1.7.16 → 1.7.18 without 1.7.17)
left a pointer to an intermediate version as the "update available",
which the UI then offered as a downgrade prompt.
check_for_updates also uses a numeric version comparator so a stale or
cached manifest with an older version can't offer itself as an
update, and 1.7.10 correctly outranks 1.7.9 past the single-digit
patch boundary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip transitively-discovered federation peers to Trusted instead of
Observer. Hints are already only ingested from peers we trust and only
peers we trust are re-exported via build_local_state, so the chain of
trust is already vetted end-to-end — making the user promote each
newcomer by hand was friction with no security win.
Backend:
- federation/sync.rs: merge_transitive_peers now inserts TrustLevel::Trusted
(doc comment updated to explain the transitive-trust rationale)
- update.rs: info! log at download start (version, components, total_bytes,
staging path), cancel (staging wiped?, marker cleared?), and apply (backup
path) so journalctl reveals where a stuck update actually is
Frontend:
- SystemUpdate What's New block gets a v1.7.18-alpha entry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Federation join flow now notifies the inviter with the joiner's name and
immediately bumps state so the Federation UI reloads without a manual
Sync click. Accepting an invite that points back at the local node is
rejected up front (DID/pubkey/onion match). After a peer joins, we spawn
a transitive sync that pulls the new peer's federated peer hints so all
nodes in the federation learn about each other as Observer entries.
Federation.vue polls every 5s while mounted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
download_update
Each component download is now resumable via HTTP Range requests
(Range: bytes=N-) and retried up to 6 times with exponential
backoff (5/15/30/60/120/180s). On a dropped connection the next
attempt picks up at the last written byte offset instead of
restarting at zero. Streams via reqwest::Response::chunk() to the
staging file so a 160 MB frontend tarball doesn't sit in RAM. SHA
is verified over the complete file at the end of each component;
mismatch nukes the staged file and restarts from scratch.
Real download progress counters
New AtomicU64 globals DOWNLOAD_BYTES/DOWNLOAD_TOTAL are updated
from the chunk loop. update.status exposes them as
download_progress.{bytes_downloaded, total_bytes, active}. The
SystemUpdate.vue progress bar now polls update.status every
second instead of incrementing a fake random counter — and
crucially, if the user navigates away and back, the component
picks up the in-progress download from the backend atomics
immediately.
Update-check retries
handle_update_check now retries the manifest fetch up to 3 times
with a 5s gap if the first try hits a transport error, so a
momentary gitea hiccup doesn't make a node report "up to date"
when there actually is a new release. Tight 10s connect timeout
per attempt keeps the total bounded.
Artefacts:
archipelago 1070c87f…c081c162b 40584792
archipelago-frontend-1.7.15-alpha.tar.gz 8e630eba…63fd43f 162078068
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install UX
SystemUpdate.vue now shows a full-screen overlay after apply: the
BitcoinFaceAscii logo, a target-version label, an indeterminate
progress stripe (solid orange; solid green on ready), and an
elapsed-time readout. Polls /health every 1.5s and auto-reloads
once the backend reports the new version. 3-min stall → "Reload
now" button. Download UI also shows a spinner + "Finishing
download — verifying checksum…" while the fake bar sits at 95%.
FIPS reconnect — for real this time
New fips.reconnect RPC does stop → start → wait 20s → re-poll →
classify. Classification buckets: connected / daemon_down /
no_seed_key / no_outbound_udp_or_anchor_down / peers_but_no_anchor,
each with a plain-language hint surfaced verbatim by the Reconnect
button. The real reason nodes like .198/.253 couldn't reach the
anchor: identity::write_fips_key_from_seed was writing fips_key.pub
as a bech32 npub TEXT file, but upstream fips expects 32 raw
bytes. The daemon silently authenticated with garbage. Fix:
PublicKey::to_bytes() → raw 32 bytes, and new
fips::config::normalize_pub_file migrates legacy files by decoding
the npub and rewriting in place. fips.reconnect also re-installs
the config + healed keys to /etc/fips before restarting.
AIUI preservation + restore
apply_update was wiping /opt/archipelago/web-ui/aiui because the
Vue build doesn't include it — every OTA lost the Claude sidebar.
The preserve block now copies aiui/ + archipelago-companion.apk
from the old web-ui into the staging dir before the swap, and
prefers new-tar versions if present. To restore it on the three
nodes that already lost it (.116/.198/.253), this release bundles
the 85 MB aiui build into the frontend tarball. Frontend component
size is now ~155 MB.
Download / install timeouts
Backend download client timeout 1800s → 3600s (1 h). Larger
tarball + slow gitea raw throughput put us above the old cap.
Frontend update.download rpc timeout 30 min → 65 min to match.
package.install rpc timeout 15 min → 45 min — IndeedHub pulls
6 images and was timing out mid-install.
UI nit
"Rollback to Previous" → "Rollback Available".
App-catalog proxy already landed in v1.7.13.
Artefacts:
archipelago 725e18e6…3c525e6 40462288
archipelago-frontend-1.7.14-alpha.tar.gz c35284be…ff2c16 162077052 (+aiui)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Discover / Marketplace page fetched the app catalog directly from
git.tx1138.com/lfg2025/app-catalog/raw/.../catalog.json in the
browser. Two blockers hit the fleet simultaneously: (1) tx1138's
Gitea doesn't emit Access-Control-Allow-Origin so the HTTPS fetch
got CORS-blocked; (2) the HTTP IP-port fallback
(http://23.182.128.160:3000/...) falls outside the node's
`connect-src` CSP. Users saw the hardcoded fallback instead of the
live catalog.
Backend: new authenticated GET /api/app-catalog handler uses reqwest
to pull catalog.json server-side (15s timeout) and returns it with
application/json + 1h Cache-Control. Tries the HTTPS URL first,
HTTP IP-port second.
Frontend: curatedApps.ts now calls /api/app-catalog (same-origin,
no CORS/CSP) with credentials included so the session cookie
authenticates the proxy. Baked /catalog.json stays as the last
resort.
Artefacts:
archipelago 0aaf7262…b979f22c 40371192
archipelago-frontend-1.7.13-alpha.tar.gz 27505811…efc6f4142 76982505
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:43:45 -04:00
672 changed files with 87420 additions and 15301 deletions
| `UPDATE_INCOMPATIBLE … signatures do not match` | Old install signed with a **different key** (e.g. pre-shared-keystore per-machine key `58:31:12…`). | Uninstall the old package, then install. **One-time** per device after a key change. |
| `INVALID_APK` / parse error | Corrupt/incomplete download or bad signing. | Re-download; re-run the publish script. |
Polishes the mesh AI assistant and Fedimint, on top of all the v1.7.99 features (kept listed below so you can still see what's new).
- The off-grid mesh radio no longer posts cryptic identity codes to the shared public channel. Your node was announcing a line starting with "ARCHY:" to the public channel about once a minute, which everyone else on that channel saw as spam; that broadcast has been removed.
- You can now use your node's AI assistant straight from a normal chat. Send "!ai <yourquestion>" in a direct message to an AI-enabled node and the answer comes right back in the same conversation — whether your message travelled over the internet or the LoRa radio. Before, the reply could be sent on the wrong path and never arrive.
- The Mesh AI Assistant panel is easier to set up: pick the Claude model from a dropdown (Haiku, Sonnet, or Opus) instead of typing it, and add specific contacts to an "always allow" list so chosen people can use "!ai" even when the assistant is set to trusted-nodes-only.
- Fedimint federations show up in Wallet Settings again. The Fedimint client app wasn't starting because of a configuration error, so the federation your node auto-joins never appeared; the client is fixed and runs again.
- In Settings, "App Updates" and "App Registry" now sit directly under your Account section for quicker access.
- In Mesh chat, scrolling the conversation no longer also scrolls the contact list behind it.
- Mesh direct messages are now private and end-to-end encrypted to the recipient — they're sent as real radio DMs instead of being broadcast on the public channel, so other people on the mesh no longer see them, and the answer arrives intact (even on standard meshcore phone apps).
- You can now message standard meshcore apps (like the phone companion) and they can message you — text shows up readable on both sides, and your node's AI answers come back as a private reply rather than on the public channel.
- New contacts you hear on the radio are added automatically, so people show up in your Peers list without any extra steps.
- "Clear All" now actually removes contacts (rather than hiding them forever); a contact comes back on its own the next time it's in range. Each contact also shows a reachability dot so you can see who's currently reachable.
- The Peers list has a search box (with a clear button) to quickly filter your contacts by name, DID, npub, or key.
All the v1.7.99-alpha features are included as well:
- Your node can now hold Fedimint ecash as well as Cashu, with tabbed Wallet Settings for each and both balances shown side by side on the home wallet card.
- You can buy files shared by another node right from their cloud, paying from this node's ecash, your Lightning wallet, on-chain, or by scanning a Lightning QR with any outside wallet.
- Your node can act as an AI assistant on the off-grid mesh: peers ask by starting a message with "!ai" and get an answer back over the radio, with a panel to turn it on or off.
- You can view your node's 24-word recovery phrase any time from Settings, behind a password (and 2FA) confirmation and a tap-to-show blur.
- Setting up a brand-new node is smoother: it waits and retries quietly instead of flashing errors, and shows a gentle "securing your private connection…" status that turns to "ready" on its own.
- The NetBird VPN app now logs in (it's served over HTTPS and opens in a browser tab).
- Phone remote-control of a node's screen now supports two-finger scrolling inside apps, and external-browser apps open on your phone.
- You can choose whether your node shares Bitcoin block headers over the mesh, and your choices are remembered.
- Version numbers display cleanly everywhere (no more doubled "v"), and "Back" buttons look and behave consistently across desktop and mobile.
- For advanced testing, Settings includes an optional update & app source choice between the usual trusted origin and an experimental peer-to-peer (DHT swarm) mode, with the trusted origin remaining the default.
## v1.7.99-alpha (2026-06-17)
- Your node can now hold Fedimint ecash as well as Cashu. Wallet Settings now has tabbed sections for each: keep your list of trusted Cashu mints, or paste a Fedimint invite code to join a federation, and the home wallet card shows both your Cashu and Fedimint balances side by side. A new "Fedimint Client" app in the catalog powers the federation side.
- You can now buy files shared by another node, right from their cloud. When you open a peer's paid file you get a simple "Buy this file" picker with several ways to pay — instantly from this node's ecash balance, from your node's own Lightning wallet, on-chain from your node, or by scanning a Lightning QR code with any outside wallet. Once payment settles, the file downloads automatically.
- Your node can now act as an AI assistant on the off-grid mesh radio network. If your node has a local AI model available (via Ollama), other people on the mesh can ask it a question by starting their message with "!ai" and get an answer back over the radio — handy where there's no internet. A new Mesh assistant panel lets you turn this on or off and shows whether a local AI model was detected.
- You can now view your node's 24-word recovery phrase whenever you need it. Settings has a new "Recovery phrase" option that, after you confirm your password (and 2FA code if you use one), reveals the words behind a tap-to-show blur with a copy button — so you can write them down and store them safely offline.
- Setting up a brand-new node is smoother and less alarming. If the node is still starting up while you generate or confirm your recovery phrase, it now quietly waits and retries instead of flashing a scary error, and offers a clear "Try again" button only when something genuinely goes wrong. The final setup screen also shows a gentle "securing your private connection…" status that turns to "ready" on its own, so you can tell the encrypted transport is coming up rather than stuck.
- The NetBird VPN app now actually logs in. It was failing to reach its sign-in screen because the dashboard needs a secure (HTTPS) connection that wasn't being provided; the node now serves it over HTTPS and opens it in a browser tab, so the login flow completes.
- When you use your phone to remote-control a node's attached screen, two-finger scrolling now works inside apps and panels, not just the main page. And tapping an app that's meant to open in an external browser now hands the link to your phone to open there, instead of trying to open it on the (often unattended) attached display.
- You can now choose whether your node shares Bitcoin block headers over the mesh. The Mesh Bitcoin panel has new switches to announce headers to peers and to accept headers from them, and your choices are remembered.
- Version numbers now display cleanly everywhere. In a few places the interface was showing a doubled "v" (like "vv1.7.98"); it now always shows a single, tidy version label.
- The "Back" buttons throughout the cloud and other detail screens now look and behave consistently on both desktop and mobile, including when browsing another node's files.
- For advanced testing, Settings now includes an optional "update & app source" choice between the usual trusted origin and an experimental peer-to-peer (DHT swarm) mode that pulls updates and app content from other nodes first, falling back to the origin automatically. The trusted origin remains the default.
## v1.7.98-alpha (2026-06-16)
- Apps that crash now recover on their own. Multi-part apps like Immich and IndeedHub could have one of their pieces stop and stay stopped until the whole node was rebooted; the node now checks every couple of minutes and restarts any crashed piece automatically (while still leaving apps you deliberately stopped alone).
- The on-screen kiosk display can no longer slow the whole node down. On machines without a graphics chip the kiosk browser could spin a CPU core at full tilt, starving everything else (including the wallet, which then timed out); it's now capped and uses lighter rendering on those machines.
- If an update download fails, you're taken back to the Download button to retry, instead of being stranded on an Install button for an update that didn't actually finish downloading.
- Your node's identity is clearer and always visible: Settings now shows your Node DID on every node (it previously only appeared if your browser had cached it) plus your node's npub, both with copy buttons. There's also a terminal tool to cryptographically prove all your node's keys come from your one seed phrase.
- The "all nodes over Tor" group chat sends quickly now — the "sending" spinner clears as soon as the reachable nodes have the message, instead of hanging on a slow or offline node.
- Message notifications now have a close button and open the relevant chat when tapped.
- The encrypted mesh transport (FIPS) turns itself on automatically after setup — no button to press — and connects to peers more reliably (it retries and keeps connections warm), so node-to-node features use the fast path more often instead of falling back to Tor.
- Your chat history with other nodes is saved reliably and now encrypted on disk, so it survives restarts and updates and can't be read from a stolen drive (only clearing chat removes it).
- Peer media shows a "connecting" loader before a video or audio file plays, and audio errors are accurate instead of blaming File Browser.
- The Fedimint app now displays with its proper styling, and the Connected Nodes screen stays compact — it shows a few nodes and scrolls, you can tap a node to jump to it in Federation, or tap Message to open its chat.
- App updates can now arrive on their own without waiting for a full system release, so individual apps can be improved and shipped faster.
## v1.7.97-alpha (2026-06-16)
- The Bitcoin sync status on the home screen no longer disappears for a moment when it refreshes. If the node was briefly busy, the panel used to vanish and pop back; it now stays put and simply shows "Updating…" until the next reading arrives, while a genuinely stopped node still correctly shows as not running.
- Bitcoin sync progress on the home screen now updates more promptly, so the percentage and block height keep pace with the node instead of lagging behind.
- The Lightning wallet "connect your wallet" screen loads its details and QR code again across all nodes, instead of failing to fetch them.
- Your list of trusted nodes is now clean: the same node no longer appears several times under different names, and removed nodes stay removed. In chat, a node that previously showed up as two separate contacts now appears just once.
- Browsing another node's cloud is smoother: music and video files from a peer now preview and play properly (including seeking partway through), and the connection now shows a small badge telling you whether it's using the fast encrypted mesh or the slower Tor network.
- Opening "My Folders" in the cloud now shows a clear, friendly message when the file app isn't running, instead of a confusing error.
- The Electrum server app opens on its own once it's ready, instead of sometimes leaving a loading spinner stuck on top of the screen.
- The Fedimint app now displays with its proper styling and icons, instead of appearing unstyled with a missing image.
- The Mempool app now connects to your Bitcoin node whether the node is Bitcoin Core or Bitcoin Knots, instead of only working with one of them.
- Nodes start up cleanly after a reboot. On some boots the node's main service was trying to start before its data drive had finished mounting, so it failed and retried about twenty times over roughly five minutes — showing a wall of "Failed to start" messages — before finally coming up. It now waits for the data drive to be ready first, so it starts on the first try.
- The background images throughout the interface now load faster — they've been made significantly smaller with no loss of quality.
## v1.7.96-alpha (2026-06-15)
- The screen attached to your node now shows the normal Archipelago interface and your dashboard after you sign in, instead of a separate, stripped-down grid of app icons that could appear in its place. That extra screen has been removed so the attached display matches what you see everywhere else.
- On a brand-new node, the attached screen now walks through the same welcome and setup steps you'd see on a phone or laptop, and shows the normal sign-in screen once the node is set up — so the on-device display always matches the rest of the interface.
- When adding a FIPS network anchor, you can now choose whether it connects over TCP (for a public anchor reached across the internet) or UDP (for one on your local network), instead of it always assuming the local-network option.
- Behind the scenes, a new automated two-node test now exercises real node-to-node features — browsing another node's shared files and handling a removed node — against live nodes before each release, so node-to-node problems are caught earlier.
## v1.7.95-alpha (2026-06-15)
- Browsing another node's shared files now works over the fast encrypted mesh. Opening a peer's cloud could fail with a generic "Operation failed" message because the request for their file list wasn't permitted over the mesh and came back as "not found" — and it never retried over Tor. The mesh now serves the file list directly, and if a peer can't answer over the mesh the node automatically falls back to Tor instead of giving up.
- Nodes you remove from your federation now stay removed. Previously a deleted node could quietly come back the next time you synced with another node that still listed it. Removed nodes are now remembered as removed and won't reappear on their own — only if you add them back yourself.
- The app credentials pop-up now appears as a normal centred box with a dimmed background over the whole screen, instead of stretching to fill the entire screen.
## v1.7.94-alpha (2026-06-15)
- Your node now joins the private encrypted mesh network on its own. A wrong built-in setting meant nodes were quietly never reaching the shared mesh meeting point, so everything between nodes fell back to the slower Tor network. Every node now connects to the mesh automatically on startup, so node-to-node features like file sharing use the faster encrypted mesh first and only fall back to Tor when a peer is genuinely offline. (Confirmed live: a node with its mesh setting wiped re-connected to the mesh by itself within a second of starting.)
- You can now bring the mesh networking software up to the latest stable version straight from the node, with one action — it fetches the new version, checks it's genuine before installing, and restarts the mesh on its own. (Confirmed live end to end: a node on an older build was upgraded to the current stable release and rejoined the mesh automatically.)
- The Lightning wallet screen connects again on nodes where it was showing a "failed to fetch" error instead of your balance and channels. The wallet app and the node now talk to each other correctly, and the connection quietly repairs itself if its details drift after a restart.
## v1.7.93-alpha (2026-06-14)
- Receiving Bitcoin and Lightning works again on nodes where the Lightning wallet was stuck locked. After some updates the wallet could come back locked with a password the node no longer had, so "generate a receive address" kept failing with a "wallet is locked" message that nothing could clear. The node now detects this and repairs itself automatically.
- Each node now secures its Lightning wallet with its own unique, randomly generated password instead of a shared built-in one, and remembers it safely so the wallet unlocks on its own after every restart or update — no more getting stuck locked.
- If a wallet is found locked with an unrecoverable password, the node rebuilds it cleanly so Bitcoin and Lightning start working again. (On these early-access nodes the wallet holds no funds, so nothing is lost — a wallet locked with an unknown password was already inaccessible.)
- The self-repair was validated end to end on live nodes: a stuck, locked wallet was detected, rebuilt, and came back unlocked on its own, and stayed unlocked across restarts.
## v1.7.92-alpha (2026-06-14)
- The Electrum server app no longer flashes a "can't connect, try again" error over its loading screen while it's still catching up. If ElectrumX is building its index or waiting on the Bitcoin node, you now just see the sync progress, and the app opens on its own once it's ready.
- Behind the scenes, the reboot-survival test now confirms the whole system is genuinely healthy after a restart — every app reachable, updates not stuck, core services answering — instead of only checking that containers came back, so update-related problems are caught before shipping.
- Settings → What's New now lists the notes for every recent release again. The screen had quietly fallen several versions behind, so the last eight releases of changes weren't showing up there — they're all back now, and a release check keeps it from drifting again.
## v1.7.91-alpha (2026-06-14)
- Apps you've installed now reliably show their "Open" button again. Some apps — including Jellyfin, BTCPay Server, Fedimint, Gitea and Portainer — were running fine but their launch link sometimes went missing, so there was no way to open them from the home screen. They now open correctly.
- Receiving Bitcoin is more dependable: if the wallet's internal connection details drift after a restart, it now repairs them on its own, and any error it does hit is reported clearly instead of as a generic failure or a misleading "wallet locked" message.
- Installing Bitcoin now sets itself up correctly without manual help — a security credential that could previously be missing and stop Bitcoin from starting is created automatically before it launches.
- The Electrum server app is back on the home screen and can be launched again.
- Behind the scenes, the release now runs an expanded automated test suite before shipping, so these kinds of issues are caught earlier.
## v1.7.90-alpha (2026-06-13)
- Generating a Bitcoin receive address works again — the wallet now requests the correct address type, fixing the "400 Bad Request" error when creating an address.
- In the companion app, the on-screen pointer can now click into apps and type — including the app store search box — instead of clicks and keystrokes not reaching app content.
- "Open in a new tab" from the companion app now opens the app in your phone's browser, instead of doing nothing. The normal mobile browser keeps working as before.
- The login/credentials pop-up on phones is once again a centered, properly sized window rather than stretching the full height of the screen.
- The Electrum server now recovers on its own if its index ever gets corrupted, and shows a clear progress screen (with percent complete and block height) while it builds its index, instead of a blank or broken page.
- Software updates are more reliable on slow internet connections — downloads are given much more time to finish before giving up.
## v1.7.89-alpha (2026-06-12)
- The AI assistant looks the way it always did again: no extra back button or close button on phones, and the desktop view fills the whole screen without a gap at the bottom.
- System updates are much more reliable: updates that previously got stuck partway or failed to install now complete cleanly, and a failed update can no longer block all future updates.
- After an update, the system now checks itself correctly on every node type, so working updates are no longer mistakenly undone.
- Generating a Bitcoin receive address works again on nodes where a network proxy previously got in the way.
- The Lightning wallet now recovers and unlocks itself properly after restarts.
## v1.7.88-alpha (2026-06-12)
- AIUI now loads immediately again instead of waiting on a production availability probe and cache-busted iframe URL, restoring the lighter launch behavior from before the regression.
- Bitcoin receive now uses LND's GET-based newaddress flow with the native SegWit address type, fixing the `501 Method Not Allowed` response from the previous POST attempt.
- Validation pending on the AIUI rollback; the rest of the release train remains unchanged.
## v1.7.87-alpha (2026-06-12)
- Bitcoin receive now calls LND's on-chain address endpoint with the correct REST method, and backend failures keep the specific address-generation error instead of collapsing into the generic operation-failed message.
- App launch credential interstitials now render as true full-screen overlays, and the launcher loading indicator uses the neutral brand palette instead of a blue spinner.
- Validation passed with `git diff --check`, `npm run type-check`, and the focused frontend tests for `bitcoinReceive` and `AppIconGrid`.
## v1.7.86-alpha (2026-06-12)
- Fleet now preserves the last known node list, alerts, and selection locally while telemetry refreshes in the background, so the dashboard no longer blanks on tab switches or update scans.
- Connected nodes and identities now reuse their last loaded data instead of reloading the visible list every time the user revisits the tab.
- The Fleet matrix and detail views now show actual node names and host information instead of raw node id prefixes.
- The network map only redraws when its graph data actually changes, which stops the D3 scene from visually resetting on every refresh tick.
- Mobile federation and system-update actions now stack full width, and the ElectrumX app health check allows a long startup window so slow sync nodes do not restart mid-index.
- Validation passed with `git diff --check`, focused frontend tests, and `npm run type-check`.
## v1.7.85-alpha (2026-06-12)
- ElectrumX now runs with less cache pressure and more memory headroom, reducing the restart loop seen during sync catch-up.
- Portainer is pinned to `2.19.4` instead of `latest`, avoiding schema-drift restarts from surprise image updates.
- LND receive-address creation now asks for a native SegWit address and returns clearer wallet/readiness failures when an address is not available.
- Fleet telemetry now carries server name, hostname, and server URL, and the Fleet dashboard shows those names instead of hashed node ids.
- Trusted federation peers are still auto-added transitively, but the local node no longer imports itself back into the fleet list.
- Validation passed locally for the touched frontend helpers, `git diff --check`, and Rust formatting.
## v1.7.84-alpha (2026-06-11)
- Bitcoin trusted-node relay approvals now generate restricted `txrelay` RPC credentials when needed and restart the active Bitcoin backend so bitcoind loads the new `rpcauth` whitelist.
- Kiosk mode now includes a browser safe-area path for HDMI displays that crop edges, and self-update refreshes kiosk launcher/systemd files so display fixes ship to existing nodes. The experimental X11 scaling safe-area is opt-in to avoid stretching TV output.
- Wi-Fi setup now reports scan errors instead of showing an empty network list, supports retrying scans from the modal, parses escaped `nmcli` SSIDs correctly, and can join open networks without forcing a WPA password.
- Bitcoin Core now matches Bitcoin Knots for restricted relay RPC support, including the txrelay secret injection and transaction broadcast whitelist.
- The restricted Bitcoin relay whitelist now includes `submitpackage` and `gettxout`, covering newer wallet/package-relay broadcast flows without opening wallet/admin RPC.
- The Bitcoin UI companion image is pinned to `1.7.84-alpha` across release metadata and the Quadlet fallback path, avoiding stale `latest` detection during OTA updates.
- Container scanning now uses an RAII in-flight guard so timeout and error paths cannot leave the scanner stuck in a permanently busy state.
- Validation passed with `cargo fmt`, `cargo check -p archipelago`, `git diff --check`, and focused source review of the relay message/approval path.
## v1.7.83-alpha (2026-06-11)
- App launch metadata now derives more consistently from app manifests, with typed launch interfaces and catalog generation updates that keep packaged apps aligned with their runtime ports and launch surfaces.
- Revoked or unsupported app surfaces were removed from the catalog and release path, including OnlyOffice and the unvalidated Saleor surface, so the Marketplace no longer exposes apps that cannot be safely supported in this release.
- The frontend production build now passes strict TypeScript checks after tightening app details, Web5, cloud refresh, and credential test typing.
- Mobile and desktop app surfaces received release polish: improved mobile app layout, safer mesh desktop/tablet scrolling, and the Home system card now routes directly to monitoring.
- Bitcoin UI status rendering now avoids false stale/reconnecting states when fresh block snapshots advance, and guards optional DOM updates so the standalone Bitcoin UI is more resilient.
- Deploy tooling now excludes local Codex scratch output, archived image-build artifacts, and upload screenshots from target syncs, and bounded optional IndeedHub fixups so a stuck Podman helper cannot hold the deploy.
- Validation passed with `npm run type-check`, production `npm run build`, backend `cargo build --release`, catalog/release manifest checks, focused frontend tests, and live `.198` deploy verification through the frontend/service restart phase.
## v1.7.82-alpha (2026-05-22)
- Saleor storefront proxying now forwards `X-Forwarded-Host`, fixing Next.js Server Actions requests that compared the browser origin with the internal `storefront-app:3000` upstream host.
- Saleor storefront media now routes `/thumbnail/` and `/media/` through the same `9011` proxy to the Saleor API, fixing product image optimizer failures caused by `localhost:8000` media URLs.
- The Saleor storefront container receives an explicit internal media origin so rewritten media URLs resolve inside the Podman network without exposing private API ports to browsers.
- Validation passed with `cargo fmt --all --check --manifest-path core/Cargo.toml`, `cargo check -p archipelago --manifest-path core/Cargo.toml`, and live checks on `100.114.134.21` for storefront HTML, static assets, GraphQL, media redirects, and optimized product images.
## v1.7.81-alpha (2026-05-21)
- Saleor storefront installs now use the prebuilt registry image instead of building the Next.js app on-device, avoiding Podman build failures during stack installation.
- Existing Saleor stacks are repaired on adoption by recreating missing storefront containers, forcing the storefront app to bind `0.0.0.0:3000`, and resolving nginx upstreams dynamically after container restarts.
- The shipped Saleor storefront image now includes public assets and omits Vercel-only Speed Insights injection, fixing broken static asset responses and the local `/_vercel/speed-insights/script.js` browser warning.
- Validation passed with `cargo fmt --all --check --manifest-path core/Cargo.toml`, `cargo check -p archipelago --manifest-path core/Cargo.toml`, and live checks on `100.114.134.21` for `9011` storefront, static assets, and proxied GraphQL.
## v1.7.80-alpha (2026-05-21)
- Saleor storefront proxying now falls back to the direct request scheme when no forwarded protocol header is present, fixing direct `http://node:9011` launches that could generate an invalid same-origin GraphQL URL.
- The Saleor storefront release path keeps public proxy support intact by still honoring forwarded HTTPS headers for Nginx Proxy Manager domains while repairing local/direct port launches.
- Validation passed with `cargo fmt --check` and `cargo check` for the Archipelago backend before release staging.
## v1.7.79-alpha (2026-05-20)
- Saleor now installs the official Saleor Storefront as part of the stack, built from the pinned `saleor/storefront` source and served as the customer-facing shop on port `9011`.
- Saleor app launches now open the storefront while the admin dashboard remains available on port `9010` with the generated `admin@example.com` credentials shown in Archipelago.
- Public Nginx Proxy Manager hosts forwarding to the Saleor storefront also expose same-origin `/graphql/`, so public storefront domains can talk to the local Saleor API without mixed-content or private-LAN reachability failures.
- Saleor stack metadata, marketplace descriptions, catalog ports, scanner exclusions, and app-session routing now describe the storefront/dashboard/API split explicitly.
## v1.7.78-alpha (2026-05-20)
- Public Nginx Proxy Manager hosts for Saleor now keep browser GraphQL calls same-origin at `/graphql/` and proxy them to the local API on `8000`, fixing `Failed to fetch` when a public domain such as `noderunner.shop` was loaded from devices that cannot reach the node's private LAN/tailnet API address.
- Saleor's validated stack changes are now release-ready: dashboard origins on port `9010` are explicitly allowed for dashboard/API calls, preserving the working test-node install path for production nodes.
- NetBird launches now stay pinned to the unified dashboard/proxy origin on port `8087` instead of following stale runtime-discovered server URLs on `8086`.
- NetBird's local nginx proxy now routes browser API, OAuth, relay, and WebSocket traffic through `host.containers.internal:8086` instead of a hard-coded rootless Podman gateway IP, and includes the upstream `management.ProxyService` gRPC path.
- The mobile credentials interstitial now keeps credential lists scrollable and action buttons reachable in both My Apps and the mobile app icon grid.
- Android WebView popup windows now hand external popup URLs to the system browser, covering app login/signup flows that open secondary windows.
- Validation passed with `git diff --check`, `cargo check -p archipelago`, and the focused `npm test -- src/views/appSession/__tests__/appSessionConfig.test.ts` suite.
## v1.7.77-alpha (2026-05-20)
- Saleor first-use now exposes generated credentials through Archipelago instead of leaving users at an unexplained dashboard login: App Details shows copyable `admin@example.com` credentials, and My Apps/mobile icon launches show a pre-launch credentials modal.
- Saleor installs now create or repair the `admin@example.com` staff account idempotently after sample data loads, use the correct dashboard mount path, and re-check stack containers after startup so stopped containers are caught.
- NetBird embedded login now uses the upstream-compatible IdP signing-key behavior and sends ID tokens from the dashboard to the management API, fixing the post-signup `Unauthenticated` state while preserving the unified local proxy/logout routes.
- Transient unnamed Podman helper containers created during app install tasks are hidden from My Apps, so generated names like `eager_keldysh` no longer appear as user applications.
- Validation passed with catalog/release JSON checks, `npm run type-check`, and `cargo fmt --all --check --manifest-path core/Cargo.toml`; live checks on `100.114.134.21` confirmed Saleor dashboard/API availability, generated Saleor admin login, NetBird OAuth availability, and NetBird logout redirects.
## v1.7.76-alpha (2026-05-20)
- Saleor installs now use dashboard port `9010`, avoiding the existing Portainer `9000` binding on the test node while keeping API `8000`, Mailpit `8025`, and Jaeger `16686` unchanged.
- Saleor's Valkey cache no longer bind-mounts `/var/lib/archipelago/saleor-cache`, and the dashboard container has the minimal rootless nginx capabilities it needs to chown cache files, bind port 80 inside the container, and drop workers to the nginx user.
- NetBird's browser proxy now sends API, OAuth, relay, WebSocket, and management traffic through the stable host-published server port at `169.254.1.2:8086`, avoiding stale rootless Podman DNS/IPs after `netbird-server` restarts.
- Mobile App Store category chips now stay visible above the tab bar, Discover is available on mobile, and category selection updates the page route/query so the selected category is actually shown.
- Apps that require a real browser tab now open directly from the app icon tap instead of first entering an in-shell app-session route, including BTCPay, Grafana, Home Assistant, Vaultwarden, Nextcloud, Portainer, OnlyOffice, Tailscale, Uptime Kuma, Gitea, and Nginx Proxy Manager.
- Validation passed with catalog JSON checks, `npm run type-check`, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml`; live checks on `100.70.96.88` confirmed Saleor dashboard `9010`/API `8000` and NetBird API/OAuth routes survive `netbird-server` restart.
## v1.7.75-alpha (2026-05-19)
- Saleor is now published as a recommended commerce app with catalog metadata, icon, direct app-session launch on port `9000`, scanner metadata, image pins, and a full stack installer for dashboard, API, worker, PostgreSQL, Valkey, Mailpit, and Jaeger.
- Existing NetBird installs are repaired more aggressively by rewriting unified-origin config, recreating the dashboard/proxy containers, restarting the server, preserving data, and handling exact `/api` and `/oauth2` routes plus dashboard logout redirects through the local proxy.
- Desktop dashboard scrolling now hands focus back from the sidebar to the main content when the pointer or wheel moves over the main pane, preventing the sidebar scroll area from trapping wheel input on short screens.
- Validation passed with catalog JSON checks, `npm run type-check`, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml` before release.
## v1.7.74-alpha (2026-05-19)
- App-session right panels now re-focus the iframe after load and when the frame area is activated, so wheel/touch scrolling works immediately after switching tabs or selecting an app on shorter screens.
- NetBird now launches through a unified local origin on port `8087` that proxies the dashboard plus `/oauth2`, `/api`, relay, WebSocket, and gRPC routes to `netbird-server`, fixing the embedded login flow that previously ended in `Unauthenticated` or `404 page not found` after logout.
- Existing NetBird installs are repaired on adopt/start by rewriting `config.yaml`, `dashboard.env`, and the local nginx proxy config, then creating the missing `netbird-dashboard` and `netbird` proxy containers when needed while preserving NetBird data.
- Saleor is still pending and is not included in this release; its registry/installer work remains local until it can be validated separately.
- Validation passed with catalog JSON checks, `npm run type-check`, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml`.
## v1.7.73-alpha (2026-05-19)
- Mobile app launches for iframe-blocked apps now open the direct app URL in a new browser tab immediately instead of landing in a broken in-shell webview that requires a second tap.
- Mobile My Apps/Websites tabs now react to route query changes, App Store pages label the mobile view as Discover, mobile filters have safe bottom spacing, and App Store search ignores the current category so searches cover all available apps.
- My Apps search now surfaces matching App Store entries when the app is not installed, making it possible to jump directly from a failed My Apps search to the installable app details.
- NetBird self-host installs now prefer a `100.x` tailnet/CGNAT address for dashboard, management, relay, STUN, and auth redirect origins when one is present; live repair on `100.89.209.89` updated the existing stack from LAN origins to `100.89.209.89` and restored `netbird-server`.
- App-session iframe frames now focus automatically and wrap the iframe in a scroll host so wheel/touch scrolling works in the active right frame without requiring an initial click.
## v1.7.72-alpha (2026-05-19)
- Settings What's New now includes the missing release notes for `v1.7.68-alpha` through `v1.7.71-alpha`, so the modal reflects the current OTA history instead of stopping at `v1.7.67-alpha`.
- The follow-up release carries the NetBird install fix, Gitea icon polish, mobile app-session fallback updates, and rounder app icon masks from `v1.7.71-alpha` with the Settings modal notes included.
- The local Cargo lockfile version metadata is kept in sync with the release bump after the previous release build updated it.
## v1.7.71-alpha (2026-05-19)
- NetBird stack installs now pre-create `/var/lib/archipelago/netbird/data` before binding it into `netbird-server`, fixing the failed install/start path seen on `100.70.96.88` where Podman rejected the missing host directory.
- NetBird start/restart ordering now starts `netbird-server` before the dashboard container so lifecycle actions bring the control plane up before the UI.
- App-session invalid IDs and panel-mode fallbacks now return to `/dashboard/apps`, avoiding the stale `/apps` route that could render a 404.
- Mobile launches for apps that block iframes now stay inside the Archipelago app-session fallback instead of automatically opening an external browser tab.
- Installed Gitea containers now report the packaged Gitea icon, and app icon masks use a rounder radius on mobile grids, app cards, and detail headers.
- Validation passed with `npm run type-check`, focused Vitest app-session/app-grid tests, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml`.
## v1.7.70-alpha (2026-05-19)
- NetBird is being corrected from the peer/client daemon image to the self-hosted NetBird control-plane stack with a launchable dashboard on port `8087`, a combined management/signal/relay server on `8086`, and STUN on UDP `3478`.
- App sessions now always launch local apps through direct host ports and carry an explicit dashboard return target, so closing an iframe returns to the launching dashboard screen instead of falling through to browser history or a 404.
- Mobile app launches ignore stale desktop panel state and route into the full app-session webview consistently.
- The desktop sidebar now pins the logo/version at the top and controller/online/mode controls at the bottom, with only the navigation section scrolling on shorter screens.
- Validation passed with catalog JSON checks, `scripts/image-versions.sh` syntax check, `npm run type-check`, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml`.
## v1.7.69-alpha (2026-05-19)
- App installs now allow up to 10 minutes for the initial `package.install` RPC to return, matching slow container image pulls and preventing apps from disappearing from My Apps while the backend is still pulling or retrying mirrors.
- Live diagnostics on `100.70.96.88` confirmed the Gitea install did not fail; the primary registry pull timed out after 300 seconds, the fallback mirror succeeded, and Gitea came up healthy on `3001` while the frontend had already timed out at 15 seconds.
- Gitea and other Docker-image app installs now stay visible during slow registry pulls instead of being marked as failed by the browser before backend install progress can complete.
- Gitea is now categorized as a known Data app in My Apps, so a running Gitea container appears with installed apps instead of being filtered into the Websites/Services split.
- NetBird `0.71.2` is now available in the app catalog and fallback marketplace data as a recommended networking app using the official `docker.io/netbirdio/netbird:0.71.2` image.
- NetBird installs get persistent state under `/var/lib/archipelago/netbird`, `NET_ADMIN`/`NET_RAW`, `/dev/net/tun`, `slirp4netns`, image-version pinning, backend metadata, and health checks through `netbird status`.
- The Archipelago terminal now includes `nano` on new disk installs and ISO builds, and self-update installs it on existing nodes if it is missing.
- Validation passed with catalog JSON checks, shell syntax checks, `npm run type-check`, `cargo fmt --all --check --manifest-path core/Cargo.toml`, and `cargo check -p archipelago --manifest-path core/Cargo.toml`.
## v1.7.68-alpha (2026-05-19)
- BTCPay Server now ships on the official `docker.io/btcpayserver/btcpayserver:2.3.9` image, fixing the plugin catalog crash caused by newer plugin dependency version metadata while preserving existing datadirs and Postgres databases.
- BTCPay release and first-boot health checks no longer depend on `curl` inside the container; they use a bash TCP probe that works with the official image out of the box.
- Host nginx now serves Nginx Proxy Manager HTTP-01 challenge files before the Archipelago SPA fallback and is marked as the default HTTP/HTTPS virtual host, so public proxy hosts can issue certificates without hijacking local API traffic.
- Nginx Proxy Manager first-boot, runtime repair, and container-doctor paths now pre-create the ACME webroot, keep bind mounts owned by the rootless Archipelago user, and sync issued public proxy hosts into host nginx vhosts.
- The Nginx Proxy Manager host-nginx sync now skips proxy hosts with missing certificate files and rolls back the generated nginx include if validation fails, preventing a bad certificate path from poisoning later nginx reloads.
- App session close buttons now return to the previous dashboard screen when possible and otherwise fall back to My Apps, avoiding the 404 page after closing an app launched from an invalid or stale history entry.
- System Update confirmation and mirror modals now teleport to the document body with a full-screen overlay, so they cover the whole app instead of only the right-hand dashboard panel.
- Mobile app launches stay inside Archipelago's app-session webview and hide desktop-only new-tab launch affordances, including apps such as Home Assistant that previously looked like they would leave the mobile shell.
- Live recovery on `100.70.96.88` upgraded only the `btcpay-server` container to `docker.io/btcpayserver/btcpayserver:2.3.9`, preserved the existing datadir and Postgres database, and confirmed the container is healthy after a pre-upgrade backup.
- Public validation confirmed `spay.tx1138.com`/`www` redirect to BTCPay login over HTTPS and `sapien.tx1138.com`/`www` serve the L484 page over HTTPS using the issued Let's Encrypt certificates.
## v1.7.67-alpha (2026-05-18)
- Home dashboard status cards now keep the last known good system, VPN, Bitcoin, and FIPS values while route changes or transient RPC failures are in flight, avoiding false "not configured" or "not running" flashes.
- Home, Web5 Monitoring, and the Monitoring page headline cards now share the same live system-stat snapshot for CPU, memory, disk, uptime, and load so the visible numbers agree across the UI.
- Settings What's New is filled through `v1.7.67-alpha`, including the missing historical `v1.7.44-alpha` through `v1.7.66-alpha` entries.
- Bitcoin/Knots/Core shell lifecycle specs now match the Rust app config memory policy: 8 GiB on normal hosts, 4 GiB on low-memory hosts, and pruned Knots uses a larger dbcache on hosts with enough RAM to improve IBD throughput.
- ElectrumX/electrs shell lifecycle specs now match the 4 GiB memory policy used by the Rust app config, reducing drift between first boot, reconcile, and app lifecycle paths.
- Live assessment of `100.70.96.88` identified the current IBD bottlenecks as CPU/thermal/I/O pressure rather than RAM exhaustion, with follow-up work planned for existing-node swap repair, kiosk Chromium CPU reduction, and reconcile failure cleanup.
## v1.7.66-alpha (2026-05-18)
- Nginx Proxy Manager stale-port repair now detects stopped or `Created` Podman records by inspecting `podman ps -a` port metadata, covering records where `podman port nginx-proxy-manager` returns no mapping until start.
- Live recovery on `100.70.96.88` removed only the stale Nginx Proxy Manager container record and recreated it with `8081:81`, `8084:80`, and `8444:443`, preserving `/var/lib/archipelago/nginx-proxy-manager` data.
- Validation confirmed Nginx Proxy Manager recovered as healthy and responds through direct admin port `8081`, host compatibility port `81`, and `/app/nginx-proxy-manager/`.
## v1.7.65-alpha (2026-05-18)
- Orchestrator-backed app starts now run the same pre-start repairs as the legacy Podman path, so Nginx Proxy Manager stale `81:81` container metadata is removed and recreated before the orchestrator tries to start it.
- Live diagnostics on `100.70.96.88` confirmed host nginx is healthy while Nginx Proxy Manager has no listeners on `8081`, `8084`, or `8444`, causing host nginx `502` responses for NPM proxy paths.
## v1.7.64-alpha (2026-05-18)
- Update apply rate limiting is relaxed for authenticated admins from 2 attempts per 10 minutes to 10 attempts per minute, preventing the System Update page from getting stuck behind `429 Too Many Requests` during legitimate OTA retry/troubleshooting flows.
- The corrected backend artifact rebuild protection from `v1.7.63-alpha` remains in place, so this release is built from a fresh Rust backend binary before publishing.
## v1.7.63-alpha (2026-05-18)
- Release automation now rebuilds the Rust backend after bumping the version and before hashing release artifacts, preventing OTA manifests from pointing at a stale backend binary.
- This corrected release carries the Nginx Proxy Manager stale-port repair in an updated backend binary, so nodes running `1.7.61-alpha` can actually receive and execute the fix.
- Validation confirmed the previously published `v1.7.62-alpha` backend artifact still contained `1.7.61-alpha`, explaining why nodes did not advance after applying that update.
## v1.7.62-alpha (2026-05-18)
- Nginx Proxy Manager start and restart now repair stale Podman containers that still publish the admin UI on host port `81`, which conflicts with host nginx on updated nodes.
- The repair recreates only the stale Nginx Proxy Manager container metadata while preserving `/var/lib/archipelago/nginx-proxy-manager` data and using the current `8081:81`, `8084:80`, and `8444:443` mappings.
- Runtime stale-listener cleanup for Nginx Proxy Manager is shared across start and restart paths so rootless port helper leftovers are still cleared before lifecycle retries.
- Validation passed with `cargo fmt --all --check --manifest-path core/Cargo.toml` and `cargo check -p archipelago --manifest-path core/Cargo.toml`.
## v1.7.61-alpha (2026-05-18)
- Multi-container stack installs now keep their app card in the `Installing` state for up to 20 minutes while dependency containers are being pulled and prepared.
- BTCPay Server installs no longer appear to vanish or fail after two minutes while Postgres and NBXplorer are still being created before the primary `btcpay-server` container exists.
- The stale-transition escape hatch remains short for start, stop, restart, update, and removal operations, so genuinely wedged lifecycle actions still recover quickly.
- Live validation on `100.70.96.88` confirmed BTCPay Server completed installation and responds on port `23000` with the expected HTTP redirect.
## v1.7.60-alpha (2026-05-18)
- Meshtastic serial detection now rejects malformed or incomplete handshakes instead of accepting unrelated serial devices as a fallback Meshtastic radio.
- Mesh radio auto-detection now skips known non-mesh serial devices such as Sierra Wireless LTE modems and Zooz/Z-Wave sticks, avoiding interference with production peripherals.
- Meshtastic config sync now sends `want_config_id` with the correct protobuf wire type, fixing radio-side `ignore malformed toradio` errors and allowing node-info/contact ingestion.
- The stable `/dev/mesh-radio` udev rule no longer claims every `ttyACM*` device; it only matches known mesh USB serial adapters and known USB CDC ACM radio vendors.
- Live validation on `100.70.96.88` confirmed Archipelago selects `/dev/ttyUSB0`, identifies the Meshtastic node, and refreshes 103 mesh contacts.
## v1.7.59-alpha (2026-05-17)
- Mobile app launching now keeps known container apps inside Archipelago's app-session flow instead of forcing desktop-only new-tab behavior on phones.
- App sessions on mobile now respect the status-bar safe area so foreground iframe content starts below the device chrome while the fullscreen backdrop remains edge-to-edge.
- Prepackaged website launch buttons now resolve their curated website URLs before website-container fallback logic, restoring launches for the L484 sites and adding the Arch Presentation bookmark.
- Meshtastic contact discovery now drains the radio config stream through completion and retries config sync when the contact cache is empty, so nearby nodes already known by the radio are more likely to appear in Archipelago.
- The Apps page now includes a compact sideload button and modal for installing trusted Docker images with optional title, description, and port mapping metadata.
- Sideloaded app title and description metadata now persist through the backend app-config file so refreshed package scans do not collapse custom apps back to generic IDs.
- Validation passed with `npm test -- appLauncher`, `npm run build`, `cargo check -p archipelago`, and `cargo fmt --all --check`.
## v1.7.58-alpha (2026-05-17)
- Mesh networking now supports Meshtastic radios over the Meshtastic serial API in addition to existing MeshCore Companion USB radios.
- The mesh listener now probes preferred and auto-detected serial paths for both MeshCore and Meshtastic firmware, preserving the existing reconnect loop so unplug/replug and firmware hot-swap behavior stays consistent.
- Meshtastic text packets are translated into the existing Archipelago mesh frame pipeline, so current RPC handlers, transport routing, message storage, typed-message decoding, and UI state continue to work without a separate frontend path.
- Meshtastic node information is surfaced as normal mesh contacts using stable synthetic public keys derived from Meshtastic node numbers, allowing peer refresh and message attribution to reuse existing MeshCore contact handling.
- Outbound Archipelago mesh messages can now be sent through Meshtastic as channel text packets using the same command path used by MeshCore channel broadcasts.
- Device status now reports the detected firmware family as `meshcore` or `meshtastic` from the shared listener abstraction.
- Radio udev rules now include USB CDC ACM serial devices (`ttyACM*`) alongside CP2102, CH340, and FTDI adapters so Meshtastic boards are more likely to appear through the stable `/dev/mesh-radio` symlink.
- Host nginx now serves `/assets/*` hashed frontend chunks as immutable static files with a hard 404 on misses instead of falling back to `index.html`, preventing strict MIME errors when a browser has a stale pre-update HTML shell.
- The SPA HTML shell and service-worker files now revalidate on every load, reducing stale frontend references after OTA updates.
- OTA runtime promotion now installs the bundled `nginx-archipelago.conf` into `/etc/nginx/sites-available/archipelago` and reloads nginx after a successful config test, so frontend cache/fallback fixes reach existing nodes without a manual deploy.
- Local validation passed with `cargo check -p archipelago`; live SSH testing against `100.70.96.88` was not completed because temporary public-key authentication was rejected on the target.
## v1.7.57-alpha (2026-05-17)
- Nginx Proxy Manager now avoids privileged rootless Podman host port `81`, preferring `8081:81` while host nginx keeps a compatibility proxy on `:81` for stale cached launch buttons.
- App installs now allocate ports by checking live host bind availability, falling back to a free high port when preferred ports are already occupied.
- Portainer-created launchable containers are separated into a `Websites` tab and launch through their discovered published host port instead of hard-coded app URLs.
- Internal BuildKit helper containers such as `buildx_buildkit_default` are hidden from the Apps UI.
- Portainer works out of the box on Debian 13/Podman installs by including `catatonit` and by preserving the Podman socket mount as a socket rather than creating it as a directory.
## v1.7.56-alpha (2026-05-15)
- Health notifications now clear when an app is no longer unhealthy, including stale alerts for removed containers such as Portainer.
- Fresh installs now include the full Wi-Fi userspace stack (`wpasupplicant`, `wireless-regdb`, `iw`, `rfkill`, `polkitd`, `pciutils`, and `usbutils`) so NetworkManager can scan and connect with Intel Wi-Fi cards out of the box.
- The installed system now grants the `archipelago` service user explicit NetworkManager PolicyKit access for web-triggered Wi-Fi scans and connection changes.
- Wi-Fi connect now replaces stale/partial NetworkManager profiles and creates an explicit WPA-PSK profile with the supplied password, avoiding no-secret retry failures after a failed attempt.
- Settings password changes now update the Linux/SSH password through non-interactive sudo, so the web password and SSH password stay in sync when the checkbox is enabled.
- Quadlet environment values with spaces or shell metacharacters are quoted consistently, preventing env drift recreate loops for apps like nostr-rs-relay and Grafana.
- Boot/bootstrap reconcile avoids restarting running Bitcoin containers while repairing RPC config, preserving IBD progress on active nodes.
- Exit code 137 is labeled as SIGKILL instead of assuming OOM, avoiding false OOM alerts for orchestrator-managed recreates.
- Container reconcile force-recreates Podman records stuck in `Stopping`, preserving bind-mounted app data while recovering wedged containers automatically.
- Container health reporting is honest for running containers: Archipelago surfaces Podman's actual health state instead of marking every running container healthy.
- Quadlet reconciliation restarts services when stale health gates, port bindings, network aliases, exec commands, or healthchecks drift from the current manifest.
- Bitcoin Knots sync performance improves on fresh installs and updates with 8Gi container memory, a 4Gi dbcache, and full CPU parallelism.
- ElectrumX initial indexing gets more headroom: CPU caps are removed, memory is raised to 4Gi, cache is raised to 3Gi, and oversized sends are allowed for heavier wallet/indexing workloads.
- Mempool/ElectrumX lifecycle qualification respects pruned/non-archival Bitcoin nodes instead of installing a half-running stack with unhealthy dependencies.
- LND wallet/RPC helpers are more tolerant of container-owned files and updated REST port metadata, improving LND lifecycle and wallet-connect flows.
- Marketplace/catalog metadata carries richer container config so remote lifecycle tests install apps using the same settings users get from the UI.
- The app screensaver no longer activates during media-heavy app sessions such as IndeeHub, Jellyfin, Immich, PhotoPrism, and File Browser; apps can also pause/resume it with media playback messages.
- A fresh `1.7.56-alpha` unbundled installer ISO is built from the same primary VPS2 release line for easy download and USB flashing.
## v1.7.55-alpha (2026-05-13)
- Container reconcile now force-recreates Podman records stuck in `Stopping`, preserving bind-mounted app data while recovering wedged containers automatically.
- `.198` is green after the container-layer hardening pass: focused and broad non-destructive lifecycle audits pass, raw Podman health/state sweep is clean, and direct app probes return healthy responses.
- Release-candidate artifacts are staged separately from live update publishing while Gitea artifact hosting is repaired.
## v1.7.54-alpha (2026-05-06)
- Existing installs now self-repair nginx backend proxy locations for `/bitcoin-status` and `/api/app-catalog`, including hosts where `sites-enabled/archipelago` is a copied active file instead of a symlink.
- LND UI is consistently served on `18083` across first boot, Tor config, companion Quadlet reconciliation, OTA runtime payloads, and ISO scripts; stale companion units/images are rewritten instead of only checking service active state.
- OTA frontend tarballs now carry a clean runtime payload with updated scripts, docker UI sources, and canonical nginx config, preventing startup promotion from reintroducing stale host assets.
- Release ISO builds now support the primary HTTP app registry when bundling core images, so unbundled media includes File Browser/Cloud support instead of requiring a post-install Marketplace download.
- `.116` was live-updated with the new backend and runtime scripts; focused non-destructive lifecycle audit passes for Bitcoin Knots, LND, BTCPay, Mempool, and Grafana.
## v1.7.53-alpha (2026-05-05)
- Bitcoin Knots/Core config generation no longer duplicates RPC bind and port settings between `bitcoin.conf` and container command args, fixing `Unable to bind all endpoints for RPC server` startup failures.
- Legacy Bitcoin container healthchecks no longer depend on `bitcoin-cli`, which is absent from current Knots images and can wedge Podman healthcheck runners.
- Update checks now prefer manifest OTA releases over stale git remotes unless `ARCHIPELAGO_GIT_UPDATES` is explicitly enabled, so installed nodes can see published releases from the VPS mirror.
## v1.7.52-alpha (2026-05-05)
- Tailscale now launches the local installed web UI on port `8240` and starts `tailscaled` before `tailscale web`, fixing unreachable installs after container creation.
- Grafana install/start/restart now repairs missing rootless host listeners on port `3000`, matching the existing SearXNG, Uptime Kuma, and Gitea recovery path.
- Debian 13/Trixie ISO and disk-install paths now force security updates from `trixie-security` during image/install creation so rebuilt release media includes patched base packages.
- Broad `.198` lifecycle audit passes with the current qualified app set; known absent blockers remain `electrumx`, `photoprism`, `dwn`, and `ollama`.
## v1.7.49-alpha (2026-04-30)
- Bitcoin Knots/Core UI now reports connection, reconnecting, syncing, and error states from a backend status bridge instead of showing a stale "Unable to connect" message while the node is warming up.
- ElectrumX UI now exposes indexed height, local Bitcoin height, known headers, status, and progress source so indexing/waiting states are readable during long initial sync.
- Added container doctor timer and smoke/lifecycle test coverage for Bitcoin Knots/Core, ElectrumX, Mempool, BTCPay/NBXplorer, and UI surface availability.
- Bitcoin Core and Bitcoin Knots are mutually exclusive variants, with a real Bitcoin Core manifest and corrected install conflict handling.
- IndeeHub now launches only on direct web UI port `7778`; the broken `/app/indeedhub/` path proxy was removed, and port `7777` remains the Nostr relay.
- BTCPay/NBXplorer Postgres environment formatting fixed so installs do not carry malformed connection strings.
## v1.7.48-alpha (2026-04-29)
- archipelago.service no longer fails to start with "Failed to set up mount namespacing: /run/containers: No such file or directory" on nodes where /run/containers wasn't pre-created. ExecStartPre now creates it. Existing nodes need a one-time `systemctl edit archipelago` to add the mkdir; ISO installs from this version forward have the fix baked in.
## v1.7.47-alpha (2026-04-29)
- Bitcoin Knots/Core sync is now significantly faster. The container now uses every available core for script verification (was capped at 2) and has 8GB of memory instead of 4GB so its 4GB UTXO cache has headroom for the mempool and peer connections. Existing nodes pick up the new limits on next install/update; freshly-installed nodes start at full speed.
- ElectrumX initial indexing is faster too. Its CPU cap is removed, container memory is 4GB, and its internal cache is now 3GB (default was 1.2GB).
## v1.7.46-alpha (2026-04-29)
- Health monitor no longer pages "Auto-restart failed" for orphaned containers. After a variant switch (bitcoin-core ↔ bitcoin-knots) the previous variant's container could survive uninstall and the health monitor would try restarting it forever. Now skipped silently with a debug log.
- Apps no longer disappear from My Apps when an install fails. The card stays visible with state=Stopped so the user can retry or uninstall, with the failure reason surfaced via the new install_progress.message field.
- "Downloading…" progress now actually advances during multi-image stack pulls. Was sticking at 20% until all pulls finished; now interpolates 20%→70% based on which image of N has landed.
- Pulled four docker.io images (bitcoin, gitea, nextcloud, valkey) into the lfg2025 registries on OVH and tx1138. Removes a docker.io dependency from first-boot installs.
- Resilience harness improvements: install-fail entries no longer vanish, install/uninstall/probe cells are timing-tolerant (60s retry on ui_probe and auth_probe), dep snapshots no longer leak companion containers into the dependent app's "new containers" set.
## v1.7.45-alpha (2026-04-29)
- Bitcoin RPC auth is durable. The dashboard reliably connects across container restart, image update, and reboot. Was failing on registry-pulled images that shipped a stale baked-in password.
- Multi-container apps show real install progress. IndeedHub (7), BTCPay (4), Mempool (3), Immich (3) — bar advances through Preparing → Pulling → Creating → Done instead of sitting at 0% until the very end.
- Apps no longer disappear from the dashboard mid-install. The container scanner now respects in-flight installs and updates instead of evicting an entry while its containers are still being created.
- IndeedHub installs cleanly on a fresh node. Five missing environment variables fixed; Nostr sign-in works on first install.
- Tailscale install no longer fails with "executable not found". Container command was a malformed shell string; now a proper command array.
- Removed three catalog entries that hung installs for ten minutes (dwn, endurain, ollama — no source images in our registries). Restored Nextcloud, sourced from docker.io.
- Bitcoin Core update path uses the correct image name (was pulling from a non-existent path).
- New ISO installs now allocate swap (sized to RAM, capped at 8GB, on the encrypted data partition). Without swap, container image builds and memory spikes were hitting OOM under load.
## v1.7.44-alpha (2026-04-28)
43de3b73 feat(orchestrator): complete container migration and release hardening
ce39430b feat(self-update): sync and rebuild UI containers on OTA
72dec5aa fix(lnd-ui): align container port across all specs
83aacdf2 chore(release): archive ISO build recipes, tarball-only releases
All notable changes to Archipelago will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
"description":"Bitcoin documentary streaming with Nostr identity.",
"id":"indeedhub",
"title":"IndeeHub",
"version":"1.0.0",
"description":"Bitcoin documentary streaming platform featuring God Bless Bitcoin and other educational content about Bitcoin, sovereignty, and decentralized technology. Sign in with your Nostr identity.",
"description":"Bot arena + 2-player arcade fighter with controller support and Adventure Mode.",
"id":"botfights",
"title":"BotFights",
"version":"1.1.0",
"description":"Bot competition arena with 2-player arcade fighting mode. AI bots battle in trivia challenges while humans duke it out with controllers. Built for Bitcoiners.",
"description":"Fedimint ecash client daemon (fmcd). Lets your node hold Fedimint ecash and join federations; the wallet talks to it over a local REST API.",
"tailscaled --tun=userspace-networking & for i in $(seq 1 30); do [ -S /var/run/tailscale/tailscaled.sock ] && break; sleep 1; done; tailscale web --listen 0.0.0.0:8240 & wait"
"notes":"Installed as a two-container stack: netbird dashboard on 8087 and netbird-server control plane on 8086 plus UDP 3478. For production clients, publish a DNS name over HTTPS with gRPC/WebSocket routing."
This will build all apps that have Dockerfiles. Standard apps (bitcoin-core, lnd, etc.) will use their official images, while custom apps (router, did-wallet, web5-dwn) will be built from source.
This will build all apps that have Dockerfiles. Standard apps (bitcoin-core, lnd, etc.) will use their official images, while custom apps (router, did-wallet) will be built from source.
### Build Specific App
```bash
./build.sh router
./build.sh did-wallet
./build.sh web5-dwn
```
## Running Apps via Archipelago
@ -64,7 +63,6 @@ In development mode, apps are accessible on offset ports:
- **Router**: http://localhost:18084
- **DID Wallet**: http://localhost:18083
- **Web5 DWN**: http://localhost:13000
- **Nostr RS Relay**: http://localhost:18081
- **Strfry**: http://localhost:18082
@ -72,7 +70,7 @@ See [PORTS.md](./PORTS.md) for complete port mapping.
## Development Workflow
### For Custom Apps (router, did-wallet, web5-dwn)
### For Custom Apps (router, did-wallet)
1. **Make changes** to source code in `apps/<app-id>/src/`
description:Bot competition arena with 2-player arcade fighting mode. AI bots battle in trivia challenges while humans duke it out with controllers. Built for Bitcoiners.
description:Bitcoin documentary streaming platform featuring God Bless Bitcoin and other educational content about Bitcoin, sovereignty, and decentralized technology. Sign in with your Nostr identity.
category:media
category:community
# The user-facing launcher (app_id "indeedhub"). Container is named "indeedhub"
# (matches the runtime's per-app references + the live container, so the
# orchestrator adopts it). Its nginx (listen 7777) proxies to the backends by
# their short aliases on indeedhub-net: api:4000, minio:9000, relay:8080.
container_name:indeedhub
container:
image:git.tx1138.com/lfg2025/indeedhub:latest
pull_policy:always # Pull from registry; falls back to local build
image:146.59.87.168:3000/lfg2025/indeedhub:1.0.0
pull_policy:if-not-present
network:indeedhub-net
dependencies:
- app_id:indeedhub-api
- storage:1Gi
resources:
cpu_limit:2
memory_limit:512Mi
disk_limit:1Gi
security:
capabilities:[]
readonly_root:true
no_new_privileges:true
user:1001
seccomp_profile:default
network_policy:bridge
apparmor_profile:default
# nginx master runs as root and drops workers to the nginx user (uid/gid
# 101) — needs SET{UID,GID}; CHOWN + DAC_OVERRIDE let it own + write the
# proxy cache under the tmpfs /var/cache/nginx. The orchestrator does
# --cap-drop=ALL, so (unlike the legacy `podman run` default caps) these
# must be declared or nginx workers die with "setgid(101) failed".
description:NetBird combined management / signal / relay server with an embedded identity provider and STUN. Backend for the self-hosted NetBird mesh VPN.
category:networking
# Hyphen name matches the runtime references (crash_recovery / dependencies /
# config startup order) + the live container, so on an existing node the
# orchestrator ADOPTS the running server rather than recreating it (data +
# the sqlite store under /var/lib/netbird preserved). Alias `netbird-server`
# is the short hostname the proxy's nginx proxies/grpc-passes to.
container_name:netbird-server
container:
image:docker.io/netbirdio/netbird-server:0.71.2
pull_policy:if-not-present
network:netbird-net
network_aliases:[netbird-server]
# The relay authSecret and the sqlite store encryptionKey are base64 keys
# (the server base64-decodes them to recover raw bytes — hex would decode to
# the wrong value). Generated once and reused: ensure_generated_secrets
# no-ops when the file already exists, so a re-render of config.yaml on an
# adopted node keeps the same keys (regenerating would orphan the store).
generated_secrets:
- name:netbird-relay-auth-secret
kind:base64
- name:netbird-store-encryption-key
kind:base64
# Pass the rendered config explicitly, mirroring the legacy `--config` arg.
description:Self-hosted WireGuard mesh VPN control plane with dashboard, embedded identity provider, management API, signal, relay, and STUN. The user-facing entry point — a TLS proxy in front of the dashboard + server.
category:networking
# The user-facing launcher (app_id + container both "netbird", matching the
# runtime references + the live container so the orchestrator adopts it). This
# is the nginx that terminates TLS on 8087 and fans out to the dashboard +
# server by their short aliases on netbird-net.
container_name:netbird
container:
image:docker.io/library/nginx:1.27-alpine
pull_policy:if-not-present
network:netbird-net
# Self-signed TLS cert materialised before create — the dashboard needs a
# secure context (window.crypto.subtle / OIDC PKCE, issue #15), so the proxy
# serves HTTPS. Idempotent: kept as-is when crt+key already exist (a user
# accepts it once). SAN defaults to the host IP + 127.0.0.1 + localhost.
generated_certs:
- crt:/var/lib/archipelago/netbird/tls.crt
key:/var/lib/archipelago/netbird/tls.key
dependencies:
- app_id:netbird-server
- app_id:netbird-dashboard
- storage:1Gi
resources:
memory_limit:256Mi
security:
# cap-drop=ALL is applied by the orchestrator. nginx (master as root, drops
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.