377 Commits

Author SHA1 Message Date
archipelago
c5ea41d0cb fix(mesh/outbox): expire messages with zero TTL immediately
is_expired used age > ttl_secs, so a message with ttl_secs=0 whose age
rounded to 0 seconds was considered live forever. Switch to >= so the
zero-TTL boundary expires on the first check, matching the intuitive
meaning of TTL and the behavior the tests assert.
2026-04-23 13:02:07 -04:00
archipelago
9d42645aa3 fix(avatar): prevent u16 overflow panic when seed byte is large
hue_color and accent_color computed (seed as u16) * 360, which overflows
u16 when seed >= 182 — debug builds panicked, release wrapped silently.
Widen to u32 before the multiplication.

This also unblocks several identity_manager tests that constructed avatars
through master_node_svg and were aborting on the panic.
2026-04-23 13:02:01 -04:00
archipelago
f6efe2f356 fix(transport/chunking): stop overwriting first 4 bytes of user data
encode_chunked() split the payload into shards first, then overwrote
the first 4 bytes of shard 0 with a u32 length header, then re-ran
Reed-Solomon to regenerate parity over the now-corrupted shards. The
decoder correctly read the length header and trimmed `[4..4+len]`
from the reconstructed buffer, but those first 4 bytes had already
been destroyed on the encode side, so every chunked mesh payload
lost its first 4 bytes.

Restructure: reserve 4 bytes for the length header up front, build
a single contiguous [len][data][pad] buffer, then split into shards.
Parity is computed over the correct shards on the first pass, no
double-encode needed.

Update test_chunk_roundtrip_medium: 500 bytes + 4-byte header = 504
bytes, which is 5 data shards (ceil(504/124)), not 4. The old test
assertion was wrong all along and masked the corruption bug because
it only checked the roundtripped bytes, which is exactly what we
need to verify. New assertion is correct.

Verified: all 7 transport::chunking tests pass.
2026-04-23 12:29:10 -04:00
archipelago
cd6f8bad70 fix(install-log): pre-create /var/log/archipelago/ so non-root backend can write
The backend runs as `archipelago` and calls `install_log()` to append
audit lines to the install log on every install / update / remove /
start / stop / restart. Target path was /var/log/archipelago-container-installs.log,
which does not exist and cannot be created by the service because
/var/log/ is root-owned. OpenOptions errors were silently swallowed,
so the log was never written on any node.

Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and
container-installs.log with archipelago:archipelago ownership. Move
the const path to match, keeping logs inside the directory logrotate
already rotates (image-recipe/configs/logrotate.conf). Install the
rule from both the ISO build and self-update, and apply it
immediately on self-update so existing nodes get a working log
without needing a reboot.

Verified on .228: file created, backend user can write, backend
binary rebuilt with new const.
2026-04-23 12:02:46 -04:00
archipelago
694e5b0a9d fix(update): pass --create-missing when rollback recreates a destroyed container
The update flow removes the old container before starting the new
one. If the update fails after removal, the rollback path tries
`podman start <name>` first, then falls back to reconcile. But
reconcile without --create-missing treats the now-absent container
as an optional one that the install flow will (re)create later,
and skips it. Result: container stays destroyed until someone
notices and runs reconcile manually.

Add --create-missing to the rollback reconcile invocation so the
fallback actually rebuilds the container from its canonical spec.

Fixes the failure mode observed on .228 where a bitcoin-knots
update left the node with no bitcoin-knots container at all.
2026-04-23 10:06:55 -04:00
archipelago
12f93cc15e fix(image-versions): locate image-versions.sh at its actual deployed path
The Rust search path listed /opt/archipelago/image-versions.sh and
scripts/image-versions.sh (repo-relative for dev), but the image
recipe deploys the file to /opt/archipelago/scripts/image-versions.sh.
Production nodes therefore silently failed every lookup: find_file
returned None, load_image_versions returned an empty HashMap, and
both pinned_image_for_app and pinned_images_for_stack returned no
matches.

Symptom on deployed nodes: every container scan emitted
"image-versions.sh not found in any search path" at DEBUG level, and
the version-comparison logic in docker_packages.rs plus the
update-check logic in api/rpc/package/update.rs silently degraded to
no-op — users would not see update-available badges and upgrade RPCs
could not resolve pinned targets.

Fix: put the canonical deployed path first in PATHS, keep the older
/opt/archipelago/image-versions.sh as a fallback for not-yet-updated
nodes, and retain scripts/image-versions.sh as the dev-repo-relative
fallback. Verified on .228: backend now logs "Parsed 57 image
versions from /opt/archipelago/scripts/image-versions.sh" on scan.

Pre-existing test_parse_image_versions failure in this module is
unrelated (the NOT_AN_IMAGE assertion was broken before this change
because the parser's _IMAGE-suffix retain keeps it). Leaving that for
the general cargo-test cleanup pass.
2026-04-23 09:29:15 -04:00
archipelago
28e38a36a9 fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs
load_registries + load_mirrors normally only ADD missing defaults to
the persisted JSON — explicit removals stick. After retiring the .23
Hetzner VPS we need the opposite: existing nodes have .23 baked into
their saved configs and would spend seconds per install/update timing
out against a dead host until the operator manually removes it via
the Settings UI.

Add a targeted one-time migration in both loaders: if any saved entry
has 23.182.128.160 in its URL, drop it on load and rewrite the file.
This is an exception to the usual "explicit removals stick" rule —
the user never chose to add this mirror, it was a default.

Narrow-scope migration (one hardcoded IP match, no schema version)
because the cost/benefit of a general migration system isn't worth
it for a single decommissioned host. Future retirements can follow
the same pattern.
2026-04-23 08:51:26 -04:00
archipelago
d9d5fa65e5 chore: retire .23 VPS mirror, promote .168 OVH to primary
The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it
everywhere with the OVH VPS at 146.59.87.168, which was previously
the tertiary mirror.

  - update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into
    the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2.
    Default mirror list shrinks from 3 to 2.
  - container/registry.rs: default RegistryConfig drops .23, promotes
    .168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10.
  - api/rpc/package/config.rs: trusted-registry allowlist swaps .23
    for .168.
  - api/handler/mod.rs: app-catalog fallback URL uses .168.
  - neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168.
  - scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168.
  - image-recipe/build-auto-installer-iso.sh: installer ISO registries
    use .168 (both podman registries.conf and backend registries.json).

Tests updated to assert on the new 2-entry default lists (registry +
mirror). URL-parser fixture tests in update.rs retain .23 strings —
they exercise string-parsing logic, not mirror policy.

Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin`
multi-push alias (not part of this commit — pure working-copy change).
2026-04-23 08:22:32 -04:00
archipelago
980c1b25f4 fix(install): kick scanner post-install so Launch button appears immediately
After install completes, the async-spawn wrapper wrote state=Running
but the skeletal install-time manifest (interfaces: None) persisted
until the next scheduled 60s scan. The frontend saw state=running but
hasUI=false and hid the Launch button for up to a full minute.

Add a shared Notify/watch pair between RpcHandler and the scan loop:
  - scan_kick (Notify): scan loop selects! between the 60s interval
    and this notify, running immediately on either.
  - scan_tick (watch<u64>): scan loop bumps the counter after each
    completed scan so callers can await completion.

Install and update success paths now call kick_scanner_and_wait before
flipping to Running. The scan merges via merge_preserving_transitional
(state stays Installing/Updating, manifest refreshed from live podman
with interfaces.main.ui populated from real port bindings). 2s timeout
falls back to pre-fix behavior on slow podman — no regression.
2026-04-23 07:59:03 -04:00
archipelago
7e62ea07f7 feat(install): phase-based progress bar replaces unparseable pull bytes
Podman emits zero parseable progress when stderr is piped (no TTY), so
the old byte-counter regex never matched in real installs. Users saw
0% for the whole pull, then a jump to 95%, then silence through
create-container, health-check, and post-install hooks.

Replace with 7 explicit lifecycle phases wired through install.rs and
update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%),
StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%),
Done (100%). Each maps to a fixed UI progress and status message.

Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when
present, falls back to byte-counter for legacy. A Math.max forward-only
guard ensures the bar never regresses. Deleted the duplicate watcher
in Discover.vue that was fighting the store's watcher with stale byte
logic. Added shimmer CSS on the fill (with prefers-reduced-motion
opt-out) so the bar looks alive during long phases.
2026-04-23 07:58:43 -04:00
archipelago
49b98e0271 fix(rpc): empty icon in transient install entry to avoid broken-image flicker
create_installing_entry hardcoded /assets/img/app-icons/<id>.png for
every new install. About half the app icons ship as .svg or .webp
(lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the
browser 404s on the wrong extension and renders the default broken-image
glyph for the 10-30s window before the scanner refreshes with real
manifest data.

Send empty icon. The frontend's icon computed in AppCard.vue falls
through to curatedMap which has correct extensions for bundled apps,
and handleImageError still guards any remaining misses with a
placeholder SVG.
2026-04-23 06:58:12 -04:00
archipelago
1ad889608f feat(rpc): async-spawn install/uninstall/update lifecycle
Extend the async-spawn treatment previously shipped for Stop/Start/Restart
to the three remaining long-running lifecycle RPCs. Each wrapper validates
params, rejects duplicate in-flight ops, flips state to the transitional
variant (Installing/Removing/Updating), then spawns the existing inner
handler on tokio. RPC returns immediately with { status, package_id }; the
spawn task owns the terminal state write.

Install and update success arms explicitly set state=Running. The scan
loop merge (merge_preserving_transitional) refuses to overwrite
transitional states, so the spawn task must write the terminal state.
Uninstall's inner handler removes the entry entirely, so no explicit
terminal write is needed there.

Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so
spawned tasks can hold their own Arc without extra field cloning.

Transient install entry uses empty icon string. Hardcoding
/assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp
assets, which produces a broken-image flicker until the scanner refreshes
with manifest data. Empty string causes the frontend's icon computed to
fall through to the curated map, which has correct extensions.

Removed the inner "already updating" guard in update.rs — the wrapper
now owns duplicate-op detection for all three operations.
2026-04-23 06:57:50 -04:00
archipelago
cd69c3b2f6 fix(state): preserve transitional state across container scans
The 30s package scan loop used to blindly overwrite every package
entry from podman inspect. While a user-initiated Stop / Start /
Restart was in flight, the RPC spawn task would flip the state to
Stopping / Starting / Restarting, the next scan would see podman
still reporting "running" (for the duration of the graceful stop,
up to 600s for bitcoin-core), and clobber the transitional state
back to Running. The dashboard would then flip Running -> Stopping
-> Running -> Stopped, making it look like the stop had silently
failed until it eventually completed.

The merge loop now treats transitional variants (Stopping, Starting,
Restarting, Installing, Updating, Removing, and the three backup
variants) as owned by the RPC spawn task. For those variants,
merge_preserving_transitional keeps the existing state while still
taking live observability fields (health, exit_code, installed,
lan_address, manifest, static_files, available_update) from the
fresh scan so the UI continues to see live health readings.

Adds an escape hatch via a per-scan transitional_since side table:
if a package has been in a transitional state for more than 1200s
(2x the longest graceful stop at 600s on bitcoin-core), the scan
loop assumes the spawn task died without cleanup and overrides with
podman's live state. Prevents a crashed background task from wedging
a package in Stopping forever.

Three unit tests cover the merge rule, the observability passthrough,
and the transitional-variant classifier.
2026-04-23 05:15:13 -04:00
archipelago
39dd1d9dcc fix(rpc): async container stop/start/restart; widen state mapping
RPC handlers no longer block on podman operations. container-stop on
bitcoin-core used to hold the connection for up to 600s while the UI
showed a frozen spinner; it now returns in under a second with
{status: stopping} after flipping the package state to Stopping and
broadcasting over WebSocket. Same treatment for container-start and
the new container-restart route.

Widens container-list state mapping to emit the transitional variants
(stopping, starting, restarting, installing, updating, removing,
installed, and the backup states) instead of collapsing them to
"unknown". Keeps the mapping in sync with the UI ContainerStatus.state
union so the dashboard can render the right transitional label.

Mirrors the treatment in package/runtime.rs for package.start,
package.stop, and package.restart. The body of each handler is lifted
into pure do_package_* helpers that the background task runs; state
flipping is bracketed around the spawn with revert on error. The
pre-existing post-start exit-check verification and restart stop+start
fallback run inside the spawned task, not the RPC body.

Adds container-restart route to the dispatcher. mark_user_stopped
continues to run BEFORE the spawn, preserving the ordering contract
with the crash recovery layer at runtime.rs:145-148.
2026-04-23 04:59:45 -04:00
archipelago
5baced5f5b feat(rpc): spawn_transitional helper for async lifecycle ops
Introduces a new RPC-layer helper that bridges the synchronous
ContainerOrchestrator trait with RPC handlers that must return in <1s.

The helper flips the package state to a transitional variant
(Stopping / Starting / Restarting) in the StateManager so WebSocket
clients see the live label immediately, then tokio::spawns the
actual orchestrator call. On success it writes the final state; on
error it reverts to the pre-transition state and logs via
install_log().

The ContainerOrchestrator trait stays synchronous so the reconciler,
boot flow, unit tests, and chaos harness keep deterministic
behaviour. Async only lives in the RPC layer.

Not wired to any handler yet — Commit 2 consumes this helper.
Widens install_log visibility from pub(super) to
pub(in crate::api::rpc) so the new sibling module can reach it.
2026-04-23 04:55:52 -04:00
archipelago
30b31b3670 fix(lnd): read admin macaroon via sudo fallback
LND's admin.macaroon is owned by a rootless-podman subordinate UID
(typically 100000) with mode 640. The archipelago server runs as UID
1000 and cannot read the file directly, which caused every dashboard
LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client
to fail with "Failed to read LND admin macaroon".

Add a read_lnd_admin_macaroon() helper that first tries a direct read
(for operators who have relaxed permissions) then falls back to
`sudo -n cat`, mirroring the pattern already used for Tor hidden
service hostnames in handle_lnd_connect_info. Centralise the canonical
macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers
through the helper.

Verified on .228: GET /lnd-connect-info now returns 200 with cert,
macaroon, and tor_onion fields. Dashboard QR/connect-string UI
unblocked.
2026-04-23 04:15:44 -04:00
archipelago
3e9c192b48 feat(container): bitcoin-ui pre-start hook renders nginx.conf from embedded template
Replaces the first-boot-containers.sh sed/envsubst approach with a
Rust-native render step bound into the ContainerOrchestrator lifecycle.

- New container::bitcoin_ui module: embeds the nginx.conf template via
  include_str!, reads the plaintext RPC password from
  /var/lib/archipelago/secrets/bitcoin-rpc-password, substitutes
  {{BITCOIN_RPC_AUTH}} with base64(archipelago:<password>), and atomic-
  writes (tmp + rename) to /var/lib/archipelago/bitcoin-ui/nginx.conf.
  Idempotent: byte-compares before writing so unchanged input is a
  no-op (no inode churn, no restart cascade).
- ProdContainerOrchestrator gains run_pre_start_hooks(app_id) returning
  HookOutcome::{Rewritten, Unchanged}. Fires in install_fresh before
  create_container, and in ensure_running: on Running + Rewritten
  triggers a restart; on Stopped re-renders then starts.
- bitcoin-ui Dockerfile no longer COPYs a default.conf; the file now
  arrives via runtime bind-mount of the rendered config. If the bind-
  mount is ever missing, nginx starts with no site configured and
  returns 404 everywhere — safe failure vs. serving upstream RPC with
  a stale Authorization header.
- apps/{bitcoin,electrs,lnd}-ui/manifest.yml land as first-class
  manifests. bitcoin-ui declares the bind-mount target and a dependency
  on bitcoin-core; electrs-ui and lnd-ui declare their own deps and
  health checks.
- 8 new unit tests on the render fn (idempotency, rotation, trimming,
  missing/empty secret, template invariants) plus an integration test
  asserting install(bitcoin-ui) actually lands a substituted nginx.conf
  on disk via the hook. 39/39 container:: tests pass
  (test_parse_image_versions pre-existing failure unchanged, out of
  scope).
2026-04-23 02:19:52 -04:00
archipelago
6a0809d386 feat(container): wire ProdContainerOrchestrator + BootReconciler into main
Step 6 of the rust-orchestrator migration. Construct the container
orchestrator once in main.rs, call load_manifests + adopt_existing
immediately after Config::load, log the adoption report, and spawn
BootReconciler::run_forever with the 30s default interval. Thread the
orchestrator through Server::new -> ApiHandler::new -> RpcHandler::new
so the reconciler and RPC layer share one instance.

Wire a tokio::sync::Notify through the SIGTERM/SIGINT shutdown path so
the reconciler exits cleanly alongside the server drain. Uses notify_one
so the signal stores a permit if the reconciler is mid reconcile_all
when the signal fires.

Delete the commented-out run_boot_reconciliation block in main.rs that
documented the prior bash-script approach being unsafe on unbundled
installs — the new reconciler is manifest-driven and only touches apps
present in /opt/archipelago/apps, fixing that concern.

cargo check -p archipelago clean (6 pre-existing dead-code warnings on
trait methods not yet exercised until Step 9 hot-swap). Container test
suite 43/44 pass; the one failure (container::image_versions::
test_parse_image_versions) is pre-existing and unrelated.
2026-04-22 19:20:13 -04:00
archipelago
81c1613040 feat(container): BootReconciler — periodic reconcile loop for prod orchestrator
Step 5 of the rust-orchestrator migration. New file boot_reconciler.rs holds a
small Tokio task that calls ProdContainerOrchestrator::reconcile_all() on a
30-second cadence (answered design Q3).

  * BootReconciler::new(orch, interval, shutdown) — shutdown is an Arc<Notify>
    so callers can trigger a graceful exit without pulling in tokio-util.
  * run_forever(self) — does one reconcile immediately, then loops on
    tokio::select! { sleep_until | shutdown.notified() }. Shutdown interrupts
    the sleep but never an in-flight reconcile_all call.
  * Per-pass outcomes are logged at debug/warn; failures never propagate out
    because reconcile_all already absorbs per-app errors into ReconcileReport.

Four tokio::test(start_paused = true) tests verify the loop cadence against a
CountingRuntime test double:
  * initial_pass_fires_immediately — first reconcile runs with no delay
  * second_pass_fires_after_interval — second pass fires after exactly
    interval elapses in paused-clock time
  * shutdown_terminates_loop — notify_one() lets run_forever return
  * failure_in_one_pass_does_not_stop_loop — the loop keeps ticking even when
    the first pass had to install a missing container

Not wired into main.rs yet — that is Step 6. Re-exported from container::mod
as BootReconciler + RECONCILER_DEFAULT_INTERVAL for the wire-up step.
2026-04-22 19:04:34 -04:00
archipelago
40a6eaca72 feat(container): ContainerOrchestrator trait, RpcHandler uses it in prod
Step 4 of the rust-orchestrator migration. Unifies the container lifecycle
surface behind a single trait so the RPC layer stops caring whether it is
talking to the dev or prod orchestrator.

  * New trait core/archipelago/src/container/traits.rs: ContainerOrchestrator
    with install / start / stop / restart / remove / upgrade / status / list /
    logs / health, all keyed by app_id. Every method is async_trait-based.

  * ProdContainerOrchestrator: the lifecycle methods are moved from inherent
    impl into the trait impl (avoids name-shadowing recursion). Adoption and
    reconcile remain inherent since only main.rs / BootReconciler call them.

  * DevContainerOrchestrator: new trait impl that forwards to the existing
    Dev-named methods, applying the dev container-name + port-offset rules
    internally. New load_manifest_for() helper resolves app_id to
    <data_dir>/apps/<app_id>/manifest.yml so trait-level install(app_id)
    works in dev too. install_container(manifest, path) stays inherent for
    the manifest-path RPC shape.

  * RpcHandler now holds Option<Arc<dyn ContainerOrchestrator>> and, when in
    dev mode, a separate Option<Arc<DevContainerOrchestrator>> for the
    manifest_path install RPC. In prod mode RpcHandler::new() constructs a
    ProdContainerOrchestrator and calls load_manifests() at startup.

  * All seven container-* RPC guards no longer say dev mode required.
    container-install still requires dev mode because its manifest_path
    argument has no prod meaning; every other container RPC now works in both
    modes via the trait.

BOOT STILL DOES NOT USE THIS. main.rs wire-up (Step 6) and BootReconciler
(Step 5) come next. Until then the prod orchestrator is constructed but nothing
populates /opt/archipelago/apps so it has zero manifests to manage, matching
the pre-Step-4 behaviour.

Verification: cargo build -p archipelago clean (11 expected unused method
warnings for methods not yet wired from main.rs). cargo test -p archipelago:
all 21 container::* tests pass (16 prod_orchestrator + 5 others). 24 other
test failures are pre-existing and unrelated (identity_manager / session /
wallet / mesh / credentials — all independently flaky on file-backed state).
2026-04-22 18:56:52 -04:00
archipelago
e103925a4e feat(container): ProdContainerOrchestrator with build-or-pull, adoption, reconcile
Step 3 of the rust-orchestrator-migration. New file prod_orchestrator.rs (999 LOC)
implements the full public surface that will replace scripts/first-boot-containers.sh:

  * install / start / stop / restart / remove / upgrade / status / list / logs / health
  * adopt_existing: read-only scan that claims containers matching our manifests by
    name, without recreating — preserves the v1.7.42 fixture on .116.
  * reconcile_all: level-triggered, per-app failures collected rather than aborting.
  * install_fresh: build-or-pull (Step 2 trait methods), relative build contexts
    resolved against the manifest directory.

Naming rule (answered design Q1): UI app IDs (bitcoin-ui/electrs-ui/lnd-ui) get the
archy- prefix; backends keep their bare ID. An explicit extensions.container_name
always wins. Codified in compute_container_name() with unit tests for all three tiers.

Concurrency (answered design Q4): per-app tokio::sync::Mutex<()> created lazily,
protecting every mutating op against the reconciler loop. Acquiring the per-app
lock only needs a read lock on the map, so independent apps do not serialize.

16 tests: 3 sync naming rule tests + 13 tokio async tests covering install (pull,
build-absent, build-present, relative-context), reconcile (noop/exited/missing/
mixed-failure), adopt-by-name, upgrade sequence ordering, list filtering, health
state mapping, and unknown-app-id rejection. All pass.

Not wired into main.rs yet — that is Step 6. Crate builds clean with expected
unused warnings for the new re-exports.
2026-04-22 18:32:31 -04:00
archipelago
919055f3f1 feat(container): add build source to manifest schema
ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.
2026-04-22 17:46:36 -04:00
archipelago
0ac673deb4 release(v1.7.42-alpha): bitcoin RPC retry wrapper so syncing nodes stop flashing red
Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on
a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block
validation. The old 10s client-side timeout was rejecting roughly 30% of
UI calls even though the node was perfectly healthy. 20x stress test on
the live .116 node (caught in IBD catch-up at block 797k) used to drop
10 of 20 calls; now drops 0 of 20.

What changed:
- core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up
  to 3 times with 500ms and 1500ms backoffs between attempts. Only
  transient transport errors (timeout, connect refused, send/recv IO)
  trigger retry. A well-formed bitcoind error response is surfaced
  immediately - real RPC bugs are never masked.
- Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top
  of reqwest's own timeout, so DNS starvation or TLS wedging can't
  steal the entire retry budget.
- handle_bitcoin_getinfo client builder gained a 3s connect_timeout
  so a dead bitcoind is fast-failed inside the first attempt instead
  of eating the whole 15s.
- Retry policy extracted into a RetryConfig struct so tests can dial
  down timeouts to ~100ms per attempt. Production defaults live in
  RetryConfig::production().

Not changed (tracked as follow-up):
- mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the
  same 10s-timeout pattern. Not migrated to the new wrapper in this
  release; scheduled for v1.7.43 alongside the render_bitcoin_conf
  work.
- lnd/info.rs and electrs_status have similar 10s/15s timeouts but
  different failure profiles - audit first, migrate only the ones
  that actually exhibit the bug.

Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing.
Uses an in-process hyper server (already a transitive dep) to simulate
bitcoind responses; no new crates required.
  - happy_path_first_attempt: no retry when first attempt succeeds
  - retries_on_timeout_then_succeeds: first attempt times out, second
    succeeds, returns OK (uses a short-timeout RetryConfig so the test
    runs in <1s instead of 15s)
  - retries_exhausted_on_persistent_connect_refused: all attempts fail
    against a closed port, error bubbles up, elapsed time confirms
    backoffs actually ran
  - does_not_retry_on_rpc_level_error: bitcoind-returned error body is
    surfaced immediately, no retry
  - does_not_retry_parse_errors: non-JSON response (e.g. 503 with html
    body) is NOT retried - guards against the tempting "retry all
    non-2xx" mistake that would mask real bitcoind misconfig
  - retry_budget_invariants: asserts total wall-time ceiling stays
    under 60s so a bumped constant can't silently hang a UI call
    forever

Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD
catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the
old 10s timeout. Worst-case latency was 48.9s during peak validation;
happy-path latency (cached result) remains 28-77ms.
2026-04-22 16:46:28 -04:00
archipelago
d1bcf271f9 release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet
Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.

What changed:
- apply_update() writes a pending-verify marker with old+new version and
  a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
  present and within its freshness window, the new binary waits 15s for
  nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
  up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
  nothing else happens.
- On window-exhaust, the new binary:
    1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
       (quarantined, not deleted, so we can post-mortem).
    2. Restores web-ui.bak on top of web-ui.
    3. Calls rollback_update() to restore the previous binary.
    4. Updates state.current_version to reflect the rollback.
    5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
  probing, so a crashed-during-startup marker from weeks ago cannot
  spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
  tokio::fs::copy, so it escapes the service's ProtectSystem=strict
  mount namespace. Without this, the rollback silently failed with
  EROFS on /usr/local/bin and orphaned the rollback - the exact
  opposite of what auto-rollback is for.

Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.

Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
Dorian
b8d084368e release(v1.7.39-alpha): hotfix web-ui perms after OTA (nginx 500) + startup self-heal
v1.7.38 shipped with an OTA bug: the tar-extracted staging dir inherited 700
perms and nginx (www-data) returned 500/403 on every request after the swap.
.116 hit this on rollout; had to chmod by hand to recover.

- update.rs: after extraction, explicitly chmod 755 dirs + 644 files on the
  new staging dir before the mv into place, so nginx can stat/serve them.
- main.rs: self-heal on startup — if /opt/archipelago/web-ui is not
  world-readable, run `sudo chmod -R u=rwX,go=rX` to repair. This is what
  rescues nodes upgrading from v1.7.37/v1.7.38, since their extractor
  (running on the old binary) doesn't have the chmod fix yet — the new
  binary's first boot fixes the mess before nginx serves a single request.

Everything v1.7.38 shipped is still in this release:
- auth.rs auto-heals is_onboarding_complete() from setup_complete +
  password_hash so nodes don't bounce back to /onboarding/intro after
  browser clear / reboot / update
- useOnboarding tri-state: backend-unreachable no longer defaults to intro
- login sounds gated by isFirstInstallPhase() — silent after onboarding,
  typing sounds unaffected
- FIPS app / Nostr Relay / Nostr VPN / Routstr / Penpot removed from
  catalog + frontend + Rust + docker + icons; 15 image versions deleted
  from tx1138, .168, gitea-local
- AIUI baked into release tarball via demo/aiui/
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 13:26:54 -04:00
Dorian
36a6101026 release(v1.7.38-alpha): onboarding auto-heal + silent returning logins + app-store trim
- auth.rs now infers onboarding-complete from setup_complete + password_hash so
  nodes stop bouncing users through the intro wizard after browser clear / update
  / reboot; the flag self-heals to disk on next check
- frontend: "backend uncertain" no longer defaults to /onboarding/intro —
  useOnboarding returns null + callers poll / retry instead of flashing the wizard
- login sounds (synthwave, welcome voice, pop, whoosh, oomph) gated by
  isFirstInstallPhase(); typing sounds unaffected
- removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog,
  frontend config, Rust AppMetadata + install dispatch + install_penpot_stack;
  docker/fips-ui + docker/nostr-vpn-ui + apps/penpot dirs and 5 icons deleted;
  15 image versions deleted from tx1138, .168, gitea-local registries (.160
  Gitea was 502 at release time — follow-up)
- AIUI baked into frontend release tarball via demo/aiui/; deploy-to-target
  falls back to demo/aiui/ when the AIUI sibling checkout is missing
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json so the
  two copies can no longer drift (was the source of the "apps still visible"
  bug — public/ had stale data)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 13:02:24 -04:00
Dorian
cfc98c600e release(v1.7.37-alpha): bitcoin-core install fixes + dynamic node UI + full-archive default
Install flow
- api/rpc/package/install.rs: always append the literal image URL as a
  last-resort pull candidate in do_pull_image, so images not carried by
  any configured mirror (docker.io/bitcoin/bitcoin:28.4) still install
  instead of masquerading as a generic pull failure across every mirror.
- api/rpc/package/install.rs: write_bitcoin_conf now skips on any stat
  error, not just "file exists". Once bitcoin-knots' first-boot chowns
  /var/lib/archipelago/bitcoin into the container's user namespace (700
  perms, UID 100100/100101), the archipelago daemon can't even traverse
  in — try_exists returns Err which unwrap_or(false) treated as "not
  present" and drove a doomed write. Now errors out of the directory
  traversal are treated as "conf already owned by container user" and
  the write is skipped. Mirrors the lnd.conf pattern.
- api/rpc/package/install.rs: drop the hardcoded `prune=550` from the
  conf default. Operators with multi-TB drives shouldn't be silently
  pruned; users who want a pruned node can set it in bitcoin.conf
  themselves. Full archive is the only honest default.
- api/rpc/package/config.rs: bitcoin-core now passes explicit
  -server/-rpcbind/-rpcallowip/-rpcport/-printtoconsole/-datadir CLI
  args. Vanilla bitcoin/bitcoin:28.4 has no entrypoint wrapper and
  reads conf + argv only; without these the RPC listens on 127.0.0.1
  inside the container and rootlessport can't reach it, so the
  bitcoin-ui companion gets 502 on every /bitcoin-rpc/ call.
  Bitcoin Knots keeps its own entrypoint-driven defaults.
- container/docker_packages.rs: split bitcoin-core out of the shared
  AppMetadata arm. bitcoin-core now surfaces as "Bitcoin Core" with
  bitcoin-core.svg and a Reference-implementation description; the
  bitcoin + bitcoin-knots ids keep the Knots branding. Fixes the home
  card showing "Bitcoin Knots" for a Core install.

Bitcoin node UI (docker/bitcoin-ui)
- index.html: impl name/tagline/logo now dynamic. applyImplBranding()
  reads subversion from getnetworkinfo — /Satoshi:X/Knots:Y/ resolves
  to Bitcoin Knots, plain /Satoshi:X/ resolves to Bitcoin Core. Both
  get their own icon and subtitle. Settings modal replaced its
  hardcoded Regtest/txindex=1/port-18443 placeholders with live values
  from getblockchaininfo + getindexinfo + getzmqnotifications.
- index.html: new Storage info card (Full Archive · X GB /
  Pruned · X GB from blockchainInfo.pruned + size_on_disk) visible on
  the main dashboard, same level as Network. Settings modal mirrors it
  with the prune height when applicable.
- Dockerfile + assets/: bitcoin-core.svg, bitcoin-knots.webp, and the
  bg-network.jpg used by the dashboard are now COPY'd into the image
  under /usr/share/nginx/html/assets. Previously the <img src> pointed
  at paths that 404'd into the SPA fallback and the onerror handler
  hid the broken logo silently.

Frontend
- appSession/appSessionConfig.ts: add bitcoin-core to APP_PORTS (8334),
  HTTPS_PROXY_PATHS (/app/bitcoin-ui/), and APP_TITLES (Bitcoin Core).
  Without these the AppSessionFrame showed "No URL found for
  bitcoin-core" and the home/app-list title fell through to the raw id.
- settings/AccountInfoSection.vue: backfill What's New entries for
  v1.7.31 through v1.7.37 that had been missed in earlier cuts.

Release plumbing
- releases/v1.7.37-alpha/: binary + frontend tarball.
- releases/manifest.json: v1.7.37-alpha, sha256/size refreshed.
- Cargo.toml / package.json: version bumps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:03:47 -04:00
Dorian
9cf1177b73 release(v1.7.36-alpha): bitcoin-core in App Store + Sovereignty Stack + dynamic catalog URL
- neode-ui/public/assets/img/app-icons/bitcoin-core.svg (NEW): 256×256
  Umbrel community Bitcoin icon sourced from getumbrel.github.io/
  umbrel-apps-gallery/bitcoin/icon.svg. Referenced by the static
  catalog, the curated fallback, and the upstream lfg2025/app-catalog
  entry so every surface shows the same image.
- app-catalog/catalog.json + neode-ui/public/catalog.json: add
  bitcoin-core (v28.4) entry pointing at bitcoin/bitcoin:28.4. Same
  entry pushed to the lfg2025/app-catalog repo on .160 and the local
  gitea mirror so nodes see it without needing a full archipelago
  update. Sovereignty Stack entry added to FEATURED_DEFINITIONS with
  a description that frames it as a Knots alternative, not a rival.
- core/archipelago/src/api/handler/mod.rs: handle_app_catalog_proxy
  is now instance-scoped (&self) and derives its upstream list from
  load_registries — each active container registry contributes one
  `<scheme>://<reg.url>/app-catalog/raw/branch/main/catalog.json` URL
  in priority order (scheme follows tls_verify). When the operator
  switches mirrors in Settings, the App Store now follows. Falls back
  to the legacy hardcoded .160/tx1138 pair only when registry config
  can't be loaded, so the App Store still renders on nodes that
  haven't persisted one yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 09:06:10 -04:00
Dorian
a7048f6d8e release(v1.7.35-alpha): rootless-netns self-heal + app update button + bitcoin-core 28.4 + Node DID unification
- core/archipelago/src/bootstrap.rs (NEW): embed scripts/container-doctor.sh
  and image-recipe/configs/archipelago-doctor.{service,timer} via
  include_str! and sync to disk + enable the timer on every archipelago
  startup. Idempotent (content-hash compare), dev-box symlink guard keeps
  the git checkout untouched, best-effort (warn-only on failure) so
  bootstrap never blocks server readiness. Wired in main.rs as a
  background tokio task.
- scripts/container-doctor.sh: add fix_rootless_netns_egress(). Detects
  when the rootless-netns has lost its pasta tap (container-to-container
  still works but outbound DNS/TCP fails) via an nsenter probe into
  aardvark-dns; with a two-probe 10s debounce to rule out transients and
  a host-precheck that bails out if the host itself is offline. When the
  rootless-netns is truly broken, does a graceful podman stop --all /
  start --all so pasta + aardvark-dns rebuild the netns from scratch.
  Bitcoin-knots and every other outbound container recover in one cycle.
- core/archipelago/src/update.rs: host_sudo → pub(crate) so bootstrap.rs
  can reuse the existing systemd-run escape hatch.
- apps/bitcoin-core/manifest.yml: bump app version 24.0.0 → 28.4.0 and
  image bitcoin/bitcoin:24.0 → bitcoin/bitcoin:28.4. Resources aligned
  with the real container-specs.sh large-disk tune (4 GiB memory cap,
  cpu_limit: 0 so bitcoind can run -par=auto across every core).
- neode-ui/src/views/apps/AppCard.vue + Apps.vue: add an Update button
  + Updating spinner to every app card that has available-update set.
  Wires through serverStore.updatePackage(id) — the same RPC the detail
  view already calls. common.update / common.updating i18n keys added in
  en.json and es.json.
- core/archipelago/src/identity_manager.rs: add create_from_signing_key()
  that mirrors an existing Ed25519 key as a manager-level identity with
  a deterministic id (`node-<pubkey16>`). Idempotent across restarts,
  gets the hex-SVG master avatar.
- core/archipelago/src/server.rs: the auto-create path on first boot now
  mirrors the node's own signing_key (seed-derived on onboarded installs)
  as a "Node" identity instead of generating a random "Default" keypair.
  Once this ships, the DID on the Web5 DID Status card (via node.did
  RPC), the Node entry on the Identities page (via identity.list), and
  the DID used for peer-to-peer connects (via server_info.pubkey) all
  resolve to the same seed-derived pubkey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 08:29:56 -04:00
Dorian
974fce5870 release(v1.7.32-alpha): fix frontend tarball layout + mDNS shutdown hang
- HOTFIX: v1.7.31-alpha's frontend tarball was packaged with a
  `neode-ui/` top-level directory instead of the flat layout v1.7.30
  and earlier used. Nodes that applied v1.7.31 ended up with
  `/opt/archipelago/web-ui/neode-ui/index.html` instead of
  `/opt/archipelago/web-ui/index.html`, and nginx returned 403/500.
  v1.7.32's tarball is built with `tar -C web/dist/neode-ui .` so
  files land directly at web-ui root. Broken nodes auto-heal on this
  update (web-ui dir is replaced).
- transport/lan.rs: add Drop impl that calls ServiceDaemon::shutdown()
  on the mdns_sd daemon. Without this the OS thread it spawns, plus
  the blocking `receiver.recv()` task, keep the tokio runtime alive
  past SIGTERM — long enough for systemd's TimeoutStopSec to SIGKILL
  the service and mark it Failed. Was visible on every update:
  "shut down cleanly" logged, then 15s later systemd forcibly kills.
- main.rs: after logging "Archipelago shut down cleanly", call
  `std::process::exit(0)` explicitly. Belt-and-suspenders against
  any future non-daemon thread creeping in (reqwest resolver pool,
  etc.) and causing the same SIGKILL regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 03:52:22 -04:00
Dorian
682b93f2d6 release(v1.7.31-alpha): idempotent IndeedHub install + auto-merge default mirrors/registries + 3rd OVH update mirror
- Backend: install.rs registry reachability probe now strips the
  `host[:port]/namespace` suffix before appending `/v2/` (the Docker
  V2 API lives at the host root, not under the namespace) and accepts
  HTTP 405 in addition to 200/401 as "registry daemon alive". This
  fixes false "unreachable" reports on the Test button for Gitea and
  other registries that protect their /v2/ endpoint.
- Backend: stacks.rs install_indeedhub_stack now force-removes any
  leftover indeedhub-* containers and indeedhub-net before creating
  the stack. A partial install (or the old first-boot stub racing the
  installer) used to leave containers around that blocked re-install
  with "name already in use". Re-running the App Store install now
  self-heals.
- Backend: registry.rs load_registries auto-merges any default
  registry URLs missing from the saved config (appended with priority
  max+10+i, persisted). Lets new default mirrors (e.g. Server 3 OVH)
  roll out to existing nodes without manual config edits. Explicit
  removals still stick — URLs absent from disk AND absent from
  defaults stay gone.
- Backend: update.rs adds DEFAULT_TERTIARY_MIRROR_URL at
  http://146.59.87.168:3000/ (Server 3 OVH) to default_mirrors, with
  the same auto-merge-on-load behavior as registries. Test updated
  for 3-mirror default (.160, tx1138, .168).
- Scripts: dropped the first-boot IndeedHub stub (~38 lines in
  first-boot-containers.sh §8b). It predated the proper stack
  installer, raced it, and was the main source of the name-conflict
  mess the stacks.rs cleanup above now also guards against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 03:26:09 -04:00
Dorian
18f0929614 release(v1.7.30-alpha): live install/uninstall progress + cleaner pull waterfall
- Backend: unified pull-progress streaming across primary AND fallback
  registries. Earlier code only streamed for the primary attempt; if it
  failed fast (VPS 404, etc.) the UI froze at 0% until the fallback
  finished. The waterfall now uses a single shared helper that streams
  podman stderr through update_install_progress for every URL tried.
- Backend: PackageDataEntry gains uninstall_stage, set at each phase of
  handle_package_uninstall ("Stopping containers (i/total)",
  "Cleaning up volumes", "Removing app data"). State flips to Removing
  during the pipeline.
- Frontend: MarketplaceAppCard renders the live progress bar with byte
  counts during installs, matching the System Update download bar style.
- Frontend: AppCard renders the live uninstall stage label per app.
  Modal closes immediately on confirm so concurrent uninstalls each
  show their own progress on their own card.
- Cleanup: removed dead helpers (image_candidates, rewrite_for_primary,
  primary_image_url, pull_from_registries_with_skip) made unused by
  the install.rs refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:11:36 -04:00
Dorian
1709149ebd release(v1.7.29-alpha): VPS as default app registry + settings UI
- New Settings → App registries page (/dashboard/settings/registries)
  that mirrors the update-mirrors experience: list of configured
  registries, test reachability, set primary, add/remove. New
  registry.set-primary RPC; existing registry.{list,add,remove,test}
  reused.
- Default RegistryConfig flipped: VPS (23.182.128.160:3000/lfg2025) is
  now Server 1 (primary), tx1138 is Server 2 (fallback).
- Install pipeline now rewrites the first pull to the primary registry
  URL before attempting it. Before this, installs always hit whichever
  registry the image was hardcoded to, so changing the primary didn't
  actually affect where images came from. On failure, the existing
  fallback walk skips the primary (already tried) and walks the rest.
- App catalog proxy UPSTREAMS order flipped so the catalog follows the
  same VPS-first rule.
- Reboot overlay: animated "a" logo now sits in the center of the ring
  (matches the screensaver composition). Extracted the logo-wrapper
  pattern inline.

7/7 registry tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:54:07 -04:00
Dorian
2664074210 release(v1.7.28-alpha): reboot progress overlay + VPS default primary
- New reboot progress overlay: full-screen black with the screensaver's
  pulsing ring, rebooting → reconnecting → back-online → stalled stages,
  elapsed counter, auto-reload on health-check success, manual reload
  button at 3 min stall. Mirrors the existing update overlay.
- Ring extracted from Screensaver.vue into a reusable ScreensaverRing
  component so the reboot overlay reuses the same animation.
- default_mirrors() now puts the VPS as Server 1 (primary) and tx1138 as
  Server 2 — new nodes fetch manifests from VPS first; existing nodes
  keep whatever mirror order they've customized.
- What's New entry prepended for v1.7.28-alpha.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:06:37 -04:00
Dorian
9868991900 release(v1.7.27-alpha): mirror transparency — served-by line + one-click test button
- New "Served by {mirror}" line on the System Update page so operators can see
  which mirror actually served the available manifest (vs. which is configured
  primary). Backend threads the served URL through UpdateState.manifest_mirror.
- New update.test-mirror RPC + per-row lightning-bolt button that pings a
  mirror and renders reachable/latency or error inline under the URL.
- UI polish on the mirrors section: Set Primary, Remove, and the new Test
  action are compact icon buttons; add-mirror form moved into a dialog.
- "What's New" block prepended for v1.7.27-alpha.

21/21 update module tests pass. vue-tsc + vite build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 13:05:42 -04:00
Dorian
0d15ca588a release(v1.7.26-alpha): mirror list + origin-relative download URLs
Adds a multi-mirror manifest fetch. `check_for_updates` walks a
configurable list (data_dir/update-mirrors.json) in priority order
and falls through to the next mirror on any HTTP / parse / timeout
failure. Two defaults bake in: Server 1 (git.tx1138.com) and Server 2
(23.182.128.160:3000).

Critical fix: after parsing a manifest, rewrite every component's
`download_url` so its origin matches the manifest URL we fetched.
Before this, the manifest hard-coded absolute URLs pointing at one
specific server — so even when a node fetched the manifest from a
faster mirror, the actual 200MB download went back to the slow
original. Now the faster mirror wins end-to-end.

New RPCs: update.list-mirrors, update.add-mirror, update.remove-mirror,
update.set-primary-mirror. New UI section on the System Update page
for operator management. 5 new unit tests for origin parsing and
manifest rewriting (21/21 green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:09:28 -04:00
Dorian
1c1416cc1a release(v1.7.25-alpha): TCP transport for public FIPS mesh + modal cleanup
Re-adds the TCP transport (`0.0.0.0:8443`) to the rendered fips.yaml
alongside UDP. Upstream factory default enables both; we had
inadvertently narrowed to UDP-only when the yaml rewriter was last
touched, which left nodes unable to reach fips.v0l.io (the public
anchor only answers on TCP right now) or talk across networks that
block UDP.

Backend startup now compares the installed yaml against the current
rendered schema and restarts whichever fips unit is active when they
differ — so OTA-upgrading nodes pick up the new transport without
anyone having to click Reconnect.

Dropped the earlier plan to auto-add federated peers as seed anchors:
invites don't carry a FIPS-reachable IP:port, and once TCP reconnects
the public mesh, federated peers become npub-routable without needing
a seed entry.

Seed Anchors modal cleanup: replaced malformed header icon with a
three-arc broadcast glyph, and the close button now matches the
What's New modal (embedded in the card header, same icon + hover
style) instead of the earlier floating off-design placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 09:25:53 -04:00
Dorian
4b6a088e38 release(v1.7.22-alpha): honest anchor status + Reconnect works on all nodes
- fips::service::active_unit() picks whichever fips unit is running
  (archipelago-fips.service vs upstream fips.service) so
  handle_fips_restart and handle_fips_reconnect don't silently no-op
  on hosts where the archipelago-managed unit was never created.
- peer_connectivity_summary(anchor_candidates) replaces the old
  identity-cache check. anchor_connected is now true when at least
  one authenticated peer's npub matches the public anchor OR any
  entry in seed-anchors.json, which matches what the user actually
  cares about ("am I in the mesh?") rather than what the card used
  to claim ("is this one specific public anchor reachable?").
- FipsStatus::query takes data_dir now (so it can read seed-anchors)
  rather than identity_dir. All call-sites updated.
- handle_fips_reconnect re-pushes seed anchors after restart so the
  new daemon gets dialed without waiting for the 5-min apply loop.
- FipsNetworkCard label drops "(fips.v0l.io)" — misleading now that
  multiple anchors may be configured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 07:08:26 -04:00
Dorian
f8304aed90 release(v1.7.21-alpha): operator-editable FIPS seed anchors
Adds a local seed-anchor list at <data_dir>/seed-anchors.json. Each
entry is {npub, address, transport, label}. On archipelago startup
and every 5 minutes the list is pushed into the running fips daemon
via `fipsctl connect <npub> <addr> <transport>`, so a cluster can
anchor itself independently of the global fips.v0l.io. A flaky or
unreachable public anchor no longer strands a fresh install.

New RPCs:
- fips.list-seed-anchors
- fips.add-seed-anchor (validates npub1… + host:port)
- fips.remove-seed-anchor
- fips.apply-seed-anchors (on-demand re-dial)

New standalone UI card at views/server/FipsSeedAnchorsCard.vue. Not
wired into Home.vue / Server.vue — operator places it per the
entry-point convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 06:21:37 -04:00
Dorian
40f76013dc release(v1.7.20-alpha): stop auto-apply scheduler killing the service
The 3AM auto-update path called std::process::exit(0) immediately
after apply_update returned. apply_update had already spawned a 2s-
delayed systemctl restart, but exit(0) killed the runtime before that
spawned task could run — and the unit's Restart=on-failure does not
trigger on a clean exit 0, so the service stayed dead until someone
SSH'd in and started it manually (.253 hit this today).

Scheduler now returns from the task without killing the process;
apply_update's existing restart path (same one the UI's Install
Update button uses) brings the new version up cleanly.

Also hardens the ISO CI: the AIUI inclusion step now falls back to
extracting from the newest release tarball if the runner's cached
/opt/archipelago/web-ui/aiui path is missing, so a reprovisioned
runner can't silently ship a frontend tarball without AIUI. The ISO
build step also sanity-checks the binary exists before invoking the
builder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 04:33:11 -04:00
Dorian
4e2c6d210b release(v1.7.19-alpha): kill stale available_update + numeric version compare
load_state now drops any stored available_update whenever the running
binary version differs from what's on disk — the old migration only
cleared it when the stale entry happened to match the new version, so
skipping releases (e.g. sideloading 1.7.16 → 1.7.18 without 1.7.17)
left a pointer to an intermediate version as the "update available",
which the UI then offered as a downgrade prompt.

check_for_updates also uses a numeric version comparator so a stale or
cached manifest with an older version can't offer itself as an
update, and 1.7.10 correctly outranks 1.7.9 past the single-digit
patch boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 04:04:20 -04:00
Dorian
7d8ddcccef release(v1.7.18-alpha): transitive peers default Trusted + update-flow logs
Flip transitively-discovered federation peers to Trusted instead of
Observer. Hints are already only ingested from peers we trust and only
peers we trust are re-exported via build_local_state, so the chain of
trust is already vetted end-to-end — making the user promote each
newcomer by hand was friction with no security win.

Backend:
- federation/sync.rs: merge_transitive_peers now inserts TrustLevel::Trusted
  (doc comment updated to explain the transitive-trust rationale)
- update.rs: info! log at download start (version, components, total_bytes,
  staging path), cancel (staging wiped?, marker cleared?), and apply (backup
  path) so journalctl reveals where a stuck update actually is

Frontend:
- SystemUpdate What's New block gets a v1.7.18-alpha entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:20:36 -04:00
Dorian
f853d14421 release(v1.7.17-alpha): cancel download + stall detection
Add Cancel Download button + stall detection so a wedged download can
be recovered instead of leaving the UI stuck on a frozen progress bar.

Backend:
- update.rs: DOWNLOAD_CANCEL AtomicBool + DOWNLOAD_PROGRESS_AT AtomicU64
- download loop checks cancel between chunks and during retry backoff
  (500ms slices instead of one exponential sleep, so Cancel wakes fast)
- cancel_download() wipes staging + clears update_in_progress
- update.status exposes download_progress.stalled (30s no-progress)
- RPC: update.cancel-download + dispatcher entry

Frontend:
- SystemUpdate.vue: Cancel Download button, amber stall styling,
  stalled copy, cancel-download confirm branch in modal
- i18n keys (en + es) for cancel/stall flow
- v1.7.17-alpha What's New block in AccountInfoSection

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:10:34 -04:00
Dorian
f2360d570f release(v1.7.16-alpha): bidirectional + transitive federation, no self-peering
Federation join flow now notifies the inviter with the joiner's name and
immediately bumps state so the Federation UI reloads without a manual
Sync click. Accepting an invite that points back at the local node is
rejected up front (DID/pubkey/onion match). After a peer joins, we spawn
a transitive sync that pulls the new peer's federated peer hints so all
nodes in the federation learn about each other as Observer entries.
Federation.vue polls every 5s while mounted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 18:12:02 -04:00
Dorian
749234b8b0 release(v1.7.15-alpha): bulletproof downloads — resume, retry, real progress
download_update
  Each component download is now resumable via HTTP Range requests
  (Range: bytes=N-) and retried up to 6 times with exponential
  backoff (5/15/30/60/120/180s). On a dropped connection the next
  attempt picks up at the last written byte offset instead of
  restarting at zero. Streams via reqwest::Response::chunk() to the
  staging file so a 160 MB frontend tarball doesn't sit in RAM. SHA
  is verified over the complete file at the end of each component;
  mismatch nukes the staged file and restarts from scratch.

Real download progress counters
  New AtomicU64 globals DOWNLOAD_BYTES/DOWNLOAD_TOTAL are updated
  from the chunk loop. update.status exposes them as
  download_progress.{bytes_downloaded, total_bytes, active}. The
  SystemUpdate.vue progress bar now polls update.status every
  second instead of incrementing a fake random counter — and
  crucially, if the user navigates away and back, the component
  picks up the in-progress download from the backend atomics
  immediately.

Update-check retries
  handle_update_check now retries the manifest fetch up to 3 times
  with a 5s gap if the first try hits a transport error, so a
  momentary gitea hiccup doesn't make a node report "up to date"
  when there actually is a new release. Tight 10s connect timeout
  per attempt keeps the total bounded.

Artefacts:
  archipelago                                      1070c87f…c081c162b  40584792
  archipelago-frontend-1.7.15-alpha.tar.gz         8e630eba…63fd43f   162078068

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 17:17:58 -04:00
Dorian
be8e5ee46b release(v1.7.14-alpha): install overlay + FIPS real fix + AIUI restore
Install UX
  SystemUpdate.vue now shows a full-screen overlay after apply: the
  BitcoinFaceAscii logo, a target-version label, an indeterminate
  progress stripe (solid orange; solid green on ready), and an
  elapsed-time readout. Polls /health every 1.5s and auto-reloads
  once the backend reports the new version. 3-min stall → "Reload
  now" button. Download UI also shows a spinner + "Finishing
  download — verifying checksum…" while the fake bar sits at 95%.

FIPS reconnect — for real this time
  New fips.reconnect RPC does stop → start → wait 20s → re-poll →
  classify. Classification buckets: connected / daemon_down /
  no_seed_key / no_outbound_udp_or_anchor_down / peers_but_no_anchor,
  each with a plain-language hint surfaced verbatim by the Reconnect
  button. The real reason nodes like .198/.253 couldn't reach the
  anchor: identity::write_fips_key_from_seed was writing fips_key.pub
  as a bech32 npub TEXT file, but upstream fips expects 32 raw
  bytes. The daemon silently authenticated with garbage. Fix:
  PublicKey::to_bytes() → raw 32 bytes, and new
  fips::config::normalize_pub_file migrates legacy files by decoding
  the npub and rewriting in place. fips.reconnect also re-installs
  the config + healed keys to /etc/fips before restarting.

AIUI preservation + restore
  apply_update was wiping /opt/archipelago/web-ui/aiui because the
  Vue build doesn't include it — every OTA lost the Claude sidebar.
  The preserve block now copies aiui/ + archipelago-companion.apk
  from the old web-ui into the staging dir before the swap, and
  prefers new-tar versions if present. To restore it on the three
  nodes that already lost it (.116/.198/.253), this release bundles
  the 85 MB aiui build into the frontend tarball. Frontend component
  size is now ~155 MB.

Download / install timeouts
  Backend download client timeout 1800s → 3600s (1 h). Larger
  tarball + slow gitea raw throughput put us above the old cap.
  Frontend update.download rpc timeout 30 min → 65 min to match.
  package.install rpc timeout 15 min → 45 min — IndeedHub pulls
  6 images and was timing out mid-install.

UI nit
  "Rollback to Previous" → "Rollback Available".

App-catalog proxy already landed in v1.7.13.

Artefacts:
  archipelago                                      725e18e6…3c525e6   40462288
  archipelago-frontend-1.7.14-alpha.tar.gz         c35284be…ff2c16   162077052 (+aiui)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:40:25 -04:00
Dorian
687c216e65 release(v1.7.13-alpha): proxy app catalog server-side (CORS + CSP fix)
The Discover / Marketplace page fetched the app catalog directly from
git.tx1138.com/lfg2025/app-catalog/raw/.../catalog.json in the
browser. Two blockers hit the fleet simultaneously: (1) tx1138's
Gitea doesn't emit Access-Control-Allow-Origin so the HTTPS fetch
got CORS-blocked; (2) the HTTP IP-port fallback
(http://23.182.128.160:3000/...) falls outside the node's
`connect-src` CSP. Users saw the hardcoded fallback instead of the
live catalog.

Backend: new authenticated GET /api/app-catalog handler uses reqwest
to pull catalog.json server-side (15s timeout) and returns it with
application/json + 1h Cache-Control. Tries the HTTPS URL first,
HTTP IP-port second.

Frontend: curatedApps.ts now calls /api/app-catalog (same-origin,
no CORS/CSP) with credentials included so the session cookie
authenticates the proxy. Baked /catalog.json stays as the last
resort.

Artefacts:
  archipelago                                      0aaf7262…b979f22c  40371192
  archipelago-frontend-1.7.13-alpha.tar.gz         27505811…efc6f4142 76982505

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:43:45 -04:00
Dorian
79f4ec4bde release(v1.7.10-alpha): apply namespace fix + FIPS cascade + profile polish
THE apply fix
  archipelago.service uses ProtectSystem=strict, so /opt and /usr are
  read-only inside the service's mount namespace. sudo inherits that
  namespace — every sudo mkdir/mv/chown from apply_update was hitting
  EROFS even as root. Every prior "Failed to apply update" was a
  symptom of this. New `host_sudo()` helper wraps every filesystem
  call in `sudo systemd-run --wait --collect --pipe -- <cmd>`, which
  spawns a transient unit with systemd's default (no ProtectSystem)
  protections — the command runs in the host namespace and can touch
  /opt/archipelago + /usr/local/bin normally.

FIPS cascade (#2)
  Home.vue and Server.vue both carry a FIPS row that previously only
  looked at {installed, service_active, key_present}. Now they also
  read anchor_connected + authenticated_peer_count and mirror the
  full FIPS card: green "Active · N peers" when healthy, orange "No
  anchor" when the DHT bootstrap has failed.

Profile paste URL fallback (#4)
  Web5Identities.vue list + editor previously had `@error="display:none"`
  on the <img>, which hid the tag without re-rendering the fallback —
  a broken pasted URL showed up blank. Replaced with reactive
  pictureLoadFailed / listPictureFailed flags plus a watcher that
  resets on URL change. Broken URL now falls back to the initial (or
  identicon for seed-derived identities).

Small-upload data URL (#3)
  Uploaded profile pictures ≤ 64 KB are now inlined as
  `data:image/png;base64,...` into profile.picture on the client
  before calling update-profile. That kind-0 event is fetchable by
  any Nostr client — no Tor needed. Larger uploads fall back to the
  onion-rooted public_url with a hint telling the user to paste a
  public https:// URL for broader visibility.

Deferred: #1 FIPS Reconnect "actually fixes" — the current Reconnect
calls fips.restart which clears the daemon state, but when the
anchor is truly unreachable (UDP 8668 blocked by network/ISP), no
amount of restart can help. A richer diagnostic is out of scope for
this bundle.

Artefacts:
  archipelago                                      4a77c704…82aa6f8  40379696
  archipelago-frontend-1.7.10-alpha.tar.gz         0644a436…54f58    76983846

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:46:03 -04:00
Dorian
1e648800ad release(v1.7.8-alpha): fix apply ETXTBSY — use mv instead of install
apply_update's binary swap called `sudo install -m 0755 src
/usr/local/bin/archipelago`. install opens the destination for write
with O_TRUNC; the kernel returns ETXTBSY (exit 1) when the path is a
currently-running executable, which it always is during apply because
apply_update is called by the archipelago RPC handler — running as
archipelago itself. Every previous "Failed to apply update" was this
one root cause; the manual sideload path only worked because we
stopped the service first.

rename() doesn't modify the file it replaces — it repoints the path
at a new inode while the old inode stays alive for any process that
has it mapped. `mv` uses rename(). Switched to `sudo mv` (with prior
chmod+chown on the staging file) so the swap is atomic and tolerant
of the running binary.

Frontend tarball byte-identical to v1.7.7-alpha; only the binary
version string changes.

Artefacts:
  archipelago                                      2753daec…48094d  40377648
  archipelago-frontend-1.7.8-alpha.tar.gz          4fb79664…0172e9  76984615 (reused)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:04:09 -04:00
Dorian
4d1bd063e8 release(v1.7.6-alpha): robust apply_update + manifest-override env var
apply_update frontend swap
  Transient EROFS on .198 (filesystem hiccup — root FS mounts with
  errors=remount-ro so a fleeting glitch can bounce /opt to RO for a
  moment) caught the pre-cleanup `rm -rf web-ui.new web-ui.bak` mid-
  stride and aborted the apply. Rewrote the swap to use a timestamped
  staging dir (web-ui.new.<ms>) and a timestamped old-copy path so
  nothing needs to be rm'd before the extract. After the new tree is
  mv'd into place, the previous rollback copy is rotated aside with a
  .<ms> suffix (best-effort) and this apply's old copy becomes the new
  web-ui.bak. If the final mv fails, the staged old is restored so
  nginx keeps serving.

handle_update_check manifest override
  handle_update_check takes the git path whenever ~/archy/.git exists.
  On the dev box (.116) that meant the Pull & Rebuild button was
  always the only option even though the manifest-path OTA was
  already wired via ARCHIPELAGO_UPDATE_URL. Now: if that env var is
  set, we skip the git detection entirely and use the manifest path.
  The regular fleet (no env var, no repo) hits the manifest branch
  naturally; beta dev nodes (repo + no env var) still get Pull &
  Rebuild; dev nodes with the env var explicitly set can finally test
  the manifest OTA end-to-end.

Artefacts:
  archipelago                                      356e78cc…91a6dd  40372288
  archipelago-frontend-1.7.6-alpha.tar.gz          4fb79664…0172e9  76984615 (reused)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:33:10 -04:00