cargo check was showing five real warnings, all genuinely dead:
* container/mod.rs — re-exports compute_container_name, AdoptionReport,
ReconcileAction, ReconcileReport were unused outside
prod_orchestrator. Drop from the pub use line.
* prod_orchestrator — with_runtime + insert_manifest_for_test only exist
for the test module in the same file. Mark them
#[cfg(test)] so they don't appear in release builds.
* async_lifecycle — remove_package_entry has no callers; doc claims
"used for install-failure cleanup" but nothing
cleans up. Delete (10 lines).
* registry.rs — `use tracing::{debug, info};` had no consumers.
* fips.rs — unused-assignment chain on last_status. The poll
loop always sets it on every break path, so the
initial `None` and the unwrap_or_else fallback
were both dead. Refactored to `let after = loop
{ ...; break s; };`.
cargo check is now clean. cargo test --workspace --bins: 614 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If bitcoin-core was installed but never started (e.g. port 8332 already
bound by bitcoin-knots), the container sticks in `created` state forever.
The old conflict check refused EVERY future bitcoin install — including
re-install of the running variant — leaving no UI path to recovery.
Now the check distinguishes states:
- missing → no conflict, continue
- running → real conflict, refuse install
- created/exited/configured/... → stuck; auto-remove and continue
Volumes are untouched; only the dead container record goes away.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bitcoin containers were exiting in ms after start because the orchestrator
install path skipped the credential-materialisation step the legacy path
did. resolve_secret_env then failed to read
/var/lib/archipelago/secrets/bitcoin-rpc-password, the container started
with no password, and bitcoind crashed before logs were useful.
Two changes:
1. install.rs — call bitcoin_rpc_credentials() for bitcoin/bitcoin-core/
bitcoin-knots before any install branch runs. The function generates +
persists on first call (OnceCell-cached), so this is idempotent.
2. manifest.rs::resolve_secret_env — return ManifestError::Invalid when a
resolved secret trims to empty, instead of silently producing
`KEY=` env vars that crash auth.
Adds a unit test for the empty-secret rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3a of the install path consolidation. Two coupled changes:
1. install.rs handle_package_install: gate the legacy "container exists →
adopt + return" probe on !orchestrator_managed. Apps the orchestrator
knows about (bitcoin-knots, bitcoin-core, lnd, electrumx, fedimint,
filebrowser, btcpay-server stack apps, mempool stack apps, plus the
companion UIs that just moved to Quadlet) skip the legacy probe and
fall straight into the orchestrator branch.
The legacy adopt block was returning success on a bare `podman start`
exit-0 — even when the process inside the container crashed seconds
later. That's the .228 "running but unreachable" failure mode. The
orchestrator's ensure_running honors the manifest's health check and
pre-start hooks (e.g. re-renders bitcoin-ui's nginx.conf if the RPC
password rotated), so this is a behavioral upgrade, not just a
refactor.
2. ProdContainerOrchestrator::install: make idempotent. Previously it
blindly called install_fresh which would fail on `podman create` if
the container name already existed. Now it delegates to ensure_running:
- Container Running + healthy → no-op (refresh hooks, restart if
config rewritten)
- Container Stopped/Exited → start (with hook refresh)
- Container missing → install_fresh
- Container in wedged state (Created/Paused/Unknown) → force-recreate
Without this, change #1 would regress every "container already exists"
case for the 18 orchestrator-managed app IDs. With it, install becomes
the single source of truth for "make app X be in the desired state."
Tests: 654 passed across the workspace (614 unit + 37 orchestration + 3
rpc), 0 failures. The 20 prod_orchestrator tests cover the install /
ensure_running / reconcile paths the new install delegates through.
Net delta: install.rs grows by ~30 lines (gating wrapper + comments),
prod_orchestrator.rs grows by ~30 lines (idempotent install body). Both
are temporary — the larger deletions (~1700 lines) come once every app
has been verified through the orchestrator path in subsequent phases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn
blocks from install.rs. If archipelago crashed mid-spawn or the
container's cgroup was reaped, companions vanished from podman ps -a
and only a manual rm/run could bring them back (the .228 incident).
Now each companion is rendered as a Quadlet .container unit under
~/.config/containers/systemd/, daemon-reloaded, and started via
systemctl --user. systemd owns supervision from that point on:
- archipelago can crash, restart, or be uninstalled without touching
any companion.
- Quadlet's Restart=always + RestartSec=10 handles container exits.
- A 30s reconcile tick in boot_reconciler enumerates expected
companion units and re-installs any whose unit file or service
vanished — defense-in-depth against external tampering.
New module layout:
- container/quadlet.rs: pure unit renderer + atomic write_if_changed
+ systemctl helpers (daemon_reload_user / enable_now / disable_remove
/ is_active). 6 unit tests, no I/O in the renderer.
- container/companion.rs: per-app companion specs, install/remove/
reconcile, image presence (build local first, fall back to insecure
registry only via image_uses_insecure_registry whitelist). 2 tests.
install.rs handle_package_install now ends with a single call to
companion::install_for(package_id), replacing 287 lines of spawn-and-
hope shellouts plus a ~120-line nginx auth-injector helper that worked
around per-node RPC password baking. The helper is gone too — the
pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/
bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only.
runtime.rs handle_package_uninstall now disables companions before
the container rm loop. Otherwise systemd's Restart=always would
respawn each companion within ~10s of removal.
Tests: 53 container tests pass, including 6 quadlet renderer tests
(host network, bridge network, capability set, atomic write idempotence)
and 2 companion specs (per-app companion lookup, build_unit shape).
boot_reconciler tests gain a #[cfg(test)] without_companion_stage()
flag so the paused-clock fixtures don't race the real systemctl I/O.
A bats regression test (companion-survives-archipelago-restart.bats,
gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode
cannot recur: every installed companion has a unit file, services
stay active across systemctl --user restart archipelago, and a
deleted unit file is recreated within one reconcile tick.
Net delta: +941 / -363, but the +941 is mostly tests (~440 lines)
and the new declarative layer; the imperative tokio::spawn block and
its nginx-auth helper are gone, removing two failure classes
(orphan companions on archipelago crash, and post-start exec races
under tightly-confined cgroups) that previously needed manual SSH
recovery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small, focused tightenings:
- core/container/src/podman_client.rs: drop the legacy Hetzner
23.182.128.160:3000 mirror from image_uses_insecure_registry().
It was decommissioned in v1.7.x and is stripped from active
registry config at load time; leaving it in the bypass list let
a stale config still skip TLS. Replace the inline match with a
named INSECURE_REGISTRY_HOSTS slice so future entries are one
line. Test now also pins the spoofing-immune semantics
("evil.example/146.59.87.168:3000/x" must NOT match).
- core/archipelago/src/api/rpc/package/config.rs: split bitcoin
from lnd in get_app_capabilities(). bitcoind never opens raw
sockets — drop CAP_NET_RAW from bitcoin/bitcoin-core/bitcoin-knots.
lnd/fedimint/fedimint-gateway keep it because they enumerate
network interfaces during cert generation.
- core/archipelago/src/bootstrap.rs: tighten_secrets_dir()
enforces 0700 on /var/lib/archipelago/secrets and 0600 on every
file inside on each startup. The dir-mode is the load-bearing
isolation boundary against rootless container escapes (their UID
maps to >=100000, can't traverse uid=1000/0700). The per-file
sweep is defense-in-depth against any installer that wrote 0644.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshots the in-flight hardening work so subsequent reconcile/Quadlet
phases land on a clean before/after diff.
Changes:
- core/container/src/podman_client.rs: image_uses_insecure_registry()
whitelist for the OVH (146.59.87.168:3000) and legacy Hetzner
(23.182.128.160:3000) HTTP mirrors; podman_network_settings() lifts
custom networks into the Networks map so containers can join them.
- core/archipelago/src/container/prod_orchestrator.rs:
ensure_container_network() creates per-manifest networks on demand;
apply_data_uid() now goes through host_sudo for mkdir -p + chown so
bind-mount roots get created and chowned without password prompts.
- core/archipelago/src/api/rpc/package/{install,update,stacks}.rs:
podman pull adds --tls-verify=false only for whitelisted registries.
- core/archipelago/src/bootstrap.rs: removes stale dev-mode systemd
override on startup (live nodes carried it from old installers).
- core/archipelago/src/config.rs: ignore ARCHIPELAGO_DEV_MODE in prod
binaries — it had been silently rerouting volumes to /tmp.
- apps/bitcoin-{core,knots}/manifest.yml: locate bitcoind at runtime
so image-layout differences don't break entrypoint.
- scripts/app-catalog-image-smoke-test.py: production catalog/image
smoke test that probes a target node before users click Install.
- .gitignore: cover .codex, .pnpm-store, __pycache__, *.bak.
Removes filebrowser.rs.bak and two stale catalog.json.bak files
(verified identical to live counterparts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sync-perf tuning for bitcoin/bitcoin-core/bitcoin-knots/electrumx.
- Drop the --cpus=2 cap on bitcoin/electrumx variants. Script verification
is parallelizable; the cap halved IBD speed on 4-8 core machines.
- Bump bitcoin --memory 4g→8g so dbcache=4096 has headroom for mempool +
connection buffers + I/O. 4g was OOM-prone during heavy IBD.
- Bump electrumx --memory 1g→2g + add CACHE_MB=2048 + MAX_SEND=10MB.
- bitcoin-core CLI args gain -dbcache=4096 -par=0 -maxconnections=125.
- bitcoin-knots manifest matched (1024MB pruned / 4096MB full + par=0).
Future v2: host-RAM-aware dbcache scaling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.
Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
replaces fragile post-start exec that failed under restricted-cap rootless
podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition
Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
tester, every app × every transition. Run before each release.
Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend runs as `archipelago` and calls `install_log()` to append
audit lines to the install log on every install / update / remove /
start / stop / restart. Target path was /var/log/archipelago-container-installs.log,
which does not exist and cannot be created by the service because
/var/log/ is root-owned. OpenOptions errors were silently swallowed,
so the log was never written on any node.
Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and
container-installs.log with archipelago:archipelago ownership. Move
the const path to match, keeping logs inside the directory logrotate
already rotates (image-recipe/configs/logrotate.conf). Install the
rule from both the ISO build and self-update, and apply it
immediately on self-update so existing nodes get a working log
without needing a reboot.
Verified on .228: file created, backend user can write, backend
binary rebuilt with new const.
The update flow removes the old container before starting the new
one. If the update fails after removal, the rollback path tries
`podman start <name>` first, then falls back to reconcile. But
reconcile without --create-missing treats the now-absent container
as an optional one that the install flow will (re)create later,
and skips it. Result: container stays destroyed until someone
notices and runs reconcile manually.
Add --create-missing to the rollback reconcile invocation so the
fallback actually rebuilds the container from its canonical spec.
Fixes the failure mode observed on .228 where a bitcoin-knots
update left the node with no bitcoin-knots container at all.
The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it
everywhere with the OVH VPS at 146.59.87.168, which was previously
the tertiary mirror.
- update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into
the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2.
Default mirror list shrinks from 3 to 2.
- container/registry.rs: default RegistryConfig drops .23, promotes
.168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10.
- api/rpc/package/config.rs: trusted-registry allowlist swaps .23
for .168.
- api/handler/mod.rs: app-catalog fallback URL uses .168.
- neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168.
- scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168.
- image-recipe/build-auto-installer-iso.sh: installer ISO registries
use .168 (both podman registries.conf and backend registries.json).
Tests updated to assert on the new 2-entry default lists (registry +
mirror). URL-parser fixture tests in update.rs retain .23 strings —
they exercise string-parsing logic, not mirror policy.
Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin`
multi-push alias (not part of this commit — pure working-copy change).
After install completes, the async-spawn wrapper wrote state=Running
but the skeletal install-time manifest (interfaces: None) persisted
until the next scheduled 60s scan. The frontend saw state=running but
hasUI=false and hid the Launch button for up to a full minute.
Add a shared Notify/watch pair between RpcHandler and the scan loop:
- scan_kick (Notify): scan loop selects! between the 60s interval
and this notify, running immediately on either.
- scan_tick (watch<u64>): scan loop bumps the counter after each
completed scan so callers can await completion.
Install and update success paths now call kick_scanner_and_wait before
flipping to Running. The scan merges via merge_preserving_transitional
(state stays Installing/Updating, manifest refreshed from live podman
with interfaces.main.ui populated from real port bindings). 2s timeout
falls back to pre-fix behavior on slow podman — no regression.
Podman emits zero parseable progress when stderr is piped (no TTY), so
the old byte-counter regex never matched in real installs. Users saw
0% for the whole pull, then a jump to 95%, then silence through
create-container, health-check, and post-install hooks.
Replace with 7 explicit lifecycle phases wired through install.rs and
update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%),
StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%),
Done (100%). Each maps to a fixed UI progress and status message.
Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when
present, falls back to byte-counter for legacy. A Math.max forward-only
guard ensures the bar never regresses. Deleted the duplicate watcher
in Discover.vue that was fighting the store's watcher with stale byte
logic. Added shimmer CSS on the fill (with prefers-reduced-motion
opt-out) so the bar looks alive during long phases.
create_installing_entry hardcoded /assets/img/app-icons/<id>.png for
every new install. About half the app icons ship as .svg or .webp
(lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the
browser 404s on the wrong extension and renders the default broken-image
glyph for the 10-30s window before the scanner refreshes with real
manifest data.
Send empty icon. The frontend's icon computed in AppCard.vue falls
through to curatedMap which has correct extensions for bundled apps,
and handleImageError still guards any remaining misses with a
placeholder SVG.
Extend the async-spawn treatment previously shipped for Stop/Start/Restart
to the three remaining long-running lifecycle RPCs. Each wrapper validates
params, rejects duplicate in-flight ops, flips state to the transitional
variant (Installing/Removing/Updating), then spawns the existing inner
handler on tokio. RPC returns immediately with { status, package_id }; the
spawn task owns the terminal state write.
Install and update success arms explicitly set state=Running. The scan
loop merge (merge_preserving_transitional) refuses to overwrite
transitional states, so the spawn task must write the terminal state.
Uninstall's inner handler removes the entry entirely, so no explicit
terminal write is needed there.
Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so
spawned tasks can hold their own Arc without extra field cloning.
Transient install entry uses empty icon string. Hardcoding
/assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp
assets, which produces a broken-image flicker until the scanner refreshes
with manifest data. Empty string causes the frontend's icon computed to
fall through to the curated map, which has correct extensions.
Removed the inner "already updating" guard in update.rs — the wrapper
now owns duplicate-op detection for all three operations.
RPC handlers no longer block on podman operations. container-stop on
bitcoin-core used to hold the connection for up to 600s while the UI
showed a frozen spinner; it now returns in under a second with
{status: stopping} after flipping the package state to Stopping and
broadcasting over WebSocket. Same treatment for container-start and
the new container-restart route.
Widens container-list state mapping to emit the transitional variants
(stopping, starting, restarting, installing, updating, removing,
installed, and the backup states) instead of collapsing them to
"unknown". Keeps the mapping in sync with the UI ContainerStatus.state
union so the dashboard can render the right transitional label.
Mirrors the treatment in package/runtime.rs for package.start,
package.stop, and package.restart. The body of each handler is lifted
into pure do_package_* helpers that the background task runs; state
flipping is bracketed around the spawn with revert on error. The
pre-existing post-start exit-check verification and restart stop+start
fallback run inside the spawned task, not the RPC body.
Adds container-restart route to the dispatcher. mark_user_stopped
continues to run BEFORE the spawn, preserving the ordering contract
with the crash recovery layer at runtime.rs:145-148.
Introduces a new RPC-layer helper that bridges the synchronous
ContainerOrchestrator trait with RPC handlers that must return in <1s.
The helper flips the package state to a transitional variant
(Stopping / Starting / Restarting) in the StateManager so WebSocket
clients see the live label immediately, then tokio::spawns the
actual orchestrator call. On success it writes the final state; on
error it reverts to the pre-transition state and logs via
install_log().
The ContainerOrchestrator trait stays synchronous so the reconciler,
boot flow, unit tests, and chaos harness keep deterministic
behaviour. Async only lives in the RPC layer.
Not wired to any handler yet — Commit 2 consumes this helper.
Widens install_log visibility from pub(super) to
pub(in crate::api::rpc) so the new sibling module can reach it.
LND's admin.macaroon is owned by a rootless-podman subordinate UID
(typically 100000) with mode 640. The archipelago server runs as UID
1000 and cannot read the file directly, which caused every dashboard
LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client
to fail with "Failed to read LND admin macaroon".
Add a read_lnd_admin_macaroon() helper that first tries a direct read
(for operators who have relaxed permissions) then falls back to
`sudo -n cat`, mirroring the pattern already used for Tor hidden
service hostnames in handle_lnd_connect_info. Centralise the canonical
macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers
through the helper.
Verified on .228: GET /lnd-connect-info now returns 200 with cert,
macaroon, and tor_onion fields. Dashboard QR/connect-string UI
unblocked.
Step 6 of the rust-orchestrator migration. Construct the container
orchestrator once in main.rs, call load_manifests + adopt_existing
immediately after Config::load, log the adoption report, and spawn
BootReconciler::run_forever with the 30s default interval. Thread the
orchestrator through Server::new -> ApiHandler::new -> RpcHandler::new
so the reconciler and RPC layer share one instance.
Wire a tokio::sync::Notify through the SIGTERM/SIGINT shutdown path so
the reconciler exits cleanly alongside the server drain. Uses notify_one
so the signal stores a permit if the reconciler is mid reconcile_all
when the signal fires.
Delete the commented-out run_boot_reconciliation block in main.rs that
documented the prior bash-script approach being unsafe on unbundled
installs — the new reconciler is manifest-driven and only touches apps
present in /opt/archipelago/apps, fixing that concern.
cargo check -p archipelago clean (6 pre-existing dead-code warnings on
trait methods not yet exercised until Step 9 hot-swap). Container test
suite 43/44 pass; the one failure (container::image_versions::
test_parse_image_versions) is pre-existing and unrelated.
Step 4 of the rust-orchestrator migration. Unifies the container lifecycle
surface behind a single trait so the RPC layer stops caring whether it is
talking to the dev or prod orchestrator.
* New trait core/archipelago/src/container/traits.rs: ContainerOrchestrator
with install / start / stop / restart / remove / upgrade / status / list /
logs / health, all keyed by app_id. Every method is async_trait-based.
* ProdContainerOrchestrator: the lifecycle methods are moved from inherent
impl into the trait impl (avoids name-shadowing recursion). Adoption and
reconcile remain inherent since only main.rs / BootReconciler call them.
* DevContainerOrchestrator: new trait impl that forwards to the existing
Dev-named methods, applying the dev container-name + port-offset rules
internally. New load_manifest_for() helper resolves app_id to
<data_dir>/apps/<app_id>/manifest.yml so trait-level install(app_id)
works in dev too. install_container(manifest, path) stays inherent for
the manifest-path RPC shape.
* RpcHandler now holds Option<Arc<dyn ContainerOrchestrator>> and, when in
dev mode, a separate Option<Arc<DevContainerOrchestrator>> for the
manifest_path install RPC. In prod mode RpcHandler::new() constructs a
ProdContainerOrchestrator and calls load_manifests() at startup.
* All seven container-* RPC guards no longer say dev mode required.
container-install still requires dev mode because its manifest_path
argument has no prod meaning; every other container RPC now works in both
modes via the trait.
BOOT STILL DOES NOT USE THIS. main.rs wire-up (Step 6) and BootReconciler
(Step 5) come next. Until then the prod orchestrator is constructed but nothing
populates /opt/archipelago/apps so it has zero manifests to manage, matching
the pre-Step-4 behaviour.
Verification: cargo build -p archipelago clean (11 expected unused method
warnings for methods not yet wired from main.rs). cargo test -p archipelago:
all 21 container::* tests pass (16 prod_orchestrator + 5 others). 24 other
test failures are pre-existing and unrelated (identity_manager / session /
wallet / mesh / credentials — all independently flaky on file-backed state).
Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on
a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block
validation. The old 10s client-side timeout was rejecting roughly 30% of
UI calls even though the node was perfectly healthy. 20x stress test on
the live .116 node (caught in IBD catch-up at block 797k) used to drop
10 of 20 calls; now drops 0 of 20.
What changed:
- core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up
to 3 times with 500ms and 1500ms backoffs between attempts. Only
transient transport errors (timeout, connect refused, send/recv IO)
trigger retry. A well-formed bitcoind error response is surfaced
immediately - real RPC bugs are never masked.
- Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top
of reqwest's own timeout, so DNS starvation or TLS wedging can't
steal the entire retry budget.
- handle_bitcoin_getinfo client builder gained a 3s connect_timeout
so a dead bitcoind is fast-failed inside the first attempt instead
of eating the whole 15s.
- Retry policy extracted into a RetryConfig struct so tests can dial
down timeouts to ~100ms per attempt. Production defaults live in
RetryConfig::production().
Not changed (tracked as follow-up):
- mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the
same 10s-timeout pattern. Not migrated to the new wrapper in this
release; scheduled for v1.7.43 alongside the render_bitcoin_conf
work.
- lnd/info.rs and electrs_status have similar 10s/15s timeouts but
different failure profiles - audit first, migrate only the ones
that actually exhibit the bug.
Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing.
Uses an in-process hyper server (already a transitive dep) to simulate
bitcoind responses; no new crates required.
- happy_path_first_attempt: no retry when first attempt succeeds
- retries_on_timeout_then_succeeds: first attempt times out, second
succeeds, returns OK (uses a short-timeout RetryConfig so the test
runs in <1s instead of 15s)
- retries_exhausted_on_persistent_connect_refused: all attempts fail
against a closed port, error bubbles up, elapsed time confirms
backoffs actually ran
- does_not_retry_on_rpc_level_error: bitcoind-returned error body is
surfaced immediately, no retry
- does_not_retry_parse_errors: non-JSON response (e.g. 503 with html
body) is NOT retried - guards against the tempting "retry all
non-2xx" mistake that would mask real bitcoind misconfig
- retry_budget_invariants: asserts total wall-time ceiling stays
under 60s so a bumped constant can't silently hang a UI call
forever
Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD
catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the
old 10s timeout. Worst-case latency was 48.9s during peak validation;
happy-path latency (cached result) remains 28-77ms.
- auth.rs now infers onboarding-complete from setup_complete + password_hash so
nodes stop bouncing users through the intro wizard after browser clear / update
/ reboot; the flag self-heals to disk on next check
- frontend: "backend uncertain" no longer defaults to /onboarding/intro —
useOnboarding returns null + callers poll / retry instead of flashing the wizard
- login sounds (synthwave, welcome voice, pop, whoosh, oomph) gated by
isFirstInstallPhase(); typing sounds unaffected
- removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog,
frontend config, Rust AppMetadata + install dispatch + install_penpot_stack;
docker/fips-ui + docker/nostr-vpn-ui + apps/penpot dirs and 5 icons deleted;
15 image versions deleted from tx1138, .168, gitea-local registries (.160
Gitea was 502 at release time — follow-up)
- AIUI baked into frontend release tarball via demo/aiui/; deploy-to-target
falls back to demo/aiui/ when the AIUI sibling checkout is missing
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json so the
two copies can no longer drift (was the source of the "apps still visible"
bug — public/ had stale data)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install flow
- api/rpc/package/install.rs: always append the literal image URL as a
last-resort pull candidate in do_pull_image, so images not carried by
any configured mirror (docker.io/bitcoin/bitcoin:28.4) still install
instead of masquerading as a generic pull failure across every mirror.
- api/rpc/package/install.rs: write_bitcoin_conf now skips on any stat
error, not just "file exists". Once bitcoin-knots' first-boot chowns
/var/lib/archipelago/bitcoin into the container's user namespace (700
perms, UID 100100/100101), the archipelago daemon can't even traverse
in — try_exists returns Err which unwrap_or(false) treated as "not
present" and drove a doomed write. Now errors out of the directory
traversal are treated as "conf already owned by container user" and
the write is skipped. Mirrors the lnd.conf pattern.
- api/rpc/package/install.rs: drop the hardcoded `prune=550` from the
conf default. Operators with multi-TB drives shouldn't be silently
pruned; users who want a pruned node can set it in bitcoin.conf
themselves. Full archive is the only honest default.
- api/rpc/package/config.rs: bitcoin-core now passes explicit
-server/-rpcbind/-rpcallowip/-rpcport/-printtoconsole/-datadir CLI
args. Vanilla bitcoin/bitcoin:28.4 has no entrypoint wrapper and
reads conf + argv only; without these the RPC listens on 127.0.0.1
inside the container and rootlessport can't reach it, so the
bitcoin-ui companion gets 502 on every /bitcoin-rpc/ call.
Bitcoin Knots keeps its own entrypoint-driven defaults.
- container/docker_packages.rs: split bitcoin-core out of the shared
AppMetadata arm. bitcoin-core now surfaces as "Bitcoin Core" with
bitcoin-core.svg and a Reference-implementation description; the
bitcoin + bitcoin-knots ids keep the Knots branding. Fixes the home
card showing "Bitcoin Knots" for a Core install.
Bitcoin node UI (docker/bitcoin-ui)
- index.html: impl name/tagline/logo now dynamic. applyImplBranding()
reads subversion from getnetworkinfo — /Satoshi:X/Knots:Y/ resolves
to Bitcoin Knots, plain /Satoshi:X/ resolves to Bitcoin Core. Both
get their own icon and subtitle. Settings modal replaced its
hardcoded Regtest/txindex=1/port-18443 placeholders with live values
from getblockchaininfo + getindexinfo + getzmqnotifications.
- index.html: new Storage info card (Full Archive · X GB /
Pruned · X GB from blockchainInfo.pruned + size_on_disk) visible on
the main dashboard, same level as Network. Settings modal mirrors it
with the prune height when applicable.
- Dockerfile + assets/: bitcoin-core.svg, bitcoin-knots.webp, and the
bg-network.jpg used by the dashboard are now COPY'd into the image
under /usr/share/nginx/html/assets. Previously the <img src> pointed
at paths that 404'd into the SPA fallback and the onerror handler
hid the broken logo silently.
Frontend
- appSession/appSessionConfig.ts: add bitcoin-core to APP_PORTS (8334),
HTTPS_PROXY_PATHS (/app/bitcoin-ui/), and APP_TITLES (Bitcoin Core).
Without these the AppSessionFrame showed "No URL found for
bitcoin-core" and the home/app-list title fell through to the raw id.
- settings/AccountInfoSection.vue: backfill What's New entries for
v1.7.31 through v1.7.37 that had been missed in earlier cuts.
Release plumbing
- releases/v1.7.37-alpha/: binary + frontend tarball.
- releases/manifest.json: v1.7.37-alpha, sha256/size refreshed.
- Cargo.toml / package.json: version bumps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- neode-ui/public/assets/img/app-icons/bitcoin-core.svg (NEW): 256×256
Umbrel community Bitcoin icon sourced from getumbrel.github.io/
umbrel-apps-gallery/bitcoin/icon.svg. Referenced by the static
catalog, the curated fallback, and the upstream lfg2025/app-catalog
entry so every surface shows the same image.
- app-catalog/catalog.json + neode-ui/public/catalog.json: add
bitcoin-core (v28.4) entry pointing at bitcoin/bitcoin:28.4. Same
entry pushed to the lfg2025/app-catalog repo on .160 and the local
gitea mirror so nodes see it without needing a full archipelago
update. Sovereignty Stack entry added to FEATURED_DEFINITIONS with
a description that frames it as a Knots alternative, not a rival.
- core/archipelago/src/api/handler/mod.rs: handle_app_catalog_proxy
is now instance-scoped (&self) and derives its upstream list from
load_registries — each active container registry contributes one
`<scheme>://<reg.url>/app-catalog/raw/branch/main/catalog.json` URL
in priority order (scheme follows tls_verify). When the operator
switches mirrors in Settings, the App Store now follows. Falls back
to the legacy hardcoded .160/tx1138 pair only when registry config
can't be loaded, so the App Store still renders on nodes that
haven't persisted one yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: install.rs registry reachability probe now strips the
`host[:port]/namespace` suffix before appending `/v2/` (the Docker
V2 API lives at the host root, not under the namespace) and accepts
HTTP 405 in addition to 200/401 as "registry daemon alive". This
fixes false "unreachable" reports on the Test button for Gitea and
other registries that protect their /v2/ endpoint.
- Backend: stacks.rs install_indeedhub_stack now force-removes any
leftover indeedhub-* containers and indeedhub-net before creating
the stack. A partial install (or the old first-boot stub racing the
installer) used to leave containers around that blocked re-install
with "name already in use". Re-running the App Store install now
self-heals.
- Backend: registry.rs load_registries auto-merges any default
registry URLs missing from the saved config (appended with priority
max+10+i, persisted). Lets new default mirrors (e.g. Server 3 OVH)
roll out to existing nodes without manual config edits. Explicit
removals still stick — URLs absent from disk AND absent from
defaults stay gone.
- Backend: update.rs adds DEFAULT_TERTIARY_MIRROR_URL at
http://146.59.87.168:3000/ (Server 3 OVH) to default_mirrors, with
the same auto-merge-on-load behavior as registries. Test updated
for 3-mirror default (.160, tx1138, .168).
- Scripts: dropped the first-boot IndeedHub stub (~38 lines in
first-boot-containers.sh §8b). It predated the proper stack
installer, raced it, and was the main source of the name-conflict
mess the stacks.rs cleanup above now also guards against.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Backend: unified pull-progress streaming across primary AND fallback
registries. Earlier code only streamed for the primary attempt; if it
failed fast (VPS 404, etc.) the UI froze at 0% until the fallback
finished. The waterfall now uses a single shared helper that streams
podman stderr through update_install_progress for every URL tried.
- Backend: PackageDataEntry gains uninstall_stage, set at each phase of
handle_package_uninstall ("Stopping containers (i/total)",
"Cleaning up volumes", "Removing app data"). State flips to Removing
during the pipeline.
- Frontend: MarketplaceAppCard renders the live progress bar with byte
counts during installs, matching the System Update download bar style.
- Frontend: AppCard renders the live uninstall stage label per app.
Modal closes immediately on confirm so concurrent uninstalls each
show their own progress on their own card.
- Cleanup: removed dead helpers (image_candidates, rewrite_for_primary,
primary_image_url, pull_from_registries_with_skip) made unused by
the install.rs refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New Settings → App registries page (/dashboard/settings/registries)
that mirrors the update-mirrors experience: list of configured
registries, test reachability, set primary, add/remove. New
registry.set-primary RPC; existing registry.{list,add,remove,test}
reused.
- Default RegistryConfig flipped: VPS (23.182.128.160:3000/lfg2025) is
now Server 1 (primary), tx1138 is Server 2 (fallback).
- Install pipeline now rewrites the first pull to the primary registry
URL before attempting it. Before this, installs always hit whichever
registry the image was hardcoded to, so changing the primary didn't
actually affect where images came from. On failure, the existing
fallback walk skips the primary (already tried) and walks the rest.
- App catalog proxy UPSTREAMS order flipped so the catalog follows the
same VPS-first rule.
- Reboot overlay: animated "a" logo now sits in the center of the ring
(matches the screensaver composition). Extracted the logo-wrapper
pattern inline.
7/7 registry tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New "Served by {mirror}" line on the System Update page so operators can see
which mirror actually served the available manifest (vs. which is configured
primary). Backend threads the served URL through UpdateState.manifest_mirror.
- New update.test-mirror RPC + per-row lightning-bolt button that pings a
mirror and renders reachable/latency or error inline under the URL.
- UI polish on the mirrors section: Set Primary, Remove, and the new Test
action are compact icon buttons; add-mirror form moved into a dialog.
- "What's New" block prepended for v1.7.27-alpha.
21/21 update module tests pass. vue-tsc + vite build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a multi-mirror manifest fetch. `check_for_updates` walks a
configurable list (data_dir/update-mirrors.json) in priority order
and falls through to the next mirror on any HTTP / parse / timeout
failure. Two defaults bake in: Server 1 (git.tx1138.com) and Server 2
(23.182.128.160:3000).
Critical fix: after parsing a manifest, rewrite every component's
`download_url` so its origin matches the manifest URL we fetched.
Before this, the manifest hard-coded absolute URLs pointing at one
specific server — so even when a node fetched the manifest from a
faster mirror, the actual 200MB download went back to the slow
original. Now the faster mirror wins end-to-end.
New RPCs: update.list-mirrors, update.add-mirror, update.remove-mirror,
update.set-primary-mirror. New UI section on the System Update page
for operator management. 5 new unit tests for origin parsing and
manifest rewriting (21/21 green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fips::service::active_unit() picks whichever fips unit is running
(archipelago-fips.service vs upstream fips.service) so
handle_fips_restart and handle_fips_reconnect don't silently no-op
on hosts where the archipelago-managed unit was never created.
- peer_connectivity_summary(anchor_candidates) replaces the old
identity-cache check. anchor_connected is now true when at least
one authenticated peer's npub matches the public anchor OR any
entry in seed-anchors.json, which matches what the user actually
cares about ("am I in the mesh?") rather than what the card used
to claim ("is this one specific public anchor reachable?").
- FipsStatus::query takes data_dir now (so it can read seed-anchors)
rather than identity_dir. All call-sites updated.
- handle_fips_reconnect re-pushes seed anchors after restart so the
new daemon gets dialed without waiting for the 5-min apply loop.
- FipsNetworkCard label drops "(fips.v0l.io)" — misleading now that
multiple anchors may be configured.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a local seed-anchor list at <data_dir>/seed-anchors.json. Each
entry is {npub, address, transport, label}. On archipelago startup
and every 5 minutes the list is pushed into the running fips daemon
via `fipsctl connect <npub> <addr> <transport>`, so a cluster can
anchor itself independently of the global fips.v0l.io. A flaky or
unreachable public anchor no longer strands a fresh install.
New RPCs:
- fips.list-seed-anchors
- fips.add-seed-anchor (validates npub1… + host:port)
- fips.remove-seed-anchor
- fips.apply-seed-anchors (on-demand re-dial)
New standalone UI card at views/server/FipsSeedAnchorsCard.vue. Not
wired into Home.vue / Server.vue — operator places it per the
entry-point convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Federation join flow now notifies the inviter with the joiner's name and
immediately bumps state so the Federation UI reloads without a manual
Sync click. Accepting an invite that points back at the local node is
rejected up front (DID/pubkey/onion match). After a peer joins, we spawn
a transitive sync that pulls the new peer's federated peer hints so all
nodes in the federation learn about each other as Observer entries.
Federation.vue polls every 5s while mounted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
download_update
Each component download is now resumable via HTTP Range requests
(Range: bytes=N-) and retried up to 6 times with exponential
backoff (5/15/30/60/120/180s). On a dropped connection the next
attempt picks up at the last written byte offset instead of
restarting at zero. Streams via reqwest::Response::chunk() to the
staging file so a 160 MB frontend tarball doesn't sit in RAM. SHA
is verified over the complete file at the end of each component;
mismatch nukes the staged file and restarts from scratch.
Real download progress counters
New AtomicU64 globals DOWNLOAD_BYTES/DOWNLOAD_TOTAL are updated
from the chunk loop. update.status exposes them as
download_progress.{bytes_downloaded, total_bytes, active}. The
SystemUpdate.vue progress bar now polls update.status every
second instead of incrementing a fake random counter — and
crucially, if the user navigates away and back, the component
picks up the in-progress download from the backend atomics
immediately.
Update-check retries
handle_update_check now retries the manifest fetch up to 3 times
with a 5s gap if the first try hits a transport error, so a
momentary gitea hiccup doesn't make a node report "up to date"
when there actually is a new release. Tight 10s connect timeout
per attempt keeps the total bounded.
Artefacts:
archipelago 1070c87f…c081c162b 40584792
archipelago-frontend-1.7.15-alpha.tar.gz 8e630eba…63fd43f 162078068
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install UX
SystemUpdate.vue now shows a full-screen overlay after apply: the
BitcoinFaceAscii logo, a target-version label, an indeterminate
progress stripe (solid orange; solid green on ready), and an
elapsed-time readout. Polls /health every 1.5s and auto-reloads
once the backend reports the new version. 3-min stall → "Reload
now" button. Download UI also shows a spinner + "Finishing
download — verifying checksum…" while the fake bar sits at 95%.
FIPS reconnect — for real this time
New fips.reconnect RPC does stop → start → wait 20s → re-poll →
classify. Classification buckets: connected / daemon_down /
no_seed_key / no_outbound_udp_or_anchor_down / peers_but_no_anchor,
each with a plain-language hint surfaced verbatim by the Reconnect
button. The real reason nodes like .198/.253 couldn't reach the
anchor: identity::write_fips_key_from_seed was writing fips_key.pub
as a bech32 npub TEXT file, but upstream fips expects 32 raw
bytes. The daemon silently authenticated with garbage. Fix:
PublicKey::to_bytes() → raw 32 bytes, and new
fips::config::normalize_pub_file migrates legacy files by decoding
the npub and rewriting in place. fips.reconnect also re-installs
the config + healed keys to /etc/fips before restarting.
AIUI preservation + restore
apply_update was wiping /opt/archipelago/web-ui/aiui because the
Vue build doesn't include it — every OTA lost the Claude sidebar.
The preserve block now copies aiui/ + archipelago-companion.apk
from the old web-ui into the staging dir before the swap, and
prefers new-tar versions if present. To restore it on the three
nodes that already lost it (.116/.198/.253), this release bundles
the 85 MB aiui build into the frontend tarball. Frontend component
size is now ~155 MB.
Download / install timeouts
Backend download client timeout 1800s → 3600s (1 h). Larger
tarball + slow gitea raw throughput put us above the old cap.
Frontend update.download rpc timeout 30 min → 65 min to match.
package.install rpc timeout 15 min → 45 min — IndeedHub pulls
6 images and was timing out mid-install.
UI nit
"Rollback to Previous" → "Rollback Available".
App-catalog proxy already landed in v1.7.13.
Artefacts:
archipelago 725e18e6…3c525e6 40462288
archipelago-frontend-1.7.14-alpha.tar.gz c35284be…ff2c16 162077052 (+aiui)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>