Hotfix: archipelago.service ExecStartPre now mkdirs /run/containers and
/var/lib/containers before the unit's mount-namespace setup tries to bind
them. Without this, fresh nodes that don't have /run/containers (e.g.
nodes provisioned without a prior podman session) fail at the namespace
step with:
Failed to set up mount namespacing: /run/containers: No such file or directory
Failed at step NAMESPACE spawning /bin/bash: No such file or directory
Existing nodes don't pick up systemd unit changes via OTA — they need a
one-time `systemctl edit archipelago` adding the same mkdir. ISO installs
from this version forward have the fix baked in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sync-perf tuning for bitcoin/bitcoin-core/bitcoin-knots/electrumx.
- Drop the --cpus=2 cap on bitcoin/electrumx variants. Script verification
is parallelizable; the cap halved IBD speed on 4-8 core machines.
- Bump bitcoin --memory 4g→8g so dbcache=4096 has headroom for mempool +
connection buffers + I/O. 4g was OOM-prone during heavy IBD.
- Bump electrumx --memory 1g→2g + add CACHE_MB=2048 + MAX_SEND=10MB.
- bitcoin-core CLI args gain -dbcache=4096 -par=0 -maxconnections=125.
- bitcoin-knots manifest matched (1024MB pruned / 4096MB full + par=0).
Future v2: host-RAM-aware dbcache scaling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.
Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
replaces fragile post-start exec that failed under restricted-cap rootless
podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition
Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
tester, every app × every transition. Run before each release.
Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes failure mode adjacent to FM3 (docs/bulletproof-containers.md): on
a syncing pruned node, bitcoind's RPC thread blocks for 5-10s during block
validation. The old 10s client-side timeout was rejecting roughly 30% of
UI calls even though the node was perfectly healthy. 20x stress test on
the live .116 node (caught in IBD catch-up at block 797k) used to drop
10 of 20 calls; now drops 0 of 20.
What changed:
- core/archipelago/src/api/rpc/bitcoin.rs: bitcoin_rpc_call now retries up
to 3 times with 500ms and 1500ms backoffs between attempts. Only
transient transport errors (timeout, connect refused, send/recv IO)
trigger retry. A well-formed bitcoind error response is surfaced
immediately - real RPC bugs are never masked.
- Per-attempt hard deadline (tokio::time::timeout, 15s) layered on top
of reqwest's own timeout, so DNS starvation or TLS wedging can't
steal the entire retry budget.
- handle_bitcoin_getinfo client builder gained a 3s connect_timeout
so a dead bitcoind is fast-failed inside the first attempt instead
of eating the whole 15s.
- Retry policy extracted into a RetryConfig struct so tests can dial
down timeouts to ~100ms per attempt. Production defaults live in
RetryConfig::production().
Not changed (tracked as follow-up):
- mesh/mod.rs bitcoin_rpc_getblockcount and related helpers use the
same 10s-timeout pattern. Not migrated to the new wrapper in this
release; scheduled for v1.7.43 alongside the render_bitcoin_conf
work.
- lnd/info.rs and electrs_status have similar 10s/15s timeouts but
different failure profiles - audit first, migrate only the ones
that actually exhibit the bug.
Tests: 6 new unit tests under api::rpc::bitcoin::tests, all passing.
Uses an in-process hyper server (already a transitive dep) to simulate
bitcoind responses; no new crates required.
- happy_path_first_attempt: no retry when first attempt succeeds
- retries_on_timeout_then_succeeds: first attempt times out, second
succeeds, returns OK (uses a short-timeout RetryConfig so the test
runs in <1s instead of 15s)
- retries_exhausted_on_persistent_connect_refused: all attempts fail
against a closed port, error bubbles up, elapsed time confirms
backoffs actually ran
- does_not_retry_on_rpc_level_error: bitcoind-returned error body is
surfaced immediately, no retry
- does_not_retry_parse_errors: non-JSON response (e.g. 503 with html
body) is NOT retried - guards against the tempting "retry all
non-2xx" mistake that would mask real bitcoind misconfig
- retry_budget_invariants: asserts total wall-time ceiling stays
under 60s so a bumped constant can't silently hang a UI call
forever
Validated live on .116: 20/20 bitcoin.getinfo calls succeed during IBD
catch-up (chain at block 797419 -> 797464), vs ~40% baseline under the
old 10s timeout. Worst-case latency was 48.9s during peak validation;
happy-path latency (cached result) remains 28-77ms.
Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.
What changed:
- apply_update() writes a pending-verify marker with old+new version and
a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
present and within its freshness window, the new binary waits 15s for
nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
nothing else happens.
- On window-exhaust, the new binary:
1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
(quarantined, not deleted, so we can post-mortem).
2. Restores web-ui.bak on top of web-ui.
3. Calls rollback_update() to restore the previous binary.
4. Updates state.current_version to reflect the rollback.
5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
probing, so a crashed-during-startup marker from weeks ago cannot
spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
tokio::fs::copy, so it escapes the service's ProtectSystem=strict
mount namespace. Without this, the rollback silently failed with
EROFS on /usr/local/bin and orphaned the rollback - the exact
opposite of what auto-rollback is for.
Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.
Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
v1.7.38 and v1.7.39 both shipped with `./` inside the frontend tarball marked
drwx------ (700). Tar extraction preserves archive perms, so every node that
pulled the OTA landed with /opt/archipelago/web-ui at 700, nginx (www-data)
returned 500 "permission denied" on every page, and the browser showed
"Internal Server Error nginx". .116 hit this on both v1.7.38 and v1.7.39
rollouts. The v1.7.39 runtime self-heal in main.rs was the wrong layer —
systemd's ReadOnlyPaths namespace made /opt/archipelago read-only from inside
the archipelago service, so chmod from there returned EROFS.
Root cause: create-release-manifest.sh used mktemp -d (700 default umask) for
staging, then tar preserved that 700 in the archive's root entry.
Fix the archive itself:
- chmod 755 staging dir + `find -type d -exec chmod 755` + `-type f chmod 644`
before tar, so the on-disk entries are correct.
- tar --owner=0 --group=0 --mode='u=rwX,go=rX' to normalize archive perms
belt-and-braces in case file-mode drift ever reappears.
- Post-tar verify: `tar tvzf | head -1` must show drwxr-xr-x at root, or
the release script aborts before the manifest is even generated.
Binary unchanged semantically — the main.rs self-heal stays in as a last-
resort belt (can't hurt on nodes whose FS isn't namespace-isolated), and the
update.rs in-extractor chmod stays in so v1.7.40-onwards extractors are
double-safe. The authoritative fix is the archive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.38 shipped with an OTA bug: the tar-extracted staging dir inherited 700
perms and nginx (www-data) returned 500/403 on every request after the swap.
.116 hit this on rollout; had to chmod by hand to recover.
- update.rs: after extraction, explicitly chmod 755 dirs + 644 files on the
new staging dir before the mv into place, so nginx can stat/serve them.
- main.rs: self-heal on startup — if /opt/archipelago/web-ui is not
world-readable, run `sudo chmod -R u=rwX,go=rX` to repair. This is what
rescues nodes upgrading from v1.7.37/v1.7.38, since their extractor
(running on the old binary) doesn't have the chmod fix yet — the new
binary's first boot fixes the mess before nginx serves a single request.
Everything v1.7.38 shipped is still in this release:
- auth.rs auto-heals is_onboarding_complete() from setup_complete +
password_hash so nodes don't bounce back to /onboarding/intro after
browser clear / reboot / update
- useOnboarding tri-state: backend-unreachable no longer defaults to intro
- login sounds gated by isFirstInstallPhase() — silent after onboarding,
typing sounds unaffected
- FIPS app / Nostr Relay / Nostr VPN / Routstr / Penpot removed from
catalog + frontend + Rust + docker + icons; 15 image versions deleted
from tx1138, .168, gitea-local
- AIUI baked into release tarball via demo/aiui/
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>