Fixes from real fresh-install feedback (Framework node .81) + its log bundle:
Backend:
- websocket: subscribe before initial snapshot — broadcasts in the gap were
silently lost, stranding clients on stale state until a hard refresh
(the "everything needs ctrl-r" bug: My Apps stuck Loading, App Store
stuck Checking, containers-scanned never arriving)
- crash recovery: check the crash marker BEFORE writing our own PID —
recovery had never run on any node (always saw its own PID and skipped);
PID-reuse guard via /proc cmdline
- boot status: pending-boot-starts registry (recovery, stack recovery,
reconciler, adoption) — scanner overlays queued-but-down apps as
Restarting instead of Stopped after a reboot; scanner-authored
Restarting resolves immediately on a settled scan (no transitional wedge)
- install deps: bounded wait (36x5s) when a dependency is installed but
still starting ("Waiting for Bitcoin to start…") instead of instant
rejection; dependency-gate rejections remove the optimistic entry (no
phantom Stopped tile) and surface as a notification
- seed backup: auth.setup persists the onboarding mnemonic as the
encrypted seed backup (reveal previously failed on EVERY node — nothing
ever wrote master_seed.enc); seed.restore stashes too; error sanitizer
lets seed/2FA errors through instead of "Check server logs"
- lnd: bitcoind.rpchost resolved from the running Bitcoin variant
(hardcoded bitcoin-knots broke Core nodes); manifest uses derived_env
- bitcoin status: clean human message for connection-reset/startup; raw
URLs + os-error chains no longer reach the app card
- fedimint-clientd: chown /var/lib/archipelago/fmcd to 1000:1000 (root-
created dir crash-looped the rootless container, EACCES) — first-boot
script + pre-start self-heal
- log volume (>1GB/day on a day-old node): journald caps drop-in (ISO +
bootstrap self-heal), bitcoind -printtoconsole=0 everywhere (90% of the
journal was IBD UpdateTip spam), tracing default debug→info
Frontend:
- Login: Enter advances to confirm field then submits; submit always
clickable with inline errors (was silently disabled on mismatch);
Restart Onboarding needs a confirming second click (the mismatch →
"onboarding restarted" trap)
- sync store: 30s state reconciliation + refetch on re-entrant connect;
20s containers-scanned escape hatch so Checking can never show forever;
fresh empty node reaches the real "no apps yet" state
- intro video: CRF20 re-encode (SSIM 0.988) + faststart — moov was at EOF
so playback needed the full 15MB first (the intro lag)
- backgrounds: 10 heaviest JPEGs → WebP q90 (9.4MB→6.6MB); 7 stayed JPEG
(WebP larger on noisy sources)
- Web5ConnectedNodes: drop unused template ref that failed vue-tsc -b
ISO/kiosk:
- nginx: /assets/ 404s no longer cached immutable for a year; HTTPS block
gained the missing /assets/ location (served index.html as images)
- kiosk: launcher/service spliced from configs/ at ISO build (stale
heredoc force-disabled GPU); MemoryHigh/Max 1200/1500→2200/2800M (kiosk
rode the reclaim throttle = the lag); firmware-intel-graphics +
firmware-amd-graphics (trixie split DMC blobs out of misc-nonfree)
Verified: cargo test 898/898 green, npm run build green with dist
contents confirmed (webp refs, lnd.png, faststart video, new strings).
Handover for ISO build + deploy: docs/HANDOVER-2026-07-02-iso-feedback.md
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
343 lines
13 KiB
Rust
343 lines
13 KiB
Rust
//! Cached Bitcoin node status for browser UIs.
|
|
//!
|
|
//! The bitcoin-ui should not poll Bitcoin RPC directly for display state.
|
|
//! During container restarts, reindexing, and IBD, direct browser RPC polling
|
|
//! turns short RPC gaps into visible UI failures. This module owns the RPC
|
|
//! polling loop, caches the last successful snapshot, and serves stale-but-known
|
|
//! state while the node is reconnecting.
|
|
|
|
use anyhow::{Context, Result};
|
|
use serde::Serialize;
|
|
use std::sync::OnceLock;
|
|
use std::time::{Duration, SystemTime, UNIX_EPOCH};
|
|
use tokio::sync::RwLock;
|
|
use tracing::{debug, warn};
|
|
|
|
// Poll frequently and recover fast so the cached snapshot tracks bitcoind's
|
|
// responsive windows during IBD. During heavy block-connection, getblockchaininfo
|
|
// can block briefly; a slow 10s/15s/20s cadence let one missed poll age the
|
|
// snapshot past the UI's 30s "stale" threshold, so the UI dwelled on
|
|
// "reconnecting…" long after bitcoind was answering again. Tight cadence + short
|
|
// timeout keeps last-known state fresh and clears the stale banner promptly.
|
|
const CACHE_REFRESH_SECS: u64 = 5;
|
|
const CACHE_ERROR_BACKOFF_SECS: u64 = 5;
|
|
|
|
// Grace window before a failing poll marks the snapshot "stale" for the UI.
|
|
// On a busy / swap-thrashing node (e.g. .198) getblockchaininfo intermittently
|
|
// exceeds the RPC timeout, so a single missed poll is normal and must NOT flip
|
|
// the UI to "reconnecting…". Only after the cached snapshot is genuinely old —
|
|
// several polls failed in a row — do we surface the banner.
|
|
const STALE_GRACE_MS: u64 = 20_000;
|
|
|
|
#[derive(Debug, Clone, Serialize)]
|
|
pub struct BitcoinNodeStatus {
|
|
pub ok: bool,
|
|
pub stale: bool,
|
|
pub updated_at_ms: u64,
|
|
// Server-computed age of the snapshot, filled in at serve time. The browser
|
|
// must not derive this itself (Date.now() - updated_at_ms) because that
|
|
// compares the browser clock against this node's clock — any skew made a
|
|
// fresh snapshot look stale and the "reconnecting…" banner never cleared.
|
|
pub age_ms: u64,
|
|
pub error: Option<String>,
|
|
pub blockchain_info: Option<serde_json::Value>,
|
|
pub network_info: Option<serde_json::Value>,
|
|
pub index_info: Option<serde_json::Value>,
|
|
pub zmq_notifications: Option<serde_json::Value>,
|
|
}
|
|
|
|
impl Default for BitcoinNodeStatus {
|
|
fn default() -> Self {
|
|
Self {
|
|
ok: false,
|
|
stale: false,
|
|
updated_at_ms: 0,
|
|
age_ms: 0,
|
|
error: Some("Connecting to Bitcoin node...".to_string()),
|
|
blockchain_info: None,
|
|
network_info: None,
|
|
index_info: None,
|
|
zmq_notifications: None,
|
|
}
|
|
}
|
|
}
|
|
|
|
static STATUS_CACHE: OnceLock<RwLock<BitcoinNodeStatus>> = OnceLock::new();
|
|
|
|
fn cache() -> &'static RwLock<BitcoinNodeStatus> {
|
|
STATUS_CACHE.get_or_init(|| RwLock::new(BitcoinNodeStatus::default()))
|
|
}
|
|
|
|
fn now_ms() -> u64 {
|
|
SystemTime::now()
|
|
.duration_since(UNIX_EPOCH)
|
|
.unwrap_or_default()
|
|
.as_millis() as u64
|
|
}
|
|
|
|
fn transient_error(err_msg: &str) -> bool {
|
|
let lower = err_msg.to_lowercase();
|
|
lower.contains("connect")
|
|
|| lower.contains("reset")
|
|
|| lower.contains("refused")
|
|
|| lower.contains("timed out")
|
|
|| lower.contains("timeout")
|
|
|| lower.contains("broken pipe")
|
|
|| lower.contains("eof")
|
|
|| lower.contains("500 internal server error")
|
|
|| lower.contains("503 service unavailable")
|
|
|| lower.contains("work queue depth exceeded")
|
|
|| lower.contains("decode bitcoin rpc json")
|
|
|| lower.contains("error decoding response body")
|
|
|| lower.contains("expected value at line 1 column 1")
|
|
}
|
|
|
|
fn friendly_transient_error(has_cached_state: bool, err_msg: &str) -> String {
|
|
let detail = err_msg
|
|
.lines()
|
|
.next()
|
|
.unwrap_or(err_msg)
|
|
.trim()
|
|
.trim_end_matches('.');
|
|
let lower = detail.to_lowercase();
|
|
let state = if lower.contains("verifying blocks") {
|
|
Some("verifying blocks after restart")
|
|
} else if lower.contains("connection reset") {
|
|
Some("starting up and not yet accepting RPC connections")
|
|
} else if lower.contains("connection refused") || lower.contains("tcp connect error") {
|
|
Some("waiting for the Bitcoin RPC listener")
|
|
} else if lower.contains("timed out") || lower.contains("timeout") {
|
|
Some("busy and not answering RPC before the timeout")
|
|
} else {
|
|
None
|
|
};
|
|
|
|
// Recognized transient causes get a clean human sentence only — the raw
|
|
// transport error (URLs, repeated "os error 104" chains) is operator
|
|
// noise that was ending up verbatim on the app card. Unrecognized errors
|
|
// keep a bounded detail so a genuinely new failure stays diagnosable.
|
|
let (state, detail) = match state {
|
|
Some(state) => (state, None),
|
|
None => (
|
|
"starting or busy syncing",
|
|
Some(if detail.len() > 120 {
|
|
let mut cut = 120;
|
|
while !detail.is_char_boundary(cut) {
|
|
cut -= 1;
|
|
}
|
|
format!("{}…", &detail[..cut])
|
|
} else {
|
|
detail.to_string()
|
|
}),
|
|
),
|
|
};
|
|
|
|
let base = if has_cached_state {
|
|
format!("Bitcoin node is {state}; showing last known state and retrying.")
|
|
} else {
|
|
format!("Bitcoin node is {state}; retrying automatically.")
|
|
};
|
|
match detail {
|
|
Some(detail) => format!("{base} Detail: {detail}"),
|
|
None => base,
|
|
}
|
|
}
|
|
|
|
pub fn spawn_status_cache() {
|
|
tokio::spawn(async {
|
|
loop {
|
|
let fresh = fetch_bitcoin_status().await;
|
|
let mut cached = cache().write().await;
|
|
let mut sleep_secs = CACHE_REFRESH_SECS;
|
|
match fresh {
|
|
Ok(mut status) => {
|
|
status.ok = true;
|
|
status.stale = false;
|
|
status.error = None;
|
|
*cached = status;
|
|
}
|
|
Err(e) => {
|
|
let err_msg = format!("{e:#}");
|
|
if transient_error(&err_msg) {
|
|
debug!("Bitcoin status: transient RPC failure: {}", err_msg);
|
|
} else {
|
|
warn!("Bitcoin status: RPC failure: {}", err_msg);
|
|
}
|
|
sleep_secs = CACHE_ERROR_BACKOFF_SECS;
|
|
|
|
if cached.blockchain_info.is_some() {
|
|
cached.ok = false;
|
|
// Only flip to "stale" once the last good snapshot is older
|
|
// than the grace window. A brief RPC gap on a busy node keeps
|
|
// showing last-known state silently instead of a banner flicker.
|
|
let snapshot_age_ms = now_ms().saturating_sub(cached.updated_at_ms);
|
|
cached.stale = snapshot_age_ms > STALE_GRACE_MS;
|
|
cached.error = Some(friendly_transient_error(true, &err_msg));
|
|
} else {
|
|
*cached = BitcoinNodeStatus {
|
|
ok: false,
|
|
stale: false,
|
|
updated_at_ms: now_ms(),
|
|
error: Some(friendly_transient_error(false, &err_msg)),
|
|
..BitcoinNodeStatus::default()
|
|
};
|
|
}
|
|
}
|
|
}
|
|
drop(cached);
|
|
tokio::time::sleep(Duration::from_secs(sleep_secs)).await;
|
|
}
|
|
});
|
|
}
|
|
|
|
pub async fn get_bitcoin_status() -> BitcoinNodeStatus {
|
|
let mut status = cache().read().await.clone();
|
|
// Compute age here (server clock only) so the browser never has to subtract
|
|
// across clocks. A successful snapshot serves age_ms ≈ 0 → the UI clears the
|
|
// "reconnecting…" banner on its very next poll regardless of browser-clock skew.
|
|
if status.updated_at_ms > 0 {
|
|
status.age_ms = now_ms().saturating_sub(status.updated_at_ms);
|
|
}
|
|
status
|
|
}
|
|
|
|
async fn fetch_bitcoin_status() -> Result<BitcoinNodeStatus> {
|
|
// 12s (not 8s): on a swap-thrashing node getblockchaininfo can answer slowly
|
|
// but correctly; too tight a timeout turned working-but-slow polls into
|
|
// failures and tripped the "reconnecting…" banner. Stays under STALE_GRACE_MS.
|
|
let client = reqwest::Client::builder()
|
|
.timeout(Duration::from_secs(12))
|
|
.build()
|
|
.context("build Bitcoin status HTTP client")?;
|
|
|
|
// Fetch all four calls concurrently: getblockchaininfo gates freshness, so a
|
|
// slow auxiliary call (network/index/zmq) must not delay the snapshot or block
|
|
// the next refresh. Only getblockchaininfo failing marks the status stale.
|
|
let (blockchain_info, network_info, index_info, zmq_notifications) = tokio::join!(
|
|
bitcoin_rpc_call(&client, "getblockchaininfo", serde_json::json!([])),
|
|
bitcoin_rpc_call(&client, "getnetworkinfo", serde_json::json!([])),
|
|
bitcoin_rpc_call(&client, "getindexinfo", serde_json::json!([])),
|
|
bitcoin_rpc_call(&client, "getzmqnotifications", serde_json::json!([])),
|
|
);
|
|
let blockchain_info = blockchain_info.context("getblockchaininfo")?;
|
|
|
|
Ok(BitcoinNodeStatus {
|
|
ok: true,
|
|
stale: false,
|
|
updated_at_ms: now_ms(),
|
|
age_ms: 0,
|
|
error: None,
|
|
blockchain_info: Some(blockchain_info),
|
|
network_info: network_info.ok(),
|
|
index_info: index_info.ok(),
|
|
zmq_notifications: zmq_notifications.ok(),
|
|
})
|
|
}
|
|
|
|
async fn bitcoin_rpc_call(
|
|
client: &reqwest::Client,
|
|
method: &str,
|
|
params: serde_json::Value,
|
|
) -> Result<serde_json::Value> {
|
|
let (rpc_user, rpc_pass) = crate::bitcoin_rpc::bitcoin_rpc_credentials().await;
|
|
let body = serde_json::json!({
|
|
"jsonrpc": "1.0",
|
|
"id": "bitcoin-status",
|
|
"method": method,
|
|
"params": params,
|
|
});
|
|
|
|
let resp = client
|
|
.post(crate::constants::BITCOIN_RPC_URL)
|
|
.basic_auth(rpc_user, Some(rpc_pass))
|
|
.header("Content-Type", "application/json")
|
|
.json(&body)
|
|
.send()
|
|
.await
|
|
.context("Bitcoin RPC request failed")?;
|
|
|
|
let status = resp.status();
|
|
let json: serde_json::Value = resp.json().await.context("decode Bitcoin RPC JSON")?;
|
|
if !status.is_success() {
|
|
anyhow::bail!("Bitcoin RPC returned {}: {}", status, json);
|
|
}
|
|
if let Some(error) = json.get("error").filter(|e| !e.is_null()) {
|
|
anyhow::bail!("Bitcoin RPC {} error: {}", method, error);
|
|
}
|
|
json.get("result")
|
|
.cloned()
|
|
.context("missing Bitcoin RPC result")
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::friendly_transient_error;
|
|
|
|
#[test]
|
|
fn explains_verifying_blocks_without_generic_timeout_copy() {
|
|
let msg = friendly_transient_error(
|
|
false,
|
|
r#"getblockchaininfo: Bitcoin RPC returned 500 Internal Server Error: {"error":{"code":-28,"message":"Verifying blocks..."}}"#,
|
|
);
|
|
|
|
assert!(msg.contains("verifying blocks after restart"));
|
|
assert!(msg.contains("retrying automatically"));
|
|
}
|
|
|
|
#[test]
|
|
fn explains_missing_rpc_listener() {
|
|
let msg = friendly_transient_error(
|
|
true,
|
|
"getblockchaininfo: tcp connect error: Connection refused (os error 111)",
|
|
);
|
|
|
|
assert!(msg.contains("waiting for the Bitcoin RPC listener"));
|
|
assert!(msg.contains("showing last known state"));
|
|
}
|
|
|
|
#[test]
|
|
fn explains_rpc_timeout() {
|
|
let msg = friendly_transient_error(
|
|
false,
|
|
"getblockchaininfo: Bitcoin RPC request failed: operation timed out",
|
|
);
|
|
|
|
assert!(msg.contains("busy and not answering RPC before the timeout"));
|
|
}
|
|
|
|
#[test]
|
|
fn connection_reset_gets_clean_message_without_raw_detail() {
|
|
// The exact string a fresh install showed on the app card: the raw
|
|
// reqwest chain (URL + repeated "os error 104") must not surface.
|
|
let msg = friendly_transient_error(
|
|
false,
|
|
"getblockchaininfo: Bitcoin RPC request failed: error sending request for url (http://127.0.0.1:8332/): connection error: Connection reset by peer (os error 104): connection error: Connection reset by peer (os error 104): Connection reset by peer (os error 104)",
|
|
);
|
|
|
|
assert!(msg.contains("starting up and not yet accepting RPC connections"));
|
|
assert!(!msg.contains("os error"));
|
|
assert!(!msg.contains("127.0.0.1"));
|
|
assert!(!msg.contains("Detail:"));
|
|
}
|
|
|
|
#[test]
|
|
fn recognized_causes_omit_detail_entirely() {
|
|
for raw in [
|
|
"x: Connection refused (os error 111)",
|
|
"x: operation timed out",
|
|
r#"x: {"error":{"code":-28,"message":"Verifying blocks..."}}"#,
|
|
] {
|
|
let msg = friendly_transient_error(false, raw);
|
|
assert!(!msg.contains("Detail:"), "leaked detail for: {raw}");
|
|
}
|
|
}
|
|
|
|
#[test]
|
|
fn unknown_errors_keep_bounded_detail() {
|
|
let long = format!("weird new failure {}", "x".repeat(300));
|
|
let msg = friendly_transient_error(false, &long);
|
|
assert!(msg.contains("Detail: weird new failure"));
|
|
assert!(msg.len() < 260);
|
|
}
|
|
}
|