docs/UNIFIED-TASK-TRACKER.md replaces hunting across SESSION-1.8.0-OTA-PROGRESS.md
and PRODUCTION-MASTER-PLAN.md for "what's left" — fastest/simplest tasks first.
Verified against live code/nodes rather than trusting doc text: several previously
"open" items (bind-dir chown, netbird legacy installer, launch-port fallback,
archival-bitcoin manifest field, progress-UI monotonicity, all-apps coverage,
fedimint test coverage, changelog backfill, portainer image pin, grafana quadlet
activation) turned out already shipped or non-issues, and are closed out here.
TESTING.md's release-gate checklist updated to match reality (cargo warnings,
5x gate, changelog already green; multinode/backend-default-flip/tag genuinely open).
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Two Phase-3 bats suites used `fail` (a bats-assert helper) but bats-assert
isn't installed on the alpha fleet (only bats-core), so every tripped
assertion crashed with `fail: command not found` (status 127) instead of
reporting a real pass/fail. Define the same minimal `fail() { echo ...;
return 1; }` the other suites already use (see mempool.bats). Without this
the gates were silently non-functional.
Also rewrite the obsolete "HealthCmd= implies Notify=healthy" assertion in
use-quadlet-backends-install.bats. Phase 3.4's Notify=healthy was
deliberately reverted: gating `systemctl start` on health hung boot
reconciliation for dependency-waiting apps (fedimint idles until Bitcoin
IBD; lnd until macaroon unlock), leaving units stuck "deactivating". The
renderer now emits HealthCmd= for Podman's health state but TimeoutStartSec=0
and NO Notify=healthy (quadlet.rs render() + contains_stale_health_gate()).
The test now asserts the current invariant: no backend unit gates start on
health.
Verified on the .228 canary node (ARCHIPELAGO_USE_QUADLET_BACKENDS=1):
use-quadlet-backends-install 6/6, backend-survives-archipelago-restart 3/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched
channels, so nodes only ever saw themselves. Bring them to MeshCore parity
using the official Meshtastic admin API:
- Auto-provision LoRa region (set_config, AdminMessage field 34) from a new
mesh-config `lora_region` (e.g. EU_868) when the radio's region differs.
- Auto-provision a shared primary channel (set_channel, field 33) with a
PSK derived deterministically from channel_name, so every node converges on
one mesh — the parity equivalent of MeshCore's named "archipelago" channel.
- Read current region/channel from want_config; only write when different
(no reboot loop); cap attempts so a radio that won't persist can't loop.
- Active NodeInfo advert scaffolding + aggressive serial drain.
Verified on .116+.228: region+channel persist, discovery works (both see each
other as named reachable contacts), bidirectional RF + sending confirmed.
Receiving in the running driver is still under diagnosis (instrumentation added).
Also removes the unwanted `meshtastic` daemon app from the registry (it was
never meant to be a container — native driver provides system-level support):
deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) +
test refs. Meshtastic stays native, like MeshCore.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Active counterpart to the read-only all-apps-matrix.bats: drives
stop/start/restart for every installed app and, under
ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall →
no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core
suites. App set is discovered from My Apps ∩ the node catalog; reinstall
spec comes from catalog.json {dockerImage, containerConfig}.
PROTECTED by default (never cycled or torn down): bitcoin*/electrum*
(expensive resync) AND lnd/btcpay*/fedimint* (teardown = irreversible
wallet/channel/guardian loss). The user asked to protect only
bitcoin+electrum; the wallet apps are added for safety and can be
removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised
pass, not folded into run-gate. Validated on .228: discovery excludes
the 6 protected installed apps; lifecycle tier cycles a single app
(botfights) stop/start/restart green; teardown gated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite
(uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression
guard) existed but was never enabled by the gate. Add an opt-in single
cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires
ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out
of the 5× loop deliberately — uninstall/reinstall every iteration would
balloon runtime and re-pull images; one pass guards the class. Default
gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 5x destructive gate on heavy nodes false-failed on transient windows
during stack recovery, not real regressions:
- immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis
->server (DB migrations on boot) stack can take >30s to republish :2283 after
a churn-induced recreate; destructive-tier immich tests already allow 180-240s.
- mempool.bats: orphan-container check now polls to steady state (<=30s) instead
of a single-shot count, which caught a recreated member briefly visible
alongside its replacement mid-reconcile.
- run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when
installed, so the next iteration's read-only probe doesn't race a still-
recovering stack. Settle returns the instant every probe is green.
A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only
absorb the transient recreate window under sustained churn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others
(jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats
derives the app set from server.get-state package-data (no hardcoded list) and
asserts baseline health across EVERY installed app:
- settles to a non-transitional state within a window (the #13/#14 stuck-ghost
class, generalized fleet-wide — installing/removing that never settles)
- not in error/failed
- reports a recognized (non-garbage) state
- every running UI app (manifest ui=="true") exposes a non-null lan-address
(the immich/port-drift unreachable-UI failure, generalized to all UI apps)
Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as
nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where
the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14
reinstall stalling on stale state). New cascade-uninstall.bats drives the full
teardown path on a throwaway app (default grafana, precondition-skips if already
installed so it can't destroy real data) and asserts:
- fresh install reaches running via a truthful, non-silent progression
- uninstall makes the entry DISAPPEAR from server.get-state package-data
(the literal My Apps map) — no ghost, no stuck uninstall stage
- container + (on-node) data dir are gone
- reinstall returns to running
- node left as found
Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical
gate. Verified 7/7 against .228.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It called bats-assert's `fail` (not loaded in this file) → "fail:
command not found"/127, masking the real reason. Emit+return instead,
bump the cold-restart RPC window 60s→120s (block-index reload), and
note a node mid-IBD legitimately can't serve getinfo (environmental
precondition, not a product regression).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5× run #4 flaked iter4 on "immich exposes its web UI lan-address
(port 2283)": container-list returned lan_address=null because
immich_server was momentarily mid-recreate when the read-only tier
queried it (passed the other 4 iterations; immich_server does publish
0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots
state probe — poll <=30s for the exposed port instead of one read. A
genuinely unexposed immich never publishes 2283, so real port drift
is still caught.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.restart resolved its container list via
ordered_containers_for_start, which injected every name from the
union startup_order list that wasn't already present — including
variant names not live on a given node (mysql-mempool,
archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is
2nd in the mempool start order, so do_orchestrator_package_start hit
its unknown-app-id fallback, do_package_start failed the inspect
("no such object"), and the `?` aborted the whole start sequence —
leaving mempool-api + the frontend down until the health monitor
recovered them minutes later. That was the source of the 5× gate
flakes #73 (frontend not running in 180s) and #74 (api not queryable
in 300s); root-caused from the .228 journal
("Start failed: mysql-mempool").
Replace the inject-then-sort logic with a pure helper
order_present_containers that orders only the actually-present
containers and never adds phantom entries. startup_order remains a
union of name variants across install generations — it's now used
purely to order what's live, not to inject what isn't. +3 unit tests.
Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a
settled state instead of a single-shot read, so a container caught
mid-reconcile (transient restarting/configured) can't flake a 20-min
iteration. A genuinely-stuck container never settles, so real
breakage is still caught.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The frontend nginx used a literal proxy_pass host with no resolver, so it
pinned mempool-api's IP at worker startup. When the backend restarts (gate,
OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying
to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a
manual nginx reload. Same stale-upstream-IP class as the netbird 502.
Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to
re-resolve the backend per-request via 'resolver' + a variable proxy_pass.
Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers
on the network gateway, not Docker's 127.0.0.11). Per-location path mapping
preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite).
Proven on .228: backend IP change now auto-recovers with no reload; the
literal-host control still 502s. Migrated the manifest off the retired
tx1138 registry to vps2.
Also: mempool.bats #74 waited only 180s post-restart (the slow path) and
called an undefined 'fail' helper (status 127). Bumped to 300s to match the
passing parity probes and emit a real failure instead.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO
recover — lnd synced, mempool just mid-restart when probed — but slower than the
windows when restarted back-to-back). Hardening:
- run-20x.sh: best-effort settle_stack() before each iteration (wait for
mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run).
- required containers present/running (80/81): wait-loops (180s) not single-shot.
- mempool api/frontend (87/88): retry ~180s not single-shot.
- mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s.
lnd getinfo (60): 90s->240s retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC).
Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md
with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale
nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites.
Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is
green; these 4 were test-harness issues:
- lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart
recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded
node but DOES complete (synced_to_chain:true).
- bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may
have just been recreated by the companion-survives test).
- probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for
post-restart proxy/UI readiness instead of single-shot.
- required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL
app (not in required_containers) — only assert it when NPM is installed; and make
the trailing lncli getinfo a retry.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the
container 'running' state — single-shot lncli getinfo raced that window and
false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is
functional (getinfo returns cleanly once ready).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3-
container stack (postgres->redis->server w/ DB migrations), so it needs at least
as long as the start test (180s) — the old 120s was inconsistent and false-failed
on loaded nodes. immich does return to running.
- fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the
legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex
omitted it -> total>known false orphan on every node running fedimint-clientd.
Add fedimint-clientd to known.
Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node
(.116), not the RPC target — surfaced while driving the .228 gate green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on
.228 AND .198 for now, down from 20x. Restore to 20x before the final ship.
Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
container-list reports stack apps package-level (.name="immich"), so the suite
checks the "immich" package (presence, valid state, :2283 lan-address) rather than
individual container names. Destructive tier fires async stop/start/restart and
asserts on the end state via wait_for_container_status.
KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs
ops back-to-back with no settling while immich's async stack ops take 30s+, and
stopped reports as "exited" not "stopped". The immich migration itself is verified
working (manual stop/start/restart succeed; all 3 containers healthy). Hardening
the harness for stack apps (inter-op settling + stopped|exited acceptance) is a
follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack
(immich + immich-postgres + immich-redis): presence + valid state of all 3 members,
a guard that no legacy underscore containers exist (catches botched migration /
legacy-installer fallback), destructive stop/start/restart of the server with
postgres+redis staying up, and cascade uninstall/reinstall (preserve_data).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Meshtastic DMs were falling back to a channel broadcast, so every node
on the LoRa channel saw a "direct" message. Send a directed MeshPacket
(to = node num, decoded from the synthetic pubkey's node-id bytes)
instead — the Meshtastic analog of the meshcore CMD_SEND_TXT_MSG fix.
DMs now reach only the recipient; firmware auto-PKC-encrypts them
end-to-end once NodeInfo keys are exchanged.
Capture E2E status at the driver level (no shared-type/UI change):
- learn each peer's real Curve25519 key from User.public_key (field 8)
and inbound MeshPacket.public_key (16), kept in a side-map separate
from the synthetic routing key so unicast routing is untouched
- detect inbound MeshPacket.pki_encrypted (17) to tell a true E2E DM
from a channel-PSK fallback
- peer_is_pkc_capable() seam for a future mesh-tab E2E badge
Hot-swap preserved: no dispatched MeshRadioDevice signature or the
shared ParsedContact changed, so meshcore and meshtastic stay
interchangeable behind the listener.
Adds tests/multinode/meshtastic.sh, a two/three-radio on-air parity
harness (detect, discover, DM round-trip, DM privacy, channel
broadcast, typed envelope, reachability).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CORE_RPC_HOST was hardcoded to bitcoin-knots in three env-render paths, so on a
bitcoin-core node (container named bitcoin-core) mempool-api could not reach
Bitcoin RPC. Both node variants are reachable on archy-net by container name —
only the name differs.
- Legacy direct-podman (stacks.rs) and config.rs::get_app_config now use a new
dependencies::detect_bitcoin_rpc_host() (pure, unit-tested pick_bitcoin_host).
- Quadlet/manifest path (the modern fleet default): add a {{BITCOIN_HOST}}
derived-env placeholder — HostFacts.bitcoin_host + resolve_derived_env render
it; prod_orchestrator detects Knots/Core via podman ps, resolved on demand
only for manifests that use the placeholder. mempool-api manifest moves
CORE_RPC_HOST from static env to derived_env: {{BITCOIN_HOST}}.
Tests: pick_bitcoin_host (5 cases incl. substring safety), container-crate
resolve_derived_env, and orchestrator mempool_core_rpc_host_follows_bitcoin_node
(core->bitcoin-core, knots->bitcoin-knots). No-regression confirmed: picker
returns bitcoin-knots live on .198. Live bitcoin-core validation pending (no
core node available). Sibling hardcodes (lnd/btcpay/electrumx/fedimint) tracked
as B12b.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B13 template fix only fixed fresh ISOs. Already-deployed nodes keep their
old nginx config, where /app/fedimint/ proxies to :8175 without rewriting the
Guardian UI's root-rooted asset URLs (src="/assets/...", url("/assets/...")).
Those resolve against the SPA root: bg-network.jpg exists there by luck, but
app-icons/fedimint.jpg 404s (location /assets/ uses try_files =404) — the
visibly-broken icon.
bootstrap.rs::patch_nginx_conf now heals both paths on startup:
- Style A (main conf, HTTP): swaps the old single nostr-provider sub_filter tail
for the full reroot set; byte-matches the shipped template.
- Style B (HTTPS app-proxy snippet): the snippet's fedimint block has no
sub_filter and a per-node-varying trailing directive, so anchor on the unique
:8175 proxy_pass and insert the reroot set after it (nginx ignores directive
order). Snippet added to the bootstrap nginx loop (skipped on HTTP-only nodes).
missing_* flags are now gated on their splice anchors so the included snippet
neither attempts the main-conf-only patches nor logs warn-skips every boot.
Idempotent via the 'href="/' 'href="/app/fedimint/' marker.
Verified on .198 (both paths): fedimint app-icon 404 -> 200 image/jpeg; nginx -t
OK; containers survived restart (Quadlet); idempotent steady state, no warn spam.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fedimint UI HTML/CSS reference absolute /assets/* paths; under /app/fedimint/
those hit the main SPA, not the fedimint container, so the UI renders
unstyled. Add the proven sub_filter asset-rewrite pattern (as indeedhub/
botfights use) to the /app/fedimint/ block in the nginx template + https
snippet (also rewrites url(...) for the CSS background image). Bootstrap
self-heal for already-deployed nodes is the documented resume point.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
B15: Home system stats (incl. bitcoin sync %) polled every 30s — too slow;
now 10s so sync progress tracks the actual block height more closely.
B7: the ElectrumX sync overlay was gated only on status!=='synced', so if
the status never flips to 'synced' (ElectrumX stale/disconnected) the loader
stuck on top forever. Now the overlay hides and the app iframe loads when
the sync status is stale (fail-open), while still showing during active
indexing. type-check EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B3 streaming proxy endpoint existed in the backend but nginx had no
location for /api/peer-content/*, so the browser's requests fell through to
the SPA (200 text/html) and media still wouldn't play. Add an
NGINX_PEER_CONTENT_BLOCK that bootstrap patches into every server block
(forwards Cookie for session auth + Range, proxy_buffering off). Idempotent;
covers fresh-ISO nodes too since bootstrap runs on every startup.
Verified on .198: after restart the async nginx patch lands and
/api/peer-content/<onion>/<id> returns 401 (reaches backend, auth-gated)
instead of the SPA; nginx block present in both server blocks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Peer media (music/video) wouldn't play: the frontend downloaded the whole
file via RPC as base64 and made a non-seekable Blob URL, so <video>/large
<audio> stalled and big files hit the RPC timeout.
Add GET /api/peer-content/<onion>/<id> — a same-origin, session-gated proxy
that forwards the browser's Range header to the peer's /content/<id> (which
already returns 206 Partial Content) and passes status + Content-Range +
Content-Type back. PeerFiles.playMedia() now points <video>/<audio> at this
streaming URL for free content instead of buffering a base64 blob, so the
player can seek and start immediately. Onion/id validated to prevent
SSRF/path traversal. (Paid preview keeps its existing flow.)
Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
content.browse-peer now returns the transport that actually reached the
peer (fips/tor/mesh/lan). PeerFiles shows it as a small coloured pill next
to the peer name (FIPS/Mesh green, LAN blue, Tor amber) and the loading
text no longer hardcodes "Connecting via Tor" (it was misleading when FIPS
was used). Pairs with B14 (transport recording).
Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The B14 commit referenced crate::federation::storage::record_peer_transport
but `storage` is a private module — record_peer_transport is re-exported at
crate::federation::. E0603 broke the build. Use the re-exported path (as
load_nodes/fips_npub_for_onion already do). Verified: cargo build --release
EXIT 0. Also logs B21 (Tor/FIPS pill) plan.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 4 content peer handlers (browse, download, download_paid, preview)
captured the transport returned by PeerRequest::send_get() but discarded
it, so the federation node's last_transport was never updated for cloud
activity — the UI showed Tor/none even when FIPS was used. Call
record_peer_transport() after each successful fetch (same as sync does).
Note: live data shows FIPS still reaches only some peers (many genuinely
fall back to Tor) — tracked separately as B14b (FIPS reachability).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
B1/B2: the same physical node can linger in the federation list under two
dids (e.g. after a did/key change). An onion is a node's unique stable
identity, so two entries with the same onion are one node. This showed the
node twice in the trusted-node list (B1) and as two mesh chat contacts —
one by name+logo, one by raw did (B2).
- storage::load_nodes now collapses same-onion entries (keep first, merge
fips_npub/name/last_state) so every consumer (list + chat seed + sync)
sees one entry per node.
- federation::sync merge_transitive_peers also matches by onion (not just
did) so new transitive hints don't re-add a known node under a new did.
- mesh::seed_federation_peers_into_mesh skips already-seeded onions (belt
and suspenders).
- Unit tests for dedup_nodes_by_onion (collapse + onion-suffix handling).
B4: filebrowser-client.listDirectory only checked res.ok before res.json(),
so when File Browser is absent (nginx serves the SPA index.html, 200) or
down (502) the JSON parse threw the opaque "Unexpected token '<'". Now it
checks the content-type and throws a friendly "File Browser is not
available" the Cloud view already renders as an empty state.
Verified: dedup unit tests 2/2; live .198 (15 entries→13 distinct onions)
restarted healthy on new binary; B4 guard present in built bundle + deployed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The LND wallet UI (served on its own app port) fetches /lnd-connect-info
and /proxy/lnd/* cross-origin, so both need correct CORS headers.
(a) Older nginx configs add their own Access-Control-Allow-Origin in the
/lnd-connect-info location on top of the one the backend sets, yielding
a DUPLICATE header that browsers reject ("multiple values"). bootstrap
now strips that redundant nginx add_header (backend owns CORS).
(b) /proxy/lnd/* returned a 401 with no CORS headers when the session
check failed, so the browser saw an opaque CORS error instead of a
readable 401. Add unauthorized_cors() and use it on that path.
Adds tests/production-quality/ (bug tracker + lnd-cors-test.sh harness).
Verified: harness 4/4 on .116, .198, .103.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tests/multinode/smoke.sh on the existing multinode.bash lib: an
assertion suite (pass/fail + non-zero exit) driving two real nodes through
login, onion + FIPS identity, FIPS anchor-connected, federation pairing
both directions, peer content browse over the mesh, and the removed-node
tombstone (with an optional 3rd node C for the transitive-reappear case).
Guards the v1.7.94/v1.7.95 fixes. Content-browse + tombstone checks
skip-with-note against peers older than v1.7.95.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A wrong/locked LND wallet password leaves the wallet LOCKED after every
restart/OTA, breaking all Bitcoin-receive + Lightning ops fleet-wide — and the
harness was blind to it: live-lnd-address-type treats 'wallet locked' as PASS,
os-audit treated lnd-unreachable as WARN, and the archipelago lnd.getinfo RPC
masks a locked wallet (returns all-zero success).
- tests/release/run.sh: new 'live-lnd-unlocked' stage polls LND's unauth
/v1/state and FAILs if still LOCKED after a 60s grace window.
- tests/lifecycle/os-audit.sh: probe lnd.newaddress (the real receive path,
which surfaces LND_WALLET_LOCKED) instead of lnd.getinfo; locked = hard FAIL,
not-installed = WARN.
Proven on .116 (genuinely locked): os-audit now reports
'[FAIL] lnd wallet unlocked (lnd.newaddress) wallet LOCKED'.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The What's New modal (AccountInfoSection.vue) hardcodes one block per release
and had silently drifted: it sat at v1.7.84 while the fleet shipped through
v1.7.92, so eight releases of notes never reached users in Settings.
- scripts/sync-whats-new.py: renders a modal block from each CHANGELOG version
that's missing one (curated bullets, dev-process 'Validation…' lines dropped),
inserts newest-first; never touches older hand-written pre-CHANGELOG history.
--check mode lists anything missing and exits non-zero.
- tests/release/run.sh: new 'whats-new-sync' static gate runs --check, so a
release with an un-surfaced CHANGELOG entry fails before shipping.
- Backfilled the eight missing blocks (v1.7.85 … v1.7.92) into the modal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When ElectrumX is still building its index (or waiting on the Bitcoin node),
AppSessionFrame shows a sync 'pre UI'. The iframe-blocked fallback ('App not
reachable / retrying') was not gated on electrsSync, so it painted over the
sync screen and read as a hard connection error. Gate it on !electrsSync,
mirroring the iframe's own guard.
Also harden the lifecycle health probe: container_health used jq '// "unknown"',
which only catches null/false — an empty-string health (a brief window under
load) rendered as a blank 'bad health: X is '. Map empty to 'unknown' so the
retry loop keeps waiting instead of failing on a transient.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
os-audit.sh: one non-destructive scorecard tying backend/RPC health, the
all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards
(port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The
per-boot building block for the reboot-survival loop. FM12 check uses jq has()
not // (// treats a legit false as empty). Section A validated all-PASS on .116.
docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>