archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	79bbcca964	docs: consolidate OTA 1.8.0 + master-plan open items into one priority-ordered tracker docs/UNIFIED-TASK-TRACKER.md replaces hunting across SESSION-1.8.0-OTA-PROGRESS.md and PRODUCTION-MASTER-PLAN.md for "what's left" — fastest/simplest tasks first. Verified against live code/nodes rather than trusting doc text: several previously "open" items (bind-dir chown, netbird legacy installer, launch-port fallback, archival-bitcoin manifest field, progress-UI monotonicity, all-apps coverage, fedimint test coverage, changelog backfill, portainer image pin, grafana quadlet activation) turned out already shipped or non-issues, and are closed out here. TESTING.md's release-gate checklist updated to match reality (cargo warnings, 5x gate, changelog already green; multinode/backend-default-flip/tag genuinely open). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>	2026-07-01 12:29:26 -04:00
archipelago	df9d3a55be	integration: preserve deployed 1.8.0 OTA work	2026-06-30 05:08:17 -04:00
archipelago	3cea7dd6c5	test(phase3): fix Phase-3 quadlet gates — define fail(), drop stale Notify=healthy assert Two Phase-3 bats suites used `fail` (a bats-assert helper) but bats-assert isn't installed on the alpha fleet (only bats-core), so every tripped assertion crashed with `fail: command not found` (status 127) instead of reporting a real pass/fail. Define the same minimal `fail() { echo ...; return 1; }` the other suites already use (see mempool.bats). Without this the gates were silently non-functional. Also rewrite the obsolete "HealthCmd= implies Notify=healthy" assertion in use-quadlet-backends-install.bats. Phase 3.4's Notify=healthy was deliberately reverted: gating `systemctl start` on health hung boot reconciliation for dependency-waiting apps (fedimint idles until Bitcoin IBD; lnd until macaroon unlock), leaving units stuck "deactivating". The renderer now emits HealthCmd= for Podman's health state but TimeoutStartSec=0 and NO Notify=healthy (quadlet.rs render() + contains_stale_health_gate()). The test now asserts the current invariant: no backend unit gates start on health. Verified on the .228 canary node (ARCHIPELAGO_USE_QUADLET_BACKENDS=1): use-quadlet-backends-install 6/6, backend-survives-archipelago-restart 3/3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 16:09:05 -04:00
archipelago	f9a6ae3f32	feat(mesh): Meshtastic region + shared-channel auto-provisioning (MeshCore parity) Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched channels, so nodes only ever saw themselves. Bring them to MeshCore parity using the official Meshtastic admin API: - Auto-provision LoRa region (set_config, AdminMessage field 34) from a new mesh-config `lora_region` (e.g. EU_868) when the radio's region differs. - Auto-provision a shared primary channel (set_channel, field 33) with a PSK derived deterministically from channel_name, so every node converges on one mesh — the parity equivalent of MeshCore's named "archipelago" channel. - Read current region/channel from want_config; only write when different (no reboot loop); cap attempts so a radio that won't persist can't loop. - Active NodeInfo advert scaffolding + aggressive serial drain. Verified on .116+.228: region+channel persist, discovery works (both see each other as named reachable contacts), bidirectional RF + sending confirmed. Receiving in the running driver is still under diagnosis (instrumentation added). Also removes the unwanted `meshtastic` daemon app from the registry (it was never meant to be a container — native driver provides system-level support): deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) + test refs. Meshtastic stays native, like MeshCore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:46:35 -04:00
archipelago	43934eefa5	test(gate): destructive all-apps lifecycle matrix (WS-F#3) Active counterpart to the read-only all-apps-matrix.bats: drives stop/start/restart for every installed app and, under ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall → no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core suites. App set is discovered from My Apps ∩ the node catalog; reinstall spec comes from catalog.json {dockerImage, containerConfig}. PROTECTED by default (never cycled or torn down): bitcoin/electrum (expensive resync) AND lnd/btcpay/fedimint (teardown = irreversible wallet/channel/guardian loss). The user asked to protect only bitcoin+electrum; the wallet apps are added for safety and can be removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised pass, not folded into run-gate. Validated on .228: discovery excludes the 6 protected installed apps; lifecycle tier cycles a single app (botfights) stop/start/restart green; teardown gated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:22 -04:00
archipelago	b7d9210784	test(gate): optional ARCHY_GATE_CASCADE pass — wire the cascade tier in run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite (uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression guard) existed but was never enabled by the gate. Add an opt-in single cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out of the 5× loop deliberately — uninstall/reinstall every iteration would balloon runtime and re-pull images; one pass guards the class. Default gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:22:45 -04:00
archipelago	41e7f500f8	test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn The 5x destructive gate on heavy nodes false-failed on transient windows during stack recovery, not real regressions: - immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis ->server (DB migrations on boot) stack can take >30s to republish :2283 after a churn-induced recreate; destructive-tier immich tests already allow 180-240s. - mempool.bats: orphan-container check now polls to steady state (<=30s) instead of a single-shot count, which caught a recreated member briefly visible alongside its replacement mid-reconcile. - run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when installed, so the next iteration's read-only probe doesn't race a still- recovering stack. Settle returns the instant every probe is green. A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only absorb the transient recreate window under sustained churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 09:18:34 -04:00
archipelago	0406af522c	test(lifecycle): add manifest-driven all-apps health matrix The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others (jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats derives the app set from server.get-state package-data (no hardcoded list) and asserts baseline health across EVERY installed app: - settles to a non-transitional state within a window (the #13/#14 stuck-ghost class, generalized fleet-wide — installing/removing that never settles) - not in error/failed - reports a recognized (non-garbage) state - every running UI app (manifest ui=="true") exposes a non-null lan-address (the immich/port-drift unreachable-UI failure, generalized to all UI apps) Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:27:10 -04:00
archipelago	57a69257c4	test(lifecycle): add CASCADE uninstall/reinstall tier (guards #13 ghost, #14 reinstall) The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14 reinstall stalling on stale state). New cascade-uninstall.bats drives the full teardown path on a throwaway app (default grafana, precondition-skips if already installed so it can't destroy real data) and asserts: - fresh install reaches running via a truthful, non-silent progression - uninstall makes the entry DISAPPEAR from server.get-state package-data (the literal My Apps map) — no ghost, no stuck uninstall stage - container + (on-node) data dir are gone - reinstall returns to running - node left as found Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical gate. Verified 7/7 against .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:13:53 -04:00
archipelago	ccb594fb85	test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note It called bats-assert's `fail` (not loaded in this file) → "fail: command not found"/127, masking the real reason. Emit+return instead, bump the cold-restart RPC window 60s→120s (block-index reload), and note a node mid-IBD legitimately can't serve getinfo (environmental precondition, not a product regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 06:28:20 -04:00
archipelago	2afd18c6de	test(gate): poll immich lan_address to absorb mid-recreate churn 5× run #4 flaked iter4 on "immich exposes its web UI lan-address (port 2283)": container-list returned lan_address=null because immich_server was momentarily mid-recreate when the read-only tier queried it (passed the other 4 iterations; immich_server does publish 0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots state probe — poll <=30s for the exposed port instead of one read. A genuinely unexposed immich never publishes 2283, so real port drift is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 03:20:18 -04:00
archipelago	92d7f52dd6	fix(orchestrator): order only live containers on package start/restart package.restart resolved its container list via ordered_containers_for_start, which injected every name from the union startup_order list that wasn't already present — including variant names not live on a given node (mysql-mempool, archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is 2nd in the mempool start order, so do_orchestrator_package_start hit its unknown-app-id fallback, do_package_start failed the inspect ("no such object"), and the `?` aborted the whole start sequence — leaving mempool-api + the frontend down until the health monitor recovered them minutes later. That was the source of the 5× gate flakes #73 (frontend not running in 180s) and #74 (api not queryable in 300s); root-caused from the .228 journal ("Start failed: mysql-mempool"). Replace the inject-then-sort logic with a pure helper order_present_containers that orders only the actually-present containers and never adds phantom entries. startup_order remains a union of name variants across install generations — it's now used purely to order what's live, not to inject what isn't. +3 unit tests. Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a settled state instead of a single-shot read, so a container caught mid-reconcile (transient restarting/configured) can't flake a 20-min iteration. A genuinely-stuck container never settles, so real breakage is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 02:22:50 -04:00
archipelago	57a013bc66	test(gate): make 5× the canonical gate, drop 20x naming Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub 20× references across CLAUDE.md, the master plan, TESTING.md, app-registry status, the orchestrator/config doc-comments, and the bats suites. Also add a minimal fail() helper to mempool.bats so guard failures report cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:12:41 -04:00
archipelago	0f05f73a23	fix(mempool): self-healing nginx backend proxy (v3.0.1) + gate timeout The frontend nginx used a literal proxy_pass host with no resolver, so it pinned mempool-api's IP at worker startup. When the backend restarts (gate, OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a manual nginx reload. Same stale-upstream-IP class as the netbird 502. Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to re-resolve the backend per-request via 'resolver' + a variable proxy_pass. Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers on the network gateway, not Docker's 127.0.0.11). Per-location path mapping preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite). Proven on .228: backend IP change now auto-recovers with no reload; the literal-host control still 502s. Migrated the manifest off the retired tx1138 registry to vps2. Also: mempool.bats #74 waited only 180s post-restart (the slow path) and called an undefined 'fail' helper (status 127). Bumped to 300s to match the passing parity probes and emit a real failure instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:07:07 -04:00
archipelago	98f4fa44a8	test(gate): harden readiness for sustained 5x churn + inter-iteration settle The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO recover — lnd synced, mempool just mid-restart when probed — but slower than the windows when restarted back-to-back). Hardening: - run-20x.sh: best-effort settle_stack() before each iteration (wait for mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run). - required containers present/running (80/81): wait-loops (180s) not single-shot. - mempool api/frontend (87/88): retry ~180s not single-shot. - mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s. lnd getinfo (60): 90s->240s retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:11:15 -04:00
archipelago	27299ea687	docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:47:34 -04:00
archipelago	892ff083c4	test(gate): fix the last 4 readiness/config false-fails (none are product bugs) On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is green; these 4 were test-harness issues: - lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded node but DOES complete (synced_to_chain:true). - bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may have just been recreated by the companion-survives test). - probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for post-restart proxy/UI readiness instead of single-shot. - required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL app (not in required_containers) — only assert it when NPM is installed; and make the trailing lncli getinfo a retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 15:43:51 -04:00
archipelago	8893055810	test(gate): retry lnd getinfo for RPC readiness (wallet-unlock lags 'running') lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the container 'running' state — single-shot lncli getinfo raced that window and false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is functional (getinfo returns cleanly once ready). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:45:36 -04:00
archipelago	53b8e47f1d	test(gate): fix two false-failing lifecycle tests (not product bugs) - immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3- container stack (postgres->redis->server w/ DB migrations), so it needs at least as long as the start test (180s) — the old 120s was inconsistent and false-failed on loaded nodes. immich does return to running. - fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex omitted it -> total>known false orphan on every node running fedimint-clientd. Add fedimint-clientd to known. Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node (.116), not the RPC target — surfaced while driving the .228 gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:11:35 -04:00
archipelago	84031e6209	docs: temporarily reduce release lifecycle gate from 20x to 5x Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on .228 AND .198 for now, down from 20x. Restore to 20x before the final ship. Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:11:00 -04:00
archipelago	b0b54a96fa	test(lifecycle): immich suite — package-level checks, wait-based destructive tier container-list reports stack apps package-level (.name="immich"), so the suite checks the "immich" package (presence, valid state, :2283 lan-address) rather than individual container names. Destructive tier fires async stop/start/restart and asserts on the end state via wait_for_container_status. KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs ops back-to-back with no settling while immich's async stack ops take 30s+, and stopped reports as "exited" not "stopped". The immich migration itself is verified working (manual stop/start/restart succeed; all 3 containers healthy). Hardening the harness for stack apps (inter-op settling + stopped\|exited acceptance) is a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:52:33 -04:00
archipelago	b1f175b927	test(lifecycle): add immich stack lifecycle suite RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack (immich + immich-postgres + immich-redis): presence + valid state of all 3 members, a guard that no legacy underscore containers exist (catches botched migration / legacy-installer fallback), destructive stop/start/restart of the server with postgres+redis staying up, and cascade uninstall/reinstall (preserve_data). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:01:19 -04:00
archipelago	03a4ee1b30	feat(container): manifest-declared generated secrets + companion/quadlet hardening Generated-secrets system: apps declare `generated_secrets` in their manifest (kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets` materialises them 0600/rootless in resolve_dynamic_env — idempotent and self-healing (recovers wrongly root-owned secrets with no privilege). Replaces per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests now declare fmcd-password / fedimint-gateway-hash. companion.rs: rebuild the auto-built :latest image when its build context changes (staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes. quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit 125) + regression tests. UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged as Services (headless backends), gateway icon fallback. Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start; grafana/strfry orphan crash-loop units removed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:11:07 -04:00
archipelago	298595069d	fix(mesh): native Meshtastic unicast DMs + driver-level E2E status Meshtastic DMs were falling back to a channel broadcast, so every node on the LoRa channel saw a "direct" message. Send a directed MeshPacket (to = node num, decoded from the synthetic pubkey's node-id bytes) instead — the Meshtastic analog of the meshcore CMD_SEND_TXT_MSG fix. DMs now reach only the recipient; firmware auto-PKC-encrypts them end-to-end once NodeInfo keys are exchanged. Capture E2E status at the driver level (no shared-type/UI change): - learn each peer's real Curve25519 key from User.public_key (field 8) and inbound MeshPacket.public_key (16), kept in a side-map separate from the synthetic routing key so unicast routing is untouched - detect inbound MeshPacket.pki_encrypted (17) to tell a true E2E DM from a channel-PSK fallback - peer_is_pkc_capable() seam for a future mesh-tab E2E badge Hot-swap preserved: no dispatched MeshRadioDevice signature or the shared ParsedContact changed, so meshcore and meshtastic stay interchangeable behind the listener. Adds tests/multinode/meshtastic.sh, a two/three-radio on-air parity harness (detect, discover, DM round-trip, DM privacy, channel broadcast, typed envelope, reachability). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 09:52:26 -04:00
archipelago	4576964be4	docs(tracker): file new backlog as gitea #32-#35; relay UI + fedimint CSS live on .116	2026-06-16 06:41:22 -04:00
archipelago	82659e9f4e	docs(tracker): v1.7.97-alpha cut + mid-rollout state (116 deployed, 198 deploying, fleet pending)	2026-06-16 04:31:18 -04:00
archipelago	8a62ae008c	docs(tracker): B17 root-caused + fixed (data-volume mount ordering), verified .198	2026-06-16 03:38:58 -04:00
archipelago	dd0fac0e15	docs(tracker): B16 done (bitcoin tile retain/Updating…, unit-tested); image-opt staged for .97	2026-06-16 02:59:33 -04:00
archipelago	bf24bbc15a	fix(mempool): resolve CORE_RPC_HOST to the actual bitcoin node (Knots/Core) (B12) CORE_RPC_HOST was hardcoded to bitcoin-knots in three env-render paths, so on a bitcoin-core node (container named bitcoin-core) mempool-api could not reach Bitcoin RPC. Both node variants are reachable on archy-net by container name — only the name differs. - Legacy direct-podman (stacks.rs) and config.rs::get_app_config now use a new dependencies::detect_bitcoin_rpc_host() (pure, unit-tested pick_bitcoin_host). - Quadlet/manifest path (the modern fleet default): add a {{BITCOIN_HOST}} derived-env placeholder — HostFacts.bitcoin_host + resolve_derived_env render it; prod_orchestrator detects Knots/Core via podman ps, resolved on demand only for manifests that use the placeholder. mempool-api manifest moves CORE_RPC_HOST from static env to derived_env: {{BITCOIN_HOST}}. Tests: pick_bitcoin_host (5 cases incl. substring safety), container-crate resolve_derived_env, and orchestrator mempool_core_rpc_host_follows_bitcoin_node (core->bitcoin-core, knots->bitcoin-knots). No-regression confirmed: picker returns bitcoin-knots live on .198. Live bitcoin-core validation pending (no core node available). Sibling hardcodes (lnd/btcpay/electrumx/fedimint) tracked as B12b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 02:07:39 -04:00
archipelago	987a961f4a	fix(nginx): self-heal fedimint asset rewrite on deployed nodes — HTTP + HTTPS (B13) The B13 template fix only fixed fresh ISOs. Already-deployed nodes keep their old nginx config, where /app/fedimint/ proxies to :8175 without rewriting the Guardian UI's root-rooted asset URLs (src="/assets/...", url("/assets/...")). Those resolve against the SPA root: bg-network.jpg exists there by luck, but app-icons/fedimint.jpg 404s (location /assets/ uses try_files =404) — the visibly-broken icon. bootstrap.rs::patch_nginx_conf now heals both paths on startup: - Style A (main conf, HTTP): swaps the old single nostr-provider sub_filter tail for the full reroot set; byte-matches the shipped template. - Style B (HTTPS app-proxy snippet): the snippet's fedimint block has no sub_filter and a per-node-varying trailing directive, so anchor on the unique :8175 proxy_pass and insert the reroot set after it (nginx ignores directive order). Snippet added to the bootstrap nginx loop (skipped on HTTP-only nodes). missing_* flags are now gated on their splice anchors so the included snippet neither attempts the main-conf-only patches nor logs warn-skips every boot. Idempotent via the 'href="/' 'href="/app/fedimint/' marker. Verified on .198 (both paths): fedimint app-icon 404 -> 200 image/jpeg; nginx -t OK; containers survived restart (Quadlet); idempotent steady state, no warn spam. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 18:03:04 -04:00
archipelago	a50b6df21b	fix(nginx): rewrite fedimint UI asset paths so CSS applies (B13, fresh-ISO) Fedimint UI HTML/CSS reference absolute /assets/* paths; under /app/fedimint/ those hit the main SPA, not the fedimint container, so the UI renders unstyled. Add the proven sub_filter asset-rewrite pattern (as indeedhub/ botfights use) to the /app/fedimint/ block in the nginx template + https snippet (also rewrites url(...) for the CSS background image). Bootstrap self-heal for already-deployed nodes is the documented resume point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 16:52:30 -04:00
archipelago	8427e219ea	docs(tracker): round-2 status (B15/B7 done, B13/B12/B16 deferred w/ plans) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 16:31:24 -04:00
archipelago	c0d41cf8cf	fix(ui): faster bitcoin sync refresh + unstick ElectrumX loader (B15,B7) B15: Home system stats (incl. bitcoin sync %) polled every 30s — too slow; now 10s so sync progress tracks the actual block height more closely. B7: the ElectrumX sync overlay was gated only on status!=='synced', so if the status never flips to 'synced' (ElectrumX stale/disconnected) the loader stuck on top forever. Now the overlay hides and the app iframe loads when the sync status is stale (fail-open), while still showing during active indexing. type-check EXIT 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 16:29:44 -04:00
archipelago	eb55c88e1a	docs(tracker): B6/B7/B12/B13/B15/B16 root causes + fix plans Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 14:43:01 -04:00
archipelago	31fe91b99a	docs(tracker): B13 fedimint CSS investigation progress Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 14:13:28 -04:00
archipelago	b9cc4bd780	docs(tracker): B14b FIPS reachability findings (dial-time, not npub/service) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 14:11:47 -04:00
archipelago	6c92eacba0	docs(tracker): add B22 (peer download/audio errors), B23 (group chat), B3 PASSED-http Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 14:09:31 -04:00
archipelago	602b9cd3df	fix(nginx): route /api/peer-content/* to the backend for B3 streaming The B3 streaming proxy endpoint existed in the backend but nginx had no location for /api/peer-content/*, so the browser's requests fell through to the SPA (200 text/html) and media still wouldn't play. Add an NGINX_PEER_CONTENT_BLOCK that bootstrap patches into every server block (forwards Cookie for session auth + Range, proxy_buffering off). Idempotent; covers fresh-ISO nodes too since bootstrap runs on every startup. Verified on .198: after restart the async nginx patch lands and /api/peer-content/<onion>/<id> returns 401 (reaches backend, auth-gated) instead of the SPA; nginx block present in both server blocks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 14:07:39 -04:00
archipelago	5c8707432b	fix(cloud): Range-streaming proxy for peer media so it plays/seeks (B3) Peer media (music/video) wouldn't play: the frontend downloaded the whole file via RPC as base64 and made a non-seekable Blob URL, so <video>/large <audio> stalled and big files hit the RPC timeout. Add GET /api/peer-content/<onion>/<id> — a same-origin, session-gated proxy that forwards the browser's Range header to the peer's /content/<id> (which already returns 206 Partial Content) and passes status + Content-Range + Content-Type back. PeerFiles.playMedia() now points <video>/<audio> at this streaming URL for free content instead of buffering a base64 blob, so the player can seek and start immediately. Onion/id validated to prevent SSRF/path traversal. (Paid preview keeps its existing flow.) Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:46:51 -04:00
archipelago	4cac6bc835	docs(tracker): record B1/B2/B4/B14/B21 done + B14b; next B3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:27:51 -04:00
archipelago	0801dd6632	feat(cloud): show Tor/FIPS transport pill on peer browse (B21) content.browse-peer now returns the transport that actually reached the peer (fips/tor/mesh/lan). PeerFiles shows it as a small coloured pill next to the peer name (FIPS/Mesh green, LAN blue, Tor amber) and the loading text no longer hardcodes "Connecting via Tor" (it was misleading when FIPS was used). Pairs with B14 (transport recording). Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:25:39 -04:00
archipelago	1c6dc153ce	fix(content): use re-exported federation::record_peer_transport path (repair build) The B14 commit referenced crate::federation::storage::record_peer_transport but `storage` is a private module — record_peer_transport is re-exported at crate::federation::. E0603 broke the build. Use the re-exported path (as load_nodes/fips_npub_for_onion already do). Verified: cargo build --release EXIT 0. Also logs B21 (Tor/FIPS pill) plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:15:01 -04:00
archipelago	f2e3710c28	fix(content): record peer transport on cloud browse/download/preview (B14) The 4 content peer handlers (browse, download, download_paid, preview) captured the transport returned by PeerRequest::send_get() but discarded it, so the federation node's last_transport was never updated for cloud activity — the UI showed Tor/none even when FIPS was used. Call record_peer_transport() after each successful fetch (same as sync does). Note: live data shows FIPS still reaches only some peers (many genuinely fall back to Tor) — tracked separately as B14b (FIPS reachability). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 13:02:13 -04:00
archipelago	ed4931064b	fix(federation,cloud): dedup trusted nodes + chat contacts by onion; guard cloud my-folders (B1,B2,B4) B1/B2: the same physical node can linger in the federation list under two dids (e.g. after a did/key change). An onion is a node's unique stable identity, so two entries with the same onion are one node. This showed the node twice in the trusted-node list (B1) and as two mesh chat contacts — one by name+logo, one by raw did (B2). - storage::load_nodes now collapses same-onion entries (keep first, merge fips_npub/name/last_state) so every consumer (list + chat seed + sync) sees one entry per node. - federation::sync merge_transitive_peers also matches by onion (not just did) so new transitive hints don't re-add a known node under a new did. - mesh::seed_federation_peers_into_mesh skips already-seeded onions (belt and suspenders). - Unit tests for dedup_nodes_by_onion (collapse + onion-suffix handling). B4: filebrowser-client.listDirectory only checked res.ok before res.json(), so when File Browser is absent (nginx serves the SPA index.html, 200) or down (502) the JSON parse threw the opaque "Unexpected token '<'". Now it checks the content-type and throws a friendly "File Browser is not available" the Cloud view already renders as an empty state. Verified: dedup unit tests 2/2; live .198 (15 entries→13 distinct onions) restarted healthy on new binary; B4 guard present in built bundle + deployed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 12:29:12 -04:00
archipelago	1db720af13	fix(lnd): repair fleet-wide CORS on LND connect-wallet endpoints (B5) The LND wallet UI (served on its own app port) fetches /lnd-connect-info and /proxy/lnd/* cross-origin, so both need correct CORS headers. (a) Older nginx configs add their own Access-Control-Allow-Origin in the /lnd-connect-info location on top of the one the backend sets, yielding a DUPLICATE header that browsers reject ("multiple values"). bootstrap now strips that redundant nginx add_header (backend owns CORS). (b) /proxy/lnd/* returned a 401 with no CORS headers when the session check failed, so the browser saw an opaque CORS error instead of a readable 401. Add unauthorized_cors() and use it on that path. Adds tests/production-quality/ (bug tracker + lnd-cors-test.sh harness). Verified: harness 4/4 on .116, .198, .103. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 11:31:14 -04:00
archipelago	0c8991b519	test(multinode): assertion-based two-node E2E smoke suite Adds tests/multinode/smoke.sh on the existing multinode.bash lib: an assertion suite (pass/fail + non-zero exit) driving two real nodes through login, onion + FIPS identity, FIPS anchor-connected, federation pairing both directions, peer content browse over the mesh, and the removed-node tombstone (with an optional 3rd node C for the transitive-reappear case). Guards the v1.7.94/v1.7.95 fixes. Content-browse + tombstone checks skip-with-note against peers older than v1.7.95. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 09:03:58 -04:00
archipelago	8c8e4d7a29	test: gate that LND wallet is unlocked after restart (catches fleet-wide lock) A wrong/locked LND wallet password leaves the wallet LOCKED after every restart/OTA, breaking all Bitcoin-receive + Lightning ops fleet-wide — and the harness was blind to it: live-lnd-address-type treats 'wallet locked' as PASS, os-audit treated lnd-unreachable as WARN, and the archipelago lnd.getinfo RPC masks a locked wallet (returns all-zero success). - tests/release/run.sh: new 'live-lnd-unlocked' stage polls LND's unauth /v1/state and FAILs if still LOCKED after a 60s grace window. - tests/lifecycle/os-audit.sh: probe lnd.newaddress (the real receive path, which surfaces LND_WALLET_LOCKED) instead of lnd.getinfo; locked = hard FAIL, not-installed = WARN. Proven on .116 (genuinely locked): os-audit now reports '[FAIL] lnd wallet unlocked (lnd.newaddress) wallet LOCKED'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 10:36:12 -04:00
archipelago	2fac63e58c	feat(release): gate that Settings 'What's New' modal stays in sync with CHANGELOG The What's New modal (AccountInfoSection.vue) hardcodes one block per release and had silently drifted: it sat at v1.7.84 while the fleet shipped through v1.7.92, so eight releases of notes never reached users in Settings. - scripts/sync-whats-new.py: renders a modal block from each CHANGELOG version that's missing one (curated bullets, dev-process 'Validation…' lines dropped), inserts newest-first; never touches older hand-written pre-CHANGELOG history. --check mode lists anything missing and exits non-zero. - tests/release/run.sh: new 'whats-new-sync' static gate runs --check, so a release with an un-surfaced CHANGELOG entry fails before shipping. - Backfilled the eight missing blocks (v1.7.85 … v1.7.92) into the modal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 08:31:43 -04:00
archipelago	4232424b23	fix(ui): suppress app-unreachable overlay while ElectrumX sync screen shows When ElectrumX is still building its index (or waiting on the Bitcoin node), AppSessionFrame shows a sync 'pre UI'. The iframe-blocked fallback ('App not reachable / retrying') was not gated on electrsSync, so it painted over the sync screen and read as a hard connection error. Gate it on !electrsSync, mirroring the iframe's own guard. Also harden the lifecycle health probe: container_health used jq '// "unknown"', which only catches null/false — an empty-string health (a brief window under load) rendered as a blank 'bad health: X is '. Map empty to 'unknown' so the retry loop keeps waiting instead of failing on a transient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 07:58:24 -04:00
archipelago	329e7811eb	test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes os-audit.sh: one non-destructive scorecard tying backend/RPC health, the all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards (port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The per-boot building block for the reboot-survival loop. FM12 check uses jq has() not // (// treats a legit false as empty). Section A validated all-PASS on .116. docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 04:36:06 -04:00

1 2

75 Commits