56 Commits

Author SHA1 Message Date
archipelago
84031e6209 docs: temporarily reduce release lifecycle gate from 20x to 5x
Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on
.228 AND .198 for now, down from 20x. Restore to 20x before the final ship.
Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:11:00 -04:00
archipelago
b0b54a96fa test(lifecycle): immich suite — package-level checks, wait-based destructive tier
container-list reports stack apps package-level (.name="immich"), so the suite
checks the "immich" package (presence, valid state, :2283 lan-address) rather than
individual container names. Destructive tier fires async stop/start/restart and
asserts on the end state via wait_for_container_status.

KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs
ops back-to-back with no settling while immich's async stack ops take 30s+, and
stopped reports as "exited" not "stopped". The immich migration itself is verified
working (manual stop/start/restart succeed; all 3 containers healthy). Hardening
the harness for stack apps (inter-op settling + stopped|exited acceptance) is a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 09:52:33 -04:00
archipelago
b1f175b927 test(lifecycle): add immich stack lifecycle suite
RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack
(immich + immich-postgres + immich-redis): presence + valid state of all 3 members,
a guard that no legacy underscore containers exist (catches botched migration /
legacy-installer fallback), destructive stop/start/restart of the server with
postgres+redis staying up, and cascade uninstall/reinstall (preserve_data).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 09:01:19 -04:00
archipelago
03a4ee1b30 feat(container): manifest-declared generated secrets + companion/quadlet hardening
Generated-secrets system: apps declare `generated_secrets` in their manifest
(kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets`
materialises them 0600/rootless in resolve_dynamic_env — idempotent and
self-healing (recovers wrongly root-owned secrets with no privilege). Replaces
per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests
now declare fmcd-password / fedimint-gateway-hash.

companion.rs: rebuild the auto-built :latest image when its build context changes
(staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes.

quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit
125) + regression tests.

UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged
as Services (headless backends), gateway icon fallback.

Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start;
grafana/strfry orphan crash-loop units removed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:11:07 -04:00
archipelago
298595069d fix(mesh): native Meshtastic unicast DMs + driver-level E2E status
Meshtastic DMs were falling back to a channel broadcast, so every node
on the LoRa channel saw a "direct" message. Send a directed MeshPacket
(to = node num, decoded from the synthetic pubkey's node-id bytes)
instead — the Meshtastic analog of the meshcore CMD_SEND_TXT_MSG fix.
DMs now reach only the recipient; firmware auto-PKC-encrypts them
end-to-end once NodeInfo keys are exchanged.

Capture E2E status at the driver level (no shared-type/UI change):
- learn each peer's real Curve25519 key from User.public_key (field 8)
  and inbound MeshPacket.public_key (16), kept in a side-map separate
  from the synthetic routing key so unicast routing is untouched
- detect inbound MeshPacket.pki_encrypted (17) to tell a true E2E DM
  from a channel-PSK fallback
- peer_is_pkc_capable() seam for a future mesh-tab E2E badge

Hot-swap preserved: no dispatched MeshRadioDevice signature or the
shared ParsedContact changed, so meshcore and meshtastic stay
interchangeable behind the listener.

Adds tests/multinode/meshtastic.sh, a two/three-radio on-air parity
harness (detect, discover, DM round-trip, DM privacy, channel
broadcast, typed envelope, reachability).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:26 -04:00
archipelago
4576964be4 docs(tracker): file new backlog as gitea #32-#35; relay UI + fedimint CSS live on .116 2026-06-16 06:41:22 -04:00
archipelago
82659e9f4e docs(tracker): v1.7.97-alpha cut + mid-rollout state (116 deployed, 198 deploying, fleet pending) 2026-06-16 04:31:18 -04:00
archipelago
8a62ae008c docs(tracker): B17 root-caused + fixed (data-volume mount ordering), verified .198 2026-06-16 03:38:58 -04:00
archipelago
dd0fac0e15 docs(tracker): B16 done (bitcoin tile retain/Updating…, unit-tested); image-opt staged for .97 2026-06-16 02:59:33 -04:00
archipelago
bf24bbc15a fix(mempool): resolve CORE_RPC_HOST to the actual bitcoin node (Knots/Core) (B12)
CORE_RPC_HOST was hardcoded to bitcoin-knots in three env-render paths, so on a
bitcoin-core node (container named bitcoin-core) mempool-api could not reach
Bitcoin RPC. Both node variants are reachable on archy-net by container name —
only the name differs.

- Legacy direct-podman (stacks.rs) and config.rs::get_app_config now use a new
  dependencies::detect_bitcoin_rpc_host() (pure, unit-tested pick_bitcoin_host).
- Quadlet/manifest path (the modern fleet default): add a {{BITCOIN_HOST}}
  derived-env placeholder — HostFacts.bitcoin_host + resolve_derived_env render
  it; prod_orchestrator detects Knots/Core via podman ps, resolved on demand
  only for manifests that use the placeholder. mempool-api manifest moves
  CORE_RPC_HOST from static env to derived_env: {{BITCOIN_HOST}}.

Tests: pick_bitcoin_host (5 cases incl. substring safety), container-crate
resolve_derived_env, and orchestrator mempool_core_rpc_host_follows_bitcoin_node
(core->bitcoin-core, knots->bitcoin-knots). No-regression confirmed: picker
returns bitcoin-knots live on .198. Live bitcoin-core validation pending (no
core node available). Sibling hardcodes (lnd/btcpay/electrumx/fedimint) tracked
as B12b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 02:07:39 -04:00
archipelago
987a961f4a fix(nginx): self-heal fedimint asset rewrite on deployed nodes — HTTP + HTTPS (B13)
The B13 template fix only fixed fresh ISOs. Already-deployed nodes keep their
old nginx config, where /app/fedimint/ proxies to :8175 without rewriting the
Guardian UI's root-rooted asset URLs (src="/assets/...", url("/assets/...")).
Those resolve against the SPA root: bg-network.jpg exists there by luck, but
app-icons/fedimint.jpg 404s (location /assets/ uses try_files =404) — the
visibly-broken icon.

bootstrap.rs::patch_nginx_conf now heals both paths on startup:
- Style A (main conf, HTTP): swaps the old single nostr-provider sub_filter tail
  for the full reroot set; byte-matches the shipped template.
- Style B (HTTPS app-proxy snippet): the snippet's fedimint block has no
  sub_filter and a per-node-varying trailing directive, so anchor on the unique
  :8175 proxy_pass and insert the reroot set after it (nginx ignores directive
  order). Snippet added to the bootstrap nginx loop (skipped on HTTP-only nodes).

missing_* flags are now gated on their splice anchors so the included snippet
neither attempts the main-conf-only patches nor logs warn-skips every boot.
Idempotent via the 'href="/' 'href="/app/fedimint/' marker.

Verified on .198 (both paths): fedimint app-icon 404 -> 200 image/jpeg; nginx -t
OK; containers survived restart (Quadlet); idempotent steady state, no warn spam.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:03:04 -04:00
archipelago
a50b6df21b fix(nginx): rewrite fedimint UI asset paths so CSS applies (B13, fresh-ISO)
Fedimint UI HTML/CSS reference absolute /assets/* paths; under /app/fedimint/
those hit the main SPA, not the fedimint container, so the UI renders
unstyled. Add the proven sub_filter asset-rewrite pattern (as indeedhub/
botfights use) to the /app/fedimint/ block in the nginx template + https
snippet (also rewrites url(...) for the CSS background image). Bootstrap
self-heal for already-deployed nodes is the documented resume point.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:52:30 -04:00
archipelago
8427e219ea docs(tracker): round-2 status (B15/B7 done, B13/B12/B16 deferred w/ plans)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:31:24 -04:00
archipelago
c0d41cf8cf fix(ui): faster bitcoin sync refresh + unstick ElectrumX loader (B15,B7)
B15: Home system stats (incl. bitcoin sync %) polled every 30s — too slow;
now 10s so sync progress tracks the actual block height more closely.

B7: the ElectrumX sync overlay was gated only on status!=='synced', so if
the status never flips to 'synced' (ElectrumX stale/disconnected) the loader
stuck on top forever. Now the overlay hides and the app iframe loads when
the sync status is stale (fail-open), while still showing during active
indexing. type-check EXIT 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:29:44 -04:00
archipelago
eb55c88e1a docs(tracker): B6/B7/B12/B13/B15/B16 root causes + fix plans
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:43:01 -04:00
archipelago
31fe91b99a docs(tracker): B13 fedimint CSS investigation progress
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:13:28 -04:00
archipelago
b9cc4bd780 docs(tracker): B14b FIPS reachability findings (dial-time, not npub/service)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:11:47 -04:00
archipelago
6c92eacba0 docs(tracker): add B22 (peer download/audio errors), B23 (group chat), B3 PASSED-http
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:09:31 -04:00
archipelago
602b9cd3df fix(nginx): route /api/peer-content/* to the backend for B3 streaming
The B3 streaming proxy endpoint existed in the backend but nginx had no
location for /api/peer-content/*, so the browser's requests fell through to
the SPA (200 text/html) and media still wouldn't play. Add an
NGINX_PEER_CONTENT_BLOCK that bootstrap patches into every server block
(forwards Cookie for session auth + Range, proxy_buffering off). Idempotent;
covers fresh-ISO nodes too since bootstrap runs on every startup.

Verified on .198: after restart the async nginx patch lands and
/api/peer-content/<onion>/<id> returns 401 (reaches backend, auth-gated)
instead of the SPA; nginx block present in both server blocks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 14:07:39 -04:00
archipelago
5c8707432b fix(cloud): Range-streaming proxy for peer media so it plays/seeks (B3)
Peer media (music/video) wouldn't play: the frontend downloaded the whole
file via RPC as base64 and made a non-seekable Blob URL, so <video>/large
<audio> stalled and big files hit the RPC timeout.

Add GET /api/peer-content/<onion>/<id> — a same-origin, session-gated proxy
that forwards the browser's Range header to the peer's /content/<id> (which
already returns 206 Partial Content) and passes status + Content-Range +
Content-Type back. PeerFiles.playMedia() now points <video>/<audio> at this
streaming URL for free content instead of buffering a base64 blob, so the
player can seek and start immediately. Onion/id validated to prevent
SSRF/path traversal. (Paid preview keeps its existing flow.)

Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:46:51 -04:00
archipelago
4cac6bc835 docs(tracker): record B1/B2/B4/B14/B21 done + B14b; next B3
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:27:51 -04:00
archipelago
0801dd6632 feat(cloud): show Tor/FIPS transport pill on peer browse (B21)
content.browse-peer now returns the transport that actually reached the
peer (fips/tor/mesh/lan). PeerFiles shows it as a small coloured pill next
to the peer name (FIPS/Mesh green, LAN blue, Tor amber) and the loading
text no longer hardcodes "Connecting via Tor" (it was misleading when FIPS
was used). Pairs with B14 (transport recording).

Verified: cargo build --release EXIT 0; vue-tsc --noEmit EXIT 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:25:39 -04:00
archipelago
1c6dc153ce fix(content): use re-exported federation::record_peer_transport path (repair build)
The B14 commit referenced crate::federation::storage::record_peer_transport
but `storage` is a private module — record_peer_transport is re-exported at
crate::federation::. E0603 broke the build. Use the re-exported path (as
load_nodes/fips_npub_for_onion already do). Verified: cargo build --release
EXIT 0. Also logs B21 (Tor/FIPS pill) plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:15:01 -04:00
archipelago
f2e3710c28 fix(content): record peer transport on cloud browse/download/preview (B14)
The 4 content peer handlers (browse, download, download_paid, preview)
captured the transport returned by PeerRequest::send_get() but discarded
it, so the federation node's last_transport was never updated for cloud
activity — the UI showed Tor/none even when FIPS was used. Call
record_peer_transport() after each successful fetch (same as sync does).

Note: live data shows FIPS still reaches only some peers (many genuinely
fall back to Tor) — tracked separately as B14b (FIPS reachability).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:02:13 -04:00
archipelago
ed4931064b fix(federation,cloud): dedup trusted nodes + chat contacts by onion; guard cloud my-folders (B1,B2,B4)
B1/B2: the same physical node can linger in the federation list under two
dids (e.g. after a did/key change). An onion is a node's unique stable
identity, so two entries with the same onion are one node. This showed the
node twice in the trusted-node list (B1) and as two mesh chat contacts —
one by name+logo, one by raw did (B2).
- storage::load_nodes now collapses same-onion entries (keep first, merge
  fips_npub/name/last_state) so every consumer (list + chat seed + sync)
  sees one entry per node.
- federation::sync merge_transitive_peers also matches by onion (not just
  did) so new transitive hints don't re-add a known node under a new did.
- mesh::seed_federation_peers_into_mesh skips already-seeded onions (belt
  and suspenders).
- Unit tests for dedup_nodes_by_onion (collapse + onion-suffix handling).

B4: filebrowser-client.listDirectory only checked res.ok before res.json(),
so when File Browser is absent (nginx serves the SPA index.html, 200) or
down (502) the JSON parse threw the opaque "Unexpected token '<'". Now it
checks the content-type and throws a friendly "File Browser is not
available" the Cloud view already renders as an empty state.

Verified: dedup unit tests 2/2; live .198 (15 entries→13 distinct onions)
restarted healthy on new binary; B4 guard present in built bundle + deployed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 12:29:12 -04:00
archipelago
1db720af13 fix(lnd): repair fleet-wide CORS on LND connect-wallet endpoints (B5)
The LND wallet UI (served on its own app port) fetches /lnd-connect-info
and /proxy/lnd/* cross-origin, so both need correct CORS headers.

(a) Older nginx configs add their own Access-Control-Allow-Origin in the
    /lnd-connect-info location on top of the one the backend sets, yielding
    a DUPLICATE header that browsers reject ("multiple values"). bootstrap
    now strips that redundant nginx add_header (backend owns CORS).
(b) /proxy/lnd/* returned a 401 with no CORS headers when the session
    check failed, so the browser saw an opaque CORS error instead of a
    readable 401. Add unauthorized_cors() and use it on that path.

Adds tests/production-quality/ (bug tracker + lnd-cors-test.sh harness).
Verified: harness 4/4 on .116, .198, .103.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:31:14 -04:00
archipelago
0c8991b519 test(multinode): assertion-based two-node E2E smoke suite
Adds tests/multinode/smoke.sh on the existing multinode.bash lib: an
assertion suite (pass/fail + non-zero exit) driving two real nodes through
login, onion + FIPS identity, FIPS anchor-connected, federation pairing
both directions, peer content browse over the mesh, and the removed-node
tombstone (with an optional 3rd node C for the transitive-reappear case).
Guards the v1.7.94/v1.7.95 fixes. Content-browse + tombstone checks
skip-with-note against peers older than v1.7.95.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:03:58 -04:00
archipelago
8c8e4d7a29 test: gate that LND wallet is unlocked after restart (catches fleet-wide lock)
A wrong/locked LND wallet password leaves the wallet LOCKED after every
restart/OTA, breaking all Bitcoin-receive + Lightning ops fleet-wide — and the
harness was blind to it: live-lnd-address-type treats 'wallet locked' as PASS,
os-audit treated lnd-unreachable as WARN, and the archipelago lnd.getinfo RPC
masks a locked wallet (returns all-zero success).

- tests/release/run.sh: new 'live-lnd-unlocked' stage polls LND's unauth
  /v1/state and FAILs if still LOCKED after a 60s grace window.
- tests/lifecycle/os-audit.sh: probe lnd.newaddress (the real receive path,
  which surfaces LND_WALLET_LOCKED) instead of lnd.getinfo; locked = hard FAIL,
  not-installed = WARN.

Proven on .116 (genuinely locked): os-audit now reports
'[FAIL] lnd wallet unlocked (lnd.newaddress) wallet LOCKED'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:36:12 -04:00
archipelago
2fac63e58c feat(release): gate that Settings 'What's New' modal stays in sync with CHANGELOG
The What's New modal (AccountInfoSection.vue) hardcodes one block per release
and had silently drifted: it sat at v1.7.84 while the fleet shipped through
v1.7.92, so eight releases of notes never reached users in Settings.

- scripts/sync-whats-new.py: renders a modal block from each CHANGELOG version
  that's missing one (curated bullets, dev-process 'Validation…' lines dropped),
  inserts newest-first; never touches older hand-written pre-CHANGELOG history.
  --check mode lists anything missing and exits non-zero.
- tests/release/run.sh: new 'whats-new-sync' static gate runs --check, so a
  release with an un-surfaced CHANGELOG entry fails before shipping.
- Backfilled the eight missing blocks (v1.7.85 … v1.7.92) into the modal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 08:31:43 -04:00
archipelago
4232424b23 fix(ui): suppress app-unreachable overlay while ElectrumX sync screen shows
When ElectrumX is still building its index (or waiting on the Bitcoin node),
AppSessionFrame shows a sync 'pre UI'. The iframe-blocked fallback ('App not
reachable / retrying') was not gated on electrsSync, so it painted over the
sync screen and read as a hard connection error. Gate it on !electrsSync,
mirroring the iframe's own guard.

Also harden the lifecycle health probe: container_health used jq '// "unknown"',
which only catches null/false — an empty-string health (a brief window under
load) rendered as a blank 'bad health: X is '. Map empty to 'unknown' so the
retry loop keeps waiting instead of failing on a transient.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 07:58:24 -04:00
archipelago
329e7811eb test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes
os-audit.sh: one non-destructive scorecard tying backend/RPC health, the
all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards
(port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The
per-boot building block for the reboot-survival loop. FM12 check uses jq has()
not // (// treats a legit false as empty). Section A validated all-PASS on .116.

docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 04:36:06 -04:00
archipelago
0ed892a412 fix: wallet receive reliability, bitcoin install self-heal, ElectrumX app tile
Fixes three Bitcoin/wallet failures observed across the fleet on v1.7.90-alpha
(all nodes were already on the latest build — these were live bugs, not stale
builds), plus the missing ElectrumX tile, and adds automated coverage so each
can't regress silently.

Receive address (".116 receive fails", ".228 false 'wallet is locked'"):
- LND publishes its REST API on a host port that can drift from the manifest
  (a container created when the mapping was 8080 kept publishing 8080 after the
  manifest moved to 18080). The in-process client connects to the manifest port,
  gets connection-refused, and wallet init fails forever while the container
  looks "Up". Add published-port drift detection to the reconciler
  (container_ports_drifted / host_port_bindings_drifted) that recreates a
  drifted backend even for restart-sensitive apps — a drifted container is
  already broken, so leaving it "untouched" only perpetuates the failure.
- Receive errors now carry a stable [CODE] token (REST_UNREACHABLE, WALLET_LOCKED,
  WALLET_UNINITIALIZED, SYNCING) and always start with "Bitcoin address" so they
  survive the RPC error sanitizer instead of collapsing to the generic
  "Operation failed". The UI maps the code instead of guessing wallet state from
  substrings — so an unreachable REST endpoint is no longer mislabelled "locked".

Bitcoin install (".198 bitcoin gone / reinstall just stops"):
- bitcoin-knots requires the secret bitcoin-rpc-txrelay-rpcauth, which was only
  generated by the tx-relay flow. Nodes that never used tx-relay lacked it, so
  secret resolution hard-failed and the whole Bitcoin stack cascaded. Generate
  it idempotently before bitcoin starts (ensure_app_secrets, reusing
  ensure_txrelay_credentials), and name the missing secret in the error so a
  genuine gap is actionable instead of a bare "IO error".

ElectrumX app tile missing on every node with it installed:
- The catalog generator dropped electrumx because the manifest had no
  interfaces.main block, so the tile had no launch URL and was hidden. Declare
  the companion UI port (50002) in the manifest, regenerate the catalog, and let
  an app with a known launch URL stay launchable while its backend is still
  "starting" (ElectrumX indexes for 10m+).

Test harness:
- New lifecycle bats suites: bitcoin-receive, port-drift, secret-completeness
  (validated live; port-drift catches the real .116 drift).
- Rust unit tests for drift detection, the receive reason-code classifier, and
  the named-missing-secret error; vitest for the UI code mapping.
- create-release.sh now runs tests/release/run.sh and aborts the release on
  failure — previously it ran no tests at all.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:12:56 -04:00
archipelago
c800293f1f fix: bitcoin receive, AIUI pointer input, electrs self-heal, OTA timeout
- LND wallet: request correct address type so receive-address generation
  no longer 400s
- AIUI/app session: on-screen pointer can click + type into app content
  (incl. app store search); "open in new tab" opens the phone browser;
  mobile credential modal centered instead of full-height
  (remote-relay.ts, AppSession.vue, AppSessionFrame.vue, AppIconGrid.vue,
  openExternal.ts, WebViewScreen.kt) + remote-relay tests
- health_monitor: electrs auto-recovers from a corrupt index and shows a
  percent/block-height progress screen while reindexing (useElectrsSync.ts)
- update.rs: drop retired tx1138 secondary mirror (one-time migration);
  longer download timeout for slow connections
- CHANGELOG: v1.7.90-alpha notes
- tests/release/run.sh: harness tweaks

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 04:49:32 -04:00
archipelago
c49e8fcacd fix: harden OTA updates, AIUI desktop gap, LND no-proxy
- update.rs: post-OTA probe falls back to http://127.0.0.1/ on connect
  error (nginx binds :80, not :443) so good updates are no longer rolled
  back; recover stuck update_in_progress; avoid ETXTBSY on running binary
- LND: REST client bypasses proxy, GET newaddress p2wkh, wallet
  readiness/unlock after restart
- Dashboard.vue: chat route back to plain h-full (desktop bottom-gap fix)
- vite.config.ts: dev-only /aiui proxy
- tests/release/run.sh: release gate harness (static+frontend+backend)
- CHANGELOG: v1.7.89-alpha notes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 01:23:32 -04:00
archipelago
d6f108d818 chore: snapshot release workspace 2026-06-12 03:00:15 -04:00
archipelago
c393b96da3 backend: harden rootless app lifecycle orchestration 2026-06-11 00:24:32 -04:00
archipelago
d736364ad7 fix(apps): stabilize btcpay and public proxy launch flows 2026-05-19 09:26:43 -04:00
archipelago
7804223152 chore: release v1.7.57-alpha 2026-05-17 17:30:04 -04:00
Dorian
835c525218 chore(release): stage v1.7.55-alpha 2026-05-13 15:09:22 -04:00
archipelago
745cb1c626 chore(release): stage v1.7.52-alpha 2026-05-05 11:29:18 -04:00
archipelago
10fbb8f87c docs(testing): track Phase 3.4 race fix + drift-sync hook
* L0 unit count: 630 → 631 (translate_health_check_http_does_not_double_prefix_scheme)
* Phase 3 row: add TimeoutStartSec=600 race fix (44f275ed) + drift-sync hook (0889367d)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:53:18 -04:00
archipelago
bd96c0475d feat(config): ARCHIPELAGO_USE_QUADLET_BACKENDS env override
Adds an env-var lever for Phase 3.2's use_quadlet_backends flag so the
20× harness can flip the path on per-node without a config.json edit
(which would require an archipelago.service restart — and that triggers
FM3 cgroup cascade until Phase 3.5 ships, so we can't ask anyone to
reconfigure live nodes that way today).

Truthy parsing centralised in `parse_truthy_env` (1, true, yes, on —
case-insensitive, whitespace-trimmed). Anything else is false. The
helper is unit-tested so future env-var flags can reuse the same shape.

Also adds a default-off regression test for use_quadlet_backends so
flipping the default ahead of the 20× verification fires immediately.

TESTING.md documents the Environment= snippet for the systemd drop-in
so the next operator can flip the flag on a debug node without
re-deriving the recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:44:09 -04:00
archipelago
9a89a000d4 test(lifecycle): post-condition gate for use_quadlet_backends path
A six-test bats suite that validates what install_via_quadlet (Phase 3.2)
is supposed to leave behind:

  * `.container` unit on disk in $XDG_CONFIG_HOME/containers/systemd/
    with [Container] / [Service] / [Install] sections, Image= present,
    and Restart=on-failure (the backend invariant — companions use Always)
  * Phase 3.4 cross-check: any unit with HealthCmd= must also emit
    Notify=healthy, otherwise systemctl start won't gate on health
  * `systemctl --user is-active` returns 0 for the .service
  * podman shows the container running
  * the container's cgroup is under user.slice/, NOT under
    archipelago.service — the kernel-level proof that FM3 cgroup
    cascade SIGKILL is structurally fixed for this container

Auto-skips on every test when no backend Quadlet units exist (today's
default state, use_quadlet_backends=false) — so the suite is a no-op
on current fleet boxes and turns into a hard regression gate the
moment anyone flips the flag and reinstalls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:34:47 -04:00
archipelago
97ce23d773 feat(quadlet): Phase 3.4 — health-gated startup via Notify=healthy
QuadletUnit gains an optional HealthSpec; from_manifest translates the
manifest's health_check (tcp/http/cmd) into a HealthCmd= directive and
emits Notify=healthy alongside it. systemctl start <unit>.service then
blocks until the container's first green probe — eliminating the
"container up but RPC not ready" race the orchestrator currently papers
over with post-start polling.

Translation policy:
* tcp,  endpoint "host:port"        -> nc -z host port
* http, endpoint "host:port", path  -> curl -fsS -m 5 http://endpoint<path>
* cmd,  endpoint "<shell command>"  -> verbatim
* unknown type / malformed endpoint -> None (skip Notify=healthy rather
  than emit a HealthCmd that hangs the unit start forever)

Companion units leave health: None and remain byte-identical to before
this PR — the renderer only emits the Health* / Notify= block when set.

+4 quadlet unit tests (19 total). Dropped a never-used test setter that
was generating a dead_code warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:21:57 -04:00
archipelago
65576bd755 feat(orchestrator): Phase 3.3 — in-place migration to Quadlet
When use_quadlet_backends flips from off → on, existing fleet boxes
have backend containers parented under archipelago.service's cgroup
(the bad shape that triggers FM3 cascade SIGKILL on every archipelago
restart). ensure_running now notices and corrects this:

* If there's already a `<name>.container` unit on disk → no-op
  (subsequent reconcile ticks take this fast path).
* Else if a podman container with that name exists → it's a pre-3.3
  artifact. Stop+remove it (volumes survive — bind mounts are not
  touched by `podman rm`), then write the Quadlet unit, daemon-reload,
  and start the new managed service.
* Else → fall through to install_fresh, which already routes through
  install_via_quadlet when the flag is on.

The migration is idempotent and self-healing: if a fleet box is
half-migrated (unit on disk but no service active, or service active
but stale unit), the next reconcile tick converges. Bitcoin chain
data, lnd wallet state, and electrumx index all live on host bind
mounts and are unaffected by the container-record swap.

Volume safety audited per backend in `uses_orchestrator_install_flow`
allowlist — every entry mounts its data dir as a host bind mount.

Default still off. To migrate a node:
  /etc/archipelago/config.toml: use_quadlet_backends = true
followed by `systemctl restart archipelago` — the next reconcile tick
walks every managed app and migrates each in turn.

Tests: 624 passing, 0 cargo warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:27:59 -04:00
archipelago
5b2e02bd43 feat(orchestrator): Phase 3.2 — wire Quadlet path behind feature flag
prod_orchestrator::install_fresh now branches on the new
Config::use_quadlet_backends flag (default false):

* off (today's production behavior) — unchanged: runtime.create_container
  + start_container, container parented under archipelago.service's
  cgroup, FM3 cascade SIGKILL on every archipelago restart.
* on  — install_via_quadlet renders the manifest as a Quadlet unit via
  QuadletUnit::from_manifest, writes it atomically into
  ~/.config/containers/systemd/, calls daemon-reload, and starts the
  generated <name>.service. Container ends up under user.slice — no
  more cgroup parented under archipelago, so archipelago restarts
  don't touch the container's lifetime.

Default off so this commit is structurally safe to ship: nothing
changes at runtime until an operator opts in. Flip the default once
tests/lifecycle/run-20x.sh has gone green against the new path on
.228 + .198 (the v1.7.52 release gate).

Plumbing:
* config.rs — `use_quadlet_backends: bool` w/ Default false
* prod_orchestrator.rs — flag stored on the struct, threaded through
  new(), with set_use_quadlet_backends(bool) test setter
* prod_orchestrator.rs — install_via_quadlet helper
* dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest /
  parse_memory_mib / RestartPolicy::OnFailure now that the call path
  exists; if a future revert removes the wiring, the warnings come back.

Tests: 624 passing, cargo check clean (0 warnings). Existing companion
behavior unaffected — render_skips_backend_directives_when_default
still passes byte-equal to before quadlet.rs grew the new fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:22:10 -04:00
archipelago
9becafafd3 feat(quadlet): backend-manifest renderer (Phase 3.1 of v1.7.52)
The QuadletUnit struct now covers everything a backend manifest needs
(ports, environment, devices, add_hosts, entrypoint+command, read-only
root, no_new_privileges, cpu_quota, restart policy choice). Adds
QuadletUnit::from_manifest(&AppManifest, name) that translates a parsed
manifest into a unit, plus parse_memory_mib for "1g"/"512m"/raw-MiB
forms. The renderer skips empty/false directives so existing companion
units render byte-identically — no behavior change for shipping
companions; the backend renderer is dead code until Phase 3.2 wires it
into the orchestrator.

Eight new unit tests cover:
* parse_memory_mib forms (1024, 512m, 2g, garbage)
* shell_join quoting (whitespace, embedded quotes)
* RestartPolicy → systemd string mapping
* render emits backend directives when set
* render skips them when defaulted (companion regression gate)
* from_manifest happy path on a bitcoin-knots-shaped manifest
* from_manifest read-only volume detection
* from_manifest tmpfs filtering
* end-to-end manifest → render bytes assertion

Tests: 615 → 624 (+9 net; one pre-existing parse_memory_mib path was
implicitly covered before but is now explicit). Cargo warnings: 0.

`from_manifest`, `parse_memory_mib`, and `RestartPolicy::OnFailure` are
marked allow(dead_code) with explicit references to Phase 3.2 — if
3.2 doesn't wire them, the dead-code warning resurfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:09:50 -04:00
archipelago
5074572373 test(lifecycle): add btcpay + fedimint + mempool suites
Brings L1 (RPC API) + L3 (lifecycle survival) parity coverage to the
three multi-app stacks that were previously only touched by
required-stack.bats. Combined with bitcoin-knots / lnd / electrumx
already shipping, the six core apps now have dedicated bats files.

Each suite is shaped like the existing single-container suites
(bitcoin-knots / lnd / electrumx) and gates every assertion on the
backing container actually being present, so a node without the stack
installed gets clean skip messages instead of false fails.

* btcpay.bats — 9 tests, including stack-wide presence and a
  "supporting containers don't cascade-restart" guard
* fedimint.bats — 8 tests, single container
* mempool.bats — 9 tests, mixed legacy + orchestrator-managed stack;
  reuses the :8999 mempool-api probe from required-stack for parity

Total bats now: 88 (was 53 → +35).
TESTING.md matrix advances 23 → 50 of 110 cells.
UI URL coverage for these three apps already lives in
ui-coverage.bats, so this PR doesn't duplicate proxy-path probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:55:31 -04:00
archipelago
ec1dce93a9 docs(testing): canonical scorecard for container subsystem testing
Single source of truth for "where are we, where are we going" on the
v1.7.52 container excellence work. Replaces ad-hoc tracking in chat.

Sections:
* Test layers L0..L6 with toolchain + per-iteration latency
* Per-app × per-state coverage matrix (23 of 110 cells today; goal 110)
* Layer-by-layer status (L0+L1+L2 ●; L3 ◐; L4..L6 ○)
* Run commands (single suite / full suite / 20×)
* LoC budget — -270 committed, ~1,616 more possible if Phase 3 ships
* Performance KPIs (TBD — measure first, target second)
* Release gates — 8 boxes that must tick before v1.7.52 ships

The file lives in-repo so PR diffs to it answer "what did this commit
improve?". If you can't tick the box, the change isn't ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:52:42 -04:00
archipelago
b9eb6eb18a test(lifecycle): add UI surface coverage — HTTPS proxy + iframe URLs
Closes the coverage gap where existing bats suites would report green
on a node whose dashboard tiles 502 because the proxy upstream is dead.
First pass against .198 caught real prod issues immediately:
  /app/lnd/       → 502 (lnd container exited)
  /app/mempool/   → 502 (mempool container exited)
  /app/fedimint/  → 502 (fedimint container exited)
while existing tests reported only "container is up: false" with no
404/502 distinction.

* lib/ui-probes.bash — sourced helper. probe_https_200,
  probe_app_url (skip-if-container-down else assert-200),
  probe_dashboard_shell (asserts the Vue SPA HTML, not nginx default —
  catches the layout regression from feedback_release_tarball_layout.md),
  probe_dashboard_catalog (asserts /catalog.json non-empty).
* bats/ui-coverage.bats — 9 @test cases covering the dashboard +
  bitcoin-ui :8334 + the seven HTTPS_PROXY_PATHS most users hit
  (lnd, electrumx, mempool, fedimint, btcpay, filebrowser).

URL list mirrors HTTPS_PROXY_PATHS in
neode-ui/src/views/appSession/appSessionConfig.ts. Divergence between
the two is the exact bug class we're guarding against.

Loops clean under run-20x.sh. Container-state oracle is via local
podman inspect, so the suite must run on the archy host (same as
companion-survives-archipelago-restart.bats).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:49:30 -04:00