archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	7b0748c868	fix(mesh): respect the radio's flashed LoRa region (don't force ours) ensure_lora_region previously force-overrode the device's region with the mesh-config region (EU_868) whenever they differed — which would shove a US/ANZ user's radio onto EU_868: an illegal band that also cuts it off from its local mesh. Off-the-shelf interop must respect whatever region the user flashed. Now: a radio that already reports a REAL region (US, EU_868, ANZ, …) is left untouched. We only set a region when the device reports UNSET (a fresh radio is RF-silent and can't mesh at all), using the operator-configured region as the fallback. Unknown/None (never reported) is also left alone. Pairs with the default-channel change so a meshtastic archy node behaves like a stock device. cargo check green (built into the same binary as the channel fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 08:36:04 -04:00
archipelago	810127fd3e	feat(mesh): meshtastic off-the-shelf interop — default channel + private archipelago Make a meshtastic-equipped archy node work like a stock Meshtastic device AND keep the private archy group, instead of being isolated on a custom primary: - slot 0 (PRIMARY) = the DEFAULT public channel (empty name + default key) → interoperates with every off-the-shelf device on LongFast and picks up default-channel users; our NodeInfo broadcasts ride here like normal. - slot 1 (SECONDARY) = "archipelago" (deterministic psk) → private archy↔archy. Previously the driver set "archipelago" as the PRIMARY, isolating archy from the public mesh. Now ensure_channel writes at most one channel per call (default primary first, then archipelago secondary), reusing the existing reboot→ reconnect→re-check loop so it converges in ≤2 cycles without reboot-looping; primary_is_default() accepts the default key in 1-byte or expanded form so a stock radio is never needlessly rewritten. set_channel generalized to (index, name, psk, role); want_config parse tracks both slots. MeshCore needs no change — it never overrides channels (ensure_channel is a no-op) and already rides MeshCore's default Public channel off the shelf. cargo check green. NEEDS radio verify on .116/.198 (default-channel RX + archy group on the secondary). Channel provision cap (3) covers the 2-write migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 07:40:10 -04:00
archipelago	067002b04b	Merge branch 'bitcoin-version-bulletproof' into mesh-multiversion-integration	2026-06-29 06:45:50 -04:00
archipelago	20f762cb2c	feat(fips): auto-peer LAN-discovered federation nodes directly over FIPS Mesh/federation messages between co-located nodes were always falling back to Tor because the FIPS overlay had no direct peering — every node depended on the global anchor's spanning tree, and when that anchor link flaps a node is isolated and all FIPS dials time out. (Diagnosed live on .116/.198: pure-FIPS direct peering over UDP 8668 fixes it — 2.5ms vs timeout.) Generalize the manual fix: in the existing 5-min FIPS seed-anchor apply loop, also auto-connect every federation peer the PeerRegistry knows both a LAN address AND a FIPS npub for, dialing its FIPS UDP transport (port 8668) at its LAN IP via the same idempotent `fipsctl connect` path (new anchors::lan_fips_anchors). This is FIPS's own transport over the LAN — NOT Tailscale, NOT the HTTP/LAN messaging port. Transient (recomputed each tick from live mDNS discovery, never persisted) so changing IPs self-correct. Remote peers with no LAN address are untouched (still routed via the anchor). Registry Arc hoisted out of the transport-init block so the loop can read all_peers(). cargo check green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 06:42:18 -04:00
archipelago	11155055aa	feat(mesh): meshtastic PKI E2E pill — surface pki_encrypted on received DMs The synthetic meshcore-style frame the meshtastic driver builds can't carry the radio's PKI-encryption status, so received meshtastic DMs never lit the E2E pill. Thread it out-of-band: the device records `last_rx_encrypted` (= packet pki_encrypted) when it yields a text frame; the session loop reads it via `take_rx_encrypted()` right after dispatch and stamps the just-stored received message E2E (dispatch::stamp_received_encrypted, monotonic-id keyed). Meshcore returns false here (its E2E is derived in the frames decrypt path). Pure out-of-band signal — no change to the shared meshcore wire format. Built + deployed live in binary d937814e on .116/.198. cargo check green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 06:25:01 -04:00
archipelago	f4f45c1a09	docs: mark .228 reindex finish/verify as other-agent owned Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 06:04:01 -04:00
archipelago	ed1352d3a3	docs+catalog: bitcoin multi-version rollout handoff + reproducible generator - generate-app-catalog.sh: VERSIONS map now lists the full Knots set (29.3.knots20260508/20260507/20260210 + 29.2.knots20251110) and Core (adds 29.2 + a `latest` entry → newest); generator forces top-level `version` == the default entry's version (the 169ff2e2 invariant) so regeneration is reproducible. releases/app-catalog.json regenerated. - docs/bitcoin-version-bulletproof-rollout.md: full handoff — root causes, fixes, current .228 state, the coordinated fleet-rollout steps (incl. :latest repoint sequencing / fleet-safety), reindex finish procedure, and the switch-matrix test plan. - PRODUCTION-MASTER-PLAN.md: link the rollout doc (§6b-bis). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 06:02:24 -04:00
archipelago	095a76cd20	fix(bitcoin): bulletproof multi-version switching (Knots & Core) Three stacked bugs made "switch version" silently fail / crash-loop, and the data-access mismatch corrupted a node's index during recovery attempts. Backend renderer: - sync_quadlet_unit ignored the per-app pinned version and re-rendered the quadlet with the manifest's :latest every reconcile tick, reverting any switch. Factor the install-time catalog/pin resolution into a shared resolve_catalog_image() and call it in BOTH install_fresh and sync_quadlet_unit. - The renderer folded manifest `entrypoint: ["sh","-lc"]` into Exec=, which only worked when the image entrypoint was a passthrough shell wrapper. The versioned images use ENTRYPOINT ["bitcoind"], so Exec=sh -lc ... became `bitcoind sh -lc ...` and crash-looped. Emit a real Entrypoint= override; exec_changed now also compares Entrypoint=. Images: - Build all bitcoin images (Core + Knots, every version) as container-root (USER removed) like the legacy :latest image. Chain data is owned by the data_uid (container uid 102); root reads it via CAP_DAC_OVERRIDE (granted in the manifest). A non-root USER (the previous uid 1000) can't read existing chain data → "Error initializing block database". Still fully rootless: container-root maps to the unprivileged host service user. Catalog: - bitcoin-knots versions[]: 29.3.knots20260508/20260507/20260210 + 29.2.knots20251110, "latest" tracking newest. - bitcoin-core versions[]: add 29.2 + a "latest" entry. All images rebuilt root and published to the mirror. Frontend: - AppSidebar version dropdown: rename the latest option to "Always use the latest version" (no v prefix), fix right padding, and guarantee the current selection matches a real option (was rendering blank). - New InstallVersionModal: full-screen version chooser shown from the App Store / Discover install button for multi-version apps (Bitcoin Knots/Core), app icon + "Install <name>", latest pre-selected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 05:46:04 -04:00
archipelago	3c7c04a662	fix(mesh): meshtastic receive — drain frame batch per poll + rx diagnostics Addresses the open Meshtastic parity bug (project_meshtastic_parity): the running driver received nothing (`mesh.messages` stayed []) though the radio got the packets and sends worked. Root-cause candidate: `try_recv_frame` decoded ONE serial frame per poll and returned Ok(None) for every non-text FromRadio frame, so the session loop slept 50ms between frames. Under Meshtastic's frequent NodeInfo/telemetry stream a received text packet queued behind them, and read_from_radio's 64KB buffer cap could drain (drop) it before it was ever decoded — reception silently dead while sends kept working. - try_recv_frame now drains a bounded batch (64) per poll, processing each frame's side effects and returning the first inbound text frame, so a text packet is decoded the same poll it arrives and the buffer never grows enough to hit the lossy cap. Bounded so a continuous flood still yields to select!. - packet_to_inbound_frame logs every decoded packet (from/portnum/payload_len) and a "did not parse (dropped)" case, so one live radio pass is conclusive. The rest of the decode path was verified correct by inspection (FROM_RADIO_PACKET =2, wire-type-5 handled, parse_mesh_packet sound, 60s heartbeat present) — not a parse bug. cargo check green. NEEDS a live radio pass on a rig that isn't .228 (off-limits: bitcoin testing) to confirm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 05:04:09 -04:00
archipelago	11038cdcc9	feat(mesh,ui): per-message transport pill (Mesh/FIPS/Tor) + fix E2E pill Adds a per-message transport badge to archy↔archy mesh chats and fixes the long-broken E2E badge — both meshcore and meshtastic, styled like the existing E2E pill. Transport pill: - New `MeshMessage.transport` ("lora"/"fips"/"tor"), surfaced in the UI beside the E2E badge (Mesh.vue transportLabel() → Mesh/FIPS/Tor, mesh-styles.css). - Sent LoRa → "lora"; sent federation → finalized to the real leg ("fips"/"tor") once the background send resolves (req.send_json transport), via an id-keyed store update. - Received: a post-dispatch stamp on handle_typed_envelope_direct's output (monotonic ids) tags both transports without threading through all 20 typed- dispatch sites — radio wrapper stamps "lora", federation injector stamps the peer's last_transport ("fips"/"tor", default tor; the inbound HTTP carries no FIPS-vs-Tor signal). - Plain native/channel LoRa frames → "lora"; channel broadcasts stay non-E2E. E2E pill fix: - `encrypted` was hardcoded false at every MeshMessage construction site, so the UI badge (Mesh.vue `v-if="msg.encrypted"`) never showed. Now: federation envelopes are E2E (identity-signed over an encrypted transport); the meshcore native-DM receive path already had a real `encrypted` flag (now also tagged with transport). meshtastic-PKI radio E2E flag threading is a noted follow-up. Backend cargo check + frontend vue-tsc build both green. Needs a live radio + multi-transport pass on .116/.228 to confirm end-to-end (see project_transport_pill / project_meshtastic_parity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 04:29:25 -04:00
archipelago	169ff2e2cd	fix(bitcoin): knots catalog default must equal top-level version The knots versions[] marked 29.3.knots20260508 as default while the top-level catalog version is the floating 'latest' tag — violating the generator's own invariant (default:true MUST equal the top-level version so selecting it un-pins / tracks latest). Live effect via package.versions: catalog_default_version='latest' so the UI-highlighted default actually PINS+recreates (opposite of un-pin) and 'latest' was unreachable from the Version & Updates card. Add a 'latest' default entry (== the manifest's floating tag) and keep 29.3.knots20260508 as a pinnable option. Verified on .228: package.versions now returns default=latest with 2 selectable versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 19:56:49 -04:00
archipelago	da20f67462	Merge bitcoin-multi-version: multi-version support for Core & Knots Integrate the bitcoin-multi-version feature (commit 6aa74c73): per-node choice/pin/switch of Bitcoin Core & Knots versions with auto-update toggle — catalog versions[] schema, install-time selection, package.versions + package.set-config RPCs, hourly per-app auto-update tick, build-bitcoin-image.sh (GPG+SHA verified rootless image builder), and UI (version select + Version & Updates card). Catalog regenerated; preserves the mempool 127.0.0.1 health fix. Not yet live-verified on .228 — gate any tagged release on that per CLAUDE.md.	2026-06-28 18:48:38 -04:00
archipelago	6aa74c7386	feat(bitcoin): multi-version support for Core & Knots (install/switch/pin/auto-update) Lets a node runner choose which Bitcoin Core / Knots version to install (latest pre-selected), then switch, pin, or opt into auto-update from the app's interface — all manifest/catalog-driven, rootless, signed-registry, zero-data-loss. Motivated by upcoming BIP-110 signalling: runners need a real choice of software version. Backend: - version_config.rs: per-app pin + auto-update persistence (atomic, merge- preserving), downgrade detection, auto-update enumeration (+ unit tests). - app_catalog.rs: CatalogVersion / versions[] schema, catalog_versions(), catalog_image_for_version() (same-repo guard); a pin suppresses the update badge. - prod_orchestrator.rs: pinned version wins over the catalog default on every install/recreate. - install.rs: install-time `version` param persisted (default = unpinned). - set_config.rs: package.versions (read) + package.set-config (write) RPCs; downgrade is gated behind explicit confirm (warn + confirm + allow). - update.rs/main.rs: hourly per-app auto-update tick via the orchestrator (opt-in, pin-respecting); fix handle_package_update to be non-fatal for orchestrator-managed apps lacking a catalog primary image (bitcoin-core). UI: - MarketplaceAppDetails.vue: install-time version selector (shown when an app offers >=2 versions). - appDetails/AppSidebar.vue: "Version & Updates" card (switch / pin / auto- update toggle / downgrade warning), per app. - rpc-client.ts + en.json: RPC methods, types, strings. Phase 0 image pipeline: - scripts/build-bitcoin-image.sh: download official tarball + SHA256SUMS(.asc), verify SHA-256 + pinned-maintainer OpenPGP signature (fail-closed), build a minimal rootless image, smoke-test, tag + push. - apps/bitcoin-core/Dockerfile rewritten (drops stale community base); apps/bitcoin-knots/Dockerfile added. - generate-app-catalog.sh: emit curated versions[]; published + catalog now offers Core 25.2/26.2/27.2/28.4/29.3/30.2/31.0 + Knots 29.3.knots20260508. docs/bitcoin-multi-version-design.md: live progress tracker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 18:46:17 -04:00
archipelago	3cea7dd6c5	test(phase3): fix Phase-3 quadlet gates — define fail(), drop stale Notify=healthy assert Two Phase-3 bats suites used `fail` (a bats-assert helper) but bats-assert isn't installed on the alpha fleet (only bats-core), so every tripped assertion crashed with `fail: command not found` (status 127) instead of reporting a real pass/fail. Define the same minimal `fail() { echo ...; return 1; }` the other suites already use (see mempool.bats). Without this the gates were silently non-functional. Also rewrite the obsolete "HealthCmd= implies Notify=healthy" assertion in use-quadlet-backends-install.bats. Phase 3.4's Notify=healthy was deliberately reverted: gating `systemctl start` on health hung boot reconciliation for dependency-waiting apps (fedimint idles until Bitcoin IBD; lnd until macaroon unlock), leaving units stuck "deactivating". The renderer now emits HealthCmd= for Podman's health state but TimeoutStartSec=0 and NO Notify=healthy (quadlet.rs render() + contains_stale_health_gate()). The test now asserts the current invariant: no backend unit gates start on health. Verified on the .228 canary node (ARCHIPELAGO_USE_QUADLET_BACKENDS=1): use-quadlet-backends-install 6/6, backend-survives-archipelago-restart 3/3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 16:09:05 -04:00
archipelago	d7c6f8c348	fix(mempool): health-check 127.0.0.1 not localhost (stops false-unhealthy loop) The archy-mempool-web health_check endpoint used http://localhost:8080. Inside the frontend image, wget resolves `localhost` to ::1 (IPv6) first, but nginx binds 0.0.0.0:8080 (IPv4) only -> the baked HealthCmd gets "connection refused" every probe -> container is perpetually unhealthy -> the reconciler recreates it forever (observed on .228: mempool container re-Started every ~3 min, Health=unhealthy). Proven live: in-container `wget http://localhost:8080/` = refused, `wget http://127.0.0.1:8080/` = OK. Pin the probe to 127.0.0.1 so it matches nginx's IPv4 bind. Updated both the source manifest and the embedded copy in releases/app-catalog.json (the catalog overlay wins over the disk manifest on fleet nodes, so the catalog copy is the one that actually reaches .228). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 15:09:34 -04:00
archipelago	83344b9f3a	fix(orchestrator): drop legacy mempool umbrella manifest on catalog-driven nodes The split-mempool-stack guard that skips the legacy monolithic `mempool` manifest (whose container collides with its split-stack frontend member `archy-mempool-web`) only ran over DISK manifests. On catalog-driven nodes (no disk manifests — e.g. the Phase-3/registry-manifest path), the legacy `mempool` manifest arrives via the registry-catalog overlay AFTER that guard, so both `mempool` and `archy-mempool-web` end up owning container `mempool` and rewrite+restart each other forever ("port binding drift" / "network alias drift" loop observed on .228, leaving mempool down). Enforce the guard once more over the merged (disk + catalog) manifest set: drop the `mempool` umbrella whenever all three split members are present. Installing `mempool` assembles the split stack, so `archy-mempool-web` owns the frontend container either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 14:04:41 -04:00
archipelago	05c22b6085	fix(mempool): correct frontend container port 4080->8080 (stops restart loop) The mempool manifest + embedded catalog declared the frontend container port as 4080, but mempool-frontend nginx listens on 8080 (the stack creates it as -p 4080:8080 with FRONTEND_HTTP_PORT=8080, see api/rpc/package/stacks.rs). So every reconcile rendered the quadlet as PublishPort=4080:4080, disagreed with the working 4080:8080 container, and restarted it ("port binding drift" -> "host port 4080 did not become reachable within 5s" -> "host listener disappeared; restarting") in a perpetual loop on .228. Correcting the manifest container port to 8080 makes the rendered quadlet match reality so the drift/restart loop stops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 13:49:54 -04:00
archipelago	6734947c3e	fix(fmcd): cap CPU + watchdog-restart the iroh relay hot-loop On NAT'd nodes that can reach the iroh federation neither directly nor via iroh's public relays, fmcd's embedded iroh networking enters a relay/hole-punch reconnect hot-loop that pegs its entire CPU allotment indefinitely (observed ~1 core sustained for 4 days on a Tailscale node, while LAN nodes that reach the guardian directly stay <3%). fmcd 0.8.0 exposes no iroh/relay knobs, so: - fmcd-run now samples fmcd's own CPU and restarts it when it stays near its allotment for ~15 min (a restart demonstrably clears the stuck iroh state; real work is bursty and never flat-pegs a core for minutes). - Lower cpu_limit 1 -> 0.25 core so a stuck instance can't starve the node (steady-state is <3% of a core; joins are brief). Ships as fmcd:0.8.1 (launcher-only rebuild, same fmcd binary). Bumped the image pin + cpu_limit in the manifest, image-versions.sh, the embedded catalog manifest (releases/app-catalog.json), and the UI catalogs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 12:19:27 -04:00
archipelago	4519dbf04f	fix(orchestrator): render manifest certs on the adopted-running reconcile path WS-F #10: a netbird reinstall that adopts a leftover running container skipped ensure_manifest_certs, so when its data dir was wiped the self- signed tls.crt/key were never regenerated; the next nginx.conf rewrite + restart then died on the missing cert (proxy 502, login broken). The Running branch of ensure_running_with_mode now calls ensure_manifest_certs before ensure_manifest_files, mirroring prepare_for_start's certs-before- files ordering. Idempotent: a no-op when crt+key already exist. Live-validated on .228: deleted netbird tls.crt/key under a Running container; reconciler regenerated a fresh CN=<host_ip> self-signed cert (1000:1000), https :8087 = 200. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 17:49:50 -04:00
archipelago	a38c9d5f29	docs(master-plan): §10d Meshtastic MeshCore-parity status (one open received-msg bug) Region (EU_868) + shared channel "archipelago" auto-provisioning shipped in 8fdb45e8 and riding the rolled #9 fleet binary (0060dcd6). Discovery, RF, and sending verified on .116+.228; the one open blocker is the running driver not surfacing received messages. Slotted after WS-F #9–11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:53:06 -04:00
archipelago	f9a6ae3f32	feat(mesh): Meshtastic region + shared-channel auto-provisioning (MeshCore parity) Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched channels, so nodes only ever saw themselves. Bring them to MeshCore parity using the official Meshtastic admin API: - Auto-provision LoRa region (set_config, AdminMessage field 34) from a new mesh-config `lora_region` (e.g. EU_868) when the radio's region differs. - Auto-provision a shared primary channel (set_channel, field 33) with a PSK derived deterministically from channel_name, so every node converges on one mesh — the parity equivalent of MeshCore's named "archipelago" channel. - Read current region/channel from want_config; only write when different (no reboot loop); cap attempts so a radio that won't persist can't loop. - Active NodeInfo advert scaffolding + aggressive serial drain. Verified on .116+.228: region+channel persist, discovery works (both see each other as named reachable contacts), bidirectional RF + sending confirmed. Receiving in the running driver is still under diagnosis (instrumentation added). Also removes the unwanted `meshtastic` daemon app from the registry (it was never meant to be a container — native driver provides system-level support): deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) + test refs. Meshtastic stays native, like MeshCore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:46:35 -04:00
archipelago	fd3a4ee4ef	fix(orchestrator): chown the whole fresh bind subtree, not just the leaf ensure_bind_mount_dirs chowned a freshly-created no-data_uid bind dir with --reference={immediate_parent}. For a NESTED bind source like jellyfin's /var/lib/archipelago/jellyfin/config (or netbird's .../netbird/ data), `mkdir -p` creates the intermediate <app> dir root:root too, so referencing the immediate parent just copied ROOT — leaving the dir unwritable and the app EACCES-crash-looping on reinstall (found by the all-apps-lifecycle pass: jellyfin "/config/log denied" exit 139; netbird-server "unable to open database file"). It only ever worked for direct children of the data root (immich). Fix: anchor to the nearest PRE-EXISTING ancestor (the rootless data root, owned by the service user) and chown -R the entire newly-created subtree to it. Extracted the walk into fresh_subtree_anchor() with a unit test covering nested / direct / second-volume cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:46:35 -04:00
Dorian	38d2bbf570	chore(android): update companion APK download [skip ci]	2026-06-26 13:08:37 +01:00
Dorian	a90fea80ed	feat(android): edit server entries from in-app settings menu (NESMenu); bump to 0.4.12 (vc16) The 0.4.11 edit affordance only lived on ServerConnectScreen, which a connected user never sees. Add edit to NESMenu — the settings modal reached via two-finger hold while connected: a ✎ pencil on each saved server opens the form pre-populated (Edit Server header + Cancel), persists via ServerPreferences.updateSavedServer(), and reconnects when the edited server is the live one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 13:08:18 +01:00
Dorian	389e602097	chore(android): update companion APK download [skip ci]	2026-06-26 12:54:52 +01:00
Dorian	5677f9cca1	feat(android): edit saved server entries; bump companion to 0.4.11 (vc15) Add an edit affordance to each saved server in ServerConnectScreen: a pencil button loads the entry into the form (Edit Server mode) with Save Changes / Cancel actions. Persisted via a new ServerPreferences.updateSavedServer() that replaces by connection identity (address/port/scheme) and keeps the active record in sync when the edited server is the active one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 12:54:07 +01:00
archipelago	fc64b422e7	docs(master-plan): WS-F#3 first destructive run — 3 reinstall bugs found Full all-apps-lifecycle pass on .228: lifecycle 11/11, teardown 8/11. Surfaced (1) fresh-install bind-dir ownership root:root → reinstall EACCES (jellyfin/netbird; Fix B misses the install path), (2) netbird reinstall adopts leftover containers → skips manifest cert/file render, (3) portainer image pin lfg2025/portainer:2.19.4 unpublished (manifest unknown), pin overrides RPC dockerImage. .228 restored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 07:47:24 -04:00
Dorian	07b9b5a3aa	docs(android): companion release + App-Not-Installed runbook Capture the 2026-06-26 lessons durably: ship via the hardened publish script only, v1+v2+v3 signing is enforced by apksigner (AGP ignores enableV1Signing at minSdk>=24), diagnose install failures with adb install FIRST, signature-key changes force a one-time uninstall, and keep all phone/adb work scoped to com.archipelago.app.debug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 12:21:48 +01:00
Dorian	ac59771560	fix(android): force v1+v2+v3 signing & clean-build guards in companion publish The published companion APK was v2-only (AGP silently ignores enableV1Signing for minSdk>=24) and clean builds broke on stray space-named resource dirs. Harden scripts/publish-companion-apk.sh: clean build, remove/ýreject space-named res dirs, force v1+v2+v3 via zipalign+apksigner, and abort unless all three schemes verify. Wire ship-companion.sh to the shared script. Re-sign the served 0.4.10 APK. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 11:53:25 +01:00
Dorian	d1f9e9ce88	chore(android): update companion apk download	2026-06-26 11:32:00 +01:00
Dorian	58847fc3d7	chore(android): bump companion to 0.4.10 (versionCode 14) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 11:31:36 +01:00
archipelago	a3e09eab57	docs(master-plan): WS-F#3 — destructive all-apps lifecycle matrix landed (43934eef) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:51 -04:00
archipelago	43934eefa5	test(gate): destructive all-apps lifecycle matrix (WS-F#3) Active counterpart to the read-only all-apps-matrix.bats: drives stop/start/restart for every installed app and, under ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall → no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core suites. App set is discovered from My Apps ∩ the node catalog; reinstall spec comes from catalog.json {dockerImage, containerConfig}. PROTECTED by default (never cycled or torn down): bitcoin/electrum (expensive resync) AND lnd/btcpay/fedimint (teardown = irreversible wallet/channel/guardian loss). The user asked to protect only bitcoin+electrum; the wallet apps are added for safety and can be removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised pass, not folded into run-gate. Validated on .228: discovery excludes the 6 protected installed apps; lifecycle tier cycles a single app (botfights) stop/start/restart green; teardown gated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:22 -04:00
archipelago	80146f4476	docs(master-plan): WS-F#2 — uninstall progress bar made truthful (9f17ba68) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:15:11 -04:00
archipelago	9f17ba6867	fix(ui): truthful uninstall progress bar (was a solid full-red block) AppCard's uninstall bar was hardcoded `w-full bg-red-400/60 animate-pulse` — a solid, full-width, red, fake-pulsing block that never moved and read as an error, no matter the actual teardown progress (the install bar, by contrast, renders a real percentage). Derive a truthful percentage from the backend's existing `uninstall-stage` label — "Stopping containers (X/N)" → 10–50%, "Cleaning up volumes" → 70%, "Removing app data" → 90% — and render it exactly like install: neutral fill, real width + percent, shimmer (not a fake pulse) carrying motion when a stage has no number. Frontend-only; the backend already broadcasts these stages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:04:48 -04:00
archipelago	67426c0d41	docs(master-plan): cascade tier wired into the gate (b7d92107) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:24:07 -04:00
archipelago	b7d9210784	test(gate): optional ARCHY_GATE_CASCADE pass — wire the cascade tier in run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite (uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression guard) existed but was never enabled by the gate. Add an opt-in single cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out of the 5× loop deliberately — uninstall/reinstall every iteration would balloon runtime and re-pull images; one pass guards the class. Default gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:22:45 -04:00
archipelago	292a2650df	docs(master-plan): WS-F — uninstall-hang root cause fixed + cascade validated Workstream F now in-progress: the immich/grafana uninstall hang → ghost/stuck-bar/reinstall-block is root-caused (unbounded systemctl/ podman in quadlet::disable_remove) and fixed (71cc9ac4); cascade- uninstall.bats 7/7 on .228. Records the remaining F items + the pending gate-wiring decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:18:39 -04:00
archipelago	71cc9ac46a	fix(uninstall): bound systemctl/podman teardown so uninstall can't hang Uninstalling immich/grafana could hang with a frozen full-red progress bar, leave a ghost entry stuck in My Apps, and then refuse reinstall. Single root cause: quadlet::disable_remove() — called first in the uninstall task (via companion + orchestrator teardown) — ran `systemctl --user stop`, daemon-reload, and `podman rm -f` with NO timeout. On rootless podman a generated unit can wedge in "deactivating" while podman hangs underneath, so `systemctl stop` blocks forever. The spawned uninstall task then never returns Ok or Err, so: - set_uninstall_stage() (after the stop) never fires → progress frozen; - remove_package_state_entry() never runs → entry stranded in `Removing` → ghost in My Apps; - the install guard rejects reinstall with "already Removing". The spawn wrapper already reverts state on Err and removes the entry on Ok — the only failure mode was a hang that returns neither. Bound the teardown so it always terminates: - systemctl stop → QUADLET_STOP_TIMEOUT, escalate to kill+reset-failed on timeout (reuses the existing helpers); - daemon_reload_user() → bounded systemctl_user_status (30s); - defensive `podman rm -f` → wrapped in tokio timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 04:27:02 -04:00
archipelago	2ebcd8f9a8	docs(master-plan): backlog — smart launch-port selection + manifest-driven archival-node blocker §10b: replace per-app static launch-port map with a manifest-first + non-HTTP-port-skipping heuristic (the gitea :2222 class). §10c: generalize the un-pruned/archival Bitcoin install blocker from a hardcoded requires_unpruned_bitcoin() match to a manifest-declared dependency, with a clear pre-install UX. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:47:25 -04:00
archipelago	3515344800	docs(master-plan): session h — zombie guard + gitea launch-port fix Banner + §8b: zombie-container guard (0a8db904, live-proven on .228) and gitea launch-port fix (670ebb06) shipped in binary 040df5ce, rolled to the fleet. Logs the mempool env-drift recreate-loop and nostr-rs-relay follow-ups. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:41:59 -04:00
archipelago	670ebb0666	fix(launcher): pin Gitea launch URL to web port 3001 (not SSH 2222) Gitea publishes two host ports — SSH on 2222 and the web UI on 3001. The launch URL comes from manifest_lan_address_for() (the manifest's interfaces.main → 3001), but Gitea had no entry in the static lan_address_for() fallback map. On a node where the gitea manifest is absent or stale (no interfaces block), the lookup returns None and the code falls through to extract_lan_address(), which returns whichever port podman lists first — frequently the SSH port. Result: the app launched at :2222 instead of :3001 (observed on tailscale node 100.82.34.38). Add the canonical "gitea" => http://localhost:3001 entry to the static map, matching every other core app, so the web UI is pinned regardless of manifest presence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:16:41 -04:00
archipelago	0a8db9044f	fix(orchestrator): recreate zombie "Up" containers whose process is dead podman trusts its own state DB: when a container's conmon dies without podman observing it (cgroup-cascade SIGKILL on archipelago.service restart, a crash), `podman ps` keeps reporting it "Up" long after the process is gone. The reconciler NoOp'd such a zombie forever, so a dead dependency with no published host port never recovered. Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with a dead State.Pid → its nginx proxy 502'd → NetBird login broke ("Unauthenticated"). The dashboard publishes no host port, so the Running branch had nothing to probe and never recreated it. Add a zombie guard to the Running branch: verify the recorded State.Pid is alive (its /proc entry exists) before trusting "running"; on a concrete dead PID, stop+remove+install_fresh from the manifest. Conservative by design — any uncertainty (inspect failed, PID unparseable) assumes alive, so a transient podman hiccup never destroys a healthy container. Unit test covers live/dead/out-of-range PIDs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 02:25:52 -04:00
archipelago	43e700498b	fix(android): trust self-signed certs for the user's own node in WebView Node apps (e.g. NetBird on :8087) terminate TLS with a self-signed cert so the dashboard gets a secure context (OIDC / window.crypto.subtle, #15). The WebView's default onReceivedSslError CANCELs untrusted certs, so those apps rendered blank in the companion — exactly the netbird "won't load in the webview" report. Override onReceivedSslError in both WebViewClients (kiosk + in-app browser) to proceed() only when the failing cert's host matches the connected node; reject everything else (no blanket trust). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 18:13:52 -04:00
archipelago	89d397bb74	refactor(netbird): delete legacy Rust installer — #20 ph4 (manifest-driven only) netbird is fully manifest-driven (apps/netbird-/manifest.yml via the signed catalog): install_stack_via_orchestrator renders the 3-member stack with generated_certs (self-signed TLS for the #15 OIDC secure context), base64 generated_secrets, and templated config — and adopts the running stack by live container name. The hardcoded `podman run` fallback was therefore dead code on any node with the embedded catalog (verified live: .228 https:8087 -> 200). Removes the per-app Rust installer anti-pattern the master plan calls out: - install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer) - deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert, read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip, wait_for_netbird_oidc_ready), 3 NETBIRD__IMAGE consts, unused base64::Engine import - ~485 lines removed; prod_orchestrator doc-comments updated Behavioural parity: the manifest path already executed on the fleet, so this changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed by the manifest path; if that race resurfaces, add an OIDC-ready gate to the manifest rather than resurrecting the Rust fn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 11:04:01 -04:00
archipelago	41e7f500f8	test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn The 5x destructive gate on heavy nodes false-failed on transient windows during stack recovery, not real regressions: - immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis ->server (DB migrations on boot) stack can take >30s to republish :2283 after a churn-induced recreate; destructive-tier immich tests already allow 180-240s. - mempool.bats: orphan-container check now polls to steady state (<=30s) instead of a single-shot count, which caught a recreated member briefly visible alongside its replacement mid-reconcile. - run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when installed, so the next iteration's read-only probe doesn't race a still- recovering stack. Settle returns the instant every probe is green. A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only absorb the transient recreate window under sustained churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 09:18:34 -04:00
archipelago	a721532f55	feat(orchestrator): desired-state recovery + recreate volume-ownership [UNVALIDATED WIP] NOT yet validated on a node or fleet-deployed — cargo check passes, release build + .228 canary validation pending. Committed as a checkpoint so the work survives. Two fixes the immich .198 incident exposed: Fix A (reconcile_all_with_mode): a previously-running app whose container vanished (e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now, when boot reconcile would leave an app 'absent' but it was running at the last running-containers snapshot, recreate it (install_fresh). New crash_recovery::load_last_running_names() reads the snapshot without the PID/crash gate (+2 unit tests). Match is exact on compute_container_name (incl stack members); user-stopped + uninstalled apps are already excluded, so no false positives. Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a no-data_uid app running as container-root (→ host rootless user) hit EACCES and crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir for a no-data_uid app is chowned via --reference=<parent> to match the rootless data root — no host-uid guessing, only fresh dirs (no regression for existing installs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 09:28:40 -04:00
archipelago	80f49cac1c	fix(ui): backoff remote-relay reconnects + stop cryptpad icon 404 Two console-noise fixes from a live error dump: - remote-relay.ts reconnected on a FIXED 5s interval with no backoff, so when the backend is briefly down it floods the console/network with failed-WS attempts for the whole outage. It's a secondary feature (companion input), so add exponential backoff 1s->30s (mirrors websocket.ts), reset on open/start. - cryptpad's catalog/marketplace entries pointed at a non-existent /assets/img/app-icons/cryptpad.webp -> a 404 on every marketplace render. Point it at the existing default icon (handleImageError swapped to it anyway). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 08:41:04 -04:00
archipelago	2d8ade629b	fix(ui): log global errors silently instead of popping a toast + overlay The global error handler (Vue errorHandler + window error + unhandledrejection) fired a red 'Something went wrong: <raw msg>' toast AND an auto on-device overlay on every caught error — deliberately loud for bug-bash, but it surfaces benign, non-actionable noise (e.g. a transient RPC rejection during a ws reconnect, or the service worker failing to register over a self-signed cert) right in the user's face. Demote the catch-all to SILENT capture: keep console.error + the window.__archyErrors ring buffer, and expose the screenshot-able overlay on-demand via window.__archyShowErrors() — but never auto-pop. Components that need to report a specific, actionable failure still call toast.error() directly. Also filter known-benign environmental noise (PWA service-worker registration failing over a self-signed cert — needs a trusted cert, #56) so it doesn't even occupy a ring-buffer slot and push out real errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:55:49 -04:00
archipelago	0406af522c	test(lifecycle): add manifest-driven all-apps health matrix The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others (jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats derives the app set from server.get-state package-data (no hardcoded list) and asserts baseline health across EVERY installed app: - settles to a non-transitional state within a window (the #13/#14 stuck-ghost class, generalized fleet-wide — installing/removing that never settles) - not in error/failed - reports a recognized (non-garbage) state - every running UI app (manifest ui=="true") exposes a non-null lan-address (the immich/port-drift unreachable-UI failure, generalized to all UI apps) Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:27:10 -04:00

1 2 3 4 5 ...

1470 Commits