archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	169ff2e2cd	fix(bitcoin): knots catalog default must equal top-level version The knots versions[] marked 29.3.knots20260508 as default while the top-level catalog version is the floating 'latest' tag — violating the generator's own invariant (default:true MUST equal the top-level version so selecting it un-pins / tracks latest). Live effect via package.versions: catalog_default_version='latest' so the UI-highlighted default actually PINS+recreates (opposite of un-pin) and 'latest' was unreachable from the Version & Updates card. Add a 'latest' default entry (== the manifest's floating tag) and keep 29.3.knots20260508 as a pinnable option. Verified on .228: package.versions now returns default=latest with 2 selectable versions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 19:56:49 -04:00
archipelago	da20f67462	Merge bitcoin-multi-version: multi-version support for Core & Knots Integrate the bitcoin-multi-version feature (commit 6aa74c73): per-node choice/pin/switch of Bitcoin Core & Knots versions with auto-update toggle — catalog versions[] schema, install-time selection, package.versions + package.set-config RPCs, hourly per-app auto-update tick, build-bitcoin-image.sh (GPG+SHA verified rootless image builder), and UI (version select + Version & Updates card). Catalog regenerated; preserves the mempool 127.0.0.1 health fix. Not yet live-verified on .228 — gate any tagged release on that per CLAUDE.md.	2026-06-28 18:48:38 -04:00
archipelago	6aa74c7386	feat(bitcoin): multi-version support for Core & Knots (install/switch/pin/auto-update) Lets a node runner choose which Bitcoin Core / Knots version to install (latest pre-selected), then switch, pin, or opt into auto-update from the app's interface — all manifest/catalog-driven, rootless, signed-registry, zero-data-loss. Motivated by upcoming BIP-110 signalling: runners need a real choice of software version. Backend: - version_config.rs: per-app pin + auto-update persistence (atomic, merge- preserving), downgrade detection, auto-update enumeration (+ unit tests). - app_catalog.rs: CatalogVersion / versions[] schema, catalog_versions(), catalog_image_for_version() (same-repo guard); a pin suppresses the update badge. - prod_orchestrator.rs: pinned version wins over the catalog default on every install/recreate. - install.rs: install-time `version` param persisted (default = unpinned). - set_config.rs: package.versions (read) + package.set-config (write) RPCs; downgrade is gated behind explicit confirm (warn + confirm + allow). - update.rs/main.rs: hourly per-app auto-update tick via the orchestrator (opt-in, pin-respecting); fix handle_package_update to be non-fatal for orchestrator-managed apps lacking a catalog primary image (bitcoin-core). UI: - MarketplaceAppDetails.vue: install-time version selector (shown when an app offers >=2 versions). - appDetails/AppSidebar.vue: "Version & Updates" card (switch / pin / auto- update toggle / downgrade warning), per app. - rpc-client.ts + en.json: RPC methods, types, strings. Phase 0 image pipeline: - scripts/build-bitcoin-image.sh: download official tarball + SHA256SUMS(.asc), verify SHA-256 + pinned-maintainer OpenPGP signature (fail-closed), build a minimal rootless image, smoke-test, tag + push. - apps/bitcoin-core/Dockerfile rewritten (drops stale community base); apps/bitcoin-knots/Dockerfile added. - generate-app-catalog.sh: emit curated versions[]; published + catalog now offers Core 25.2/26.2/27.2/28.4/29.3/30.2/31.0 + Knots 29.3.knots20260508. docs/bitcoin-multi-version-design.md: live progress tracker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 18:46:17 -04:00
archipelago	3cea7dd6c5	test(phase3): fix Phase-3 quadlet gates — define fail(), drop stale Notify=healthy assert Two Phase-3 bats suites used `fail` (a bats-assert helper) but bats-assert isn't installed on the alpha fleet (only bats-core), so every tripped assertion crashed with `fail: command not found` (status 127) instead of reporting a real pass/fail. Define the same minimal `fail() { echo ...; return 1; }` the other suites already use (see mempool.bats). Without this the gates were silently non-functional. Also rewrite the obsolete "HealthCmd= implies Notify=healthy" assertion in use-quadlet-backends-install.bats. Phase 3.4's Notify=healthy was deliberately reverted: gating `systemctl start` on health hung boot reconciliation for dependency-waiting apps (fedimint idles until Bitcoin IBD; lnd until macaroon unlock), leaving units stuck "deactivating". The renderer now emits HealthCmd= for Podman's health state but TimeoutStartSec=0 and NO Notify=healthy (quadlet.rs render() + contains_stale_health_gate()). The test now asserts the current invariant: no backend unit gates start on health. Verified on the .228 canary node (ARCHIPELAGO_USE_QUADLET_BACKENDS=1): use-quadlet-backends-install 6/6, backend-survives-archipelago-restart 3/3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 16:09:05 -04:00
archipelago	d7c6f8c348	fix(mempool): health-check 127.0.0.1 not localhost (stops false-unhealthy loop) The archy-mempool-web health_check endpoint used http://localhost:8080. Inside the frontend image, wget resolves `localhost` to ::1 (IPv6) first, but nginx binds 0.0.0.0:8080 (IPv4) only -> the baked HealthCmd gets "connection refused" every probe -> container is perpetually unhealthy -> the reconciler recreates it forever (observed on .228: mempool container re-Started every ~3 min, Health=unhealthy). Proven live: in-container `wget http://localhost:8080/` = refused, `wget http://127.0.0.1:8080/` = OK. Pin the probe to 127.0.0.1 so it matches nginx's IPv4 bind. Updated both the source manifest and the embedded copy in releases/app-catalog.json (the catalog overlay wins over the disk manifest on fleet nodes, so the catalog copy is the one that actually reaches .228). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 15:09:34 -04:00
archipelago	83344b9f3a	fix(orchestrator): drop legacy mempool umbrella manifest on catalog-driven nodes The split-mempool-stack guard that skips the legacy monolithic `mempool` manifest (whose container collides with its split-stack frontend member `archy-mempool-web`) only ran over DISK manifests. On catalog-driven nodes (no disk manifests — e.g. the Phase-3/registry-manifest path), the legacy `mempool` manifest arrives via the registry-catalog overlay AFTER that guard, so both `mempool` and `archy-mempool-web` end up owning container `mempool` and rewrite+restart each other forever ("port binding drift" / "network alias drift" loop observed on .228, leaving mempool down). Enforce the guard once more over the merged (disk + catalog) manifest set: drop the `mempool` umbrella whenever all three split members are present. Installing `mempool` assembles the split stack, so `archy-mempool-web` owns the frontend container either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 14:04:41 -04:00
archipelago	05c22b6085	fix(mempool): correct frontend container port 4080->8080 (stops restart loop) The mempool manifest + embedded catalog declared the frontend container port as 4080, but mempool-frontend nginx listens on 8080 (the stack creates it as -p 4080:8080 with FRONTEND_HTTP_PORT=8080, see api/rpc/package/stacks.rs). So every reconcile rendered the quadlet as PublishPort=4080:4080, disagreed with the working 4080:8080 container, and restarted it ("port binding drift" -> "host port 4080 did not become reachable within 5s" -> "host listener disappeared; restarting") in a perpetual loop on .228. Correcting the manifest container port to 8080 makes the rendered quadlet match reality so the drift/restart loop stops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 13:49:54 -04:00
archipelago	6734947c3e	fix(fmcd): cap CPU + watchdog-restart the iroh relay hot-loop On NAT'd nodes that can reach the iroh federation neither directly nor via iroh's public relays, fmcd's embedded iroh networking enters a relay/hole-punch reconnect hot-loop that pegs its entire CPU allotment indefinitely (observed ~1 core sustained for 4 days on a Tailscale node, while LAN nodes that reach the guardian directly stay <3%). fmcd 0.8.0 exposes no iroh/relay knobs, so: - fmcd-run now samples fmcd's own CPU and restarts it when it stays near its allotment for ~15 min (a restart demonstrably clears the stuck iroh state; real work is bursty and never flat-pegs a core for minutes). - Lower cpu_limit 1 -> 0.25 core so a stuck instance can't starve the node (steady-state is <3% of a core; joins are brief). Ships as fmcd:0.8.1 (launcher-only rebuild, same fmcd binary). Bumped the image pin + cpu_limit in the manifest, image-versions.sh, the embedded catalog manifest (releases/app-catalog.json), and the UI catalogs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 12:19:27 -04:00
archipelago	4519dbf04f	fix(orchestrator): render manifest certs on the adopted-running reconcile path WS-F #10: a netbird reinstall that adopts a leftover running container skipped ensure_manifest_certs, so when its data dir was wiped the self- signed tls.crt/key were never regenerated; the next nginx.conf rewrite + restart then died on the missing cert (proxy 502, login broken). The Running branch of ensure_running_with_mode now calls ensure_manifest_certs before ensure_manifest_files, mirroring prepare_for_start's certs-before- files ordering. Idempotent: a no-op when crt+key already exist. Live-validated on .228: deleted netbird tls.crt/key under a Running container; reconciler regenerated a fresh CN=<host_ip> self-signed cert (1000:1000), https :8087 = 200. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 17:49:50 -04:00
archipelago	a38c9d5f29	docs(master-plan): §10d Meshtastic MeshCore-parity status (one open received-msg bug) Region (EU_868) + shared channel "archipelago" auto-provisioning shipped in 8fdb45e8 and riding the rolled #9 fleet binary (0060dcd6). Discovery, RF, and sending verified on .116+.228; the one open blocker is the running driver not surfacing received messages. Slotted after WS-F #9–11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:53:06 -04:00
archipelago	f9a6ae3f32	feat(mesh): Meshtastic region + shared-channel auto-provisioning (MeshCore parity) Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched channels, so nodes only ever saw themselves. Bring them to MeshCore parity using the official Meshtastic admin API: - Auto-provision LoRa region (set_config, AdminMessage field 34) from a new mesh-config `lora_region` (e.g. EU_868) when the radio's region differs. - Auto-provision a shared primary channel (set_channel, field 33) with a PSK derived deterministically from channel_name, so every node converges on one mesh — the parity equivalent of MeshCore's named "archipelago" channel. - Read current region/channel from want_config; only write when different (no reboot loop); cap attempts so a radio that won't persist can't loop. - Active NodeInfo advert scaffolding + aggressive serial drain. Verified on .116+.228: region+channel persist, discovery works (both see each other as named reachable contacts), bidirectional RF + sending confirmed. Receiving in the running driver is still under diagnosis (instrumentation added). Also removes the unwanted `meshtastic` daemon app from the registry (it was never meant to be a container — native driver provides system-level support): deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) + test refs. Meshtastic stays native, like MeshCore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:46:35 -04:00
archipelago	fd3a4ee4ef	fix(orchestrator): chown the whole fresh bind subtree, not just the leaf ensure_bind_mount_dirs chowned a freshly-created no-data_uid bind dir with --reference={immediate_parent}. For a NESTED bind source like jellyfin's /var/lib/archipelago/jellyfin/config (or netbird's .../netbird/ data), `mkdir -p` creates the intermediate <app> dir root:root too, so referencing the immediate parent just copied ROOT — leaving the dir unwritable and the app EACCES-crash-looping on reinstall (found by the all-apps-lifecycle pass: jellyfin "/config/log denied" exit 139; netbird-server "unable to open database file"). It only ever worked for direct children of the data root (immich). Fix: anchor to the nearest PRE-EXISTING ancestor (the rootless data root, owned by the service user) and chown -R the entire newly-created subtree to it. Extracted the walk into fresh_subtree_anchor() with a unit test covering nested / direct / second-volume cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 04:46:35 -04:00
Dorian	38d2bbf570	chore(android): update companion APK download [skip ci]	2026-06-26 13:08:37 +01:00
Dorian	a90fea80ed	feat(android): edit server entries from in-app settings menu (NESMenu); bump to 0.4.12 (vc16) The 0.4.11 edit affordance only lived on ServerConnectScreen, which a connected user never sees. Add edit to NESMenu — the settings modal reached via two-finger hold while connected: a ✎ pencil on each saved server opens the form pre-populated (Edit Server header + Cancel), persists via ServerPreferences.updateSavedServer(), and reconnects when the edited server is the live one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 13:08:18 +01:00
Dorian	389e602097	chore(android): update companion APK download [skip ci]	2026-06-26 12:54:52 +01:00
Dorian	5677f9cca1	feat(android): edit saved server entries; bump companion to 0.4.11 (vc15) Add an edit affordance to each saved server in ServerConnectScreen: a pencil button loads the entry into the form (Edit Server mode) with Save Changes / Cancel actions. Persisted via a new ServerPreferences.updateSavedServer() that replaces by connection identity (address/port/scheme) and keeps the active record in sync when the edited server is the active one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 12:54:07 +01:00
archipelago	fc64b422e7	docs(master-plan): WS-F#3 first destructive run — 3 reinstall bugs found Full all-apps-lifecycle pass on .228: lifecycle 11/11, teardown 8/11. Surfaced (1) fresh-install bind-dir ownership root:root → reinstall EACCES (jellyfin/netbird; Fix B misses the install path), (2) netbird reinstall adopts leftover containers → skips manifest cert/file render, (3) portainer image pin lfg2025/portainer:2.19.4 unpublished (manifest unknown), pin overrides RPC dockerImage. .228 restored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 07:47:24 -04:00
Dorian	07b9b5a3aa	docs(android): companion release + App-Not-Installed runbook Capture the 2026-06-26 lessons durably: ship via the hardened publish script only, v1+v2+v3 signing is enforced by apksigner (AGP ignores enableV1Signing at minSdk>=24), diagnose install failures with adb install FIRST, signature-key changes force a one-time uninstall, and keep all phone/adb work scoped to com.archipelago.app.debug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 12:21:48 +01:00
Dorian	ac59771560	fix(android): force v1+v2+v3 signing & clean-build guards in companion publish The published companion APK was v2-only (AGP silently ignores enableV1Signing for minSdk>=24) and clean builds broke on stray space-named resource dirs. Harden scripts/publish-companion-apk.sh: clean build, remove/ýreject space-named res dirs, force v1+v2+v3 via zipalign+apksigner, and abort unless all three schemes verify. Wire ship-companion.sh to the shared script. Re-sign the served 0.4.10 APK. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 11:53:25 +01:00
Dorian	d1f9e9ce88	chore(android): update companion apk download	2026-06-26 11:32:00 +01:00
Dorian	58847fc3d7	chore(android): bump companion to 0.4.10 (versionCode 14) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 11:31:36 +01:00
archipelago	a3e09eab57	docs(master-plan): WS-F#3 — destructive all-apps lifecycle matrix landed (43934eef) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:51 -04:00
archipelago	43934eefa5	test(gate): destructive all-apps lifecycle matrix (WS-F#3) Active counterpart to the read-only all-apps-matrix.bats: drives stop/start/restart for every installed app and, under ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall → no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core suites. App set is discovered from My Apps ∩ the node catalog; reinstall spec comes from catalog.json {dockerImage, containerConfig}. PROTECTED by default (never cycled or torn down): bitcoin/electrum (expensive resync) AND lnd/btcpay/fedimint (teardown = irreversible wallet/channel/guardian loss). The user asked to protect only bitcoin+electrum; the wallet apps are added for safety and can be removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised pass, not folded into run-gate. Validated on .228: discovery excludes the 6 protected installed apps; lifecycle tier cycles a single app (botfights) stop/start/restart green; teardown gated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:22 -04:00
archipelago	80146f4476	docs(master-plan): WS-F#2 — uninstall progress bar made truthful (9f17ba68) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:15:11 -04:00
archipelago	9f17ba6867	fix(ui): truthful uninstall progress bar (was a solid full-red block) AppCard's uninstall bar was hardcoded `w-full bg-red-400/60 animate-pulse` — a solid, full-width, red, fake-pulsing block that never moved and read as an error, no matter the actual teardown progress (the install bar, by contrast, renders a real percentage). Derive a truthful percentage from the backend's existing `uninstall-stage` label — "Stopping containers (X/N)" → 10–50%, "Cleaning up volumes" → 70%, "Removing app data" → 90% — and render it exactly like install: neutral fill, real width + percent, shimmer (not a fake pulse) carrying motion when a stage has no number. Frontend-only; the backend already broadcasts these stages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:04:48 -04:00
archipelago	67426c0d41	docs(master-plan): cascade tier wired into the gate (b7d92107) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:24:07 -04:00
archipelago	b7d9210784	test(gate): optional ARCHY_GATE_CASCADE pass — wire the cascade tier in run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite (uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression guard) existed but was never enabled by the gate. Add an opt-in single cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out of the 5× loop deliberately — uninstall/reinstall every iteration would balloon runtime and re-pull images; one pass guards the class. Default gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:22:45 -04:00
archipelago	292a2650df	docs(master-plan): WS-F — uninstall-hang root cause fixed + cascade validated Workstream F now in-progress: the immich/grafana uninstall hang → ghost/stuck-bar/reinstall-block is root-caused (unbounded systemctl/ podman in quadlet::disable_remove) and fixed (71cc9ac4); cascade- uninstall.bats 7/7 on .228. Records the remaining F items + the pending gate-wiring decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:18:39 -04:00
archipelago	71cc9ac46a	fix(uninstall): bound systemctl/podman teardown so uninstall can't hang Uninstalling immich/grafana could hang with a frozen full-red progress bar, leave a ghost entry stuck in My Apps, and then refuse reinstall. Single root cause: quadlet::disable_remove() — called first in the uninstall task (via companion + orchestrator teardown) — ran `systemctl --user stop`, daemon-reload, and `podman rm -f` with NO timeout. On rootless podman a generated unit can wedge in "deactivating" while podman hangs underneath, so `systemctl stop` blocks forever. The spawned uninstall task then never returns Ok or Err, so: - set_uninstall_stage() (after the stop) never fires → progress frozen; - remove_package_state_entry() never runs → entry stranded in `Removing` → ghost in My Apps; - the install guard rejects reinstall with "already Removing". The spawn wrapper already reverts state on Err and removes the entry on Ok — the only failure mode was a hang that returns neither. Bound the teardown so it always terminates: - systemctl stop → QUADLET_STOP_TIMEOUT, escalate to kill+reset-failed on timeout (reuses the existing helpers); - daemon_reload_user() → bounded systemctl_user_status (30s); - defensive `podman rm -f` → wrapped in tokio timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 04:27:02 -04:00
archipelago	2ebcd8f9a8	docs(master-plan): backlog — smart launch-port selection + manifest-driven archival-node blocker §10b: replace per-app static launch-port map with a manifest-first + non-HTTP-port-skipping heuristic (the gitea :2222 class). §10c: generalize the un-pruned/archival Bitcoin install blocker from a hardcoded requires_unpruned_bitcoin() match to a manifest-declared dependency, with a clear pre-install UX. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:47:25 -04:00
archipelago	3515344800	docs(master-plan): session h — zombie guard + gitea launch-port fix Banner + §8b: zombie-container guard (0a8db904, live-proven on .228) and gitea launch-port fix (670ebb06) shipped in binary 040df5ce, rolled to the fleet. Logs the mempool env-drift recreate-loop and nostr-rs-relay follow-ups. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:41:59 -04:00
archipelago	670ebb0666	fix(launcher): pin Gitea launch URL to web port 3001 (not SSH 2222) Gitea publishes two host ports — SSH on 2222 and the web UI on 3001. The launch URL comes from manifest_lan_address_for() (the manifest's interfaces.main → 3001), but Gitea had no entry in the static lan_address_for() fallback map. On a node where the gitea manifest is absent or stale (no interfaces block), the lookup returns None and the code falls through to extract_lan_address(), which returns whichever port podman lists first — frequently the SSH port. Result: the app launched at :2222 instead of :3001 (observed on tailscale node 100.82.34.38). Add the canonical "gitea" => http://localhost:3001 entry to the static map, matching every other core app, so the web UI is pinned regardless of manifest presence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 03:16:41 -04:00
archipelago	0a8db9044f	fix(orchestrator): recreate zombie "Up" containers whose process is dead podman trusts its own state DB: when a container's conmon dies without podman observing it (cgroup-cascade SIGKILL on archipelago.service restart, a crash), `podman ps` keeps reporting it "Up" long after the process is gone. The reconciler NoOp'd such a zombie forever, so a dead dependency with no published host port never recovered. Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with a dead State.Pid → its nginx proxy 502'd → NetBird login broke ("Unauthenticated"). The dashboard publishes no host port, so the Running branch had nothing to probe and never recreated it. Add a zombie guard to the Running branch: verify the recorded State.Pid is alive (its /proc entry exists) before trusting "running"; on a concrete dead PID, stop+remove+install_fresh from the manifest. Conservative by design — any uncertainty (inspect failed, PID unparseable) assumes alive, so a transient podman hiccup never destroys a healthy container. Unit test covers live/dead/out-of-range PIDs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 02:25:52 -04:00
archipelago	43e700498b	fix(android): trust self-signed certs for the user's own node in WebView Node apps (e.g. NetBird on :8087) terminate TLS with a self-signed cert so the dashboard gets a secure context (OIDC / window.crypto.subtle, #15). The WebView's default onReceivedSslError CANCELs untrusted certs, so those apps rendered blank in the companion — exactly the netbird "won't load in the webview" report. Override onReceivedSslError in both WebViewClients (kiosk + in-app browser) to proceed() only when the failing cert's host matches the connected node; reject everything else (no blanket trust). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 18:13:52 -04:00
archipelago	89d397bb74	refactor(netbird): delete legacy Rust installer — #20 ph4 (manifest-driven only) netbird is fully manifest-driven (apps/netbird-/manifest.yml via the signed catalog): install_stack_via_orchestrator renders the 3-member stack with generated_certs (self-signed TLS for the #15 OIDC secure context), base64 generated_secrets, and templated config — and adopts the running stack by live container name. The hardcoded `podman run` fallback was therefore dead code on any node with the embedded catalog (verified live: .228 https:8087 -> 200). Removes the per-app Rust installer anti-pattern the master plan calls out: - install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer) - deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert, read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip, wait_for_netbird_oidc_ready), 3 NETBIRD__IMAGE consts, unused base64::Engine import - ~485 lines removed; prod_orchestrator doc-comments updated Behavioural parity: the manifest path already executed on the fleet, so this changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed by the manifest path; if that race resurfaces, add an OIDC-ready gate to the manifest rather than resurrecting the Rust fn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 11:04:01 -04:00
archipelago	41e7f500f8	test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn The 5x destructive gate on heavy nodes false-failed on transient windows during stack recovery, not real regressions: - immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis ->server (DB migrations on boot) stack can take >30s to republish :2283 after a churn-induced recreate; destructive-tier immich tests already allow 180-240s. - mempool.bats: orphan-container check now polls to steady state (<=30s) instead of a single-shot count, which caught a recreated member briefly visible alongside its replacement mid-reconcile. - run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when installed, so the next iteration's read-only probe doesn't race a still- recovering stack. Settle returns the instant every probe is green. A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only absorb the transient recreate window under sustained churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 09:18:34 -04:00
archipelago	a721532f55	feat(orchestrator): desired-state recovery + recreate volume-ownership [UNVALIDATED WIP] NOT yet validated on a node or fleet-deployed — cargo check passes, release build + .228 canary validation pending. Committed as a checkpoint so the work survives. Two fixes the immich .198 incident exposed: Fix A (reconcile_all_with_mode): a previously-running app whose container vanished (e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now, when boot reconcile would leave an app 'absent' but it was running at the last running-containers snapshot, recreate it (install_fresh). New crash_recovery::load_last_running_names() reads the snapshot without the PID/crash gate (+2 unit tests). Match is exact on compute_container_name (incl stack members); user-stopped + uninstalled apps are already excluded, so no false positives. Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a no-data_uid app running as container-root (→ host rootless user) hit EACCES and crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir for a no-data_uid app is chowned via --reference=<parent> to match the rootless data root — no host-uid guessing, only fresh dirs (no regression for existing installs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 09:28:40 -04:00
archipelago	80f49cac1c	fix(ui): backoff remote-relay reconnects + stop cryptpad icon 404 Two console-noise fixes from a live error dump: - remote-relay.ts reconnected on a FIXED 5s interval with no backoff, so when the backend is briefly down it floods the console/network with failed-WS attempts for the whole outage. It's a secondary feature (companion input), so add exponential backoff 1s->30s (mirrors websocket.ts), reset on open/start. - cryptpad's catalog/marketplace entries pointed at a non-existent /assets/img/app-icons/cryptpad.webp -> a 404 on every marketplace render. Point it at the existing default icon (handleImageError swapped to it anyway). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 08:41:04 -04:00
archipelago	2d8ade629b	fix(ui): log global errors silently instead of popping a toast + overlay The global error handler (Vue errorHandler + window error + unhandledrejection) fired a red 'Something went wrong: <raw msg>' toast AND an auto on-device overlay on every caught error — deliberately loud for bug-bash, but it surfaces benign, non-actionable noise (e.g. a transient RPC rejection during a ws reconnect, or the service worker failing to register over a self-signed cert) right in the user's face. Demote the catch-all to SILENT capture: keep console.error + the window.__archyErrors ring buffer, and expose the screenshot-able overlay on-demand via window.__archyShowErrors() — but never auto-pop. Components that need to report a specific, actionable failure still call toast.error() directly. Also filter known-benign environmental noise (PWA service-worker registration failing over a self-signed cert — needs a trusted cert, #56) so it doesn't even occupy a ring-buffer slot and push out real errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:55:49 -04:00
archipelago	0406af522c	test(lifecycle): add manifest-driven all-apps health matrix The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others (jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats derives the app set from server.get-state package-data (no hardcoded list) and asserts baseline health across EVERY installed app: - settles to a non-transitional state within a window (the #13/#14 stuck-ghost class, generalized fleet-wide — installing/removing that never settles) - not in error/failed - reports a recognized (non-garbage) state - every running UI app (manifest ui=="true") exposes a non-null lan-address (the immich/port-drift unreachable-UI failure, generalized to all UI apps) Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:27:10 -04:00
archipelago	57a69257c4	test(lifecycle): add CASCADE uninstall/reinstall tier (guards #13 ghost, #14 reinstall) The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14 reinstall stalling on stale state). New cascade-uninstall.bats drives the full teardown path on a throwaway app (default grafana, precondition-skips if already installed so it can't destroy real data) and asserts: - fresh install reaches running via a truthful, non-silent progression - uninstall makes the entry DISAPPEAR from server.get-state package-data (the literal My Apps map) — no ghost, no stuck uninstall stage - container + (on-node) data dir are gone - reinstall returns to running - node left as found Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical gate. Verified 7/7 against .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:13:53 -04:00
archipelago	d1cd42c821	fix(orchestrator): stop retrying unrepairable volume chowns every reconcile ensure_running_container_ownership re-probed and re-attempted the in-container chown on every reconcile pass. For a mount that can't be re-owned from inside the userns (observed: mempool-api /data -> 'Operation not permitted'), this burned CPU and logged a WARN on every pass, forever (~6x/30min on .228/.116). Remember hard chown failures in a process-lifetime set keyed by (container-id, dest) and skip the probe+chown for known-unrepairable mounts. Keyed by Id (not name) so a recreated container gets a fresh repair attempt. Verified on .116: one recorded failure at startup, then silent across subsequent reconciles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 04:58:57 -04:00
archipelago	3e3016f2bd	fix(ui): debounce connection-lost banner so transient ws blips don't flash The reconnect banner showed 'Connection lost'/'Reconnecting' instantly on every socket close, even ones that recover in 100ms-2s (load spikes, Tailscale/relay TCP resets). On a healthy node the drops are brief and self-healing, but each one flashed a jarring banner, reading as constant instability. Debounce the transient banner by 2.5s: only surface after the connection issue persists past the grace window; hide immediately on recovery. Deliberate server lifecycle transitions (restart/shutdown) bypass the debounce and still show at once. A genuine persistent outage keeps isOffline true and surfaces after 2.5s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 04:58:54 -04:00
archipelago	7d89b4d8b2	chore(registry): publish embedded app-catalog.json (52 manifests) for fleet fetch Force-add the gitignored releases/app-catalog.json so nodes resolve 146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/app-catalog.json (currently HTTP 404 → disk-manifest fallback). Embedded-manifest delivery is default-on; origin-wins overlay with disk as fallback. Unsigned (migration window accepts unsigned). Includes netbird x3 manifests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:45:31 -04:00
archipelago	15f65428b8	docs(master-plan): §8b — uninstall fix deployed+live-verifying, #15 guardian resolved Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 18:07:41 -04:00
archipelago	36015a19fe	docs(master-plan): §8b session-b state — connection-lost+netbird+UX-merge shipped to .228, uninstall ghost fix, workstream F in progress Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 15:26:17 -04:00
archipelago	e57514b690	fix(uninstall): never ghost a removed app in My Apps on cleanup residue handle_package_uninstall lumped every teardown failure into one `errors` vec and returned Err on any of them BEFORE removing the package state entry — so a non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a volume/network removal) left the app's containers gone but its entry in package_data → a ghost in My Apps, and the spawned task reverted it to Installed. Split the failures: container removal that even force-rm can't complete (app genuinely still present) keeps the entry + returns Err; everything after the containers are gone is best-effort. Remove the state entry as soon as the containers are gone — BEFORE the slow volume/data teardown — so My Apps updates immediately and residue can never ghost the app. set_uninstall_stage is a no-op once the entry is gone (if-let guard), so the later stages don't re-create it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 15:23:16 -04:00
archipelago	4346007d37	fix(orchestrator): only TCP host ports get reachability-probed wait_for_manifest_host_ports TCP-connect-probed every published port, including UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe failed forever and drove an endless host-port repair/reconcile loop on .228 (netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 14:40:48 -04:00
archipelago	44f7af2017	merge: companion-mobile-ux UX (loader/store-driven launch/icons + android webview) into main # Conflicts: # Android/app/build.gradle.kts # Android/app/src/main/java/com/archipelago/app/ui/screens/WebViewScreen.kt # neode-ui/src/views/apps/appsConfig.ts	2026-06-23 14:07:44 -04:00
archipelago	9670af62b6	feat(registry): deliver app manifests via the signed catalog (embed by default) Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes install from the signed catalog (origin-wins overlay, disk = fallback) with no OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before load_manifests so a fresh boot overlays the latest embedded catalog instead of a restart later; offline/ISO boot falls through to disk and never hangs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 13:39:54 -04:00

1 2 3 4 5 ...

1460 Commits