archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	3e3016f2bd	fix(ui): debounce connection-lost banner so transient ws blips don't flash The reconnect banner showed 'Connection lost'/'Reconnecting' instantly on every socket close, even ones that recover in 100ms-2s (load spikes, Tailscale/relay TCP resets). On a healthy node the drops are brief and self-healing, but each one flashed a jarring banner, reading as constant instability. Debounce the transient banner by 2.5s: only surface after the connection issue persists past the grace window; hide immediately on recovery. Deliberate server lifecycle transitions (restart/shutdown) bypass the debounce and still show at once. A genuine persistent outage keeps isOffline true and surfaces after 2.5s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 04:58:54 -04:00
archipelago	7d89b4d8b2	chore(registry): publish embedded app-catalog.json (52 manifests) for fleet fetch Force-add the gitignored releases/app-catalog.json so nodes resolve 146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/app-catalog.json (currently HTTP 404 → disk-manifest fallback). Embedded-manifest delivery is default-on; origin-wins overlay with disk as fallback. Unsigned (migration window accepts unsigned). Includes netbird x3 manifests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 23:45:31 -04:00
archipelago	15f65428b8	docs(master-plan): §8b — uninstall fix deployed+live-verifying, #15 guardian resolved Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 18:07:41 -04:00
archipelago	36015a19fe	docs(master-plan): §8b session-b state — connection-lost+netbird+UX-merge shipped to .228, uninstall ghost fix, workstream F in progress Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 15:26:17 -04:00
archipelago	e57514b690	fix(uninstall): never ghost a removed app in My Apps on cleanup residue handle_package_uninstall lumped every teardown failure into one `errors` vec and returned Err on any of them BEFORE removing the package state entry — so a non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a volume/network removal) left the app's containers gone but its entry in package_data → a ghost in My Apps, and the spawned task reverted it to Installed. Split the failures: container removal that even force-rm can't complete (app genuinely still present) keeps the entry + returns Err; everything after the containers are gone is best-effort. Remove the state entry as soon as the containers are gone — BEFORE the slow volume/data teardown — so My Apps updates immediately and residue can never ghost the app. set_uninstall_stage is a no-op once the entry is gone (if-let guard), so the later stages don't re-create it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 15:23:16 -04:00
archipelago	4346007d37	fix(orchestrator): only TCP host ports get reachability-probed wait_for_manifest_host_ports TCP-connect-probed every published port, including UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe failed forever and drove an endless host-port repair/reconcile loop on .228 (netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 14:40:48 -04:00
archipelago	44f7af2017	merge: companion-mobile-ux UX (loader/store-driven launch/icons + android webview) into main # Conflicts: # Android/app/build.gradle.kts # Android/app/src/main/java/com/archipelago/app/ui/screens/WebViewScreen.kt # neode-ui/src/views/apps/appsConfig.ts	2026-06-23 14:07:44 -04:00
archipelago	9670af62b6	feat(registry): deliver app manifests via the signed catalog (embed by default) Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes install from the signed catalog (origin-wins overlay, disk = fallback) with no OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before load_manifests so a fresh boot overlays the latest embedded catalog instead of a restart later; offline/ISO boot falls through to disk and never hangs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 13:39:54 -04:00
archipelago	a8b9b0f5e8	feat(netbird): manifest-driven migration via reusable orchestrator primitives Migrate the netbird stack (server/dashboard/proxy) off ~500 lines of per-app Rust to 3 declarative manifests, adding 4 reusable primitives: - SecretGenKind::Base64 (netbird relay authSecret + sqlite store encryptionKey) - GeneratedCert schema + ensure_manifest_certs (self-signed TLS so the dashboard gets a secure context for OIDC PKCE — issue #15; https proxy on 8087 preserved) - templated GeneratedFile render: {{HOST_IP}}/{{HOST_MDNS}}/{{NETWORK_GATEWAY}} (aardvark resolver for the #15 stale-IP fix) /{{secret:NAME}} (never logged) - legacy create_container now honours port.protocol (3478/udp STUN) install_netbird_stack routes via the orchestrator first (legacy kept as fallback, mirroring indeedhub); launch URL derives https://{host_ip}:8087 from host facts. Legacy Rust deletion deferred to post-live-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 13:39:53 -04:00
archipelago	3c36cf1c40	fix(companion): stop image_exists journal flood that drops the UI websocket image_exists ran `podman image inspect <image>` via .status() (inherits the service stdout) with no --format, so every hit dumped the image's full ~249-line manifest JSON into the journal — once per companion image, every reconcile pass (.228: 21.6k journal lines / 10 min, 4131 inspect dumps). The service never crashed (NRestarts=0); the sustained journald/IO flood starved the async runtime and dropped the UI /ws/db websocket -> constant "connection lost"/reconnect. Discard the child's stdout/stderr; only the exit status is used. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 13:39:19 -04:00
archipelago	c4cd5fdc90	docs(master-plan): §8b resume — gate green + 6-node deploy + APK fix + workstream F Comprehensive resume for the session restart: single-node gate green (5/5 .228), latest backend + UX + one-tap companion APK deployed to 6 nodes (table w/ creds + pending 100.64.83.15 cred), workstream-F bugs from manual testing, agreed next order (netbird → Phase-3 → F → multinode), and loose ends (untracked AppLoadingScreen.vue, broken gitea-local mirror, don't-delete-bitcoin-data directive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 06:56:54 -04:00
archipelago	ccb594fb85	test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note It called bats-assert's `fail` (not loaded in this file) → "fail: command not found"/127, masking the real reason. Emit+return instead, bump the cold-restart RPC window 60s→120s (block-index reload), and note a node mid-IBD legitimately can't serve getinfo (environmental precondition, not a product regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 06:28:20 -04:00
archipelago	deff380191	docs(master-plan): workstream F (lifecycle perfection) + §10 state-mgmt backlog The 2026-06-23 5×-green gate is DESTRUCTIVE-tier / ~8 core apps only — it skips uninstall/reinstall (cascade) and has no progress-UI or all-apps coverage. Manual multinode testing found real bugs it never ran (immich+grafana uninstall hangs at full-red bar + ghost in My Apps; grafana reinstall stops; fedimint guardian "waiting for bitcoin sync"). Adds §4 row F, §6b post-deploy order (netbird→Phase-3→F), §6c scope + observed bugs + definition-of-done, a §5 warning, and §10 backlog to investigate TanStack-Query/push-based state management for neode-ui. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 06:28:19 -04:00
Dorian	5c43e12782	chore(android): publish companion as raw APK instead of zip Serve the companion download as a plain .apk so a phone installs it straight from the link/QR with no unzip step. Repoint the in-app download URL, the ship + publish scripts, and the pre-push hook at archipelago-companion.apk, and drop the legacy .apk.zip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 09:41:10 +01:00
Dorian	e825bbed73	feat(android): file upload/download + in-app tab redesign Companion WebView now supports file inputs and downloads, and apps opened in the in-app tab get a proper loading splash and a footer control bar matching the web app-session bar. - onShowFileChooser wired to an ActivityResultLauncher so <input type=file> opens the system file browser (kiosk + in-app tab) - DownloadListener: http(s) via DownloadManager (forwarding session cookies), blob: via JS->base64->MediaStore, data: decoded inline - in-app tab: app-icon + progress loading splash (eager favicon fetch, upgraded via onReceivedIcon) - footer controls (back/forward/refresh/open/close) matched to the web AppSession mobile bar, with the same SVG glyphs as drawables - bump to 0.4.8 (versionCode 12) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 09:41:10 +01:00
archipelago	0dd19f0721	docs(CLAUDE.md): single-node gate GREEN — demote priority banner run-gate.sh 5/5 on .228. Reframe the TOP PRIORITY banner as gate-green; keep the master plan as north-star source of truth; mark the gate definition-of-done green and point at multinode as the next exit criterion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 04:35:50 -04:00
archipelago	ae47897601	docs: single-node production gate GREEN (5/5 on .228) — demote banner run-gate.sh 5×-green on .228, 0 not-ok (gate-5x5.log). Records the milestone in the header/banner, §4 workstream E, §6 sequence, and §8b; demotes the priority banner per §6 item 6. Next: bundled testing deploy (.116/.198 + UX frontend), multinode pass, workstreams B/C/D. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 04:27:36 -04:00
archipelago	256d354048	docs(master-plan): tick off §8 P1 mobile app-launch UX (code-complete) Mobile launch UX is code-complete on branch `companion-mobile-ux` (store-driven panel, no interstitial, in-app WebView footer + loader, mesh 100dvh, ElectrumX icon, companion v0.4.7 + shared debug keystore). Marked code-complete pending on-device/mobile-web verification and merge to main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 04:11:25 -04:00
archipelago	2a249b8a48	feat(android): companion in-app WebView footer controls + loader; shared debug key; v0.4.7 - InAppBrowser now has a bottom control bar (back/forward/reload/open-in-browser/ close) mirroring the web mobile footer, plus a centered loading screen (app favicon + progress bar) instead of a bare top bar over black. - Commit a repo-dedicated debug keystore and pin signingConfigs.debug to it so every machine — and the published companion download — signs debug builds with the SAME key (fixes "App not installed" signature-mismatch on update). Force v1+v2. - Bump versionCode 10→11, versionName 0.4.6→0.4.7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 03:48:58 -04:00
archipelago	a7c7c44843	feat(neode-ui): mobile app-launch UX — store-driven panel, loader, ElectrumX icon - Mobile launches use the store-driven panel (no route push) so the background tab no longer changes and closing returns to where you launched from. - Tab-only apps open directly (in-app WebView on companion / new tab on PWA) — no "this app opens in a tab" interstitial. - Shared AppLoadingScreen (app icon + progress bar) on the app session and the legacy iframe overlay instead of a black screen. - Pin the dashboard to 100dvh on mobile so the mesh chat/tools panes stop sliding under the bottom tab bar in mobile browsers (no-op in the companion WebView). - ElectrumX/electrs/electrs-ui ids now resolve to the real ElectrumX icon in My Apps. - isMobile made reactive so overlay/footer/teleport decisions track the viewport. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 03:48:57 -04:00
archipelago	2afd18c6de	test(gate): poll immich lan_address to absorb mid-recreate churn 5× run #4 flaked iter4 on "immich exposes its web UI lan-address (port 2283)": container-list returned lan_address=null because immich_server was momentarily mid-recreate when the read-only tier queried it (passed the other 4 iterations; immich_server does publish 0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots state probe — poll <=30s for the exposed port instead of one read. A genuinely unexposed immich never publishes 2283, so real port drift is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 03:20:18 -04:00
archipelago	6511754545	docs: master-plan §8b — 5× triage, mempool restart bug fixed Record the overnight 5× outcome (2/5) and the triage: all three fails were distinct one-offs. iter1 #5 bitcoin-knots = pre-launch churn (hardened anyway); iter2 #74 + iter5 #73 = one real orchestrator bug (phantom stack-member injection in ordered_containers_for_start), now fixed + live-verified on .228. Update the resume check command to gate-5x4.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 02:23:07 -04:00
archipelago	92d7f52dd6	fix(orchestrator): order only live containers on package start/restart package.restart resolved its container list via ordered_containers_for_start, which injected every name from the union startup_order list that wasn't already present — including variant names not live on a given node (mysql-mempool, archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is 2nd in the mempool start order, so do_orchestrator_package_start hit its unknown-app-id fallback, do_package_start failed the inspect ("no such object"), and the `?` aborted the whole start sequence — leaving mempool-api + the frontend down until the health monitor recovered them minutes later. That was the source of the 5× gate flakes #73 (frontend not running in 180s) and #74 (api not queryable in 300s); root-caused from the .228 journal ("Start failed: mysql-mempool"). Replace the inject-then-sort logic with a pure helper order_present_containers that orders only the actually-present containers and never adds phantom entries. startup_order remains a union of name variants across install generations — it's now used purely to order what's live, not to inject what isn't. +3 unit tests. Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a settled state instead of a single-shot read, so a container caught mid-reconcile (transient restarting/configured) can't flake a 20-min iteration. A genuinely-stuck container never settles, so real breakage is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 02:22:50 -04:00
archipelago	57a013bc66	test(gate): make 5× the canonical gate, drop 20x naming Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub 20× references across CLAUDE.md, the master plan, TESTING.md, app-registry status, the orchestrator/config doc-comments, and the bats suites. Also add a minimal fail() helper to mempool.bats so guard failures report cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:12:41 -04:00
archipelago	0f05f73a23	fix(mempool): self-healing nginx backend proxy (v3.0.1) + gate timeout The frontend nginx used a literal proxy_pass host with no resolver, so it pinned mempool-api's IP at worker startup. When the backend restarts (gate, OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a manual nginx reload. Same stale-upstream-IP class as the netbird 502. Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to re-resolve the backend per-request via 'resolver' + a variable proxy_pass. Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers on the network gateway, not Docker's 127.0.0.11). Per-location path mapping preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite). Proven on .228: backend IP change now auto-recovers with no reload; the literal-host control still 502s. Migrated the manifest off the retired tx1138 registry to vps2. Also: mempool.bats #74 waited only 180s post-restart (the slow path) and called an undefined 'fail' helper (status 127). Bumped to 300s to match the passing parity probes and emit a real failure instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:07:07 -04:00
archipelago	c8acc84506	docs: §2 invariant single-node (.228); multinode → separate plan	2026-06-22 17:23:19 -04:00
archipelago	8355453a7e	docs: exact cutoff-proof resume in master-plan SS8b (resume from any device) Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log, nohup — survives terminal close) with the exact check-from-any-machine command; all shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx re-registered); the run-ON-the-node lesson; and remaining work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:22:29 -04:00
archipelago	98f4fa44a8	test(gate): harden readiness for sustained 5x churn + inter-iteration settle The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO recover — lnd synced, mempool just mid-restart when probed — but slower than the windows when restarted back-to-back). Hardening: - run-20x.sh: best-effort settle_stack() before each iteration (wait for mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run). - required containers present/running (80/81): wait-loops (180s) not single-shot. - mempool api/frontend (87/88): retry ~180s not single-shot. - mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s. lnd getinfo (60): 90s->240s retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:11:15 -04:00
archipelago	22b05de6d9	docs(roadmap): P1 mobile app-launch UX — drop 'opens in a tab' interstitial Companion app: open every app in the in-app WebView (not just non-iframeable), carrying the mobile-iframe footer controls into the WebView. Mobile web (PWA): open tab-apps directly in a new tab. No interstitial on either surface. Touch points + prior commits (b5a9deb8, d1fbcd9b) noted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:57:44 -04:00
archipelago	27299ea687	docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:47:34 -04:00
archipelago	892ff083c4	test(gate): fix the last 4 readiness/config false-fails (none are product bugs) On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is green; these 4 were test-harness issues: - lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded node but DOES complete (synced_to_chain:true). - bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may have just been recreated by the companion-survives test). - probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for post-restart proxy/UI readiness instead of single-shot. - required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL app (not in required_containers) — only assert it when NPM is installed; and make the trailing lncli getinfo a retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 15:43:51 -04:00
archipelago	8893055810	test(gate): retry lnd getinfo for RPC readiness (wallet-unlock lags 'running') lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the container 'running' state — single-shot lncli getinfo raced that window and false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is functional (getinfo returns cleanly once ready). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:45:36 -04:00
archipelago	53b8e47f1d	test(gate): fix two false-failing lifecycle tests (not product bugs) - immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3- container stack (postgres->redis->server w/ DB migrations), so it needs at least as long as the start test (180s) — the old 120s was inconsistent and false-failed on loaded nodes. immich does return to running. - fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex omitted it -> total>known false orphan on every node running fedimint-clientd. Add fedimint-clientd to known. Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node (.116), not the RPC target — surfaced while driving the .228 gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:11:35 -04:00
archipelago	f4727bfdb3	docs(gate): companion self-heal fix validated (10s) + test-31 harness caveat Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL rm/systemctl --user, so running it from .116 via RPC tests .116's companions with .116's binary, NOT the remote target — must run ON the target node. Explains the 'failed on both nodes' runs (both silently tested .116). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 13:44:57 -04:00
archipelago	452f05d849	fix(reconciler): decouple companion self-heal onto its own cadence The companion-unit repair stage ran at the END of each boot-reconciler tick, after reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any reasonable window (gate test 31 'deleted unit recreated within one reconcile tick' timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass. Handle is aborted when the main loop exits (shutdown uses notify_one, so a second waiter would steal the wake permit). tick() is now app-reconcile only. All 4 boot_reconciler cadence tests still green (companion_stage=false in tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 13:04:28 -04:00
archipelago	de7d3d83dc	docs(gate): final read — every failure fixed/explained, no lifecycle bugs remain Last 2 .228 stragglers confirmed load/timing, not bugs: test 31 (companion recreate) = contamination + ~108s reconcile cadence > 90s window; test 55 (immich restart) = heavy stack restarts >120s under load but DOES return. Path to literally-green gate is infra (bitcoin sync, re-quadletize .228) + minor test-window tuning. Optional product improvement noted: independent ~30s companion-reconcile cadence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 12:36:03 -04:00
archipelago	76b23adcc0	docs(gate): test 31 root-caused = .228 contamination (not a product bug) companion::reconcile only recreates a deleted companion unit when its parent backend is in manifest_ids. On contaminated .228, electrumx ran as plain podman and was NOT a tracked manifest install (manifest on disk but unloaded), so the reconciler never iterated it -> archy-electrs-ui companion orphaned. Proven: package.install electrumx re-registered it + restored the companion. Self-heal logic is sound; test 31 clears on re-quadletize. electrumx on .228 de-contaminated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 11:34:55 -04:00
archipelago	47a5148865	docs(gate): two-node result — stop blocker FIXED; residual red is bitcoin-IBD + node prep .228 104/110, .198 94/110 with the 3-fix binary. Every package.stop test passes on healthy apps. .198's 14/16 failures trace to bitcoin in IBD (test 83: ~137k blocks behind) cascading to lnd/btcpay/electrumx/mempool. 2 node-independent: companion recreate (31, both nodes), fedimint orphan pollution (44). Path to green 5x gate is now infra (sync bitcoin, re-quadletize .228) + minor (test 31), not lifecycle bugs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 11:09:12 -04:00
archipelago	b090235b04	docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228 Stop failure was 3 real product bugs (grace / reconcile-resurrection / container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) + deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was probe-induced churn (stable when left alone). Validating breadth next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:49:45 -04:00
archipelago	6e49ce6f88	fix(container-list): report user-stopped apps as stopped despite live UI companion A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running' in container-list because its UI companion (electrs-ui, …) still serves the launch port, and the state-refresh upgrades any reachable launch port to 'running'. The gate's wait_for_container_status <app> stopped therefore never saw 'stopped'. Fix: load the user_stopped marker in handle_container_list and force 'stopped' for those apps before the launch-port refresh. The reconcile guard keeps the backend down, so the marker is authoritative. package.start clears it first, so a started app reports 'running' normally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:26:30 -04:00
archipelago	760a32bccf	fix(reconcile): keep user-stopped apps stopped (reconciler was resurrecting them) package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler restarts it within ~8s: the reconcile filter's dependency_required override re-includes a user-stopped app that an active app depends on, and the in-memory disabled set is wiped on manifest reload — so ensure_running runs, the stopped app's unreachable ports look like a fault, the host-port repair restarts it, and package.stop never sticks (gate 'transitions to stopped' times out). Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single choke point every reconcile flows through) → Left('user-stopped'). Explicit install/start clear the marker first (added clear_user_stopped to orchestrator install/start, symmetric with disabled.remove; start/restart RPC already cleared it) so user actions are unaffected. The container itself already stopped correctly — this stops the resurrection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:04:02 -04:00
archipelago	29cd167894	docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues) Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 08:07:43 -04:00
archipelago	2dad64b2ee	fix(stop): honour per-app graceful-stop grace in orchestrator stop path package.stop left slow-to-SIGTERM apps (fedimint/electrumx/bitcoin/btcpay/immich) running: the orchestrator path hardcoded podman API ?t=10 / CLI -t 30 and the CLI wrapper deadline (30s) equalled the -t grace, so the await fired exactly as podman SIGKILLed -> stop reported failed -> state reverted to running. Reproduced live on clean .198 (fedimint). - container/runtime.rs: add ContainerRuntime::stop_container_with_grace (defaulted so mock/dev impls are unchanged); PodmanRuntime honours grace for API + CLI with deadline = grace + 15s buffer; AutoRuntime delegates. New canonical per-app table stop_grace_secs_for() + DEFAULT_STOP_GRACE_SECS / STOP_GRACE_DEADLINE_BUFFER_SECS. - podman_client.rs: stop_container_with_grace uses ?t=<grace> + longer HTTP deadline. - prod_orchestrator::stop: resolve grace = manifest stop_grace_secs (north-star) else the table; pass to quadlet::stop_service_with_timeout AND stop_container_with_grace. - quadlet.rs: stop_service_with_timeout so slow apps aren't SIGKILLed at 45s. - rpc/package/runtime.rs: doc-note its &str stop_timeout_secs mirrors the canonical table. - tests: resolve_stop_grace_secs (manifest field wins / table fallback / default 30). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:59:40 -04:00
archipelago	470e3c649a	docs(gate): ROOT-CAUSE the stop blocker — orchestrator ignores per-app stop grace Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30 timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd 330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI -t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:17:23 -04:00
archipelago	a111d79a05	docs(gate): downgrade stop-blocker ⛔→⚠️ — .198 has quadlet units, .228 state was my contamination .198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet is the intended runtime. .228's plain-podman state traced to my cascade-gate uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs remain (start should regen quadlet; stop podman-fallback gap). Next: canonical gate on CLEAN .198 first to tell real-bug from contamination. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:00:42 -04:00
archipelago	47026fae30	docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228) 5x gate run surfaced a real blocker: package.stop does not stop electrumx/ bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait times out). Root cause chain: these backend apps run as plain podman --restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI companions + home-assistant have .container files; bitcoin-core.container is .disabled). orchestrator.stop() podman-fallback fires for filebrowser but not electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state reporting itself is correct (filebrowser proof, user_stopped guard). Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE); restored .228 after my cascade-gate left apps stranded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 05:47:11 -04:00
archipelago	d6fa262d69	docs(#20 ): consolidate master-plan resume — indeedhub migration 2-node verified (.228+.198); cutoff-proof next-steps + deploy facts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 04:23:52 -04:00
archipelago	e2a012d086	fix(indeedhub): frontend health = tcp:7777 not http GET / (stops reconcile churn) On the loaded .198 the frontend churned (created → "unhealthy" → reconciler recreates → loop). The http health check fetched / through nginx (SPA + sub_filter) and false-failed under node load; the reconciler then treated the frontend as wedged and recreated it. nginx binds 7777 at startup, so a tcp liveness check passes immediately and stays green under load while still catching a real "nginx not listening" failure. Generous retries/start_period. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 03:39:26 -04:00
archipelago	e4d3f94913	docs(#20 ): hook exec cgroup gap FIXED + verified on .228 (scoped exec) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:57:17 -04:00
archipelago	ff78b31212	fix(hooks): run post_install `exec` in a transient user scope (fixes cgroup denial) Live on .228 the post_install `exec` steps failed with "crun: write cgroup.procs: Permission denied / OCI permission denied": a `podman exec` launched from archipelago.service can't place its child in the container's cgroup (under the service's own slice). Wrap `exec` in `systemd-run --user --scope --quiet --collect podman exec …` so it gets its own delegated cgroup — same trick as `podman_user_scope` for pasta starts. `copy_from_host` (a host-side `cp`, no in-container process) stays direct. Without this only copy_from_host worked; indeedhub happened to be unaffected (its image pre-bakes the nginx config so the exec steps were no-ops), but the hook capability is only generally useful with exec working. hooks unit tests pass; live verify on .228 next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:38:23 -04:00

1 2 3 4 5 ...

1418 Commits