1428 Commits

Author SHA1 Message Date
archipelago
0a8db9044f fix(orchestrator): recreate zombie "Up" containers whose process is dead
podman trusts its own state DB: when a container's conmon dies without
podman observing it (cgroup-cascade SIGKILL on archipelago.service
restart, a crash), `podman ps` keeps reporting it "Up" long after the
process is gone. The reconciler NoOp'd such a zombie forever, so a dead
dependency with no published host port never recovered.

Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with
a dead State.Pid → its nginx proxy 502'd → NetBird login broke
("Unauthenticated"). The dashboard publishes no host port, so the
Running branch had nothing to probe and never recreated it.

Add a zombie guard to the Running branch: verify the recorded State.Pid
is alive (its /proc entry exists) before trusting "running"; on a
concrete dead PID, stop+remove+install_fresh from the manifest.
Conservative by design — any uncertainty (inspect failed, PID
unparseable) assumes alive, so a transient podman hiccup never destroys
a healthy container. Unit test covers live/dead/out-of-range PIDs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 02:25:52 -04:00
archipelago
43e700498b fix(android): trust self-signed certs for the user's own node in WebView
Node apps (e.g. NetBird on :8087) terminate TLS with a self-signed cert
so the dashboard gets a secure context (OIDC / window.crypto.subtle, #15).
The WebView's default onReceivedSslError CANCELs untrusted certs, so those
apps rendered blank in the companion — exactly the netbird "won't load in
the webview" report. Override onReceivedSslError in both WebViewClients
(kiosk + in-app browser) to proceed() only when the failing cert's host
matches the connected node; reject everything else (no blanket trust).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 18:13:52 -04:00
archipelago
89d397bb74 refactor(netbird): delete legacy Rust installer — #20 ph4 (manifest-driven only)
netbird is fully manifest-driven (apps/netbird-*/manifest.yml via the signed
catalog): install_stack_via_orchestrator renders the 3-member stack with
generated_certs (self-signed TLS for the #15 OIDC secure context), base64
generated_secrets, and templated config — and adopts the running stack by live
container name. The hardcoded `podman run` fallback was therefore dead code on
any node with the embedded catalog (verified live: .228 https:8087 -> 200).

Removes the per-app Rust installer anti-pattern the master plan calls out:
- install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer)
- deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert,
  read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip,
  wait_for_netbird_oidc_ready), 3 NETBIRD_*_IMAGE consts, unused base64::Engine import
- ~485 lines removed; prod_orchestrator doc-comments updated

Behavioural parity: the manifest path already executed on the fleet, so this
changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed
by the manifest path; if that race resurfaces, add an OIDC-ready gate to the
manifest rather than resurrecting the Rust fn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 11:04:01 -04:00
archipelago
41e7f500f8 test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn
The 5x destructive gate on heavy nodes false-failed on transient windows
during stack recovery, not real regressions:

- immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis
  ->server (DB migrations on boot) stack can take >30s to republish :2283 after
  a churn-induced recreate; destructive-tier immich tests already allow 180-240s.
- mempool.bats: orphan-container check now polls to steady state (<=30s) instead
  of a single-shot count, which caught a recreated member briefly visible
  alongside its replacement mid-reconcile.
- run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when
  installed, so the next iteration's read-only probe doesn't race a still-
  recovering stack. Settle returns the instant every probe is green.

A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only
absorb the transient recreate window under sustained churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 09:18:34 -04:00
archipelago
a721532f55 feat(orchestrator): desired-state recovery + recreate volume-ownership [UNVALIDATED WIP]
NOT yet validated on a node or fleet-deployed — cargo check passes, release build
+ .228 canary validation pending. Committed as a checkpoint so the work survives.

Two fixes the immich .198 incident exposed:

Fix A (reconcile_all_with_mode): a previously-running app whose container vanished
(e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now,
when boot reconcile would leave an app 'absent' but it was running at the last
running-containers snapshot, recreate it (install_fresh). New
crash_recovery::load_last_running_names() reads the snapshot without the PID/crash
gate (+2 unit tests). Match is exact on compute_container_name (incl stack
members); user-stopped + uninstalled apps are already excluded, so no false
positives.

Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a
no-data_uid app running as container-root (→ host rootless user) hit EACCES and
crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir
for a no-data_uid app is chowned via --reference=<parent> to match the rootless
data root — no host-uid guessing, only fresh dirs (no regression for existing
installs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 09:28:40 -04:00
archipelago
80f49cac1c fix(ui): backoff remote-relay reconnects + stop cryptpad icon 404
Two console-noise fixes from a live error dump:
- remote-relay.ts reconnected on a FIXED 5s interval with no backoff, so when
  the backend is briefly down it floods the console/network with failed-WS
  attempts for the whole outage. It's a secondary feature (companion input), so
  add exponential backoff 1s->30s (mirrors websocket.ts), reset on open/start.
- cryptpad's catalog/marketplace entries pointed at a non-existent
  /assets/img/app-icons/cryptpad.webp -> a 404 on every marketplace render.
  Point it at the existing default icon (handleImageError swapped to it anyway).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 08:41:04 -04:00
archipelago
2d8ade629b fix(ui): log global errors silently instead of popping a toast + overlay
The global error handler (Vue errorHandler + window error + unhandledrejection)
fired a red 'Something went wrong: <raw msg>' toast AND an auto on-device overlay
on every caught error — deliberately loud for bug-bash, but it surfaces benign,
non-actionable noise (e.g. a transient RPC rejection during a ws reconnect, or
the service worker failing to register over a self-signed cert) right in the
user's face.

Demote the catch-all to SILENT capture: keep console.error + the
window.__archyErrors ring buffer, and expose the screenshot-able overlay
on-demand via window.__archyShowErrors() — but never auto-pop. Components that
need to report a specific, actionable failure still call toast.error() directly.

Also filter known-benign environmental noise (PWA service-worker registration
failing over a self-signed cert — needs a trusted cert, #56) so it doesn't even
occupy a ring-buffer slot and push out real errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 05:55:49 -04:00
archipelago
0406af522c test(lifecycle): add manifest-driven all-apps health matrix
The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others
(jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats
derives the app set from server.get-state package-data (no hardcoded list) and
asserts baseline health across EVERY installed app:
  - settles to a non-transitional state within a window (the #13/#14 stuck-ghost
    class, generalized fleet-wide — installing/removing that never settles)
  - not in error/failed
  - reports a recognized (non-garbage) state
  - every running UI app (manifest ui=="true") exposes a non-null lan-address
    (the immich/port-drift unreachable-UI failure, generalized to all UI apps)

Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as
nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 05:27:10 -04:00
archipelago
57a69257c4 test(lifecycle): add CASCADE uninstall/reinstall tier (guards #13 ghost, #14 reinstall)
The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where
the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14
reinstall stalling on stale state). New cascade-uninstall.bats drives the full
teardown path on a throwaway app (default grafana, precondition-skips if already
installed so it can't destroy real data) and asserts:
  - fresh install reaches running via a truthful, non-silent progression
  - uninstall makes the entry DISAPPEAR from server.get-state package-data
    (the literal My Apps map) — no ghost, no stuck uninstall stage
  - container + (on-node) data dir are gone
  - reinstall returns to running
  - node left as found

Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical
gate. Verified 7/7 against .228.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 05:13:53 -04:00
archipelago
d1cd42c821 fix(orchestrator): stop retrying unrepairable volume chowns every reconcile
ensure_running_container_ownership re-probed and re-attempted the in-container
chown on every reconcile pass. For a mount that can't be re-owned from inside the
userns (observed: mempool-api /data -> 'Operation not permitted'), this burned
CPU and logged a WARN on every pass, forever (~6x/30min on .228/.116).

Remember hard chown failures in a process-lifetime set keyed by (container-id,
dest) and skip the probe+chown for known-unrepairable mounts. Keyed by Id (not
name) so a recreated container gets a fresh repair attempt. Verified on .116:
one recorded failure at startup, then silent across subsequent reconciles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 04:58:57 -04:00
archipelago
3e3016f2bd fix(ui): debounce connection-lost banner so transient ws blips don't flash
The reconnect banner showed 'Connection lost'/'Reconnecting' instantly on every
socket close, even ones that recover in 100ms-2s (load spikes, Tailscale/relay
TCP resets). On a healthy node the drops are brief and self-healing, but each one
flashed a jarring banner, reading as constant instability.

Debounce the transient banner by 2.5s: only surface after the connection issue
persists past the grace window; hide immediately on recovery. Deliberate server
lifecycle transitions (restart/shutdown) bypass the debounce and still show at
once. A genuine persistent outage keeps isOffline true and surfaces after 2.5s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 04:58:54 -04:00
archipelago
7d89b4d8b2 chore(registry): publish embedded app-catalog.json (52 manifests) for fleet fetch
Force-add the gitignored releases/app-catalog.json so nodes resolve
146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/app-catalog.json
(currently HTTP 404 → disk-manifest fallback). Embedded-manifest delivery
is default-on; origin-wins overlay with disk as fallback. Unsigned (migration
window accepts unsigned). Includes netbird x3 manifests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 23:45:31 -04:00
archipelago
15f65428b8 docs(master-plan): §8b — uninstall fix deployed+live-verifying, #15 guardian resolved
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 18:07:41 -04:00
archipelago
36015a19fe docs(master-plan): §8b session-b state — connection-lost+netbird+UX-merge shipped to .228, uninstall ghost fix, workstream F in progress
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 15:26:17 -04:00
archipelago
e57514b690 fix(uninstall): never ghost a removed app in My Apps on cleanup residue
handle_package_uninstall lumped every teardown failure into one `errors` vec
and returned Err on any of them BEFORE removing the package state entry — so a
non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a
volume/network removal) left the app's containers gone but its entry in
package_data → a ghost in My Apps, and the spawned task reverted it to Installed.

Split the failures: container removal that even force-rm can't complete (app
genuinely still present) keeps the entry + returns Err; everything after the
containers are gone is best-effort. Remove the state entry as soon as the
containers are gone — BEFORE the slow volume/data teardown — so My Apps updates
immediately and residue can never ghost the app. set_uninstall_stage is a no-op
once the entry is gone (if-let guard), so the later stages don't re-create it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 15:23:16 -04:00
archipelago
4346007d37 fix(orchestrator): only TCP host ports get reachability-probed
wait_for_manifest_host_ports TCP-connect-probed every published port, including
UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe
failed forever and drove an endless host-port repair/reconcile loop on .228
(netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 14:40:48 -04:00
archipelago
44f7af2017 merge: companion-mobile-ux UX (loader/store-driven launch/icons + android webview) into main
# Conflicts:
#	Android/app/build.gradle.kts
#	Android/app/src/main/java/com/archipelago/app/ui/screens/WebViewScreen.kt
#	neode-ui/src/views/apps/appsConfig.ts
2026-06-23 14:07:44 -04:00
archipelago
9670af62b6 feat(registry): deliver app manifests via the signed catalog (embed by default)
Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now
embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes
install from the signed catalog (origin-wins overlay, disk = fallback) with no
OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before
load_manifests so a fresh boot overlays the latest embedded catalog instead of a
restart later; offline/ISO boot falls through to disk and never hangs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:54 -04:00
archipelago
a8b9b0f5e8 feat(netbird): manifest-driven migration via reusable orchestrator primitives
Migrate the netbird stack (server/dashboard/proxy) off ~500 lines of per-app Rust
to 3 declarative manifests, adding 4 reusable primitives:
- SecretGenKind::Base64 (netbird relay authSecret + sqlite store encryptionKey)
- GeneratedCert schema + ensure_manifest_certs (self-signed TLS so the dashboard
  gets a secure context for OIDC PKCE — issue #15; https proxy on 8087 preserved)
- templated GeneratedFile render: {{HOST_IP}}/{{HOST_MDNS}}/{{NETWORK_GATEWAY}}
  (aardvark resolver for the #15 stale-IP fix) /{{secret:NAME}} (never logged)
- legacy create_container now honours port.protocol (3478/udp STUN)
install_netbird_stack routes via the orchestrator first (legacy kept as fallback,
mirroring indeedhub); launch URL derives https://{host_ip}:8087 from host facts.
Legacy Rust deletion deferred to post-live-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:53 -04:00
archipelago
3c36cf1c40 fix(companion): stop image_exists journal flood that drops the UI websocket
image_exists ran `podman image inspect <image>` via .status() (inherits the
service stdout) with no --format, so every hit dumped the image's full ~249-line
manifest JSON into the journal — once per companion image, every reconcile pass
(.228: 21.6k journal lines / 10 min, 4131 inspect dumps). The service never
crashed (NRestarts=0); the sustained journald/IO flood starved the async runtime
and dropped the UI /ws/db websocket -> constant "connection lost"/reconnect.
Discard the child's stdout/stderr; only the exit status is used.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:19 -04:00
archipelago
c4cd5fdc90 docs(master-plan): §8b resume — gate green + 6-node deploy + APK fix + workstream F
Comprehensive resume for the session restart: single-node gate green
(5/5 .228), latest backend + UX + one-tap companion APK deployed to 6
nodes (table w/ creds + pending 100.64.83.15 cred), workstream-F bugs
from manual testing, agreed next order (netbird → Phase-3 → F →
multinode), and loose ends (untracked AppLoadingScreen.vue, broken
gitea-local mirror, don't-delete-bitcoin-data directive).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 06:56:54 -04:00
archipelago
ccb594fb85 test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note
It called bats-assert's `fail` (not loaded in this file) → "fail:
command not found"/127, masking the real reason. Emit+return instead,
bump the cold-restart RPC window 60s→120s (block-index reload), and
note a node mid-IBD legitimately can't serve getinfo (environmental
precondition, not a product regression).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 06:28:20 -04:00
archipelago
deff380191 docs(master-plan): workstream F (lifecycle perfection) + §10 state-mgmt backlog
The 2026-06-23 5×-green gate is DESTRUCTIVE-tier / ~8 core apps only —
it skips uninstall/reinstall (cascade) and has no progress-UI or
all-apps coverage. Manual multinode testing found real bugs it never
ran (immich+grafana uninstall hangs at full-red bar + ghost in My Apps;
grafana reinstall stops; fedimint guardian "waiting for bitcoin sync").
Adds §4 row F, §6b post-deploy order (netbird→Phase-3→F), §6c scope +
observed bugs + definition-of-done, a §5 warning, and §10 backlog to
investigate TanStack-Query/push-based state management for neode-ui.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 06:28:19 -04:00
Dorian
5c43e12782 chore(android): publish companion as raw APK instead of zip
Serve the companion download as a plain .apk so a phone installs it
straight from the link/QR with no unzip step. Repoint the in-app
download URL, the ship + publish scripts, and the pre-push hook at
archipelago-companion.apk, and drop the legacy .apk.zip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 09:41:10 +01:00
Dorian
e825bbed73 feat(android): file upload/download + in-app tab redesign
Companion WebView now supports file inputs and downloads, and apps
opened in the in-app tab get a proper loading splash and a footer
control bar matching the web app-session bar.

- onShowFileChooser wired to an ActivityResultLauncher so <input
  type=file> opens the system file browser (kiosk + in-app tab)
- DownloadListener: http(s) via DownloadManager (forwarding session
  cookies), blob: via JS->base64->MediaStore, data: decoded inline
- in-app tab: app-icon + progress loading splash (eager favicon
  fetch, upgraded via onReceivedIcon)
- footer controls (back/forward/refresh/open/close) matched to the
  web AppSession mobile bar, with the same SVG glyphs as drawables
- bump to 0.4.8 (versionCode 12)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 09:41:10 +01:00
archipelago
0dd19f0721 docs(CLAUDE.md): single-node gate GREEN — demote priority banner
run-gate.sh 5/5 on .228. Reframe the TOP PRIORITY banner as
gate-green; keep the master plan as north-star source of truth; mark
the gate definition-of-done green and point at multinode as the next
exit criterion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 04:35:50 -04:00
archipelago
ae47897601 docs: single-node production gate GREEN (5/5 on .228) — demote banner
run-gate.sh 5×-green on .228, 0 not-ok (gate-5x5.log). Records the
milestone in the header/banner, §4 workstream E, §6 sequence, and §8b;
demotes the priority banner per §6 item 6. Next: bundled testing deploy
(.116/.198 + UX frontend), multinode pass, workstreams B/C/D.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 04:27:36 -04:00
archipelago
256d354048 docs(master-plan): tick off §8 P1 mobile app-launch UX (code-complete)
Mobile launch UX is code-complete on branch `companion-mobile-ux` (store-driven
panel, no interstitial, in-app WebView footer + loader, mesh 100dvh, ElectrumX
icon, companion v0.4.7 + shared debug keystore). Marked code-complete pending
on-device/mobile-web verification and merge to main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 04:11:25 -04:00
archipelago
2a249b8a48 feat(android): companion in-app WebView footer controls + loader; shared debug key; v0.4.7
- InAppBrowser now has a bottom control bar (back/forward/reload/open-in-browser/
  close) mirroring the web mobile footer, plus a centered loading screen
  (app favicon + progress bar) instead of a bare top bar over black.
- Commit a repo-dedicated debug keystore and pin signingConfigs.debug to it so
  every machine — and the published companion download — signs debug builds with
  the SAME key (fixes "App not installed" signature-mismatch on update). Force v1+v2.
- Bump versionCode 10→11, versionName 0.4.6→0.4.7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 03:48:58 -04:00
archipelago
a7c7c44843 feat(neode-ui): mobile app-launch UX — store-driven panel, loader, ElectrumX icon
- Mobile launches use the store-driven panel (no route push) so the background
  tab no longer changes and closing returns to where you launched from.
- Tab-only apps open directly (in-app WebView on companion / new tab on PWA) —
  no "this app opens in a tab" interstitial.
- Shared AppLoadingScreen (app icon + progress bar) on the app session and the
  legacy iframe overlay instead of a black screen.
- Pin the dashboard to 100dvh on mobile so the mesh chat/tools panes stop sliding
  under the bottom tab bar in mobile browsers (no-op in the companion WebView).
- ElectrumX/electrs/electrs-ui ids now resolve to the real ElectrumX icon in My Apps.
- isMobile made reactive so overlay/footer/teleport decisions track the viewport.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 03:48:57 -04:00
archipelago
2afd18c6de test(gate): poll immich lan_address to absorb mid-recreate churn
5× run #4 flaked iter4 on "immich exposes its web UI lan-address
(port 2283)": container-list returned lan_address=null because
immich_server was momentarily mid-recreate when the read-only tier
queried it (passed the other 4 iterations; immich_server does publish
0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots
state probe — poll <=30s for the exposed port instead of one read. A
genuinely unexposed immich never publishes 2283, so real port drift
is still caught.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 03:20:18 -04:00
archipelago
6511754545 docs: master-plan §8b — 5× triage, mempool restart bug fixed
Record the overnight 5× outcome (2/5) and the triage: all three
fails were distinct one-offs. iter1 #5 bitcoin-knots = pre-launch
churn (hardened anyway); iter2 #74 + iter5 #73 = one real
orchestrator bug (phantom stack-member injection in
ordered_containers_for_start), now fixed + live-verified on .228.
Update the resume check command to gate-5x4.log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 02:23:07 -04:00
archipelago
92d7f52dd6 fix(orchestrator): order only live containers on package start/restart
package.restart resolved its container list via
ordered_containers_for_start, which injected every name from the
union startup_order list that wasn't already present — including
variant names not live on a given node (mysql-mempool,
archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is
2nd in the mempool start order, so do_orchestrator_package_start hit
its unknown-app-id fallback, do_package_start failed the inspect
("no such object"), and the `?` aborted the whole start sequence —
leaving mempool-api + the frontend down until the health monitor
recovered them minutes later. That was the source of the 5× gate
flakes #73 (frontend not running in 180s) and #74 (api not queryable
in 300s); root-caused from the .228 journal
("Start failed: mysql-mempool").

Replace the inject-then-sort logic with a pure helper
order_present_containers that orders only the actually-present
containers and never adds phantom entries. startup_order remains a
union of name variants across install generations — it's now used
purely to order what's live, not to inject what isn't. +3 unit tests.

Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a
settled state instead of a single-shot read, so a container caught
mid-reconcile (transient restarting/configured) can't flake a 20-min
iteration. A genuinely-stuck container never settles, so real
breakage is still caught.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 02:22:50 -04:00
archipelago
57a013bc66 test(gate): make 5× the canonical gate, drop 20x naming
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:12:41 -04:00
archipelago
0f05f73a23 fix(mempool): self-healing nginx backend proxy (v3.0.1) + gate timeout
The frontend nginx used a literal proxy_pass host with no resolver, so it
pinned mempool-api's IP at worker startup. When the backend restarts (gate,
OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying
to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a
manual nginx reload. Same stale-upstream-IP class as the netbird 502.

Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to
re-resolve the backend per-request via 'resolver' + a variable proxy_pass.
Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers
on the network gateway, not Docker's 127.0.0.11). Per-location path mapping
preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite).
Proven on .228: backend IP change now auto-recovers with no reload; the
literal-host control still 502s. Migrated the manifest off the retired
tx1138 registry to vps2.

Also: mempool.bats #74 waited only 180s post-restart (the slow path) and
called an undefined 'fail' helper (status 127). Bumped to 300s to match the
passing parity probes and emit a real failure instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:07:07 -04:00
archipelago
c8acc84506 docs: §2 invariant single-node (.228); multinode → separate plan 2026-06-22 17:23:19 -04:00
archipelago
8355453a7e docs: exact cutoff-proof resume in master-plan SS8b (resume from any device)
Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log,
nohup — survives terminal close) with the exact check-from-any-machine command; all
shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in
repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx
re-registered); the run-ON-the-node lesson; and remaining work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:22:29 -04:00
archipelago
98f4fa44a8 test(gate): harden readiness for sustained 5x churn + inter-iteration settle
The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO
recover — lnd synced, mempool just mid-restart when probed — but slower than the
windows when restarted back-to-back). Hardening:
- run-20x.sh: best-effort settle_stack() before each iteration (wait for
  mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run).
- required containers present/running (80/81): wait-loops (180s) not single-shot.
- mempool api/frontend (87/88): retry ~180s not single-shot.
- mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s.
  lnd getinfo (60): 90s->240s retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:11:15 -04:00
archipelago
22b05de6d9 docs(roadmap): P1 mobile app-launch UX — drop 'opens in a tab' interstitial
Companion app: open every app in the in-app WebView (not just non-iframeable),
carrying the mobile-iframe footer controls into the WebView. Mobile web (PWA):
open tab-apps directly in a new tab. No interstitial on either surface. Touch
points + prior commits (b5a9deb8, d1fbcd9b) noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 16:57:44 -04:00
archipelago
27299ea687 docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode
Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC).
Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md
with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale
nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites.
Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 16:47:34 -04:00
archipelago
892ff083c4 test(gate): fix the last 4 readiness/config false-fails (none are product bugs)
On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is
green; these 4 were test-harness issues:
- lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart
  recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded
  node but DOES complete (synced_to_chain:true).
- bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may
  have just been recreated by the companion-survives test).
- probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for
  post-restart proxy/UI readiness instead of single-shot.
- required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL
  app (not in required_containers) — only assert it when NPM is installed; and make
  the trailing lncli getinfo a retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 15:43:51 -04:00
archipelago
8893055810 test(gate): retry lnd getinfo for RPC readiness (wallet-unlock lags 'running')
lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the
container 'running' state — single-shot lncli getinfo raced that window and
false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is
functional (getinfo returns cleanly once ready).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 14:45:36 -04:00
archipelago
53b8e47f1d test(gate): fix two false-failing lifecycle tests (not product bugs)
- immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3-
  container stack (postgres->redis->server w/ DB migrations), so it needs at least
  as long as the start test (180s) — the old 120s was inconsistent and false-failed
  on loaded nodes. immich does return to running.
- fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the
  legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex
  omitted it -> total>known false orphan on every node running fedimint-clientd.
  Add fedimint-clientd to known.

Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node
(.116), not the RPC target — surfaced while driving the .228 gate green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 14:11:35 -04:00
archipelago
f4727bfdb3 docs(gate): companion self-heal fix validated (10s) + test-31 harness caveat
Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui
recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL
rm/systemctl --user, so running it from .116 via RPC tests .116's companions with
.116's binary, NOT the remote target — must run ON the target node. Explains the
'failed on both nodes' runs (both silently tested .116).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 13:44:57 -04:00
archipelago
452f05d849 fix(reconciler): decouple companion self-heal onto its own cadence
The companion-unit repair stage ran at the END of each boot-reconciler tick, after
reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a
deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any
reasonable window (gate test 31 'deleted unit recreated within one reconcile tick'
timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is
cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass.
Handle is aborted when the main loop exits (shutdown uses notify_one, so a second
waiter would steal the wake permit). tick() is now app-reconcile only.

All 4 boot_reconciler cadence tests still green (companion_stage=false in tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 13:04:28 -04:00
archipelago
de7d3d83dc docs(gate): final read — every failure fixed/explained, no lifecycle bugs remain
Last 2 .228 stragglers confirmed load/timing, not bugs: test 31 (companion recreate)
= contamination + ~108s reconcile cadence > 90s window; test 55 (immich restart) =
heavy stack restarts >120s under load but DOES return. Path to literally-green gate
is infra (bitcoin sync, re-quadletize .228) + minor test-window tuning. Optional
product improvement noted: independent ~30s companion-reconcile cadence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 12:36:03 -04:00
archipelago
76b23adcc0 docs(gate): test 31 root-caused = .228 contamination (not a product bug)
companion::reconcile only recreates a deleted companion unit when its parent
backend is in manifest_ids. On contaminated .228, electrumx ran as plain podman
and was NOT a tracked manifest install (manifest on disk but unloaded), so the
reconciler never iterated it -> archy-electrs-ui companion orphaned. Proven:
package.install electrumx re-registered it + restored the companion. Self-heal
logic is sound; test 31 clears on re-quadletize. electrumx on .228 de-contaminated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 11:34:55 -04:00
archipelago
47a5148865 docs(gate): two-node result — stop blocker FIXED; residual red is bitcoin-IBD + node prep
.228 104/110, .198 94/110 with the 3-fix binary. Every package.stop test passes on
healthy apps. .198's 14/16 failures trace to bitcoin in IBD (test 83: ~137k blocks
behind) cascading to lnd/btcpay/electrumx/mempool. 2 node-independent: companion
recreate (31, both nodes), fedimint orphan pollution (44). Path to green 5x gate is
now infra (sync bitcoin, re-quadletize .228) + minor (test 31), not lifecycle bugs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 11:09:12 -04:00
archipelago
b090235b04 docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228
Stop failure was 3 real product bugs (grace / reconcile-resurrection /
container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) +
deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was
probe-induced churn (stable when left alone). Validating breadth next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:49:45 -04:00
archipelago
6e49ce6f88 fix(container-list): report user-stopped apps as stopped despite live UI companion
A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running'
in container-list because its UI companion (electrs-ui, …) still serves the launch
port, and the state-refresh upgrades any reachable launch port to 'running'. The
gate's wait_for_container_status <app> stopped therefore never saw 'stopped'.

Fix: load the user_stopped marker in handle_container_list and force 'stopped' for
those apps before the launch-port refresh. The reconcile guard keeps the backend
down, so the marker is authoritative. package.start clears it first, so a started
app reports 'running' normally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:26:30 -04:00