archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	43934eefa5	test(gate): destructive all-apps lifecycle matrix (WS-F#3) Active counterpart to the read-only all-apps-matrix.bats: drives stop/start/restart for every installed app and, under ARCHY_ALLOW_CASCADE_DESTRUCTIVE, a FULL teardown (uninstall → no-ghost → reinstall) — the broad coverage F needs beyond the ~8 core suites. App set is discovered from My Apps ∩ the node catalog; reinstall spec comes from catalog.json {dockerImage, containerConfig}. PROTECTED by default (never cycled or torn down): bitcoin/electrum (expensive resync) AND lnd/btcpay/fedimint (teardown = irreversible wallet/channel/guardian loss). The user asked to protect only bitcoin+electrum; the wallet apps are added for safety and can be removed via ARCHY_MATRIX_PROTECT. Heavy + destructive → a supervised pass, not folded into run-gate. Validated on .228: discovery excludes the 6 protected installed apps; lifecycle tier cycles a single app (botfights) stop/start/restart green; teardown gated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 06:29:22 -04:00
archipelago	b7d9210784	test(gate): optional ARCHY_GATE_CASCADE pass — wire the cascade tier in run-gate.sh ran only the DESTRUCTIVE tier; the cascade-uninstall suite (uninstall→no-ghost→reinstall, the #13/#14/uninstall-hang regression guard) existed but was never enabled by the gate. Add an opt-in single cascade pass after the 5× loop (ARCHY_GATE_CASCADE=1, requires ARCHY_ALLOW_DESTRUCTIVE=1), counted into the pass/fail tally. Kept out of the 5× loop deliberately — uninstall/reinstall every iteration would balloon runtime and re-pull images; one pass guards the class. Default gate behavior unchanged. Validated: cascade-uninstall.bats 7/7 on .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 05:22:45 -04:00
archipelago	41e7f500f8	test(lifecycle): tolerate slow-but-healthy heavy-app recovery under 5x churn The 5x destructive gate on heavy nodes false-failed on transient windows during stack recovery, not real regressions: - immich.bats: lan_address port-publish probe 30s -> 90s. The postgres->redis ->server (DB migrations on boot) stack can take >30s to republish :2283 after a churn-induced recreate; destructive-tier immich tests already allow 180-240s. - mempool.bats: orphan-container check now polls to steady state (<=30s) instead of a single-shot count, which caught a recreated member briefly visible alongside its replacement mid-reconcile. - run-gate.sh: settle cap 180s -> 300s and also gate on immich's :2283 when installed, so the next iteration's read-only probe doesn't race a still- recovering stack. Settle returns the instant every probe is green. A genuinely unexposed/orphaned/unhealthy app still fails these checks; they only absorb the transient recreate window under sustained churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 09:18:34 -04:00
archipelago	0406af522c	test(lifecycle): add manifest-driven all-apps health matrix The per-app suites cover ~8 core apps in depth; nothing covered the ~30 others (jellyfin, vaultwarden, penpot, nextcloud, grafana, …). all-apps-matrix.bats derives the app set from server.get-state package-data (no hardcoded list) and asserts baseline health across EVERY installed app: - settles to a non-transitional state within a window (the #13/#14 stuck-ghost class, generalized fleet-wide — installing/removing that never settles) - not in error/failed - reports a recognized (non-garbage) state - every running UI app (manifest ui=="true") exposes a non-null lan-address (the immich/port-drift unreachable-UI failure, generalized to all UI apps) Read-only, so it joins run.sh/run-gate.sh on every node and grows coverage as nodes install more apps. Verified 5/5 on .228 (17 apps) and .116 (20 apps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:27:10 -04:00
archipelago	57a69257c4	test(lifecycle): add CASCADE uninstall/reinstall tier (guards #13 ghost, #14 reinstall) The 5x gate is DESTRUCTIVE-only and never exercised uninstall/reinstall — where the worst field bugs lived (#13 app ghosting in My Apps after uninstall, #14 reinstall stalling on stale state). New cascade-uninstall.bats drives the full teardown path on a throwaway app (default grafana, precondition-skips if already installed so it can't destroy real data) and asserts: - fresh install reaches running via a truthful, non-silent progression - uninstall makes the entry DISAPPEAR from server.get-state package-data (the literal My Apps map) — no ghost, no stuck uninstall stage - container + (on-node) data dir are gone - reinstall returns to running - node left as found Opt-in via ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1; not yet folded into the canonical gate. Verified 7/7 against .228. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 05:13:53 -04:00
archipelago	ccb594fb85	test(gate): fix bitcoin-knots getinfo-after-restart helper + IBD note It called bats-assert's `fail` (not loaded in this file) → "fail: command not found"/127, masking the real reason. Emit+return instead, bump the cold-restart RPC window 60s→120s (block-index reload), and note a node mid-IBD legitimately can't serve getinfo (environmental precondition, not a product regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 06:28:20 -04:00
archipelago	2afd18c6de	test(gate): poll immich lan_address to absorb mid-recreate churn 5× run #4 flaked iter4 on "immich exposes its web UI lan-address (port 2283)": container-list returned lan_address=null because immich_server was momentarily mid-recreate when the read-only tier queried it (passed the other 4 iterations; immich_server does publish 0.0.0.0:2283->2283). Same single-shot-read class as the bitcoin-knots state probe — poll <=30s for the exposed port instead of one read. A genuinely unexposed immich never publishes 2283, so real port drift is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 03:20:18 -04:00
archipelago	92d7f52dd6	fix(orchestrator): order only live containers on package start/restart package.restart resolved its container list via ordered_containers_for_start, which injected every name from the union startup_order list that wasn't already present — including variant names not live on a given node (mysql-mempool, archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is 2nd in the mempool start order, so do_orchestrator_package_start hit its unknown-app-id fallback, do_package_start failed the inspect ("no such object"), and the `?` aborted the whole start sequence — leaving mempool-api + the frontend down until the health monitor recovered them minutes later. That was the source of the 5× gate flakes #73 (frontend not running in 180s) and #74 (api not queryable in 300s); root-caused from the .228 journal ("Start failed: mysql-mempool"). Replace the inject-then-sort logic with a pure helper order_present_containers that orders only the actually-present containers and never adds phantom entries. startup_order remains a union of name variants across install generations — it's now used purely to order what's live, not to inject what isn't. +3 unit tests. Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a settled state instead of a single-shot read, so a container caught mid-reconcile (transient restarting/configured) can't flake a 20-min iteration. A genuinely-stuck container never settles, so real breakage is still caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 02:22:50 -04:00
archipelago	57a013bc66	test(gate): make 5× the canonical gate, drop 20x naming Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub 20× references across CLAUDE.md, the master plan, TESTING.md, app-registry status, the orchestrator/config doc-comments, and the bats suites. Also add a minimal fail() helper to mempool.bats so guard failures report cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:12:41 -04:00
archipelago	0f05f73a23	fix(mempool): self-healing nginx backend proxy (v3.0.1) + gate timeout The frontend nginx used a literal proxy_pass host with no resolver, so it pinned mempool-api's IP at worker startup. When the backend restarts (gate, OTA, crash, reboot re-IPAM) podman reassigns its IP and nginx keeps proxying to the dead one -> /api hangs, websocket 502s, UI shows 'offline' until a manual nginx reload. Same stale-upstream-IP class as the netbird 502. Fix: mempool-frontend:v3.0.1 rewrites the generated nginx-mempool.conf to re-resolve the backend per-request via 'resolver' + a variable proxy_pass. Resolver address is read from /etc/resolv.conf (podman aardvark-dns answers on the network gateway, not Docker's 127.0.0.11). Per-location path mapping preserved (ws -> '/', /api/v1 identity via no-URI, /api/ -> /api/v1/ rewrite). Proven on .228: backend IP change now auto-recovers with no reload; the literal-host control still 502s. Migrated the manifest off the retired tx1138 registry to vps2. Also: mempool.bats #74 waited only 180s post-restart (the slow path) and called an undefined 'fail' helper (status 127). Bumped to 300s to match the passing parity probes and emit a real failure instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:07:07 -04:00
archipelago	98f4fa44a8	test(gate): harden readiness for sustained 5x churn + inter-iteration settle The 1x gate is green; the 5x failed iters 1-2 on readiness-under-churn (apps DO recover — lnd synced, mempool just mid-restart when probed — but slower than the windows when restarted back-to-back). Hardening: - run-20x.sh: best-effort settle_stack() before each iteration (wait for mempool-api/frontend + lnd RPC healthy, 180s, on-node, never fails the run). - required containers present/running (80/81): wait-loops (180s) not single-shot. - mempool api/frontend (87/88): retry ~180s not single-shot. - mempool queryable (74): 60s->180s. lnd restart-running (64): 120s->240s. lnd getinfo (60): 90s->240s retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:11:15 -04:00
archipelago	27299ea687	docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:47:34 -04:00
archipelago	892ff083c4	test(gate): fix the last 4 readiness/config false-fails (none are product bugs) On a proper on-node .228 run (synced bitcoin, 4-fix binary) the lifecycle matrix is green; these 4 were test-harness issues: - lnd 'recovers after restart' (65): bump retry window 90s->240s. lnd cold-restart recovery (wallet unlock + bitcoind reconnect + graph sync) exceeds 90s on a loaded node but DOES complete (synced_to_chain:true). - bitcoin ui responds (89): retry ~120s instead of single-shot (companion nginx may have just been recreated by the companion-survives test). - probe_app_url (99 lnd proxy + all ui-coverage proxy probes): retry up to 90s for post-restart proxy/UI readiness instead of single-shot. - required endpoints after restart (94): :8081 is nginx-proxy-manager, an OPTIONAL app (not in required_containers) — only assert it when NPM is installed; and make the trailing lncli getinfo a retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 15:43:51 -04:00
archipelago	8893055810	test(gate): retry lnd getinfo for RPC readiness (wallet-unlock lags 'running') lnd's RPC isn't ready until its wallet auto-unlocks on (re)start, which lags the container 'running' state — single-shot lncli getinfo raced that window and false-failed (gate tests 60 + 85). Retry up to ~90s like a health probe. lnd is functional (getinfo returns cleanly once ready). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:45:36 -04:00
archipelago	53b8e47f1d	test(gate): fix two false-failing lifecycle tests (not product bugs) - immich restart: bump wait 120s->240s. Restart = ordered stop+start of the 3- container stack (postgres->redis->server w/ DB migrations), so it needs at least as long as the start test (180s) — the old 120s was inconsistent and false-failed on loaded nodes. immich does return to running. - fedimint orphan check: the unanchored 'total' regex (^fedimint) counts the legitimate fedimint-clientd (dual-ecash bridge) but the anchored 'known' regex omitted it -> total>known false orphan on every node running fedimint-clientd. Add fedimint-clientd to known. Both run as LOCAL podman/systemctl on the gate runner, so they test the runner node (.116), not the RPC target — surfaced while driving the .228 gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 14:11:35 -04:00
archipelago	84031e6209	docs: temporarily reduce release lifecycle gate from 20x to 5x Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on .228 AND .198 for now, down from 20x. Restore to 20x before the final ship. Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:11:00 -04:00
archipelago	b0b54a96fa	test(lifecycle): immich suite — package-level checks, wait-based destructive tier container-list reports stack apps package-level (.name="immich"), so the suite checks the "immich" package (presence, valid state, :2283 lan-address) rather than individual container names. Destructive tier fires async stop/start/restart and asserts on the end state via wait_for_container_status. KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs ops back-to-back with no settling while immich's async stack ops take 30s+, and stopped reports as "exited" not "stopped". The immich migration itself is verified working (manual stop/start/restart succeed; all 3 containers healthy). Hardening the harness for stack apps (inter-op settling + stopped\|exited acceptance) is a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:52:33 -04:00
archipelago	b1f175b927	test(lifecycle): add immich stack lifecycle suite RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack (immich + immich-postgres + immich-redis): presence + valid state of all 3 members, a guard that no legacy underscore containers exist (catches botched migration / legacy-installer fallback), destructive stop/start/restart of the server with postgres+redis staying up, and cascade uninstall/reinstall (preserve_data). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:01:19 -04:00
archipelago	03a4ee1b30	feat(container): manifest-declared generated secrets + companion/quadlet hardening Generated-secrets system: apps declare `generated_secrets` in their manifest (kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets` materialises them 0600/rootless in resolve_dynamic_env — idempotent and self-healing (recovers wrongly root-owned secrets with no privilege). Replaces per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests now declare fmcd-password / fedimint-gateway-hash. companion.rs: rebuild the auto-built :latest image when its build context changes (staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes. quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit 125) + regression tests. UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged as Services (headless backends), gateway icon fallback. Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start; grafana/strfry orphan crash-loop units removed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:11:07 -04:00
archipelago	8c8e4d7a29	test: gate that LND wallet is unlocked after restart (catches fleet-wide lock) A wrong/locked LND wallet password leaves the wallet LOCKED after every restart/OTA, breaking all Bitcoin-receive + Lightning ops fleet-wide — and the harness was blind to it: live-lnd-address-type treats 'wallet locked' as PASS, os-audit treated lnd-unreachable as WARN, and the archipelago lnd.getinfo RPC masks a locked wallet (returns all-zero success). - tests/release/run.sh: new 'live-lnd-unlocked' stage polls LND's unauth /v1/state and FAILs if still LOCKED after a 60s grace window. - tests/lifecycle/os-audit.sh: probe lnd.newaddress (the real receive path, which surfaces LND_WALLET_LOCKED) instead of lnd.getinfo; locked = hard FAIL, not-installed = WARN. Proven on .116 (genuinely locked): os-audit now reports '[FAIL] lnd wallet unlocked (lnd.newaddress) wallet LOCKED'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 10:36:12 -04:00
archipelago	4232424b23	fix(ui): suppress app-unreachable overlay while ElectrumX sync screen shows When ElectrumX is still building its index (or waiting on the Bitcoin node), AppSessionFrame shows a sync 'pre UI'. The iframe-blocked fallback ('App not reachable / retrying') was not gated on electrsSync, so it painted over the sync screen and read as a hard connection error. Gate it on !electrsSync, mirroring the iframe's own guard. Also harden the lifecycle health probe: container_health used jq '// "unknown"', which only catches null/false — an empty-string health (a brief window under load) rendered as a blank 'bad health: X is '. Map empty to 'unknown' so the retry loop keeps waiting instead of failing on a transient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 07:58:24 -04:00
archipelago	329e7811eb	test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes os-audit.sh: one non-destructive scorecard tying backend/RPC health, the all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards (port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The per-boot building block for the reboot-survival loop. FM12 check uses jq has() not // (// treats a legit false as empty). Section A validated all-PASS on .116. docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 04:36:06 -04:00
archipelago	0ed892a412	fix: wallet receive reliability, bitcoin install self-heal, ElectrumX app tile Fixes three Bitcoin/wallet failures observed across the fleet on v1.7.90-alpha (all nodes were already on the latest build — these were live bugs, not stale builds), plus the missing ElectrumX tile, and adds automated coverage so each can't regress silently. Receive address (".116 receive fails", ".228 false 'wallet is locked'"): - LND publishes its REST API on a host port that can drift from the manifest (a container created when the mapping was 8080 kept publishing 8080 after the manifest moved to 18080). The in-process client connects to the manifest port, gets connection-refused, and wallet init fails forever while the container looks "Up". Add published-port drift detection to the reconciler (container_ports_drifted / host_port_bindings_drifted) that recreates a drifted backend even for restart-sensitive apps — a drifted container is already broken, so leaving it "untouched" only perpetuates the failure. - Receive errors now carry a stable [CODE] token (REST_UNREACHABLE, WALLET_LOCKED, WALLET_UNINITIALIZED, SYNCING) and always start with "Bitcoin address" so they survive the RPC error sanitizer instead of collapsing to the generic "Operation failed". The UI maps the code instead of guessing wallet state from substrings — so an unreachable REST endpoint is no longer mislabelled "locked". Bitcoin install (".198 bitcoin gone / reinstall just stops"): - bitcoin-knots requires the secret bitcoin-rpc-txrelay-rpcauth, which was only generated by the tx-relay flow. Nodes that never used tx-relay lacked it, so secret resolution hard-failed and the whole Bitcoin stack cascaded. Generate it idempotently before bitcoin starts (ensure_app_secrets, reusing ensure_txrelay_credentials), and name the missing secret in the error so a genuine gap is actionable instead of a bare "IO error". ElectrumX app tile missing on every node with it installed: - The catalog generator dropped electrumx because the manifest had no interfaces.main block, so the tile had no launch URL and was hidden. Declare the companion UI port (50002) in the manifest, regenerate the catalog, and let an app with a known launch URL stay launchable while its backend is still "starting" (ElectrumX indexes for 10m+). Test harness: - New lifecycle bats suites: bitcoin-receive, port-drift, secret-completeness (validated live; port-drift catches the real .116 drift). - Rust unit tests for drift detection, the receive reason-code classifier, and the named-missing-secret error; vitest for the UI code mapping. - create-release.sh now runs tests/release/run.sh and aborts the release on failure — previously it ran no tests at all. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 03:12:56 -04:00
archipelago	d6f108d818	chore: snapshot release workspace	2026-06-12 03:00:15 -04:00
archipelago	c393b96da3	backend: harden rootless app lifecycle orchestration	2026-06-11 00:24:32 -04:00
archipelago	d736364ad7	fix(apps): stabilize btcpay and public proxy launch flows	2026-05-19 09:26:43 -04:00
archipelago	7804223152	chore: release v1.7.57-alpha	2026-05-17 17:30:04 -04:00
Dorian	835c525218	chore(release): stage v1.7.55-alpha	2026-05-13 15:09:22 -04:00
archipelago	745cb1c626	chore(release): stage v1.7.52-alpha	2026-05-05 11:29:18 -04:00
archipelago	10fbb8f87c	docs(testing): track Phase 3.4 race fix + drift-sync hook * L0 unit count: 630 → 631 (translate_health_check_http_does_not_double_prefix_scheme) * Phase 3 row: add TimeoutStartSec=600 race fix (44f275ed) + drift-sync hook (0889367d) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:53:18 -04:00
archipelago	bd96c0475d	feat(config): ARCHIPELAGO_USE_QUADLET_BACKENDS env override Adds an env-var lever for Phase 3.2's use_quadlet_backends flag so the 20× harness can flip the path on per-node without a config.json edit (which would require an archipelago.service restart — and that triggers FM3 cgroup cascade until Phase 3.5 ships, so we can't ask anyone to reconfigure live nodes that way today). Truthy parsing centralised in `parse_truthy_env` (1, true, yes, on — case-insensitive, whitespace-trimmed). Anything else is false. The helper is unit-tested so future env-var flags can reuse the same shape. Also adds a default-off regression test for use_quadlet_backends so flipping the default ahead of the 20× verification fires immediately. TESTING.md documents the Environment= snippet for the systemd drop-in so the next operator can flip the flag on a debug node without re-deriving the recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:44:09 -04:00
archipelago	9a89a000d4	test(lifecycle): post-condition gate for use_quadlet_backends path A six-test bats suite that validates what install_via_quadlet (Phase 3.2) is supposed to leave behind: * `.container` unit on disk in $XDG_CONFIG_HOME/containers/systemd/ with [Container] / [Service] / [Install] sections, Image= present, and Restart=on-failure (the backend invariant — companions use Always) * Phase 3.4 cross-check: any unit with HealthCmd= must also emit Notify=healthy, otherwise systemctl start won't gate on health * `systemctl --user is-active` returns 0 for the .service * podman shows the container running * the container's cgroup is under user.slice/, NOT under archipelago.service — the kernel-level proof that FM3 cgroup cascade SIGKILL is structurally fixed for this container Auto-skips on every test when no backend Quadlet units exist (today's default state, use_quadlet_backends=false) — so the suite is a no-op on current fleet boxes and turns into a hard regression gate the moment anyone flips the flag and reinstalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:34:47 -04:00
archipelago	97ce23d773	feat(quadlet): Phase 3.4 — health-gated startup via Notify=healthy QuadletUnit gains an optional HealthSpec; from_manifest translates the manifest's health_check (tcp/http/cmd) into a HealthCmd= directive and emits Notify=healthy alongside it. systemctl start <unit>.service then blocks until the container's first green probe — eliminating the "container up but RPC not ready" race the orchestrator currently papers over with post-start polling. Translation policy: * tcp, endpoint "host:port" -> nc -z host port * http, endpoint "host:port", path -> curl -fsS -m 5 http://endpoint<path> * cmd, endpoint "<shell command>" -> verbatim * unknown type / malformed endpoint -> None (skip Notify=healthy rather than emit a HealthCmd that hangs the unit start forever) Companion units leave health: None and remain byte-identical to before this PR — the renderer only emits the Health* / Notify= block when set. +4 quadlet unit tests (19 total). Dropped a never-used test setter that was generating a dead_code warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:21:57 -04:00
archipelago	65576bd755	feat(orchestrator): Phase 3.3 — in-place migration to Quadlet When use_quadlet_backends flips from off → on, existing fleet boxes have backend containers parented under archipelago.service's cgroup (the bad shape that triggers FM3 cascade SIGKILL on every archipelago restart). ensure_running now notices and corrects this: * If there's already a `<name>.container` unit on disk → no-op (subsequent reconcile ticks take this fast path). * Else if a podman container with that name exists → it's a pre-3.3 artifact. Stop+remove it (volumes survive — bind mounts are not touched by `podman rm`), then write the Quadlet unit, daemon-reload, and start the new managed service. * Else → fall through to install_fresh, which already routes through install_via_quadlet when the flag is on. The migration is idempotent and self-healing: if a fleet box is half-migrated (unit on disk but no service active, or service active but stale unit), the next reconcile tick converges. Bitcoin chain data, lnd wallet state, and electrumx index all live on host bind mounts and are unaffected by the container-record swap. Volume safety audited per backend in `uses_orchestrator_install_flow` allowlist — every entry mounts its data dir as a host bind mount. Default still off. To migrate a node: /etc/archipelago/config.toml: use_quadlet_backends = true followed by `systemctl restart archipelago` — the next reconcile tick walks every managed app and migrates each in turn. Tests: 624 passing, 0 cargo warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:27:59 -04:00
archipelago	5b2e02bd43	feat(orchestrator): Phase 3.2 — wire Quadlet path behind feature flag prod_orchestrator::install_fresh now branches on the new Config::use_quadlet_backends flag (default false): * off (today's production behavior) — unchanged: runtime.create_container + start_container, container parented under archipelago.service's cgroup, FM3 cascade SIGKILL on every archipelago restart. * on — install_via_quadlet renders the manifest as a Quadlet unit via QuadletUnit::from_manifest, writes it atomically into ~/.config/containers/systemd/, calls daemon-reload, and starts the generated <name>.service. Container ends up under user.slice — no more cgroup parented under archipelago, so archipelago restarts don't touch the container's lifetime. Default off so this commit is structurally safe to ship: nothing changes at runtime until an operator opts in. Flip the default once tests/lifecycle/run-20x.sh has gone green against the new path on .228 + .198 (the v1.7.52 release gate). Plumbing: * config.rs — `use_quadlet_backends: bool` w/ Default false * prod_orchestrator.rs — flag stored on the struct, threaded through new(), with set_use_quadlet_backends(bool) test setter * prod_orchestrator.rs — install_via_quadlet helper * dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest / parse_memory_mib / RestartPolicy::OnFailure now that the call path exists; if a future revert removes the wiring, the warnings come back. Tests: 624 passing, cargo check clean (0 warnings). Existing companion behavior unaffected — render_skips_backend_directives_when_default still passes byte-equal to before quadlet.rs grew the new fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:22:10 -04:00
archipelago	9becafafd3	feat(quadlet): backend-manifest renderer (Phase 3.1 of v1.7.52) The QuadletUnit struct now covers everything a backend manifest needs (ports, environment, devices, add_hosts, entrypoint+command, read-only root, no_new_privileges, cpu_quota, restart policy choice). Adds QuadletUnit::from_manifest(&AppManifest, name) that translates a parsed manifest into a unit, plus parse_memory_mib for "1g"/"512m"/raw-MiB forms. The renderer skips empty/false directives so existing companion units render byte-identically — no behavior change for shipping companions; the backend renderer is dead code until Phase 3.2 wires it into the orchestrator. Eight new unit tests cover: * parse_memory_mib forms (1024, 512m, 2g, garbage) * shell_join quoting (whitespace, embedded quotes) * RestartPolicy → systemd string mapping * render emits backend directives when set * render skips them when defaulted (companion regression gate) * from_manifest happy path on a bitcoin-knots-shaped manifest * from_manifest read-only volume detection * from_manifest tmpfs filtering * end-to-end manifest → render bytes assertion Tests: 615 → 624 (+9 net; one pre-existing parse_memory_mib path was implicitly covered before but is now explicit). Cargo warnings: 0. `from_manifest`, `parse_memory_mib`, and `RestartPolicy::OnFailure` are marked allow(dead_code) with explicit references to Phase 3.2 — if 3.2 doesn't wire them, the dead-code warning resurfaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:09:50 -04:00
archipelago	5074572373	test(lifecycle): add btcpay + fedimint + mempool suites Brings L1 (RPC API) + L3 (lifecycle survival) parity coverage to the three multi-app stacks that were previously only touched by required-stack.bats. Combined with bitcoin-knots / lnd / electrumx already shipping, the six core apps now have dedicated bats files. Each suite is shaped like the existing single-container suites (bitcoin-knots / lnd / electrumx) and gates every assertion on the backing container actually being present, so a node without the stack installed gets clean skip messages instead of false fails. * btcpay.bats — 9 tests, including stack-wide presence and a "supporting containers don't cascade-restart" guard * fedimint.bats — 8 tests, single container * mempool.bats — 9 tests, mixed legacy + orchestrator-managed stack; reuses the :8999 mempool-api probe from required-stack for parity Total bats now: 88 (was 53 → +35). TESTING.md matrix advances 23 → 50 of 110 cells. UI URL coverage for these three apps already lives in ui-coverage.bats, so this PR doesn't duplicate proxy-path probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:55:31 -04:00
archipelago	ec1dce93a9	docs(testing): canonical scorecard for container subsystem testing Single source of truth for "where are we, where are we going" on the v1.7.52 container excellence work. Replaces ad-hoc tracking in chat. Sections: * Test layers L0..L6 with toolchain + per-iteration latency * Per-app × per-state coverage matrix (23 of 110 cells today; goal 110) * Layer-by-layer status (L0+L1+L2 ●; L3 ◐; L4..L6 ○) * Run commands (single suite / full suite / 20×) * LoC budget — -270 committed, ~1,616 more possible if Phase 3 ships * Performance KPIs (TBD — measure first, target second) * Release gates — 8 boxes that must tick before v1.7.52 ships The file lives in-repo so PR diffs to it answer "what did this commit improve?". If you can't tick the box, the change isn't ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:52:42 -04:00
archipelago	b9eb6eb18a	test(lifecycle): add UI surface coverage — HTTPS proxy + iframe URLs Closes the coverage gap where existing bats suites would report green on a node whose dashboard tiles 502 because the proxy upstream is dead. First pass against .198 caught real prod issues immediately: /app/lnd/ → 502 (lnd container exited) /app/mempool/ → 502 (mempool container exited) /app/fedimint/ → 502 (fedimint container exited) while existing tests reported only "container is up: false" with no 404/502 distinction. * lib/ui-probes.bash — sourced helper. probe_https_200, probe_app_url (skip-if-container-down else assert-200), probe_dashboard_shell (asserts the Vue SPA HTML, not nginx default — catches the layout regression from feedback_release_tarball_layout.md), probe_dashboard_catalog (asserts /catalog.json non-empty). * bats/ui-coverage.bats — 9 @test cases covering the dashboard + bitcoin-ui :8334 + the seven HTTPS_PROXY_PATHS most users hit (lnd, electrumx, mempool, fedimint, btcpay, filebrowser). URL list mirrors HTTPS_PROXY_PATHS in neode-ui/src/views/appSession/appSessionConfig.ts. Divergence between the two is the exact bug class we're guarding against. Loops clean under run-20x.sh. Container-state oracle is via local podman inspect, so the suite must run on the archy host (same as companion-survives-archipelago-restart.bats). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:49:30 -04:00
archipelago	01f416ae5d	test(lifecycle): regression gate for FM3 cgroup-cascade SIGKILL Sister suite to companion-survives-archipelago-restart.bats. That one tests the same property for UI companions, which already ship via Quadlet (commit 6e716f68) and so already pass. This new suite tests the property for backend containers (bitcoin-knots / bitcoin-core / lnd / electrumx). Until v1.7.52 Phase 3 ships these under Quadlet too, the suite is EXPECTED TO FAIL on fleet boxes — it's the executable definition of "FM3 fixed". Observed live on .198 on 2026-05-01: `sudo systemctl stop archipelago` killed every container in archipelago.service's cgroup. The dedicated "backends survive archipelago restart" test catches exactly that, and also verifies the SAME container instance survives (compares pre/post .Id), so an orchestrator that recreates a fresh container after the SIGKILL doesn't read as pass. Three @test cases: * destructive gate (skip-marker for the suite) * baseline: at least one backend installed + running * backends survive: same .Id pre + post archipelago restart Don't gate releases on this passing until Phase 3 lands; before then treat it as a "expected to fail / shows progress" indicator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:17:27 -04:00
archipelago	f80daff8ba	test(lifecycle): add dedicated electrumx.bats suite Same shape as bitcoin-knots.bats and lnd.bats so the 20× release-gate exercises electrumx through the same state matrix it uses for the other two core apps. electrumx previously had a single TCP-port check inside required-stack.bats; this adds destructive + cascade-destructive tiers. 10 @test cases: * read-only: presence, valid state, TCP port (50001) reachable, no orphan containers beyond {electrumx, archy-electrs-ui} * destructive: stop, start, restart, TCP port recovers within 120s of cold restart (longer than bitcoind because electrumx replays its index against bitcoind on start) * cascade: uninstall, reinstall (240s timeout for index rebuild) With this suite, the three single-container core apps (bitcoin-knots, lnd, electrumx) now have parity coverage. Multi-container stacks (btcpay, mempool, fedimint) come next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:11:02 -04:00
archipelago	1103c2c710	test(lifecycle): add dedicated lnd.bats suite Mirrors bitcoin-knots.bats so the 20× release-gate run exercises lnd through the same state matrix. lnd previously had only a single read-only check inside required-stack.bats; this adds the destructive and cascade-destructive tiers that match what we already test for bitcoin-knots. 10 @test cases: * read-only: presence, valid state, lncli getinfo, no orphan containers * destructive (ARCHY_ALLOW_DESTRUCTIVE=1): stop, start, restart, RPC recovers within 90s of cold restart (longer than bitcoind because the wallet has to unlock first) * cascade (ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1): uninstall, reinstall Reuses the same lncli invocation as required-stack.bats so divergence shows up clearly if either test breaks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:09:43 -04:00
archipelago	1b6c500657	test(lifecycle): add setup-teardown + run-20x harness scaffolding Phase 4 of the v1.7.52 container excellence plan: a release-gate harness that loops the bats suite N times in a row, with teardown between iterations, and reports a pass/fail tally. * setup-teardown.sh — clears /tmp/archy-rpc-session-* between runs so iteration N+1 doesn't reuse a logged-out cookie from iteration N. Idempotent; safe to run anytime. Designed to grow as we add suites that leave other transient state. * run-20x.sh — wraps run.sh in a loop of ARCHY_ITERATIONS (default 20). Tracks per-iteration pass/fail with wall-clock timing, prints a results block, exits non-zero on any failure. Honors ARCHY_FAIL_FAST for short-circuit during dev. Suggested release-gate command: ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \ tests/lifecycle/run-20x.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:06:09 -04:00
archipelago	23c4e7441f	refactor(container): move companion UIs to systemd via Quadlet Companion UI containers (archy-bitcoin-ui, archy-lnd-ui, archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn blocks from install.rs. If archipelago crashed mid-spawn or the container's cgroup was reaped, companions vanished from podman ps -a and only a manual rm/run could bring them back (the .228 incident). Now each companion is rendered as a Quadlet .container unit under ~/.config/containers/systemd/, daemon-reloaded, and started via systemctl --user. systemd owns supervision from that point on: - archipelago can crash, restart, or be uninstalled without touching any companion. - Quadlet's Restart=always + RestartSec=10 handles container exits. - A 30s reconcile tick in boot_reconciler enumerates expected companion units and re-installs any whose unit file or service vanished — defense-in-depth against external tampering. New module layout: - container/quadlet.rs: pure unit renderer + atomic write_if_changed + systemctl helpers (daemon_reload_user / enable_now / disable_remove / is_active). 6 unit tests, no I/O in the renderer. - container/companion.rs: per-app companion specs, install/remove/ reconcile, image presence (build local first, fall back to insecure registry only via image_uses_insecure_registry whitelist). 2 tests. install.rs handle_package_install now ends with a single call to companion::install_for(package_id), replacing 287 lines of spawn-and- hope shellouts plus a ~120-line nginx auth-injector helper that worked around per-node RPC password baking. The helper is gone too — the pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/ bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only. runtime.rs handle_package_uninstall now disables companions before the container rm loop. Otherwise systemd's Restart=always would respawn each companion within ~10s of removal. Tests: 53 container tests pass, including 6 quadlet renderer tests (host network, bridge network, capability set, atomic write idempotence) and 2 companion specs (per-app companion lookup, build_unit shape). boot_reconciler tests gain a #[cfg(test)] without_companion_stage() flag so the paused-clock fixtures don't race the real systemctl I/O. A bats regression test (companion-survives-archipelago-restart.bats, gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode cannot recur: every installed companion has a unit file, services stay active across systemctl --user restart archipelago, and a deleted unit file is recreated within one reconcile tick. Net delta: +941 / -363, but the +941 is mostly tests (~440 lines) and the new declarative layer; the imperative tokio::spawn block and its nginx-auth helper are gone, removing two failure classes (orphan companions on archipelago crash, and post-start exec races under tightly-confined cgroups) that previously needed manual SSH recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:45:07 -04:00
archipelago	8f83b37d51	feat(orchestrator): complete container migration and release hardening	2026-04-28 15:00:58 -04:00

45 Commits