archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	c0751e2551	chore(release): stage v1.7.54-alpha	2026-05-06 09:23:57 -04:00
archipelago	1a0d8a432c	chore(release): stage v1.7.53-alpha	2026-05-05 13:59:50 -04:00
archipelago	745cb1c626	chore(release): stage v1.7.52-alpha	2026-05-05 11:29:18 -04:00
archipelago	aad0ba5234	feat(orchestrator): drift-sync existing Quadlet units on each reconcile When a Quadlet unit file already exists for an orchestrator-managed backend, sync its on-disk bytes against what the current renderer produces. write_if_changed makes this idempotent — when bytes match, no IO; when they differ (post-deploy of a renderer change), the file is rewritten and systemctl --user daemon-reload runs once. We deliberately do NOT restart the .service when the file changes: running containers keep their current config until the operator restarts them. That's the right tradeoff — file updates are cheap and non-destructive; service restarts are the SIGKILL cascade we're trying to eliminate. Why this matters: pre-this-commit, every renderer change required a fresh package.install RPC per app to take effect. Observed live on .228 2026-05-02 — the TimeoutStartSec=600 fix shipped in code but existing units stayed on the old format because nothing triggered a re-render. Combined with state.json being empty (so the reconciler's auto-install path didn't fire either), the fix was invisible until manual unit deletion. Companions (UI_APP_IDS) are skipped — companion.rs renders those units with a different shape; syncing here would clobber them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:43:18 -04:00
archipelago	281e65e697	fix(quadlet): TimeoutStartSec=600 when Notify=healthy is set Bug surfaced live on .228 2026-05-02 — every backend Quadlet unit (lnd, electrumx, fedimint, btcpay-server, mempool-api, bitcoin-knots) hit systemd's default 90s start timeout because Notify=healthy makes systemctl wait for the first green health probe, but HealthInterval=30s × HealthRetries=3 = 90s minimum even on a healthy service. Race: timeout fires the moment the third probe MIGHT succeed. Result was three different post-states (inactive+running, failed+missing, inactive+stopped) depending on whether systemd's ExecStopPost ran podman rm before the orchestrator's adoption logic re-grabbed the container. Fix: when health is set, render TimeoutStartSec=600 (10 minutes) into [Service]. Long enough for slow-starting backends (electrumx index replay, lnd wallet unlock) without being so long that a truly stuck unit hangs forever. Companions stay unchanged (no health → no override, default 90s applies). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 07:14:48 -04:00
archipelago	384f12de7a	fix(quadlet): http:// double-prefix + companion migration race Two bugs surfaced by the first real-node validation of Phase 3.2-3.4 on .228 (2026-05-02), both caught before flipping the default. Bug 1 — translate_health_check double-prefixed http://. Manifests in the wild carry the scheme inside the endpoint string ("http://localhost:8175"), and we were prepending another http:// unconditionally. Result on .228: every backend HealthCmd read `curl -fsS -m 5 http://http://localhost...`, every probe failed, fedimint hit a 14-restart loop. Now we accept either form and skip appending hc.path when the endpoint already carries one. Regression test asserts no double-prefix and that an in-endpoint path is honoured. Bug 2 — Phase 3.3 migration ran for UI companions (bitcoin-ui / electrs-ui / lnd-ui) that have shipped via Quadlet since v1.7.41. Migration tore down the running companion + raced companion.rs render, producing "Phase 3.3: re-install archy-bitcoin-ui via Quadlet" reconcile errors and leaving archy-bitcoin-ui down. Companions now short-circuit out of migrate_to_quadlet_if_needed before any IO. Also: when try_exists returns Err for an unrelated reason (permissions, EIO), we now skip migration instead of treating "I can't tell" as "go ahead and migrate" — migrating on top of a possibly-existing unit is destructive. What this does not fix yet: * the orchestrator's reconciler iterating every manifest in /opt/archipelago/apps/, not just installed apps. Pre-existing behavior (also affects the legacy path) — separate scope. * fedimint /data UID mismatch surfaced when Quadlet started fedimint fresh. Likely orthogonal — defer. * no rollback when install_via_quadlet fails after a remove_container. Tracked as Phase 3.3.1 — defer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:37:37 -04:00
archipelago	bd96c0475d	feat(config): ARCHIPELAGO_USE_QUADLET_BACKENDS env override Adds an env-var lever for Phase 3.2's use_quadlet_backends flag so the 20× harness can flip the path on per-node without a config.json edit (which would require an archipelago.service restart — and that triggers FM3 cgroup cascade until Phase 3.5 ships, so we can't ask anyone to reconfigure live nodes that way today). Truthy parsing centralised in `parse_truthy_env` (1, true, yes, on — case-insensitive, whitespace-trimmed). Anything else is false. The helper is unit-tested so future env-var flags can reuse the same shape. Also adds a default-off regression test for use_quadlet_backends so flipping the default ahead of the 20× verification fires immediately. TESTING.md documents the Environment= snippet for the systemd drop-in so the next operator can flip the flag on a debug node without re-deriving the recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:44:09 -04:00
archipelago	97ce23d773	feat(quadlet): Phase 3.4 — health-gated startup via Notify=healthy QuadletUnit gains an optional HealthSpec; from_manifest translates the manifest's health_check (tcp/http/cmd) into a HealthCmd= directive and emits Notify=healthy alongside it. systemctl start <unit>.service then blocks until the container's first green probe — eliminating the "container up but RPC not ready" race the orchestrator currently papers over with post-start polling. Translation policy: * tcp, endpoint "host:port" -> nc -z host port * http, endpoint "host:port", path -> curl -fsS -m 5 http://endpoint<path> * cmd, endpoint "<shell command>" -> verbatim * unknown type / malformed endpoint -> None (skip Notify=healthy rather than emit a HealthCmd that hangs the unit start forever) Companion units leave health: None and remain byte-identical to before this PR — the renderer only emits the Health* / Notify= block when set. +4 quadlet unit tests (19 total). Dropped a never-used test setter that was generating a dead_code warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:21:57 -04:00
archipelago	65576bd755	feat(orchestrator): Phase 3.3 — in-place migration to Quadlet When use_quadlet_backends flips from off → on, existing fleet boxes have backend containers parented under archipelago.service's cgroup (the bad shape that triggers FM3 cascade SIGKILL on every archipelago restart). ensure_running now notices and corrects this: * If there's already a `<name>.container` unit on disk → no-op (subsequent reconcile ticks take this fast path). * Else if a podman container with that name exists → it's a pre-3.3 artifact. Stop+remove it (volumes survive — bind mounts are not touched by `podman rm`), then write the Quadlet unit, daemon-reload, and start the new managed service. * Else → fall through to install_fresh, which already routes through install_via_quadlet when the flag is on. The migration is idempotent and self-healing: if a fleet box is half-migrated (unit on disk but no service active, or service active but stale unit), the next reconcile tick converges. Bitcoin chain data, lnd wallet state, and electrumx index all live on host bind mounts and are unaffected by the container-record swap. Volume safety audited per backend in `uses_orchestrator_install_flow` allowlist — every entry mounts its data dir as a host bind mount. Default still off. To migrate a node: /etc/archipelago/config.toml: use_quadlet_backends = true followed by `systemctl restart archipelago` — the next reconcile tick walks every managed app and migrates each in turn. Tests: 624 passing, 0 cargo warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:27:59 -04:00
archipelago	5b2e02bd43	feat(orchestrator): Phase 3.2 — wire Quadlet path behind feature flag prod_orchestrator::install_fresh now branches on the new Config::use_quadlet_backends flag (default false): * off (today's production behavior) — unchanged: runtime.create_container + start_container, container parented under archipelago.service's cgroup, FM3 cascade SIGKILL on every archipelago restart. * on — install_via_quadlet renders the manifest as a Quadlet unit via QuadletUnit::from_manifest, writes it atomically into ~/.config/containers/systemd/, calls daemon-reload, and starts the generated <name>.service. Container ends up under user.slice — no more cgroup parented under archipelago, so archipelago restarts don't touch the container's lifetime. Default off so this commit is structurally safe to ship: nothing changes at runtime until an operator opts in. Flip the default once tests/lifecycle/run-20x.sh has gone green against the new path on .228 + .198 (the v1.7.52 release gate). Plumbing: * config.rs — `use_quadlet_backends: bool` w/ Default false * prod_orchestrator.rs — flag stored on the struct, threaded through new(), with set_use_quadlet_backends(bool) test setter * prod_orchestrator.rs — install_via_quadlet helper * dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest / parse_memory_mib / RestartPolicy::OnFailure now that the call path exists; if a future revert removes the wiring, the warnings come back. Tests: 624 passing, cargo check clean (0 warnings). Existing companion behavior unaffected — render_skips_backend_directives_when_default still passes byte-equal to before quadlet.rs grew the new fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:22:10 -04:00
archipelago	9becafafd3	feat(quadlet): backend-manifest renderer (Phase 3.1 of v1.7.52) The QuadletUnit struct now covers everything a backend manifest needs (ports, environment, devices, add_hosts, entrypoint+command, read-only root, no_new_privileges, cpu_quota, restart policy choice). Adds QuadletUnit::from_manifest(&AppManifest, name) that translates a parsed manifest into a unit, plus parse_memory_mib for "1g"/"512m"/raw-MiB forms. The renderer skips empty/false directives so existing companion units render byte-identically — no behavior change for shipping companions; the backend renderer is dead code until Phase 3.2 wires it into the orchestrator. Eight new unit tests cover: * parse_memory_mib forms (1024, 512m, 2g, garbage) * shell_join quoting (whitespace, embedded quotes) * RestartPolicy → systemd string mapping * render emits backend directives when set * render skips them when defaulted (companion regression gate) * from_manifest happy path on a bitcoin-knots-shaped manifest * from_manifest read-only volume detection * from_manifest tmpfs filtering * end-to-end manifest → render bytes assertion Tests: 615 → 624 (+9 net; one pre-existing parse_memory_mib path was implicitly covered before but is now explicit). Cargo warnings: 0. `from_manifest`, `parse_memory_mib`, and `RestartPolicy::OnFailure` are marked allow(dead_code) with explicit references to Phase 3.2 — if 3.2 doesn't wire them, the dead-code warning resurfaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:09:50 -04:00
archipelago	c55a4f4e86	test(bootstrap): regression gate for the heal_podman_state socket bug Extracted the heal_podman_state cleanup list as a module-level HEAL_RUNTIME_SUBDIRS const so a unit test can structurally enforce the invariant: the list must contain "containers" + "libpod" but must NOT contain "podman" (which holds systemd's podman.sock listener and was the bug fixed in commit bb421803). If anyone re-adds "podman" — accidentally, by reverting, or by copy-paste from old plan memory — this test fires before we ship, not on the next deploy when it nukes the orchestrator's HTTP path. Total tests: 614 → 615. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:32:59 -04:00
archipelago	5be2febe13	fix(bootstrap): don't nuke podman socket dir during runtime self-heal Observed live on .198: heal_podman_state was removing $XDG_RUNTIME_DIR/podman/ alongside containers/ and libpod/. That dir holds the systemd-bound podman.sock — the listener systemd creates for socket-activated podman.service. Removing it broke every libpod HTTP call from the orchestrator until `systemctl --user restart podman.socket` ran. Far worse than any wedge it was trying to repair. Drop podman/ from the cleanup list. The runtime state we actually want to clean for FM6 (bolt_state.db drift) lives in containers/ and libpod/ only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:57:15 -04:00
archipelago	6bbe1b96cf	refactor: drop dead code surfaced by cargo cargo check was showing five real warnings, all genuinely dead: * container/mod.rs — re-exports compute_container_name, AdoptionReport, ReconcileAction, ReconcileReport were unused outside prod_orchestrator. Drop from the pub use line. * prod_orchestrator — with_runtime + insert_manifest_for_test only exist for the test module in the same file. Mark them #[cfg(test)] so they don't appear in release builds. * async_lifecycle — remove_package_entry has no callers; doc claims "used for install-failure cleanup" but nothing cleans up. Delete (10 lines). * registry.rs — `use tracing::{debug, info};` had no consumers. * fips.rs — unused-assignment chain on last_status. The poll loop always sets it on every break path, so the initial `None` and the unwrap_or_else fallback were both dead. Refactored to `let after = loop { ...; break s; };`. cargo check is now clean. cargo test --workspace --bins: 614 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:34:02 -04:00
archipelago	8f13298805	fix(bootstrap): self-heal wedged podman runtime state at startup Closes FM6 (podman bolt_state.db / runtime drift) — observed live on .198 today: bitcoind was running for several minutes, but podman's state DB reported the container as Exited. The reconciler then tried to "restart" it, racing the still-bound port 8332 and failing in a loop. heal_podman_state() runs as the last bootstrap stage, BEFORE the orchestrator's reconcile loop ticks. It probes `podman info` with a 5s timeout; on failure it removes the runtime-state dirs under $XDG_RUNTIME_DIR and re-probes. Persistent storage under ~/.local/share/containers/storage/ is never touched, so containers re-discover from manifests on next call. Cleanup never includes `podman system reset` or `system renumber` — those are destructive and must stay operator-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:23:36 -04:00
archipelago	6603227874	fix(install): auto-clean stuck OTHER-variant bitcoin container If bitcoin-core was installed but never started (e.g. port 8332 already bound by bitcoin-knots), the container sticks in `created` state forever. The old conflict check refused EVERY future bitcoin install — including re-install of the running variant — leaving no UI path to recovery. Now the check distinguishes states: - missing → no conflict, continue - running → real conflict, refuse install - created/exited/configured/... → stuck; auto-remove and continue Volumes are untouched; only the dead container record goes away. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:59:11 -04:00
archipelago	27ff1d5b52	fix(install): generate bitcoin RPC password before orchestrator install Bitcoin containers were exiting in ms after start because the orchestrator install path skipped the credential-materialisation step the legacy path did. resolve_secret_env then failed to read /var/lib/archipelago/secrets/bitcoin-rpc-password, the container started with no password, and bitcoind crashed before logs were useful. Two changes: 1. install.rs — call bitcoin_rpc_credentials() for bitcoin/bitcoin-core/ bitcoin-knots before any install branch runs. The function generates + persists on first call (OnceCell-cached), so this is idempotent. 2. manifest.rs::resolve_secret_env — return ManifestError::Invalid when a resolved secret trims to empty, instead of silently producing `KEY=` env vars that crash auth. Adds a unit test for the empty-secret rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:39:56 -04:00
archipelago	f9e34fd0c6	refactor(install): route orchestrator-managed apps through orchestrator first Phase 3a of the install path consolidation. Two coupled changes: 1. install.rs handle_package_install: gate the legacy "container exists → adopt + return" probe on !orchestrator_managed. Apps the orchestrator knows about (bitcoin-knots, bitcoin-core, lnd, electrumx, fedimint, filebrowser, btcpay-server stack apps, mempool stack apps, plus the companion UIs that just moved to Quadlet) skip the legacy probe and fall straight into the orchestrator branch. The legacy adopt block was returning success on a bare `podman start` exit-0 — even when the process inside the container crashed seconds later. That's the .228 "running but unreachable" failure mode. The orchestrator's ensure_running honors the manifest's health check and pre-start hooks (e.g. re-renders bitcoin-ui's nginx.conf if the RPC password rotated), so this is a behavioral upgrade, not just a refactor. 2. ProdContainerOrchestrator::install: make idempotent. Previously it blindly called install_fresh which would fail on `podman create` if the container name already existed. Now it delegates to ensure_running: - Container Running + healthy → no-op (refresh hooks, restart if config rewritten) - Container Stopped/Exited → start (with hook refresh) - Container missing → install_fresh - Container in wedged state (Created/Paused/Unknown) → force-recreate Without this, change #1 would regress every "container already exists" case for the 18 orchestrator-managed app IDs. With it, install becomes the single source of truth for "make app X be in the desired state." Tests: 654 passed across the workspace (614 unit + 37 orchestration + 3 rpc), 0 failures. The 20 prod_orchestrator tests cover the install / ensure_running / reconcile paths the new install delegates through. Net delta: install.rs grows by ~30 lines (gating wrapper + comments), prod_orchestrator.rs grows by ~30 lines (idempotent install body). Both are temporary — the larger deletions (~1700 lines) come once every app has been verified through the orchestrator path in subsequent phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:12:52 -04:00
archipelago	23c4e7441f	refactor(container): move companion UIs to systemd via Quadlet Companion UI containers (archy-bitcoin-ui, archy-lnd-ui, archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn blocks from install.rs. If archipelago crashed mid-spawn or the container's cgroup was reaped, companions vanished from podman ps -a and only a manual rm/run could bring them back (the .228 incident). Now each companion is rendered as a Quadlet .container unit under ~/.config/containers/systemd/, daemon-reloaded, and started via systemctl --user. systemd owns supervision from that point on: - archipelago can crash, restart, or be uninstalled without touching any companion. - Quadlet's Restart=always + RestartSec=10 handles container exits. - A 30s reconcile tick in boot_reconciler enumerates expected companion units and re-installs any whose unit file or service vanished — defense-in-depth against external tampering. New module layout: - container/quadlet.rs: pure unit renderer + atomic write_if_changed + systemctl helpers (daemon_reload_user / enable_now / disable_remove / is_active). 6 unit tests, no I/O in the renderer. - container/companion.rs: per-app companion specs, install/remove/ reconcile, image presence (build local first, fall back to insecure registry only via image_uses_insecure_registry whitelist). 2 tests. install.rs handle_package_install now ends with a single call to companion::install_for(package_id), replacing 287 lines of spawn-and- hope shellouts plus a ~120-line nginx auth-injector helper that worked around per-node RPC password baking. The helper is gone too — the pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/ bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only. runtime.rs handle_package_uninstall now disables companions before the container rm loop. Otherwise systemd's Restart=always would respawn each companion within ~10s of removal. Tests: 53 container tests pass, including 6 quadlet renderer tests (host network, bridge network, capability set, atomic write idempotence) and 2 companion specs (per-app companion lookup, build_unit shape). boot_reconciler tests gain a #[cfg(test)] without_companion_stage() flag so the paused-clock fixtures don't race the real systemctl I/O. A bats regression test (companion-survives-archipelago-restart.bats, gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode cannot recur: every installed companion has a unit file, services stay active across systemctl --user restart archipelago, and a deleted unit file is recreated within one reconcile tick. Net delta: +941 / -363, but the +941 is mostly tests (~440 lines) and the new declarative layer; the imperative tokio::spawn block and its nginx-auth helper are gone, removing two failure classes (orphan companions on archipelago crash, and post-start exec races under tightly-confined cgroups) that previously needed manual SSH recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:45:07 -04:00
archipelago	2bf8181110	refactor(security): tighten capability + TLS-bypass surface Three small, focused tightenings: - core/container/src/podman_client.rs: drop the legacy Hetzner 23.182.128.160:3000 mirror from image_uses_insecure_registry(). It was decommissioned in v1.7.x and is stripped from active registry config at load time; leaving it in the bypass list let a stale config still skip TLS. Replace the inline match with a named INSECURE_REGISTRY_HOSTS slice so future entries are one line. Test now also pins the spoofing-immune semantics ("evil.example/146.59.87.168:3000/x" must NOT match). - core/archipelago/src/api/rpc/package/config.rs: split bitcoin from lnd in get_app_capabilities(). bitcoind never opens raw sockets — drop CAP_NET_RAW from bitcoin/bitcoin-core/bitcoin-knots. lnd/fedimint/fedimint-gateway keep it because they enumerate network interfaces during cert generation. - core/archipelago/src/bootstrap.rs: tighten_secrets_dir() enforces 0700 on /var/lib/archipelago/secrets and 0600 on every file inside on each startup. The dir-mode is the load-bearing isolation boundary against rootless container escapes (their UID maps to >=100000, can't traverse uid=1000/0700). The per-file sweep is defense-in-depth against any installer that wrote 0644. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:59:11 -04:00
archipelago	0684491072	chore: baseline codex hardening before lifecycle refactor Snapshots the in-flight hardening work so subsequent reconcile/Quadlet phases land on a clean before/after diff. Changes: - core/container/src/podman_client.rs: image_uses_insecure_registry() whitelist for the OVH (146.59.87.168:3000) and legacy Hetzner (23.182.128.160:3000) HTTP mirrors; podman_network_settings() lifts custom networks into the Networks map so containers can join them. - core/archipelago/src/container/prod_orchestrator.rs: ensure_container_network() creates per-manifest networks on demand; apply_data_uid() now goes through host_sudo for mkdir -p + chown so bind-mount roots get created and chowned without password prompts. - core/archipelago/src/api/rpc/package/{install,update,stacks}.rs: podman pull adds --tls-verify=false only for whitelisted registries. - core/archipelago/src/bootstrap.rs: removes stale dev-mode systemd override on startup (live nodes carried it from old installers). - core/archipelago/src/config.rs: ignore ARCHIPELAGO_DEV_MODE in prod binaries — it had been silently rerouting volumes to /tmp. - apps/bitcoin-{core,knots}/manifest.yml: locate bitcoind at runtime so image-layout differences don't break entrypoint. - scripts/app-catalog-image-smoke-test.py: production catalog/image smoke test that probes a target node before users click Install. - .gitignore: cover .codex, .pnpm-store, __pycache__, *.bak. Removes filebrowser.rs.bak and two stale catalog.json.bak files (verified identical to live counterparts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:52:29 -04:00
archipelago	05e6c2e738	fix: release v1.7.51-alpha install hardening	2026-05-01 05:02:39 -04:00
archipelago	be9f9528c3	fix: release v1.7.50-alpha OTA runtime repair	2026-05-01 03:14:07 -04:00
archipelago	7ab788d178	chore: release v1.7.49-alpha	2026-04-30 16:37:54 -04:00
archipelago	8a2899ab4a	chore: release v1.7.47-alpha Sync-perf tuning for bitcoin/bitcoin-core/bitcoin-knots/electrumx. - Drop the --cpus=2 cap on bitcoin/electrumx variants. Script verification is parallelizable; the cap halved IBD speed on 4-8 core machines. - Bump bitcoin --memory 4g→8g so dbcache=4096 has headroom for mempool + connection buffers + I/O. 4g was OOM-prone during heavy IBD. - Bump electrumx --memory 1g→2g + add CACHE_MB=2048 + MAX_SEND=10MB. - bitcoin-core CLI args gain -dbcache=4096 -par=0 -maxconnections=125. - bitcoin-knots manifest matched (1024MB pruned / 4096MB full + par=0). Future v2: host-RAM-aware dbcache scaling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:47:51 -04:00
archipelago	992b673b20	chore: release v1.7.46-alpha Follow-up to v1.7.45-alpha closing the remaining tasks identified by the resilience sweeps + the new bitcoin orphan / install-fail-vanish bugs. User-visible: - Health monitor: stop paging on orphaned containers from variant switches - Install fail: card stays visible (was vanishing) with error message - Stack pull progress: interpolate 20→70% (was stuck at 20%) - docker.io → lfg2025 mirror: bitcoin/gitea/nextcloud/valkey Internal: - Resilience harness — install-wait uses expected_containers_for, ui+auth probes retry with 60s backoff, dep-snapshot fix - InstallProgress gains optional `message` field (frontend renders it when phase is None) binary $(stat -c %s releases/v1.7.46-alpha/archipelago) sha256:$(sha256sum releases/v1.7.46-alpha/archipelago \| awk '{print $1}') tarball $(stat -c %s releases/v1.7.46-alpha/archipelago-frontend-1.7.46-alpha.tar.gz) sha256:$(sha256sum releases/v1.7.46-alpha/archipelago-frontend-1.7.46-alpha.tar.gz \| awk '{print $1}') Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:50:33 -04:00
archipelago	4ec6ca98c1	chore: release v1.7.45-alpha Resilience-validated release. Three full sweeps of the new resilience harness against .228 confirm no shipstoppers. Big user-visible: - Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount, replaces fragile post-start exec that failed under restricted-cap rootless podman ("crun: write cgroup.procs: Permission denied") - Multi-container stack installs (indeedhub, immich, btcpay, mempool) now emit phase events at every boundary so the progress bar advances - Apps no longer vanish from the dashboard mid-install (absent-scanner skips packages in transitional states) - Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT, S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code - Tailscale install fixed: --entrypoint string was being passed as a single shell-line arg; switched to custom_args array - Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud restored on docker.io) - Bitcoin Core update path uses correct image (was looking for nonexistent lfg2025/bitcoin:28.4) - ISO installs now allocate swap on the encrypted data partition Infra: - New resilience harness (scripts/resilience/) — black-box state-machine tester, every app × every transition. Run before each release. Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic (homeassistant trusted_hosts), 8 harness/timing false-positives, and 3 non-shipstopper tracked items. Down from 23 in baseline sweep #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:31:45 -04:00
archipelago	8f83b37d51	feat(orchestrator): complete container migration and release hardening	2026-04-28 15:00:58 -04:00
archipelago	4bf35f95e6	test: repair stale test fixtures across identity, mesh, update, wallet, fips Several tests had drifted from the current production behavior: - identity_manager: create() already auto-provisions a Nostr key, so the explicit create_nostr_key() call failed with "already exists". Rewrite the test to assert on record.nostr_npub from create() directly. - mesh/protocol: test_build_app_start read the app name from frame[4..] but the v2 layout is [0:marker][1-2:len][3:cmd][4:version][5..:name]. test_identity_broadcast_roundtrip expected input DID = output DID but the v2 decoder derives DID from the ed25519 pubkey, so the roundtrip compares against did_key_from_pubkey_hex(&pub) now. - mesh/bitcoin_relay: test_build_block_header_announcement asserted sig.is_some(), but the builder intentionally emits an unsigned envelope to fit the 160-byte LoRa limit; assert sig.is_none(). Also widen placeholder hashes to the required 64 hex chars (32 bytes). - update: load_mirrors() now merges default mirrors post-migration, so the roundtrip test must assert the custom mirror survives alongside the defaults rather than strict equality. - wallet/cashu: test_proof_c_as_pubkey used hex that is not on the curve; replace with the secp256k1 generator point G so parsing succeeds. - fips: test_status_reports_no_key_pre_onboarding asserted npub.is_none(), which fails on dev boxes where the fips daemon is already running. Keep the !key_present assertion and drop the npub one.	2026-04-23 13:02:45 -04:00
archipelago	4edc420459	test(credentials): seed identity/node_key in test helper so encrypt/decrypt works Credentials tests created a fresh tempdir and immediately invoked encrypt/decrypt, but load_encryption_key reads <dir>/identity/node_key which did not exist, so every test failed with "node key not found". Add a test_dir_with_node_key() helper that writes a deterministic 32-byte key and switch all 8 call sites to it.	2026-04-23 13:02:28 -04:00
archipelago	7af048cc1a	fix(session): add test-only constructor so tests do not read real sessions SessionStore::new() reads /var/lib/archipelago/sessions.json, which on any node with an active dashboard contains live sessions that pollute test state and cause intermittent failures. Introduce a cfg(test) only new_for_tests(PathBuf) constructor and switch the test suite to it so tests always start from a clean tempdir.	2026-04-23 13:02:22 -04:00
archipelago	2843cc1e84	fix(container/image_versions): reject entries that are not image references The parser retained any key ending in _IMAGE, so a harmless-looking variable like NOT_AN_IMAGE="something" would be treated as a pinned container image. Add a value-shape check: the value must contain both a registry separator (/) and a tag separator (:) to qualify.	2026-04-23 13:02:15 -04:00
archipelago	c5ea41d0cb	fix(mesh/outbox): expire messages with zero TTL immediately is_expired used age > ttl_secs, so a message with ttl_secs=0 whose age rounded to 0 seconds was considered live forever. Switch to >= so the zero-TTL boundary expires on the first check, matching the intuitive meaning of TTL and the behavior the tests assert.	2026-04-23 13:02:07 -04:00
archipelago	9d42645aa3	fix(avatar): prevent u16 overflow panic when seed byte is large hue_color and accent_color computed (seed as u16) * 360, which overflows u16 when seed >= 182 — debug builds panicked, release wrapped silently. Widen to u32 before the multiplication. This also unblocks several identity_manager tests that constructed avatars through master_node_svg and were aborting on the panic.	2026-04-23 13:02:01 -04:00
archipelago	f6efe2f356	fix(transport/chunking): stop overwriting first 4 bytes of user data encode_chunked() split the payload into shards first, then overwrote the first 4 bytes of shard 0 with a u32 length header, then re-ran Reed-Solomon to regenerate parity over the now-corrupted shards. The decoder correctly read the length header and trimmed `[4..4+len]` from the reconstructed buffer, but those first 4 bytes had already been destroyed on the encode side, so every chunked mesh payload lost its first 4 bytes. Restructure: reserve 4 bytes for the length header up front, build a single contiguous [len][data][pad] buffer, then split into shards. Parity is computed over the correct shards on the first pass, no double-encode needed. Update test_chunk_roundtrip_medium: 500 bytes + 4-byte header = 504 bytes, which is 5 data shards (ceil(504/124)), not 4. The old test assertion was wrong all along and masked the corruption bug because it only checked the roundtripped bytes, which is exactly what we need to verify. New assertion is correct. Verified: all 7 transport::chunking tests pass.	2026-04-23 12:29:10 -04:00
archipelago	cd6f8bad70	fix(install-log): pre-create /var/log/archipelago/ so non-root backend can write The backend runs as `archipelago` and calls `install_log()` to append audit lines to the install log on every install / update / remove / start / stop / restart. Target path was /var/log/archipelago-container-installs.log, which does not exist and cannot be created by the service because /var/log/ is root-owned. OpenOptions errors were silently swallowed, so the log was never written on any node. Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and container-installs.log with archipelago:archipelago ownership. Move the const path to match, keeping logs inside the directory logrotate already rotates (image-recipe/configs/logrotate.conf). Install the rule from both the ISO build and self-update, and apply it immediately on self-update so existing nodes get a working log without needing a reboot. Verified on .228: file created, backend user can write, backend binary rebuilt with new const.	2026-04-23 12:02:46 -04:00
archipelago	694e5b0a9d	fix(update): pass --create-missing when rollback recreates a destroyed container The update flow removes the old container before starting the new one. If the update fails after removal, the rollback path tries `podman start <name>` first, then falls back to reconcile. But reconcile without --create-missing treats the now-absent container as an optional one that the install flow will (re)create later, and skips it. Result: container stays destroyed until someone notices and runs reconcile manually. Add --create-missing to the rollback reconcile invocation so the fallback actually rebuilds the container from its canonical spec. Fixes the failure mode observed on .228 where a bitcoin-knots update left the node with no bitcoin-knots container at all.	2026-04-23 10:06:55 -04:00
archipelago	12f93cc15e	fix(image-versions): locate image-versions.sh at its actual deployed path The Rust search path listed /opt/archipelago/image-versions.sh and scripts/image-versions.sh (repo-relative for dev), but the image recipe deploys the file to /opt/archipelago/scripts/image-versions.sh. Production nodes therefore silently failed every lookup: find_file returned None, load_image_versions returned an empty HashMap, and both pinned_image_for_app and pinned_images_for_stack returned no matches. Symptom on deployed nodes: every container scan emitted "image-versions.sh not found in any search path" at DEBUG level, and the version-comparison logic in docker_packages.rs plus the update-check logic in api/rpc/package/update.rs silently degraded to no-op — users would not see update-available badges and upgrade RPCs could not resolve pinned targets. Fix: put the canonical deployed path first in PATHS, keep the older /opt/archipelago/image-versions.sh as a fallback for not-yet-updated nodes, and retain scripts/image-versions.sh as the dev-repo-relative fallback. Verified on .228: backend now logs "Parsed 57 image versions from /opt/archipelago/scripts/image-versions.sh" on scan. Pre-existing test_parse_image_versions failure in this module is unrelated (the NOT_AN_IMAGE assertion was broken before this change because the parser's _IMAGE-suffix retain keeps it). Leaving that for the general cargo-test cleanup pass.	2026-04-23 09:29:15 -04:00
archipelago	28e38a36a9	fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs load_registries + load_mirrors normally only ADD missing defaults to the persisted JSON — explicit removals stick. After retiring the .23 Hetzner VPS we need the opposite: existing nodes have .23 baked into their saved configs and would spend seconds per install/update timing out against a dead host until the operator manually removes it via the Settings UI. Add a targeted one-time migration in both loaders: if any saved entry has 23.182.128.160 in its URL, drop it on load and rewrite the file. This is an exception to the usual "explicit removals stick" rule — the user never chose to add this mirror, it was a default. Narrow-scope migration (one hardcoded IP match, no schema version) because the cost/benefit of a general migration system isn't worth it for a single decommissioned host. Future retirements can follow the same pattern.	2026-04-23 08:51:26 -04:00
archipelago	d9d5fa65e5	chore: retire .23 VPS mirror, promote .168 OVH to primary The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it everywhere with the OVH VPS at 146.59.87.168, which was previously the tertiary mirror. - update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2. Default mirror list shrinks from 3 to 2. - container/registry.rs: default RegistryConfig drops .23, promotes .168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10. - api/rpc/package/config.rs: trusted-registry allowlist swaps .23 for .168. - api/handler/mod.rs: app-catalog fallback URL uses .168. - neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168. - scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168. - image-recipe/build-auto-installer-iso.sh: installer ISO registries use .168 (both podman registries.conf and backend registries.json). Tests updated to assert on the new 2-entry default lists (registry + mirror). URL-parser fixture tests in update.rs retain .23 strings — they exercise string-parsing logic, not mirror policy. Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin` multi-push alias (not part of this commit — pure working-copy change).	2026-04-23 08:22:32 -04:00
archipelago	980c1b25f4	fix(install): kick scanner post-install so Launch button appears immediately After install completes, the async-spawn wrapper wrote state=Running but the skeletal install-time manifest (interfaces: None) persisted until the next scheduled 60s scan. The frontend saw state=running but hasUI=false and hid the Launch button for up to a full minute. Add a shared Notify/watch pair between RpcHandler and the scan loop: - scan_kick (Notify): scan loop selects! between the 60s interval and this notify, running immediately on either. - scan_tick (watch<u64>): scan loop bumps the counter after each completed scan so callers can await completion. Install and update success paths now call kick_scanner_and_wait before flipping to Running. The scan merges via merge_preserving_transitional (state stays Installing/Updating, manifest refreshed from live podman with interfaces.main.ui populated from real port bindings). 2s timeout falls back to pre-fix behavior on slow podman — no regression.	2026-04-23 07:59:03 -04:00
archipelago	7e62ea07f7	feat(install): phase-based progress bar replaces unparseable pull bytes Podman emits zero parseable progress when stderr is piped (no TTY), so the old byte-counter regex never matched in real installs. Users saw 0% for the whole pull, then a jump to 95%, then silence through create-container, health-check, and post-install hooks. Replace with 7 explicit lifecycle phases wired through install.rs and update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%), StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%), Done (100%). Each maps to a fixed UI progress and status message. Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when present, falls back to byte-counter for legacy. A Math.max forward-only guard ensures the bar never regresses. Deleted the duplicate watcher in Discover.vue that was fighting the store's watcher with stale byte logic. Added shimmer CSS on the fill (with prefers-reduced-motion opt-out) so the bar looks alive during long phases.	2026-04-23 07:58:43 -04:00
archipelago	49b98e0271	fix(rpc): empty icon in transient install entry to avoid broken-image flicker create_installing_entry hardcoded /assets/img/app-icons/<id>.png for every new install. About half the app icons ship as .svg or .webp (lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the browser 404s on the wrong extension and renders the default broken-image glyph for the 10-30s window before the scanner refreshes with real manifest data. Send empty icon. The frontend's icon computed in AppCard.vue falls through to curatedMap which has correct extensions for bundled apps, and handleImageError still guards any remaining misses with a placeholder SVG.	2026-04-23 06:58:12 -04:00
archipelago	1ad889608f	feat(rpc): async-spawn install/uninstall/update lifecycle Extend the async-spawn treatment previously shipped for Stop/Start/Restart to the three remaining long-running lifecycle RPCs. Each wrapper validates params, rejects duplicate in-flight ops, flips state to the transitional variant (Installing/Removing/Updating), then spawns the existing inner handler on tokio. RPC returns immediately with { status, package_id }; the spawn task owns the terminal state write. Install and update success arms explicitly set state=Running. The scan loop merge (merge_preserving_transitional) refuses to overwrite transitional states, so the spawn task must write the terminal state. Uninstall's inner handler removes the entry entirely, so no explicit terminal write is needed there. Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so spawned tasks can hold their own Arc without extra field cloning. Transient install entry uses empty icon string. Hardcoding /assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp assets, which produces a broken-image flicker until the scanner refreshes with manifest data. Empty string causes the frontend's icon computed to fall through to the curated map, which has correct extensions. Removed the inner "already updating" guard in update.rs — the wrapper now owns duplicate-op detection for all three operations.	2026-04-23 06:57:50 -04:00
archipelago	cd69c3b2f6	fix(state): preserve transitional state across container scans The 30s package scan loop used to blindly overwrite every package entry from podman inspect. While a user-initiated Stop / Start / Restart was in flight, the RPC spawn task would flip the state to Stopping / Starting / Restarting, the next scan would see podman still reporting "running" (for the duration of the graceful stop, up to 600s for bitcoin-core), and clobber the transitional state back to Running. The dashboard would then flip Running -> Stopping -> Running -> Stopped, making it look like the stop had silently failed until it eventually completed. The merge loop now treats transitional variants (Stopping, Starting, Restarting, Installing, Updating, Removing, and the three backup variants) as owned by the RPC spawn task. For those variants, merge_preserving_transitional keeps the existing state while still taking live observability fields (health, exit_code, installed, lan_address, manifest, static_files, available_update) from the fresh scan so the UI continues to see live health readings. Adds an escape hatch via a per-scan transitional_since side table: if a package has been in a transitional state for more than 1200s (2x the longest graceful stop at 600s on bitcoin-core), the scan loop assumes the spawn task died without cleanup and overrides with podman's live state. Prevents a crashed background task from wedging a package in Stopping forever. Three unit tests cover the merge rule, the observability passthrough, and the transitional-variant classifier.	2026-04-23 05:15:13 -04:00
archipelago	39dd1d9dcc	fix(rpc): async container stop/start/restart; widen state mapping RPC handlers no longer block on podman operations. container-stop on bitcoin-core used to hold the connection for up to 600s while the UI showed a frozen spinner; it now returns in under a second with {status: stopping} after flipping the package state to Stopping and broadcasting over WebSocket. Same treatment for container-start and the new container-restart route. Widens container-list state mapping to emit the transitional variants (stopping, starting, restarting, installing, updating, removing, installed, and the backup states) instead of collapsing them to "unknown". Keeps the mapping in sync with the UI ContainerStatus.state union so the dashboard can render the right transitional label. Mirrors the treatment in package/runtime.rs for package.start, package.stop, and package.restart. The body of each handler is lifted into pure do_package_* helpers that the background task runs; state flipping is bracketed around the spawn with revert on error. The pre-existing post-start exit-check verification and restart stop+start fallback run inside the spawned task, not the RPC body. Adds container-restart route to the dispatcher. mark_user_stopped continues to run BEFORE the spawn, preserving the ordering contract with the crash recovery layer at runtime.rs:145-148.	2026-04-23 04:59:45 -04:00
archipelago	5baced5f5b	feat(rpc): spawn_transitional helper for async lifecycle ops Introduces a new RPC-layer helper that bridges the synchronous ContainerOrchestrator trait with RPC handlers that must return in <1s. The helper flips the package state to a transitional variant (Stopping / Starting / Restarting) in the StateManager so WebSocket clients see the live label immediately, then tokio::spawns the actual orchestrator call. On success it writes the final state; on error it reverts to the pre-transition state and logs via install_log(). The ContainerOrchestrator trait stays synchronous so the reconciler, boot flow, unit tests, and chaos harness keep deterministic behaviour. Async only lives in the RPC layer. Not wired to any handler yet — Commit 2 consumes this helper. Widens install_log visibility from pub(super) to pub(in crate::api::rpc) so the new sibling module can reach it.	2026-04-23 04:55:52 -04:00
archipelago	30b31b3670	fix(lnd): read admin macaroon via sudo fallback LND's admin.macaroon is owned by a rootless-podman subordinate UID (typically 100000) with mode 640. The archipelago server runs as UID 1000 and cannot read the file directly, which caused every dashboard LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client to fail with "Failed to read LND admin macaroon". Add a read_lnd_admin_macaroon() helper that first tries a direct read (for operators who have relaxed permissions) then falls back to `sudo -n cat`, mirroring the pattern already used for Tor hidden service hostnames in handle_lnd_connect_info. Centralise the canonical macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers through the helper. Verified on .228: GET /lnd-connect-info now returns 200 with cert, macaroon, and tor_onion fields. Dashboard QR/connect-string UI unblocked.	2026-04-23 04:15:44 -04:00
archipelago	3e9c192b48	feat(container): bitcoin-ui pre-start hook renders nginx.conf from embedded template Replaces the first-boot-containers.sh sed/envsubst approach with a Rust-native render step bound into the ContainerOrchestrator lifecycle. - New container::bitcoin_ui module: embeds the nginx.conf template via include_str!, reads the plaintext RPC password from /var/lib/archipelago/secrets/bitcoin-rpc-password, substitutes {{BITCOIN_RPC_AUTH}} with base64(archipelago:<password>), and atomic- writes (tmp + rename) to /var/lib/archipelago/bitcoin-ui/nginx.conf. Idempotent: byte-compares before writing so unchanged input is a no-op (no inode churn, no restart cascade). - ProdContainerOrchestrator gains run_pre_start_hooks(app_id) returning HookOutcome::{Rewritten, Unchanged}. Fires in install_fresh before create_container, and in ensure_running: on Running + Rewritten triggers a restart; on Stopped re-renders then starts. - bitcoin-ui Dockerfile no longer COPYs a default.conf; the file now arrives via runtime bind-mount of the rendered config. If the bind- mount is ever missing, nginx starts with no site configured and returns 404 everywhere — safe failure vs. serving upstream RPC with a stale Authorization header. - apps/{bitcoin,electrs,lnd}-ui/manifest.yml land as first-class manifests. bitcoin-ui declares the bind-mount target and a dependency on bitcoin-core; electrs-ui and lnd-ui declare their own deps and health checks. - 8 new unit tests on the render fn (idempotency, rotation, trimming, missing/empty secret, template invariants) plus an integration test asserting install(bitcoin-ui) actually lands a substituted nginx.conf on disk via the hook. 39/39 container:: tests pass (test_parse_image_versions pre-existing failure unchanged, out of scope).	2026-04-23 02:19:52 -04:00
archipelago	6a0809d386	feat(container): wire ProdContainerOrchestrator + BootReconciler into main Step 6 of the rust-orchestrator migration. Construct the container orchestrator once in main.rs, call load_manifests + adopt_existing immediately after Config::load, log the adoption report, and spawn BootReconciler::run_forever with the 30s default interval. Thread the orchestrator through Server::new -> ApiHandler::new -> RpcHandler::new so the reconciler and RPC layer share one instance. Wire a tokio::sync::Notify through the SIGTERM/SIGINT shutdown path so the reconciler exits cleanly alongside the server drain. Uses notify_one so the signal stores a permit if the reconciler is mid reconcile_all when the signal fires. Delete the commented-out run_boot_reconciliation block in main.rs that documented the prior bash-script approach being unsafe on unbundled installs — the new reconciler is manifest-driven and only touches apps present in /opt/archipelago/apps, fixing that concern. cargo check -p archipelago clean (6 pre-existing dead-code warnings on trait methods not yet exercised until Step 9 hot-swap). Container test suite 43/44 pass; the one failure (container::image_versions:: test_parse_image_versions) is pre-existing and unrelated.	2026-04-22 19:20:13 -04:00

1 2 3 4 5 ...

409 Commits