128 Commits

Author SHA1 Message Date
archipelago
57a013bc66 test(gate): make 5× the canonical gate, drop 20x naming
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:12:41 -04:00
archipelago
452f05d849 fix(reconciler): decouple companion self-heal onto its own cadence
The companion-unit repair stage ran at the END of each boot-reconciler tick, after
reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a
deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any
reasonable window (gate test 31 'deleted unit recreated within one reconcile tick'
timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is
cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass.
Handle is aborted when the main loop exits (shutdown uses notify_one, so a second
waiter would steal the wake permit). tick() is now app-reconcile only.

All 4 boot_reconciler cadence tests still green (companion_stage=false in tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 13:04:28 -04:00
archipelago
760a32bccf fix(reconcile): keep user-stopped apps stopped (reconciler was resurrecting them)
package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler
restarts it within ~8s: the reconcile filter's dependency_required override
re-includes a user-stopped app that an active app depends on, and the in-memory
disabled set is wiped on manifest reload — so ensure_running runs, the stopped
app's unreachable ports look like a fault, the host-port repair restarts it, and
package.stop never sticks (gate 'transitions to stopped' times out).

Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single
choke point every reconcile flows through) → Left('user-stopped'). Explicit
install/start clear the marker first (added clear_user_stopped to orchestrator
install/start, symmetric with disabled.remove; start/restart RPC already cleared
it) so user actions are unaffected. The container itself already stopped correctly
— this stops the resurrection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:04:02 -04:00
archipelago
2dad64b2ee fix(stop): honour per-app graceful-stop grace in orchestrator stop path
package.stop left slow-to-SIGTERM apps (fedimint/electrumx/bitcoin/btcpay/immich)
running: the orchestrator path hardcoded podman API ?t=10 / CLI -t 30 and the CLI
wrapper deadline (30s) equalled the -t grace, so the await fired exactly as podman
SIGKILLed -> stop reported failed -> state reverted to running. Reproduced live on
clean .198 (fedimint).

- container/runtime.rs: add ContainerRuntime::stop_container_with_grace (defaulted
  so mock/dev impls are unchanged); PodmanRuntime honours grace for API + CLI with
  deadline = grace + 15s buffer; AutoRuntime delegates. New canonical per-app table
  stop_grace_secs_for() + DEFAULT_STOP_GRACE_SECS / STOP_GRACE_DEADLINE_BUFFER_SECS.
- podman_client.rs: stop_container_with_grace uses ?t=<grace> + longer HTTP deadline.
- prod_orchestrator::stop: resolve grace = manifest stop_grace_secs (north-star) else
  the table; pass to quadlet::stop_service_with_timeout AND stop_container_with_grace.
- quadlet.rs: stop_service_with_timeout so slow apps aren't SIGKILLed at 45s.
- rpc/package/runtime.rs: doc-note its &str stop_timeout_secs mirrors the canonical table.
- tests: resolve_stop_grace_secs (manifest field wins / table fallback / default 30).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 06:59:40 -04:00
archipelago
ff78b31212 fix(hooks): run post_install exec in a transient user scope (fixes cgroup denial)
Live on .228 the post_install `exec` steps failed with "crun: write
cgroup.procs: Permission denied / OCI permission denied": a `podman exec`
launched from archipelago.service can't place its child in the container's
cgroup (under the service's own slice). Wrap `exec` in
`systemd-run --user --scope --quiet --collect podman exec …` so it gets its own
delegated cgroup — same trick as `podman_user_scope` for pasta starts.
`copy_from_host` (a host-side `cp`, no in-container process) stays direct.

Without this only copy_from_host worked; indeedhub happened to be unaffected
(its image pre-bakes the nginx config so the exec steps were no-ops), but the
hook capability is only generally useful with exec working. hooks unit tests
pass; live verify on .228 next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:38:23 -04:00
archipelago
b73084dbb0 refactor(indeedhub): delete orchestrator special-cases; use generic path (#20 phase 3)
The fresh-create path was blocked by hardcoded indeedhub orchestrator logic
that predated and conflicted with the manifest migration:
- ensure_running routed app_id=="indeedhub" → reconcile_indeedhub_stack, which
  REFUSED to create the frontend from its manifest (returned Left("stack-managed")).
- run_pre_start_hooks("indeedhub") → start_indeedhub_backends →
  wait_for_indeedhub_dependencies_ready(120) — a DNS gate with a chicken-and-egg
  bug (required the frontend's own alias present before the frontend could be
  created), which failed install_fresh with "dependencies were not ready within
  120s" and left the frontend down (caught live on .228).

Delete all of it (−382 lines): reconcile_indeedhub_stack, start_indeedhub_backends,
wait_for_indeedhub_dependencies_ready, indeedhub_api_dependency_dns_ready,
indeedhub_required_aliases_present, repair_indeedhub_network_aliases,
indeedhub_alias_present, patch_indeedhub_nostr_provider, and the INDEEDHUB_*
consts. The manifests now carry everything these did: network_aliases (short
hostnames), generated_secrets, dependencies, and the post_install nginx hook. So
"indeedhub" + every member flows through the generic install_fresh/reconcile path
— the frontend fresh-creates normally and runs its hook.

(crash_recovery.rs's frontend-after-deps ordering guard is kept — it's beneficial
startup ordering, not a blocker.) cargo check + release build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:11:33 -04:00
archipelago
b94b61f640 feat(manifest): network_aliases — extra DNS aliases on a container's network
Add `container.network_aliases: Vec<String>` (serde default, DNS-label
validated) so a stack member can answer to short hostnames its peers bake
in, beyond its own container name. Rendered in both runtime paths:
- podman_client: merged (deduped) into the custom-network aliases array.
- quadlet from_manifest: appended after the container name; emitted only
  for Bridge networks (slirp/pasta reject aliases).

Needed for the indeedhub migration: its frontend nginx proxies to
`api:4000` / `minio:9000` / `relay:8080`, so those members declare
`network_aliases: [api|minio|relay]` to keep the short names resolvable on
the dedicated indeedhub-net (vs. colliding generic aliases on archy-net).

Also fixes 4 pre-existing from_manifest test failures (unrelated to this
change, surfaced now that the quadlet suite runs green): test manifests
used the long-invalid `network_policy: archy-net` (allowlist is
isolated/bridge/host → moved to network_policy: isolated + container.network)
and bind sources outside /var/lib/archipelago.

Tests: container crate 53 pass; archipelago quadlet+alias 47 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 15:45:11 -04:00
archipelago
955c54b713 feat(hooks): post_install executor + install-path wiring (#20 phase 2)
Add container::hooks::run_post_install — runs an app's declarative
post_install hooks against its own running container:
- Exec  -> podman exec <container> <args…> (60s timeout-bounded)
- CopyFromHost -> resolve src against allowlist roots (<data_dir>/<app>
  and /opt/archipelago), canonicalise + prefix-check (defeats symlink
  escape), then podman cp <abs-src> <container>:<dest>

Best-effort + idempotent: a failed step is warned and skipped, never
fails the install — matching the legacy patch_indeedhub_nostr_provider
behaviour this replaces. Wired into install_fresh after the container is
up, so it runs only on a freshly created container (not plain start), and
re-applies on recreate-after-drift.

5 unit tests on resolve_copy_src (accept in-data-dir, reject absolute /
traversal / missing / symlink-escape). cargo test -p archipelago green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 11:45:28 -04:00
archipelago
9e6c5370fc feat(immich): manifest-driven stack via orchestrator — live-migrated on .228
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).

- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
  first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
  * named by app_id (dropped container_name override) — using container_name
    spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
    on the same PGDATA, which corrupted a postgres cluster. Server reaches its
    siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
  * immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
    host 100998 under rootless; verified the fresh dir is chowned correctly).
  * immich-server version "release"→"2.7.4" (manifest validation requires a digit;
    the bad version made the manifest silently skip → partial orchestrator install
    → legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
  when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
  errors instead of double-creating containers on shared data (the corruption
  root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
  manifest — this gap let the bad immich-server version through.

Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 07:08:45 -04:00
archipelago
7bfbe8fe40 feat(registry-manifest): phase 2 — publisher embeds manifests into signed catalog
generate-app-catalog.sh gains opt-in EMBED_MANIFESTS=1: embeds each
apps/<id>/manifest.yml into its catalog entry's `manifest` field (whole document,
top-level app: preserved — exactly what the Rust side deserializes). Default off
so routine catalog regen is unchanged during the migration window; turn on
deliberately, then sign via the existing release-root ceremony. Verified: default
embeds 0; EMBED_MANIFESTS=1 embeds 40 manifests (generated_secrets preserved).

Adds a round-trip guard test: every shipped apps/*/manifest.yml must deserialize
+ validate through catalog_manifest_to_overlay (image apps accepted, build apps
defer to disk) — catches schema drift between disk manifests and the catalog path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:46:17 -04:00
archipelago
220666d3a9 feat(registry-manifest): phase 1 — orchestrator consumes manifests from signed catalog
Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a
full manifest per entry; the orchestrator overlays it over the disk manifest
(origin-wins) with disk as the migration fallback. Moves apps toward
registry-distributed manifests with no OTA-shipped disk file.

- app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible,
  covered by the existing release-root signature over the raw JSON);
  `catalog_manifest_values()` accessor.
- prod_orchestrator: `load_manifests` overlays catalog manifests after the disk
  walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on
  unparseable value / app-id mismatch / failed validate() / build source
  (build contexts aren't registry-distributed yet — phase 1 is image-only).
- manifest_dir stays PathBuf (build-only field); image-only apps never read it.
- 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so
  existing nodes are unaffected.

See docs/registry-manifest-design.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:30:38 -04:00
archipelago
03a4ee1b30 feat(container): manifest-declared generated secrets + companion/quadlet hardening
Generated-secrets system: apps declare `generated_secrets` in their manifest
(kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets`
materialises them 0600/rootless in resolve_dynamic_env — idempotent and
self-healing (recovers wrongly root-owned secrets with no privilege). Replaces
per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests
now declare fmcd-password / fedimint-gateway-hash.

companion.rs: rebuild the auto-built :latest image when its build context changes
(staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes.

quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit
125) + regression tests.

UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged
as Services (headless backends), gateway icon fallback.

Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start;
grafana/strfry orphan crash-loop units removed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:11:07 -04:00
archipelago
63b98599e8 Revert "fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7)"
This reverts commit 409543c41e78025354acbdde5ffc6445895d4508.
2026-06-20 14:37:24 -04:00
archipelago
409543c41e fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7)
fmcd crash-looped "Operation not permitted (os error 1)" on .116 (kernel
6.12.74): the default rootless seccomp profile blocks a syscall its Mainline-DHT
/ iroh transport needs, so the REST API never came up (:8178 → HTTP 000) and
federations couldn't be joined. Verified: with seccomp=unconfined fmcd boots and
answers /v2/* (HTTP 401 instead of dead). fmcd works on other nodes, so this is
kernel/seccomp-specific — but the relaxation is safe for an outbound-networking
daemon and harmless where not needed.

- new `security.seccomp_unconfined` manifest flag (SecurityPolicy);
- libpod backend sets `seccomp_profile_path: "unconfined"` (== --security-opt
  seccomp=unconfined); quadlet backend emits `SeccompProfile=unconfined`;
- enabled in apps/fedimint-clientd/manifest.yml.

NOTE: manifests live on-disk at /opt/archipelago/apps/<id>/manifest.yml, so the
node needs the updated manifest deployed + the fmcd container recreated to apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 13:08:13 -04:00
archipelago
1bce694ebb feat(ui): mobile mesh tabs, AIUI-style audio player, cloud grid + map fixes
UI (this session):
- Global audio player now scales the whole interface into the space above it
  on desktop (sidebar + main) and docks directly above the tab bar on mobile;
  it stays visible while navigating.
- Mesh mobile redesign: floating Chat / BTC / Dead Man / AI / Map tab strip
  with a single fixed, internally-scrolling pane (page no longer scrolls);
  tabs hide while a conversation is open; floating back button; collapsible
  Device panel (starts collapsed); keyboard-aware conversation sizing via
  VisualViewport so the chat sits just above the keyboard.
- Cloud file grid: uniform 4/3 card heights (folders + images match).
- Swipe left/right switches tabs on the Apps and Web5 screens.
- Map tool fills its pane (no bottom gap); fix skewed Share Location toggle
  on mobile (global min-height rule was deforming the switch).
- Trim redundant helper copy from the mesh AI tab.

Also bundles pre-existing in-progress work that was already in the tree:
mesh listener/session + wallet + container + bitcoin-status backend changes,
docker UI updates, and assorted other UI tweaks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:26 -04:00
archipelago
c4855526fe feat(wallet): wire fmcd as core app + dual-ecash receive
Fedimint never appeared in Wallet > Settings > Fedimint because the
fmcd (fedimint-clientd) sidecar was never installed: ensure_default_
federation() needs the fmcd password to reach the daemon, found none,
and silently no-oped, leaving the registry empty.

- prod_orchestrator: add fedimint-clientd to the baseline auto-install
  set so it self-heals onto every node and auto-joins the default
  federation; generate the fmcd-password secret before secret_env
  resolves.
- fedimint_client: ensure_fmcd_password (random hex, 0600) shared with
  the container's secret_env; from_node reads the same secret (legacy
  fmcd/password kept as fallback); reissue_into_any redeems received
  notes into the first joined federation that accepts them.
- wallet.ecash-receive: dual-token — cashu* tokens redeem at the mint,
  anything else is reissued via fmcd; returns the kind + federation_id.
- UI: receive box advertises "Cashu or Fedimint" and reports which kind.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:26 -04:00
archipelago
83bb589ea6 style: cargo fmt for v1.7.99-alpha release gate
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:50:46 -04:00
archipelago
bd567cd165 feat(wallet,content,seed): Fedimint dual-ecash, paid content streaming, seed ceremony
- Fedimint ecash alongside Cashu: fedimint-clientd (fmcd) HTTP bridge,
  fedimint_client, fedimint RPC, wallet wiring
- Paid peer content: content invoices + streaming content server + content RPCs
- Seed-phrase ceremony/reveal RPCs and CLI ceremony tool
- LND wallet, mesh status/messaging, app-stack (netbird HTTPS), and
  decoupled-update wiring; Fedimint Client core app in catalog

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:21:07 -04:00
archipelago
7c458ede8e Merge agent-trust-wip (DHT Phases 0–4) into main
Integrates the DHT/peer-distribution line with the v1.7.98-alpha release
fixes:
- Phase 0 signed-catalog trust + release-root key (KAT-pinned)
- Phase 1 BLAKE3 content addressing alongside SHA-256
- Phase 2 swarm-assist fetch seam (origin always wins) + iroh-blobs
  provider — heavy iroh deps stay behind the off-by-default `iroh-swarm`
  feature, so the default build/deploy is unaffected
- Phase 3 signed Nostr seed-advertisement + discovery glue + paid swarm
  serving + "Networking Profits" Settings page
- Phase 4 paid swarm streaming (cross-mint ecash, Shape-A paid ALPN,
  streaming.prepare-payment), also iroh-swarm-gated

Conflicts resolved: seed.rs (kept release-root KAT tests), update.rs
(comment-only, OTA logic identical), Cargo.lock (regenerated against the
merged Cargo.toml). Default-feature build is clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 07:50:06 -04:00
archipelago
3aea8c5bfa fix(orchestrator): rebuild local UI images when source changes (#34)
The prod orchestrator only checked whether a build-image tag was *present*
before deciding to skip the build. The local UI images (bitcoin-ui, lnd-ui,
electrs-ui) COPY a built neode-ui dist, so a UI update changed the source but
left the old tag in place and the new UI never shipped.

Gate the build on a content fingerprint of the build context (sorted relative
path + length + mtime, SHA-256) recorded in a per-tag stamp under data_dir.
Rebuild whenever the fingerprint differs from the one that produced the
existing image; podman's own COPY-layer cache keeps a no-op rebuild cheap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 03:09:56 -04:00
archipelago
27f11bf85a feat(trust): wire Phase 0 signed-catalog verification + pin release-root KAT
Completes the parked trust module and wires it into the live build:
- main.rs: register `mod trust`
- app_catalog::fetch_one: verify the release-root detached signature when
  present (verify against raw JSON so forward-compat fields stay in the
  signed preimage); accept unsigned during the migration window, hard-reject
  a present-but-bad signature so a tampering mirror can't pass altered bytes
- seed: pin release-root Ed25519 known-answer test (priv+pub) for the
  signing ceremony / pinned-anchor / external-verifier cross-check
- signed_doc: drop unused import

20/20 Phase 0 unit tests pass (trust::canonical/did/signed_doc/anchor,
seed release-root, app_catalog). Crate compiles clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 12:40:57 -04:00
archipelago
aa9e0f02b7 fix(cloud): pin peer file-card filename + action buttons to the bottom (#11)
Make each peer file card a flex column filling its grid cell (flex flex-col
h-full) and pin the body row (filename + Play/Download) with mt-auto, so cards
with a media preview and cards without line their footers up across the row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 09:27:29 -04:00
archipelago
bf24bbc15a fix(mempool): resolve CORE_RPC_HOST to the actual bitcoin node (Knots/Core) (B12)
CORE_RPC_HOST was hardcoded to bitcoin-knots in three env-render paths, so on a
bitcoin-core node (container named bitcoin-core) mempool-api could not reach
Bitcoin RPC. Both node variants are reachable on archy-net by container name —
only the name differs.

- Legacy direct-podman (stacks.rs) and config.rs::get_app_config now use a new
  dependencies::detect_bitcoin_rpc_host() (pure, unit-tested pick_bitcoin_host).
- Quadlet/manifest path (the modern fleet default): add a {{BITCOIN_HOST}}
  derived-env placeholder — HostFacts.bitcoin_host + resolve_derived_env render
  it; prod_orchestrator detects Knots/Core via podman ps, resolved on demand
  only for manifests that use the placeholder. mempool-api manifest moves
  CORE_RPC_HOST from static env to derived_env: {{BITCOIN_HOST}}.

Tests: pick_bitcoin_host (5 cases incl. substring safety), container-crate
resolve_derived_env, and orchestrator mempool_core_rpc_host_follows_bitcoin_node
(core->bitcoin-core, knots->bitcoin-knots). No-regression confirmed: picker
returns bitcoin-knots live on .198. Live bitcoin-core validation pending (no
core node available). Sibling hardcodes (lnd/btcpay/electrumx/fedimint) tracked
as B12b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 02:07:39 -04:00
archipelago
1973d76427 style: rustfmt lnd migrate_locked_wallet matches! call
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 14:41:40 -04:00
archipelago
3214d6aff3 fix(lnd): self-heal unrecoverable locked wallet via wipe+recreate
When an existing LND wallet is locked and none of the candidate passwords
(per-node secret, legacy constant) open it, the node can never auto-unlock
unattended. unlock_existing_wallet now returns Ok(false) for "all candidates
actively rejected" (vs Err for transient "LND not ready"), and
ensure_wallet_initialized responds by recreating the wallet:

  - mark the lnd container user-stopped so the health monitor won't
    re-launch it (and re-open the wallet) mid-wipe,
  - stop lnd, delete its wallet/chain/graph state as root,
  - start lnd, wait for NON_EXISTING, re-init a fresh wallet on the
    per-node secret, then clear the user-stopped flag.

LND runs as a plain bridge-network podman container (not a Quadlet unit),
so it is restarted via `systemd-run --user --scope podman`, matching the
orchestrator/health-monitor path.

Alpha nodes hold no funds and a wallet locked with an unknown password is
already inaccessible, so the wipe loses nothing reachable. Completes the
forward fix from 91adc281 for nodes whose wallet pre-dates the per-node
secret and whose password is unrecorded (e.g. .116/.228).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 14:08:33 -04:00
archipelago
91adc281ca fix(lnd): per-node wallet password + locked-wallet self-heal on login
Replaces the fleet-wide hardcoded WALLET_PASSWORD='hellohello' that left wallets
LOCKED after OTA/reboot (auto-unlock used the wrong password fleet-wide).

Forward fix (both init paths unified, validated cargo check + LND REST mechanics
on a scratch wallet):
- Per-node random 256-bit secret in secrets/lnd-wallet-password (0600), mirroring
  secrets/bitcoin-rpc-password. read_wallet_password (no-gen) vs
  ensure_wallet_password (gen at init only).
- container/lnd.rs init AND api/rpc/lnd/wallet.rs seed-derived init both use the
  per-node secret (wallet.rs keeps recoverable derived entropy; password unified).
- Unlock tries [per-node secret, legacy 'hellohello']; single-attempt primitive
  distinguishes invalid-passphrase (fail fast, try next) from not-ready (retry),
  so a wrong password no longer hangs the boot path ~60s.

Migration (candidate-unlock + rotate, best-effort at login):
- change_wallet_password (WalletUnlocker.ChangePassword) + migrate_locked_wallet:
  if LOCKED, try candidates as current pw and ChangePassword onto the per-node
  secret so future boots auto-unlock. Hooked into auth.login (non-blocking) with
  the just-verified password as the candidate.

NOT YET: seed-recovery fallback for wallets where no candidate matches (e.g.
.116/.228) — destructive, needs entropy-source/funds-safety handling; next pass.
NOT shipped: pending end-to-end validation on a real node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:19:56 -04:00
archipelago
a483fe4baa fix: derive launch port from URL authority, not naive rsplit
reachable_lan_address() parsed the launch port with url.rsplit(':')
which yields "8096/" for manifest interfaces.main URLs that carry a
path (http://localhost:8096/). That fails to parse and silently drops
a perfectly reachable launch URL, so apps like jellyfin, btcpay-server,
fedimint, gitea, nextcloud and portainer showed running with no launch
link in the UI. New launch_url_port() reads digits after the final
colon (mirroring port_from_url in the RPC layer) and tolerates a
trailing path. Adds regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:35:19 -04:00
archipelago
0ed892a412 fix: wallet receive reliability, bitcoin install self-heal, ElectrumX app tile
Fixes three Bitcoin/wallet failures observed across the fleet on v1.7.90-alpha
(all nodes were already on the latest build — these were live bugs, not stale
builds), plus the missing ElectrumX tile, and adds automated coverage so each
can't regress silently.

Receive address (".116 receive fails", ".228 false 'wallet is locked'"):
- LND publishes its REST API on a host port that can drift from the manifest
  (a container created when the mapping was 8080 kept publishing 8080 after the
  manifest moved to 18080). The in-process client connects to the manifest port,
  gets connection-refused, and wallet init fails forever while the container
  looks "Up". Add published-port drift detection to the reconciler
  (container_ports_drifted / host_port_bindings_drifted) that recreates a
  drifted backend even for restart-sensitive apps — a drifted container is
  already broken, so leaving it "untouched" only perpetuates the failure.
- Receive errors now carry a stable [CODE] token (REST_UNREACHABLE, WALLET_LOCKED,
  WALLET_UNINITIALIZED, SYNCING) and always start with "Bitcoin address" so they
  survive the RPC error sanitizer instead of collapsing to the generic
  "Operation failed". The UI maps the code instead of guessing wallet state from
  substrings — so an unreachable REST endpoint is no longer mislabelled "locked".

Bitcoin install (".198 bitcoin gone / reinstall just stops"):
- bitcoin-knots requires the secret bitcoin-rpc-txrelay-rpcauth, which was only
  generated by the tx-relay flow. Nodes that never used tx-relay lacked it, so
  secret resolution hard-failed and the whole Bitcoin stack cascaded. Generate
  it idempotently before bitcoin starts (ensure_app_secrets, reusing
  ensure_txrelay_credentials), and name the missing secret in the error so a
  genuine gap is actionable instead of a bare "IO error".

ElectrumX app tile missing on every node with it installed:
- The catalog generator dropped electrumx because the manifest had no
  interfaces.main block, so the tile had no launch URL and was hidden. Declare
  the companion UI port (50002) in the manifest, regenerate the catalog, and let
  an app with a known launch URL stay launchable while its backend is still
  "starting" (ElectrumX indexes for 10m+).

Test harness:
- New lifecycle bats suites: bitcoin-receive, port-drift, secret-completeness
  (validated live; port-drift catches the real .116 drift).
- Rust unit tests for drift detection, the receive reason-code classifier, and
  the named-missing-secret error; vitest for the UI code mapping.
- create-release.sh now runs tests/release/run.sh and aborts the release on
  failure — previously it ran no tests at all.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:12:56 -04:00
archipelago
c49e8fcacd fix: harden OTA updates, AIUI desktop gap, LND no-proxy
- update.rs: post-OTA probe falls back to http://127.0.0.1/ on connect
  error (nginx binds :80, not :443) so good updates are no longer rolled
  back; recover stuck update_in_progress; avoid ETXTBSY on running binary
- LND: REST client bypasses proxy, GET newaddress p2wkh, wallet
  readiness/unlock after restart
- Dashboard.vue: chat route back to plain h-full (desktop bottom-gap fix)
- vite.config.ts: dev-only /aiui proxy
- tests/release/run.sh: release gate harness (static+frontend+backend)
- CHANGELOG: v1.7.89-alpha notes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 01:23:32 -04:00
archipelago
d6f108d818 chore: snapshot release workspace 2026-06-12 03:00:15 -04:00
archipelago
6a30ff11bd chore: release v1.7.84-alpha 2026-06-11 04:44:58 -04:00
archipelago
f818f1dcc1 app-platform: remove unsupported saleor release surface 2026-06-11 01:16:21 -04:00
archipelago
c393b96da3 backend: harden rootless app lifecycle orchestration 2026-06-11 00:24:32 -04:00
archipelago
34c4e87d14 feat(apps): add saleor storefront 2026-05-20 23:02:57 -04:00
archipelago
f4368785f0 fix(apps): unblock saleor and netbird first-use flows 2026-05-20 00:28:30 -04:00
archipelago
92c58141af fix(apps): stabilize saleor and netbird launch 2026-05-19 21:45:17 -04:00
archipelago
522c046525 feat(apps): add saleor and harden netbird repair 2026-05-19 20:11:22 -04:00
archipelago
bd69ef41d5 fix(apps): repair netbird login and iframe focus 2026-05-19 19:21:43 -04:00
archipelago
f0bd49d03d fix(apps): repair netbird install and app icons 2026-05-19 17:20:32 -04:00
archipelago
ab96c97cb9 fix(apps): self-host netbird and stabilize app sessions 2026-05-19 16:02:35 -04:00
archipelago
87be717f40 fix(apps): keep slow installs visible 2026-05-19 14:29:20 -04:00
archipelago
413d50116e fix(apps): restore mobile and website launching 2026-05-17 19:22:18 -04:00
archipelago
7804223152 chore: release v1.7.57-alpha 2026-05-17 17:30:04 -04:00
Dorian
b8053c00ca fix: clear stale health notifications 2026-05-14 08:57:54 -04:00
Dorian
f95e9a1cd0 fix: quote quadlet environment values 2026-05-14 01:15:22 -04:00
Dorian
2ff47f88a7 fix: harden container reconcile and launch behavior 2026-05-13 22:59:55 -04:00
Dorian
835c525218 chore(release): stage v1.7.55-alpha 2026-05-13 15:09:22 -04:00
archipelago
c0751e2551 chore(release): stage v1.7.54-alpha 2026-05-06 09:23:57 -04:00
archipelago
745cb1c626 chore(release): stage v1.7.52-alpha 2026-05-05 11:29:18 -04:00
archipelago
aad0ba5234 feat(orchestrator): drift-sync existing Quadlet units on each reconcile
When a Quadlet unit file already exists for an orchestrator-managed
backend, sync its on-disk bytes against what the current renderer
produces. write_if_changed makes this idempotent — when bytes match,
no IO; when they differ (post-deploy of a renderer change), the file
is rewritten and systemctl --user daemon-reload runs once.

We deliberately do NOT restart the .service when the file changes:
running containers keep their current config until the operator
restarts them. That's the right tradeoff — file updates are cheap and
non-destructive; service restarts are the SIGKILL cascade we're
trying to eliminate.

Why this matters: pre-this-commit, every renderer change required a
fresh package.install RPC per app to take effect. Observed live on
.228 2026-05-02 — the TimeoutStartSec=600 fix shipped in code but
existing units stayed on the old format because nothing triggered a
re-render. Combined with state.json being empty (so the reconciler's
auto-install path didn't fire either), the fix was invisible until
manual unit deletion.

Companions (UI_APP_IDS) are skipped — companion.rs renders those units
with a different shape; syncing here would clobber them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:43:18 -04:00