661 Commits

Author SHA1 Message Date
archipelago
daf750688d merge: mesh multiversion and transport pills
# Conflicts:
#	core/archipelago/src/mesh/listener/decode.rs
#	core/archipelago/src/mesh/meshtastic.rs
2026-06-30 05:19:58 -04:00
archipelago
df9d3a55be integration: preserve deployed 1.8.0 OTA work 2026-06-30 05:08:17 -04:00
archipelago
7b0748c868 fix(mesh): respect the radio's flashed LoRa region (don't force ours)
ensure_lora_region previously force-overrode the device's region with the
mesh-config region (EU_868) whenever they differed — which would shove a US/ANZ
user's radio onto EU_868: an illegal band that also cuts it off from its local
mesh. Off-the-shelf interop must respect whatever region the user flashed.

Now: a radio that already reports a REAL region (US, EU_868, ANZ, …) is left
untouched. We only set a region when the device reports UNSET (a fresh radio is
RF-silent and can't mesh at all), using the operator-configured region as the
fallback. Unknown/None (never reported) is also left alone. Pairs with the
default-channel change so a meshtastic archy node behaves like a stock device.

cargo check green (built into the same binary as the channel fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 08:36:04 -04:00
archipelago
810127fd3e feat(mesh): meshtastic off-the-shelf interop — default channel + private archipelago
Make a meshtastic-equipped archy node work like a stock Meshtastic device AND
keep the private archy group, instead of being isolated on a custom primary:
- slot 0 (PRIMARY)  = the DEFAULT public channel (empty name + default key) →
  interoperates with every off-the-shelf device on LongFast and picks up
  default-channel users; our NodeInfo broadcasts ride here like normal.
- slot 1 (SECONDARY) = "archipelago" (deterministic psk) → private archy↔archy.

Previously the driver set "archipelago" as the PRIMARY, isolating archy from the
public mesh. Now ensure_channel writes at most one channel per call (default
primary first, then archipelago secondary), reusing the existing reboot→
reconnect→re-check loop so it converges in ≤2 cycles without reboot-looping;
primary_is_default() accepts the default key in 1-byte or expanded form so a
stock radio is never needlessly rewritten. set_channel generalized to
(index, name, psk, role); want_config parse tracks both slots.

MeshCore needs no change — it never overrides channels (ensure_channel is a
no-op) and already rides MeshCore's default Public channel off the shelf.

cargo check green. NEEDS radio verify on .116/.198 (default-channel RX + archy
group on the secondary). Channel provision cap (3) covers the 2-write migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 07:40:10 -04:00
archipelago
067002b04b Merge branch 'bitcoin-version-bulletproof' into mesh-multiversion-integration 2026-06-29 06:45:50 -04:00
archipelago
20f762cb2c feat(fips): auto-peer LAN-discovered federation nodes directly over FIPS
Mesh/federation messages between co-located nodes were always falling back to
Tor because the FIPS overlay had no direct peering — every node depended on the
global anchor's spanning tree, and when that anchor link flaps a node is
isolated and all FIPS dials time out. (Diagnosed live on .116/.198: pure-FIPS
direct peering over UDP 8668 fixes it — 2.5ms vs timeout.)

Generalize the manual fix: in the existing 5-min FIPS seed-anchor apply loop,
also auto-connect every federation peer the PeerRegistry knows both a LAN
address AND a FIPS npub for, dialing its FIPS UDP transport (port 8668) at its
LAN IP via the same idempotent `fipsctl connect` path (new
anchors::lan_fips_anchors). This is FIPS's own transport over the LAN — NOT
Tailscale, NOT the HTTP/LAN messaging port. Transient (recomputed each tick from
live mDNS discovery, never persisted) so changing IPs self-correct. Remote peers
with no LAN address are untouched (still routed via the anchor).

Registry Arc hoisted out of the transport-init block so the loop can read
all_peers(). cargo check green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 06:42:18 -04:00
archipelago
11155055aa feat(mesh): meshtastic PKI E2E pill — surface pki_encrypted on received DMs
The synthetic meshcore-style frame the meshtastic driver builds can't carry the
radio's PKI-encryption status, so received meshtastic DMs never lit the E2E pill.
Thread it out-of-band: the device records `last_rx_encrypted` (= packet
pki_encrypted) when it yields a text frame; the session loop reads it via
`take_rx_encrypted()` right after dispatch and stamps the just-stored received
message E2E (dispatch::stamp_received_encrypted, monotonic-id keyed). Meshcore
returns false here (its E2E is derived in the frames decrypt path). Pure
out-of-band signal — no change to the shared meshcore wire format.

Built + deployed live in binary d937814e on .116/.198. cargo check green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 06:25:01 -04:00
archipelago
095a76cd20 fix(bitcoin): bulletproof multi-version switching (Knots & Core)
Three stacked bugs made "switch version" silently fail / crash-loop, and
the data-access mismatch corrupted a node's index during recovery attempts.

Backend renderer:
- sync_quadlet_unit ignored the per-app pinned version and re-rendered the
  quadlet with the manifest's :latest every reconcile tick, reverting any
  switch. Factor the install-time catalog/pin resolution into a shared
  resolve_catalog_image() and call it in BOTH install_fresh and
  sync_quadlet_unit.
- The renderer folded manifest `entrypoint: ["sh","-lc"]` into Exec=, which
  only worked when the image entrypoint was a passthrough shell wrapper. The
  versioned images use ENTRYPOINT ["bitcoind"], so Exec=sh -lc ... became
  `bitcoind sh -lc ...` and crash-looped. Emit a real Entrypoint= override;
  exec_changed now also compares Entrypoint=.

Images:
- Build all bitcoin images (Core + Knots, every version) as container-root
  (USER removed) like the legacy :latest image. Chain data is owned by the
  data_uid (container uid 102); root reads it via CAP_DAC_OVERRIDE (granted in
  the manifest). A non-root USER (the previous uid 1000) can't read existing
  chain data → "Error initializing block database". Still fully rootless:
  container-root maps to the unprivileged host service user.

Catalog:
- bitcoin-knots versions[]: 29.3.knots20260508/20260507/20260210 +
  29.2.knots20251110, "latest" tracking newest.
- bitcoin-core versions[]: add 29.2 + a "latest" entry. All images rebuilt
  root and published to the mirror.

Frontend:
- AppSidebar version dropdown: rename the latest option to "Always use the
  latest version" (no v prefix), fix right padding, and guarantee the current
  selection matches a real option (was rendering blank).
- New InstallVersionModal: full-screen version chooser shown from the App
  Store / Discover install button for multi-version apps (Bitcoin Knots/Core),
  app icon + "Install <name>", latest pre-selected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 05:46:04 -04:00
archipelago
3c7c04a662 fix(mesh): meshtastic receive — drain frame batch per poll + rx diagnostics
Addresses the open Meshtastic parity bug (project_meshtastic_parity): the
running driver received nothing (`mesh.messages` stayed []) though the radio
got the packets and sends worked.

Root-cause candidate: `try_recv_frame` decoded ONE serial frame per poll and
returned Ok(None) for every non-text FromRadio frame, so the session loop slept
50ms between frames. Under Meshtastic's frequent NodeInfo/telemetry stream a
received text packet queued behind them, and read_from_radio's 64KB buffer cap
could drain (drop) it before it was ever decoded — reception silently dead while
sends kept working.

- try_recv_frame now drains a bounded batch (64) per poll, processing each
  frame's side effects and returning the first inbound text frame, so a text
  packet is decoded the same poll it arrives and the buffer never grows enough
  to hit the lossy cap. Bounded so a continuous flood still yields to select!.
- packet_to_inbound_frame logs every decoded packet (from/portnum/payload_len)
  and a "did not parse (dropped)" case, so one live radio pass is conclusive.

The rest of the decode path was verified correct by inspection (FROM_RADIO_PACKET
=2, wire-type-5 handled, parse_mesh_packet sound, 60s heartbeat present) — not a
parse bug. cargo check green. NEEDS a live radio pass on a rig that isn't .228
(off-limits: bitcoin testing) to confirm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 05:04:09 -04:00
archipelago
11038cdcc9 feat(mesh,ui): per-message transport pill (Mesh/FIPS/Tor) + fix E2E pill
Adds a per-message transport badge to archy↔archy mesh chats and fixes the
long-broken E2E badge — both meshcore and meshtastic, styled like the existing
E2E pill.

Transport pill:
- New `MeshMessage.transport` ("lora"/"fips"/"tor"), surfaced in the UI beside
  the E2E badge (Mesh.vue transportLabel() → Mesh/FIPS/Tor, mesh-styles.css).
- Sent LoRa → "lora"; sent federation → finalized to the real leg ("fips"/"tor")
  once the background send resolves (req.send_json transport), via an id-keyed
  store update.
- Received: a post-dispatch stamp on handle_typed_envelope_direct's output
  (monotonic ids) tags both transports without threading through all 20 typed-
  dispatch sites — radio wrapper stamps "lora", federation injector stamps the
  peer's last_transport ("fips"/"tor", default tor; the inbound HTTP carries no
  FIPS-vs-Tor signal).
- Plain native/channel LoRa frames → "lora"; channel broadcasts stay non-E2E.

E2E pill fix:
- `encrypted` was hardcoded false at every MeshMessage construction site, so the
  UI badge (Mesh.vue `v-if="msg.encrypted"`) never showed. Now: federation
  envelopes are E2E (identity-signed over an encrypted transport); the meshcore
  native-DM receive path already had a real `encrypted` flag (now also tagged
  with transport). meshtastic-PKI radio E2E flag threading is a noted follow-up.

Backend cargo check + frontend vue-tsc build both green. Needs a live radio +
multi-transport pass on .116/.228 to confirm end-to-end (see
project_transport_pill / project_meshtastic_parity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 04:29:25 -04:00
archipelago
6aa74c7386 feat(bitcoin): multi-version support for Core & Knots (install/switch/pin/auto-update)
Lets a node runner choose which Bitcoin Core / Knots version to install
(latest pre-selected), then switch, pin, or opt into auto-update from the
app's interface — all manifest/catalog-driven, rootless, signed-registry,
zero-data-loss. Motivated by upcoming BIP-110 signalling: runners need a
real choice of software version.

Backend:
- version_config.rs: per-app pin + auto-update persistence (atomic, merge-
  preserving), downgrade detection, auto-update enumeration (+ unit tests).
- app_catalog.rs: CatalogVersion / versions[] schema, catalog_versions(),
  catalog_image_for_version() (same-repo guard); a pin suppresses the update
  badge.
- prod_orchestrator.rs: pinned version wins over the catalog default on every
  install/recreate.
- install.rs: install-time `version` param persisted (default = unpinned).
- set_config.rs: package.versions (read) + package.set-config (write) RPCs;
  downgrade is gated behind explicit confirm (warn + confirm + allow).
- update.rs/main.rs: hourly per-app auto-update tick via the orchestrator
  (opt-in, pin-respecting); fix handle_package_update to be non-fatal for
  orchestrator-managed apps lacking a catalog primary image (bitcoin-core).

UI:
- MarketplaceAppDetails.vue: install-time version selector (shown when an app
  offers >=2 versions).
- appDetails/AppSidebar.vue: "Version & Updates" card (switch / pin / auto-
  update toggle / downgrade warning), per app.
- rpc-client.ts + en.json: RPC methods, types, strings.

Phase 0 image pipeline:
- scripts/build-bitcoin-image.sh: download official tarball + SHA256SUMS(.asc),
  verify SHA-256 + pinned-maintainer OpenPGP signature (fail-closed), build a
  minimal rootless image, smoke-test, tag + push.
- apps/bitcoin-core/Dockerfile rewritten (drops stale community base);
  apps/bitcoin-knots/Dockerfile added.
- generate-app-catalog.sh: emit curated versions[]; published + catalog now
  offers Core 25.2/26.2/27.2/28.4/29.3/30.2/31.0 + Knots 29.3.knots20260508.

docs/bitcoin-multi-version-design.md: live progress tracker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 18:46:17 -04:00
archipelago
83344b9f3a fix(orchestrator): drop legacy mempool umbrella manifest on catalog-driven nodes
The split-mempool-stack guard that skips the legacy monolithic `mempool`
manifest (whose container collides with its split-stack frontend member
`archy-mempool-web`) only ran over DISK manifests. On catalog-driven nodes
(no disk manifests — e.g. the Phase-3/registry-manifest path), the legacy
`mempool` manifest arrives via the registry-catalog overlay AFTER that
guard, so both `mempool` and `archy-mempool-web` end up owning container
`mempool` and rewrite+restart each other forever ("port binding drift" /
"network alias drift" loop observed on .228, leaving mempool down).

Enforce the guard once more over the merged (disk + catalog) manifest set:
drop the `mempool` umbrella whenever all three split members are present.
Installing `mempool` assembles the split stack, so `archy-mempool-web`
owns the frontend container either way.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 14:04:41 -04:00
archipelago
4519dbf04f fix(orchestrator): render manifest certs on the adopted-running reconcile path
WS-F #10: a netbird reinstall that adopts a leftover running container
skipped ensure_manifest_certs, so when its data dir was wiped the self-
signed tls.crt/key were never regenerated; the next nginx.conf rewrite +
restart then died on the missing cert (proxy 502, login broken). The
Running branch of ensure_running_with_mode now calls ensure_manifest_certs
before ensure_manifest_files, mirroring prepare_for_start's certs-before-
files ordering. Idempotent: a no-op when crt+key already exist.

Live-validated on .228: deleted netbird tls.crt/key under a Running
container; reconciler regenerated a fresh CN=<host_ip> self-signed cert
(1000:1000), https :8087 = 200.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 17:49:50 -04:00
archipelago
f9a6ae3f32 feat(mesh): Meshtastic region + shared-channel auto-provisioning (MeshCore parity)
Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched
channels, so nodes only ever saw themselves. Bring them to MeshCore parity
using the official Meshtastic admin API:

- Auto-provision LoRa region (set_config, AdminMessage field 34) from a new
  mesh-config `lora_region` (e.g. EU_868) when the radio's region differs.
- Auto-provision a shared primary channel (set_channel, field 33) with a
  PSK derived deterministically from channel_name, so every node converges on
  one mesh — the parity equivalent of MeshCore's named "archipelago" channel.
- Read current region/channel from want_config; only write when different
  (no reboot loop); cap attempts so a radio that won't persist can't loop.
- Active NodeInfo advert scaffolding + aggressive serial drain.

Verified on .116+.228: region+channel persist, discovery works (both see each
other as named reachable contacts), bidirectional RF + sending confirmed.
Receiving in the running driver is still under diagnosis (instrumentation added).

Also removes the unwanted `meshtastic` daemon app from the registry (it was
never meant to be a container — native driver provides system-level support):
deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) +
test refs. Meshtastic stays native, like MeshCore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 04:46:35 -04:00
archipelago
fd3a4ee4ef fix(orchestrator): chown the whole fresh bind subtree, not just the leaf
ensure_bind_mount_dirs chowned a freshly-created no-data_uid bind dir
with --reference={immediate_parent}. For a NESTED bind source like
jellyfin's /var/lib/archipelago/jellyfin/config (or netbird's .../netbird/
data), `mkdir -p` creates the intermediate <app> dir root:root too, so
referencing the immediate parent just copied ROOT — leaving the dir
unwritable and the app EACCES-crash-looping on reinstall (found by the
all-apps-lifecycle pass: jellyfin "/config/log denied" exit 139;
netbird-server "unable to open database file"). It only ever worked for
direct children of the data root (immich).

Fix: anchor to the nearest PRE-EXISTING ancestor (the rootless data root,
owned by the service user) and chown -R the entire newly-created subtree
to it. Extracted the walk into fresh_subtree_anchor() with a unit test
covering nested / direct / second-volume cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 04:46:35 -04:00
archipelago
71cc9ac46a fix(uninstall): bound systemctl/podman teardown so uninstall can't hang
Uninstalling immich/grafana could hang with a frozen full-red progress
bar, leave a ghost entry stuck in My Apps, and then refuse reinstall.
Single root cause: quadlet::disable_remove() — called first in the
uninstall task (via companion + orchestrator teardown) — ran
`systemctl --user stop`, daemon-reload, and `podman rm -f` with NO
timeout. On rootless podman a generated unit can wedge in "deactivating"
while podman hangs underneath, so `systemctl stop` blocks forever. The
spawned uninstall task then never returns Ok or Err, so:
  - set_uninstall_stage() (after the stop) never fires → progress frozen;
  - remove_package_state_entry() never runs → entry stranded in
    `Removing` → ghost in My Apps;
  - the install guard rejects reinstall with "already Removing".

The spawn wrapper already reverts state on Err and removes the entry on
Ok — the only failure mode was a hang that returns neither. Bound the
teardown so it always terminates:
  - systemctl stop → QUADLET_STOP_TIMEOUT, escalate to kill+reset-failed
    on timeout (reuses the existing helpers);
  - daemon_reload_user() → bounded systemctl_user_status (30s);
  - defensive `podman rm -f` → wrapped in tokio timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 04:27:02 -04:00
archipelago
670ebb0666 fix(launcher): pin Gitea launch URL to web port 3001 (not SSH 2222)
Gitea publishes two host ports — SSH on 2222 and the web UI on 3001.
The launch URL comes from manifest_lan_address_for() (the manifest's
interfaces.main → 3001), but Gitea had no entry in the static
lan_address_for() fallback map. On a node where the gitea manifest is
absent or stale (no interfaces block), the lookup returns None and the
code falls through to extract_lan_address(), which returns whichever
port podman lists first — frequently the SSH port. Result: the app
launched at :2222 instead of :3001 (observed on tailscale node
100.82.34.38).

Add the canonical "gitea" => http://localhost:3001 entry to the static
map, matching every other core app, so the web UI is pinned regardless
of manifest presence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 03:16:41 -04:00
archipelago
0a8db9044f fix(orchestrator): recreate zombie "Up" containers whose process is dead
podman trusts its own state DB: when a container's conmon dies without
podman observing it (cgroup-cascade SIGKILL on archipelago.service
restart, a crash), `podman ps` keeps reporting it "Up" long after the
process is gone. The reconciler NoOp'd such a zombie forever, so a dead
dependency with no published host port never recovered.

Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with
a dead State.Pid → its nginx proxy 502'd → NetBird login broke
("Unauthenticated"). The dashboard publishes no host port, so the
Running branch had nothing to probe and never recreated it.

Add a zombie guard to the Running branch: verify the recorded State.Pid
is alive (its /proc entry exists) before trusting "running"; on a
concrete dead PID, stop+remove+install_fresh from the manifest.
Conservative by design — any uncertainty (inspect failed, PID
unparseable) assumes alive, so a transient podman hiccup never destroys
a healthy container. Unit test covers live/dead/out-of-range PIDs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 02:25:52 -04:00
archipelago
89d397bb74 refactor(netbird): delete legacy Rust installer — #20 ph4 (manifest-driven only)
netbird is fully manifest-driven (apps/netbird-*/manifest.yml via the signed
catalog): install_stack_via_orchestrator renders the 3-member stack with
generated_certs (self-signed TLS for the #15 OIDC secure context), base64
generated_secrets, and templated config — and adopts the running stack by live
container name. The hardcoded `podman run` fallback was therefore dead code on
any node with the embedded catalog (verified live: .228 https:8087 -> 200).

Removes the per-app Rust installer anti-pattern the master plan calls out:
- install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer)
- deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert,
  read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip,
  wait_for_netbird_oidc_ready), 3 NETBIRD_*_IMAGE consts, unused base64::Engine import
- ~485 lines removed; prod_orchestrator doc-comments updated

Behavioural parity: the manifest path already executed on the fleet, so this
changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed
by the manifest path; if that race resurfaces, add an OIDC-ready gate to the
manifest rather than resurrecting the Rust fn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 11:04:01 -04:00
archipelago
a721532f55 feat(orchestrator): desired-state recovery + recreate volume-ownership [UNVALIDATED WIP]
NOT yet validated on a node or fleet-deployed — cargo check passes, release build
+ .228 canary validation pending. Committed as a checkpoint so the work survives.

Two fixes the immich .198 incident exposed:

Fix A (reconcile_all_with_mode): a previously-running app whose container vanished
(e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now,
when boot reconcile would leave an app 'absent' but it was running at the last
running-containers snapshot, recreate it (install_fresh). New
crash_recovery::load_last_running_names() reads the snapshot without the PID/crash
gate (+2 unit tests). Match is exact on compute_container_name (incl stack
members); user-stopped + uninstalled apps are already excluded, so no false
positives.

Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a
no-data_uid app running as container-root (→ host rootless user) hit EACCES and
crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir
for a no-data_uid app is chowned via --reference=<parent> to match the rootless
data root — no host-uid guessing, only fresh dirs (no regression for existing
installs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 09:28:40 -04:00
archipelago
d1cd42c821 fix(orchestrator): stop retrying unrepairable volume chowns every reconcile
ensure_running_container_ownership re-probed and re-attempted the in-container
chown on every reconcile pass. For a mount that can't be re-owned from inside the
userns (observed: mempool-api /data -> 'Operation not permitted'), this burned
CPU and logged a WARN on every pass, forever (~6x/30min on .228/.116).

Remember hard chown failures in a process-lifetime set keyed by (container-id,
dest) and skip the probe+chown for known-unrepairable mounts. Keyed by Id (not
name) so a recreated container gets a fresh repair attempt. Verified on .116:
one recorded failure at startup, then silent across subsequent reconciles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 04:58:57 -04:00
archipelago
e57514b690 fix(uninstall): never ghost a removed app in My Apps on cleanup residue
handle_package_uninstall lumped every teardown failure into one `errors` vec
and returned Err on any of them BEFORE removing the package state entry — so a
non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a
volume/network removal) left the app's containers gone but its entry in
package_data → a ghost in My Apps, and the spawned task reverted it to Installed.

Split the failures: container removal that even force-rm can't complete (app
genuinely still present) keeps the entry + returns Err; everything after the
containers are gone is best-effort. Remove the state entry as soon as the
containers are gone — BEFORE the slow volume/data teardown — so My Apps updates
immediately and residue can never ghost the app. set_uninstall_stage is a no-op
once the entry is gone (if-let guard), so the later stages don't re-create it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 15:23:16 -04:00
archipelago
4346007d37 fix(orchestrator): only TCP host ports get reachability-probed
wait_for_manifest_host_ports TCP-connect-probed every published port, including
UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe
failed forever and drove an endless host-port repair/reconcile loop on .228
(netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 14:40:48 -04:00
archipelago
9670af62b6 feat(registry): deliver app manifests via the signed catalog (embed by default)
Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now
embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes
install from the signed catalog (origin-wins overlay, disk = fallback) with no
OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before
load_manifests so a fresh boot overlays the latest embedded catalog instead of a
restart later; offline/ISO boot falls through to disk and never hangs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:54 -04:00
archipelago
a8b9b0f5e8 feat(netbird): manifest-driven migration via reusable orchestrator primitives
Migrate the netbird stack (server/dashboard/proxy) off ~500 lines of per-app Rust
to 3 declarative manifests, adding 4 reusable primitives:
- SecretGenKind::Base64 (netbird relay authSecret + sqlite store encryptionKey)
- GeneratedCert schema + ensure_manifest_certs (self-signed TLS so the dashboard
  gets a secure context for OIDC PKCE — issue #15; https proxy on 8087 preserved)
- templated GeneratedFile render: {{HOST_IP}}/{{HOST_MDNS}}/{{NETWORK_GATEWAY}}
  (aardvark resolver for the #15 stale-IP fix) /{{secret:NAME}} (never logged)
- legacy create_container now honours port.protocol (3478/udp STUN)
install_netbird_stack routes via the orchestrator first (legacy kept as fallback,
mirroring indeedhub); launch URL derives https://{host_ip}:8087 from host facts.
Legacy Rust deletion deferred to post-live-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:53 -04:00
archipelago
3c36cf1c40 fix(companion): stop image_exists journal flood that drops the UI websocket
image_exists ran `podman image inspect <image>` via .status() (inherits the
service stdout) with no --format, so every hit dumped the image's full ~249-line
manifest JSON into the journal — once per companion image, every reconcile pass
(.228: 21.6k journal lines / 10 min, 4131 inspect dumps). The service never
crashed (NRestarts=0); the sustained journald/IO flood starved the async runtime
and dropped the UI /ws/db websocket -> constant "connection lost"/reconnect.
Discard the child's stdout/stderr; only the exit status is used.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 13:39:19 -04:00
archipelago
92d7f52dd6 fix(orchestrator): order only live containers on package start/restart
package.restart resolved its container list via
ordered_containers_for_start, which injected every name from the
union startup_order list that wasn't already present — including
variant names not live on a given node (mysql-mempool,
archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is
2nd in the mempool start order, so do_orchestrator_package_start hit
its unknown-app-id fallback, do_package_start failed the inspect
("no such object"), and the `?` aborted the whole start sequence —
leaving mempool-api + the frontend down until the health monitor
recovered them minutes later. That was the source of the 5× gate
flakes #73 (frontend not running in 180s) and #74 (api not queryable
in 300s); root-caused from the .228 journal
("Start failed: mysql-mempool").

Replace the inject-then-sort logic with a pure helper
order_present_containers that orders only the actually-present
containers and never adds phantom entries. startup_order remains a
union of name variants across install generations — it's now used
purely to order what's live, not to inject what isn't. +3 unit tests.

Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a
settled state instead of a single-shot read, so a container caught
mid-reconcile (transient restarting/configured) can't flake a 20-min
iteration. A genuinely-stuck container never settles, so real
breakage is still caught.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 02:22:50 -04:00
archipelago
57a013bc66 test(gate): make 5× the canonical gate, drop 20x naming
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:12:41 -04:00
archipelago
452f05d849 fix(reconciler): decouple companion self-heal onto its own cadence
The companion-unit repair stage ran at the END of each boot-reconciler tick, after
reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a
deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any
reasonable window (gate test 31 'deleted unit recreated within one reconcile tick'
timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is
cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass.
Handle is aborted when the main loop exits (shutdown uses notify_one, so a second
waiter would steal the wake permit). tick() is now app-reconcile only.

All 4 boot_reconciler cadence tests still green (companion_stage=false in tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 13:04:28 -04:00
archipelago
6e49ce6f88 fix(container-list): report user-stopped apps as stopped despite live UI companion
A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running'
in container-list because its UI companion (electrs-ui, …) still serves the launch
port, and the state-refresh upgrades any reachable launch port to 'running'. The
gate's wait_for_container_status <app> stopped therefore never saw 'stopped'.

Fix: load the user_stopped marker in handle_container_list and force 'stopped' for
those apps before the launch-port refresh. The reconcile guard keeps the backend
down, so the marker is authoritative. package.start clears it first, so a started
app reports 'running' normally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:26:30 -04:00
archipelago
760a32bccf fix(reconcile): keep user-stopped apps stopped (reconciler was resurrecting them)
package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler
restarts it within ~8s: the reconcile filter's dependency_required override
re-includes a user-stopped app that an active app depends on, and the in-memory
disabled set is wiped on manifest reload — so ensure_running runs, the stopped
app's unreachable ports look like a fault, the host-port repair restarts it, and
package.stop never sticks (gate 'transitions to stopped' times out).

Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single
choke point every reconcile flows through) → Left('user-stopped'). Explicit
install/start clear the marker first (added clear_user_stopped to orchestrator
install/start, symmetric with disabled.remove; start/restart RPC already cleared
it) so user actions are unaffected. The container itself already stopped correctly
— this stops the resurrection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:04:02 -04:00
archipelago
2dad64b2ee fix(stop): honour per-app graceful-stop grace in orchestrator stop path
package.stop left slow-to-SIGTERM apps (fedimint/electrumx/bitcoin/btcpay/immich)
running: the orchestrator path hardcoded podman API ?t=10 / CLI -t 30 and the CLI
wrapper deadline (30s) equalled the -t grace, so the await fired exactly as podman
SIGKILLed -> stop reported failed -> state reverted to running. Reproduced live on
clean .198 (fedimint).

- container/runtime.rs: add ContainerRuntime::stop_container_with_grace (defaulted
  so mock/dev impls are unchanged); PodmanRuntime honours grace for API + CLI with
  deadline = grace + 15s buffer; AutoRuntime delegates. New canonical per-app table
  stop_grace_secs_for() + DEFAULT_STOP_GRACE_SECS / STOP_GRACE_DEADLINE_BUFFER_SECS.
- podman_client.rs: stop_container_with_grace uses ?t=<grace> + longer HTTP deadline.
- prod_orchestrator::stop: resolve grace = manifest stop_grace_secs (north-star) else
  the table; pass to quadlet::stop_service_with_timeout AND stop_container_with_grace.
- quadlet.rs: stop_service_with_timeout so slow apps aren't SIGKILLed at 45s.
- rpc/package/runtime.rs: doc-note its &str stop_timeout_secs mirrors the canonical table.
- tests: resolve_stop_grace_secs (manifest field wins / table fallback / default 30).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 06:59:40 -04:00
archipelago
ff78b31212 fix(hooks): run post_install exec in a transient user scope (fixes cgroup denial)
Live on .228 the post_install `exec` steps failed with "crun: write
cgroup.procs: Permission denied / OCI permission denied": a `podman exec`
launched from archipelago.service can't place its child in the container's
cgroup (under the service's own slice). Wrap `exec` in
`systemd-run --user --scope --quiet --collect podman exec …` so it gets its own
delegated cgroup — same trick as `podman_user_scope` for pasta starts.
`copy_from_host` (a host-side `cp`, no in-container process) stays direct.

Without this only copy_from_host worked; indeedhub happened to be unaffected
(its image pre-bakes the nginx config so the exec steps were no-ops), but the
hook capability is only generally useful with exec working. hooks unit tests
pass; live verify on .228 next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:38:23 -04:00
archipelago
b73084dbb0 refactor(indeedhub): delete orchestrator special-cases; use generic path (#20 phase 3)
The fresh-create path was blocked by hardcoded indeedhub orchestrator logic
that predated and conflicted with the manifest migration:
- ensure_running routed app_id=="indeedhub" → reconcile_indeedhub_stack, which
  REFUSED to create the frontend from its manifest (returned Left("stack-managed")).
- run_pre_start_hooks("indeedhub") → start_indeedhub_backends →
  wait_for_indeedhub_dependencies_ready(120) — a DNS gate with a chicken-and-egg
  bug (required the frontend's own alias present before the frontend could be
  created), which failed install_fresh with "dependencies were not ready within
  120s" and left the frontend down (caught live on .228).

Delete all of it (−382 lines): reconcile_indeedhub_stack, start_indeedhub_backends,
wait_for_indeedhub_dependencies_ready, indeedhub_api_dependency_dns_ready,
indeedhub_required_aliases_present, repair_indeedhub_network_aliases,
indeedhub_alias_present, patch_indeedhub_nostr_provider, and the INDEEDHUB_*
consts. The manifests now carry everything these did: network_aliases (short
hostnames), generated_secrets, dependencies, and the post_install nginx hook. So
"indeedhub" + every member flows through the generic install_fresh/reconcile path
— the frontend fresh-creates normally and runs its hook.

(crash_recovery.rs's frontend-after-deps ordering guard is kept — it's beneficial
startup ordering, not a blocker.) cargo check + release build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:11:33 -04:00
archipelago
b1eea8c053 feat(indeedhub): manifest-driven 7-member stack, orchestrator-first (#20 phase 3)
Author the IndeedHub stack as 7 manifests (postgres/redis/minio/relay/api/
ffmpeg + frontend) and route install_indeedhub_stack through the
orchestrator first (immich pattern), falling back to the legacy installer
only when the manifests aren't deployed.

Data-preserving by construction — the manifests reproduce the live install
exactly so an existing node ADOPTS rather than recreates:
- container_name = the live hyphenated names the runtime already references
  (health_monitor tiers/deps, crash_recovery).
- named volumes indeedhub-{postgres,redis,minio,relay}-data (not bind mounts).
- dedicated indeedhub-net + network_aliases [postgres|redis|minio|relay|api]
  so the api/ffmpeg env hostnames and the frontend nginx upstreams resolve
  unchanged.
- generated_secrets (indeedhub-db-password/-minio-password owned by their
  backends, indeedhub-jwt by the api) reuse the live /var/lib/archipelago/
  secrets values (ensure_one no-ops on existing files; postgres pw is fixed
  at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept.

The frontend carries the post_install hook (#20) that replaces the hardcoded
patch_indeedhub_nostr_provider: strip X-Frame-Options, refresh
nostr-provider.js from /opt/archipelago/web-ui, inject the <script> if
absent, reload nginx — defensive/idempotent since indeedhub:1.0.0 already
bakes these. Frontend manifest also corrected off its dead Next.js shape
(health check now nginx :7777, tmpfs /run + /var/cache/nginx).

Builds + unit-tested; live adoption/lifecycle verification on .228 next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 15:46:26 -04:00
archipelago
b94b61f640 feat(manifest): network_aliases — extra DNS aliases on a container's network
Add `container.network_aliases: Vec<String>` (serde default, DNS-label
validated) so a stack member can answer to short hostnames its peers bake
in, beyond its own container name. Rendered in both runtime paths:
- podman_client: merged (deduped) into the custom-network aliases array.
- quadlet from_manifest: appended after the container name; emitted only
  for Bridge networks (slirp/pasta reject aliases).

Needed for the indeedhub migration: its frontend nginx proxies to
`api:4000` / `minio:9000` / `relay:8080`, so those members declare
`network_aliases: [api|minio|relay]` to keep the short names resolvable on
the dedicated indeedhub-net (vs. colliding generic aliases on archy-net).

Also fixes 4 pre-existing from_manifest test failures (unrelated to this
change, surfaced now that the quadlet suite runs green): test manifests
used the long-invalid `network_policy: archy-net` (allowlist is
isolated/bridge/host → moved to network_policy: isolated + container.network)
and bind sources outside /var/lib/archipelago.

Tests: container crate 53 pass; archipelago quadlet+alias 47 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 15:45:11 -04:00
archipelago
955c54b713 feat(hooks): post_install executor + install-path wiring (#20 phase 2)
Add container::hooks::run_post_install — runs an app's declarative
post_install hooks against its own running container:
- Exec  -> podman exec <container> <args…> (60s timeout-bounded)
- CopyFromHost -> resolve src against allowlist roots (<data_dir>/<app>
  and /opt/archipelago), canonicalise + prefix-check (defeats symlink
  escape), then podman cp <abs-src> <container>:<dest>

Best-effort + idempotent: a failed step is warned and skipped, never
fails the install — matching the legacy patch_indeedhub_nostr_provider
behaviour this replaces. Wired into install_fresh after the container is
up, so it runs only on a freshly created container (not plain start), and
re-applies on recreate-after-drift.

5 unit tests on resolve_copy_src (accept in-data-dir, reject absolute /
traversal / missing / symlink-escape). cargo test -p archipelago green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 11:45:28 -04:00
archipelago
4c1a4e5976 feat(hooks): manifest lifecycle-hooks schema (#20 phase 1) + fix container test literals
Add controlled post_install/pre_start hook schema to AppDefinition:
LifecycleHooks/HookStep (Exec | CopyFromHost)/HostCopy with allowlist
validation (relative src, no '..', absolute container dest, non-empty
exec). Re-exported from the crate root. Design: docs/manifest-hooks-design.md.

Also add the missing generated_secrets: vec![] field to three
pre-existing ContainerConfig test literals (the field was added to the
struct in 03a4ee1b but the container crate's own tests were never rerun,
so -p archipelago-container failed to compile). cargo test green: 53 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 11:07:00 -04:00
archipelago
f160e0c404 fix(reboot): enable podman-restart.service at startup (--restart reboot-survival)
Orchestrator-installed backends (immich, btcpay-db, …) run as plain podman
`--restart=unless-stopped` containers until the Phase-3 Quadlet rollout flips
use_quadlet_backends on. Nothing in the codebase enabled the user's
podman-restart.service, so those containers had NO reboot-survival mechanism.
Enable it (idempotent, best-effort) at orchestrator startup so unless-stopped
containers come back after a reboot. Already applied manually on .228 (covers
31 containers incl. immich + btcpay); this codifies it fleet-wide.

The deeper fix (render Quadlet for all orchestrator installs) remains the gated
Phase-3 Quadlet-everywhere rollout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 08:23:19 -04:00
archipelago
d5ef45731a fix(immich): restore canonical app_id "immich" (title + icon)
After the manifest migration the launcher installed as "immich-server" (app_id),
which has no catalog entry → showed the raw id and no icon. Rename the server
manifest app_id immich-server→immich so it matches the catalog/curated "immich"
entry (title "Immich", icon immich.png) and is recognised as a known launcher app
(APP_CATEGORY_MAP) → stays in My Apps. immich_stack_app_ids now installs
[immich-postgres, immich-redis, immich]; orchestrator.install bypasses package
routing so there's no recursion with the "immich"→stack-installer mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 08:07:08 -04:00
archipelago
9e6c5370fc feat(immich): manifest-driven stack via orchestrator — live-migrated on .228
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).

- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
  first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
  * named by app_id (dropped container_name override) — using container_name
    spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
    on the same PGDATA, which corrupted a postgres cluster. Server reaches its
    siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
  * immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
    host 100998 under rootless; verified the fresh dir is chowned correctly).
  * immich-server version "release"→"2.7.4" (manifest validation requires a digit;
    the bad version made the manifest silently skip → partial orchestrator install
    → legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
  when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
  errors instead of double-creating containers on shared data (the corruption
  root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
  manifest — this gap let the bad immich-server version through.

Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 07:08:45 -04:00
archipelago
7bfbe8fe40 feat(registry-manifest): phase 2 — publisher embeds manifests into signed catalog
generate-app-catalog.sh gains opt-in EMBED_MANIFESTS=1: embeds each
apps/<id>/manifest.yml into its catalog entry's `manifest` field (whole document,
top-level app: preserved — exactly what the Rust side deserializes). Default off
so routine catalog regen is unchanged during the migration window; turn on
deliberately, then sign via the existing release-root ceremony. Verified: default
embeds 0; EMBED_MANIFESTS=1 embeds 40 manifests (generated_secrets preserved).

Adds a round-trip guard test: every shipped apps/*/manifest.yml must deserialize
+ validate through catalog_manifest_to_overlay (image apps accepted, build apps
defer to disk) — catches schema drift between disk manifests and the catalog path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:46:17 -04:00
archipelago
220666d3a9 feat(registry-manifest): phase 1 — orchestrator consumes manifests from signed catalog
Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a
full manifest per entry; the orchestrator overlays it over the disk manifest
(origin-wins) with disk as the migration fallback. Moves apps toward
registry-distributed manifests with no OTA-shipped disk file.

- app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible,
  covered by the existing release-root signature over the raw JSON);
  `catalog_manifest_values()` accessor.
- prod_orchestrator: `load_manifests` overlays catalog manifests after the disk
  walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on
  unparseable value / app-id mismatch / failed validate() / build source
  (build contexts aren't registry-distributed yet — phase 1 is image-only).
- manifest_dir stays PathBuf (build-only field); image-only apps never read it.
- 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so
  existing nodes are unaffected.

See docs/registry-manifest-design.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:30:38 -04:00
archipelago
03a4ee1b30 feat(container): manifest-declared generated secrets + companion/quadlet hardening
Generated-secrets system: apps declare `generated_secrets` in their manifest
(kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets`
materialises them 0600/rootless in resolve_dynamic_env — idempotent and
self-healing (recovers wrongly root-owned secrets with no privilege). Replaces
per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests
now declare fmcd-password / fedimint-gateway-hash.

companion.rs: rebuild the auto-built :latest image when its build context changes
(staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes.

quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit
125) + regression tests.

UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged
as Services (headless backends), gateway icon fallback.

Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start;
grafana/strfry orphan crash-loop units removed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:11:07 -04:00
archipelago
db7d424bff feat(content): owned-content persistence + Fedimint paid downloads, fmcd caps fix, FIPS warm-path perf
Buyer-side paid downloads now persist: purchases are cached on disk
(content_owned.rs) keyed by (seller onion, content_id), the gallery shows
an "Owned" badge unblurred, and items view/play in-app from the local
cache with no re-payment or reliance on a browser download (which
silently failed on the mobile companion). New RPCs content.owned-list /
content.owned-get. Validated e2e .116<-.198 (paid 100 sats via Fedimint,
166KB jpeg returns, survives restart).

fedimint-clientd manifest: restore the standard container capability set
(CHOWN/DAC_OVERRIDE/FOWNER/SETUID/SETGID) so fmcd's startup chown of an
existing-federation /data succeeds instead of dying EPERM (#7). Confirmed
the orchestrator applies these to the running container.

FIPS perf: tighten the supervisor warm-path keepalive 45s -> 25s so peer
paths stay inside the ~30-60s NAT cold window. Dials now reliably land on
FIPS instead of re-punching and falling back to Tor. Measured to the same
peer: cloud browse 18-22s -> 0.4s; full Fedimint paid download 29s -> 11s
(residual is the seller-side guardian reissue round-trip).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 18:58:52 -04:00
archipelago
63b98599e8 Revert "fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7)"
This reverts commit 409543c41e78025354acbdde5ffc6445895d4508.
2026-06-20 14:37:24 -04:00
archipelago
409543c41e fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7)
fmcd crash-looped "Operation not permitted (os error 1)" on .116 (kernel
6.12.74): the default rootless seccomp profile blocks a syscall its Mainline-DHT
/ iroh transport needs, so the REST API never came up (:8178 → HTTP 000) and
federations couldn't be joined. Verified: with seccomp=unconfined fmcd boots and
answers /v2/* (HTTP 401 instead of dead). fmcd works on other nodes, so this is
kernel/seccomp-specific — but the relaxation is safe for an outbound-networking
daemon and harmless where not needed.

- new `security.seccomp_unconfined` manifest flag (SecurityPolicy);
- libpod backend sets `seccomp_profile_path: "unconfined"` (== --security-opt
  seccomp=unconfined); quadlet backend emits `SeccompProfile=unconfined`;
- enabled in apps/fedimint-clientd/manifest.yml.

NOTE: manifests live on-disk at /opt/archipelago/apps/<id>/manifest.yml, so the
node needs the updated manifest deployed + the fmcd container recreated to apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 13:08:13 -04:00
archipelago
12f54e390d feat(wallet): ecash pay confirmation screen + auto-refund on failed sale (#3)
- PeerFiles: new confirmation step after "pay from ecash" — shows the amount and
  which wallet will be spent (Cashu/Fedimint) with balances, lets the user switch
  backends, and a styled Confirm button. The chosen backend is passed to the
  payment so it spends exactly what was confirmed.
- content.download-peer-paid: accept `method` (cashu|fedimint) to honor the
  confirmed choice; log the backend + outcome; backend-specific rejection errors
  ("not in the same Fedimint federation" / "doesn't accept your Cashu mint").
- AUTO-REFUND: a minted token whose sale fails (peer unreachable, rejected, or
  error) is now reclaimed (fedimint reissue / cashu receive) so the buyer no
  longer loses the spent ecash — fixes the stuck-Fedimint-notes report.
- wallet.ecash-balance already reports cashu_sats/fedimint_sats/total_sats which
  the confirm screen uses to pick/show the covering wallet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 12:16:02 -04:00
archipelago
a6957a48f7 fix(netbird): wait for OIDC discovery before reporting install done (#10)
Right after install the dashboard SPA opens and, if it loads before NetBird's
embedded OIDC provider is serving, caches a bad auth state — the user appears
logged-in but can't log out until it self-corrects. Container "running" != OIDC
ready, so gate the install's Done phase on the management server's
/oauth2/.well-known/openid-configuration answering (best-effort, 60s cap, never
fails the install since the stack is already up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 08:57:37 -04:00
archipelago
8f06d88fbf feat(wallet): pay for peer files from BOTH Cashu and Fedimint ecash (#3)
Paying for a peer file minted a Cashu-only token, so a node whose ecash balance
lived in Fedimint couldn't pay even with funds. Now both backends are tried:

- payer (content.download-peer-paid): mint a Cashu token first; on failure fall
  back to spending Fedimint notes. Only error if BOTH backends can't cover it.
- seller (verify_and_receive_payment): accept Fedimint notes as well as Cashu —
  anything not starting with "cashu" is redeemed via reissue_into_any.
- new fedimint_client::spend_from_any() — spend from whichever joined federation
  has the balance, returning the notes + federation id (mirrors reissue_into_any).
- wallet.ecash-balance now also reports fedimint_sats + combined total_sats; the
  pay-for-file pre-check uses the combined total so a Fedimint-funded node isn't
  wrongly blocked.

Compiles (cargo check + vue-tsc). Live cross-node federation validation pending
(dual-ecash phase 6) — needs two nodes sharing a federation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 08:13:23 -04:00