Mesh/federation messages between co-located nodes were always falling back to
Tor because the FIPS overlay had no direct peering — every node depended on the
global anchor's spanning tree, and when that anchor link flaps a node is
isolated and all FIPS dials time out. (Diagnosed live on .116/.198: pure-FIPS
direct peering over UDP 8668 fixes it — 2.5ms vs timeout.)
Generalize the manual fix: in the existing 5-min FIPS seed-anchor apply loop,
also auto-connect every federation peer the PeerRegistry knows both a LAN
address AND a FIPS npub for, dialing its FIPS UDP transport (port 8668) at its
LAN IP via the same idempotent `fipsctl connect` path (new
anchors::lan_fips_anchors). This is FIPS's own transport over the LAN — NOT
Tailscale, NOT the HTTP/LAN messaging port. Transient (recomputed each tick from
live mDNS discovery, never persisted) so changing IPs self-correct. Remote peers
with no LAN address are untouched (still routed via the anchor).
Registry Arc hoisted out of the transport-init block so the loop can read
all_peers(). cargo check green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The synthetic meshcore-style frame the meshtastic driver builds can't carry the
radio's PKI-encryption status, so received meshtastic DMs never lit the E2E pill.
Thread it out-of-band: the device records `last_rx_encrypted` (= packet
pki_encrypted) when it yields a text frame; the session loop reads it via
`take_rx_encrypted()` right after dispatch and stamps the just-stored received
message E2E (dispatch::stamp_received_encrypted, monotonic-id keyed). Meshcore
returns false here (its E2E is derived in the frames decrypt path). Pure
out-of-band signal — no change to the shared meshcore wire format.
Built + deployed live in binary d937814e on .116/.198. cargo check green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three stacked bugs made "switch version" silently fail / crash-loop, and
the data-access mismatch corrupted a node's index during recovery attempts.
Backend renderer:
- sync_quadlet_unit ignored the per-app pinned version and re-rendered the
quadlet with the manifest's :latest every reconcile tick, reverting any
switch. Factor the install-time catalog/pin resolution into a shared
resolve_catalog_image() and call it in BOTH install_fresh and
sync_quadlet_unit.
- The renderer folded manifest `entrypoint: ["sh","-lc"]` into Exec=, which
only worked when the image entrypoint was a passthrough shell wrapper. The
versioned images use ENTRYPOINT ["bitcoind"], so Exec=sh -lc ... became
`bitcoind sh -lc ...` and crash-looped. Emit a real Entrypoint= override;
exec_changed now also compares Entrypoint=.
Images:
- Build all bitcoin images (Core + Knots, every version) as container-root
(USER removed) like the legacy :latest image. Chain data is owned by the
data_uid (container uid 102); root reads it via CAP_DAC_OVERRIDE (granted in
the manifest). A non-root USER (the previous uid 1000) can't read existing
chain data → "Error initializing block database". Still fully rootless:
container-root maps to the unprivileged host service user.
Catalog:
- bitcoin-knots versions[]: 29.3.knots20260508/20260507/20260210 +
29.2.knots20251110, "latest" tracking newest.
- bitcoin-core versions[]: add 29.2 + a "latest" entry. All images rebuilt
root and published to the mirror.
Frontend:
- AppSidebar version dropdown: rename the latest option to "Always use the
latest version" (no v prefix), fix right padding, and guarantee the current
selection matches a real option (was rendering blank).
- New InstallVersionModal: full-screen version chooser shown from the App
Store / Discover install button for multi-version apps (Bitcoin Knots/Core),
app icon + "Install <name>", latest pre-selected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses the open Meshtastic parity bug (project_meshtastic_parity): the
running driver received nothing (`mesh.messages` stayed []) though the radio
got the packets and sends worked.
Root-cause candidate: `try_recv_frame` decoded ONE serial frame per poll and
returned Ok(None) for every non-text FromRadio frame, so the session loop slept
50ms between frames. Under Meshtastic's frequent NodeInfo/telemetry stream a
received text packet queued behind them, and read_from_radio's 64KB buffer cap
could drain (drop) it before it was ever decoded — reception silently dead while
sends kept working.
- try_recv_frame now drains a bounded batch (64) per poll, processing each
frame's side effects and returning the first inbound text frame, so a text
packet is decoded the same poll it arrives and the buffer never grows enough
to hit the lossy cap. Bounded so a continuous flood still yields to select!.
- packet_to_inbound_frame logs every decoded packet (from/portnum/payload_len)
and a "did not parse (dropped)" case, so one live radio pass is conclusive.
The rest of the decode path was verified correct by inspection (FROM_RADIO_PACKET
=2, wire-type-5 handled, parse_mesh_packet sound, 60s heartbeat present) — not a
parse bug. cargo check green. NEEDS a live radio pass on a rig that isn't .228
(off-limits: bitcoin testing) to confirm.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a per-message transport badge to archy↔archy mesh chats and fixes the
long-broken E2E badge — both meshcore and meshtastic, styled like the existing
E2E pill.
Transport pill:
- New `MeshMessage.transport` ("lora"/"fips"/"tor"), surfaced in the UI beside
the E2E badge (Mesh.vue transportLabel() → Mesh/FIPS/Tor, mesh-styles.css).
- Sent LoRa → "lora"; sent federation → finalized to the real leg ("fips"/"tor")
once the background send resolves (req.send_json transport), via an id-keyed
store update.
- Received: a post-dispatch stamp on handle_typed_envelope_direct's output
(monotonic ids) tags both transports without threading through all 20 typed-
dispatch sites — radio wrapper stamps "lora", federation injector stamps the
peer's last_transport ("fips"/"tor", default tor; the inbound HTTP carries no
FIPS-vs-Tor signal).
- Plain native/channel LoRa frames → "lora"; channel broadcasts stay non-E2E.
E2E pill fix:
- `encrypted` was hardcoded false at every MeshMessage construction site, so the
UI badge (Mesh.vue `v-if="msg.encrypted"`) never showed. Now: federation
envelopes are E2E (identity-signed over an encrypted transport); the meshcore
native-DM receive path already had a real `encrypted` flag (now also tagged
with transport). meshtastic-PKI radio E2E flag threading is a noted follow-up.
Backend cargo check + frontend vue-tsc build both green. Needs a live radio +
multi-transport pass on .116/.228 to confirm end-to-end (see
project_transport_pill / project_meshtastic_parity).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The split-mempool-stack guard that skips the legacy monolithic `mempool`
manifest (whose container collides with its split-stack frontend member
`archy-mempool-web`) only ran over DISK manifests. On catalog-driven nodes
(no disk manifests — e.g. the Phase-3/registry-manifest path), the legacy
`mempool` manifest arrives via the registry-catalog overlay AFTER that
guard, so both `mempool` and `archy-mempool-web` end up owning container
`mempool` and rewrite+restart each other forever ("port binding drift" /
"network alias drift" loop observed on .228, leaving mempool down).
Enforce the guard once more over the merged (disk + catalog) manifest set:
drop the `mempool` umbrella whenever all three split members are present.
Installing `mempool` assembles the split stack, so `archy-mempool-web`
owns the frontend container either way.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WS-F #10: a netbird reinstall that adopts a leftover running container
skipped ensure_manifest_certs, so when its data dir was wiped the self-
signed tls.crt/key were never regenerated; the next nginx.conf rewrite +
restart then died on the missing cert (proxy 502, login broken). The
Running branch of ensure_running_with_mode now calls ensure_manifest_certs
before ensure_manifest_files, mirroring prepare_for_start's certs-before-
files ordering. Idempotent: a no-op when crt+key already exist.
Live-validated on .228: deleted netbird tls.crt/key under a Running
container; reconciler regenerated a fresh CN=<host_ip> self-signed cert
(1000:1000), https :8087 = 200.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fresh Meshtastic radios ship region-UNSET (RF-silent) and on mismatched
channels, so nodes only ever saw themselves. Bring them to MeshCore parity
using the official Meshtastic admin API:
- Auto-provision LoRa region (set_config, AdminMessage field 34) from a new
mesh-config `lora_region` (e.g. EU_868) when the radio's region differs.
- Auto-provision a shared primary channel (set_channel, field 33) with a
PSK derived deterministically from channel_name, so every node converges on
one mesh — the parity equivalent of MeshCore's named "archipelago" channel.
- Read current region/channel from want_config; only write when different
(no reboot loop); cap attempts so a radio that won't persist can't loop.
- Active NodeInfo advert scaffolding + aggressive serial drain.
Verified on .116+.228: region+channel persist, discovery works (both see each
other as named reachable contacts), bidirectional RF + sending confirmed.
Receiving in the running driver is still under diagnosis (instrumentation added).
Also removes the unwanted `meshtastic` daemon app from the registry (it was
never meant to be a container — native driver provides system-level support):
deletes apps/meshtastic + catalog entries (app-catalog, neode-ui, releases) +
test refs. Meshtastic stays native, like MeshCore.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure_bind_mount_dirs chowned a freshly-created no-data_uid bind dir
with --reference={immediate_parent}. For a NESTED bind source like
jellyfin's /var/lib/archipelago/jellyfin/config (or netbird's .../netbird/
data), `mkdir -p` creates the intermediate <app> dir root:root too, so
referencing the immediate parent just copied ROOT — leaving the dir
unwritable and the app EACCES-crash-looping on reinstall (found by the
all-apps-lifecycle pass: jellyfin "/config/log denied" exit 139;
netbird-server "unable to open database file"). It only ever worked for
direct children of the data root (immich).
Fix: anchor to the nearest PRE-EXISTING ancestor (the rootless data root,
owned by the service user) and chown -R the entire newly-created subtree
to it. Extracted the walk into fresh_subtree_anchor() with a unit test
covering nested / direct / second-volume cases.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Uninstalling immich/grafana could hang with a frozen full-red progress
bar, leave a ghost entry stuck in My Apps, and then refuse reinstall.
Single root cause: quadlet::disable_remove() — called first in the
uninstall task (via companion + orchestrator teardown) — ran
`systemctl --user stop`, daemon-reload, and `podman rm -f` with NO
timeout. On rootless podman a generated unit can wedge in "deactivating"
while podman hangs underneath, so `systemctl stop` blocks forever. The
spawned uninstall task then never returns Ok or Err, so:
- set_uninstall_stage() (after the stop) never fires → progress frozen;
- remove_package_state_entry() never runs → entry stranded in
`Removing` → ghost in My Apps;
- the install guard rejects reinstall with "already Removing".
The spawn wrapper already reverts state on Err and removes the entry on
Ok — the only failure mode was a hang that returns neither. Bound the
teardown so it always terminates:
- systemctl stop → QUADLET_STOP_TIMEOUT, escalate to kill+reset-failed
on timeout (reuses the existing helpers);
- daemon_reload_user() → bounded systemctl_user_status (30s);
- defensive `podman rm -f` → wrapped in tokio timeout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
podman trusts its own state DB: when a container's conmon dies without
podman observing it (cgroup-cascade SIGKILL on archipelago.service
restart, a crash), `podman ps` keeps reporting it "Up" long after the
process is gone. The reconciler NoOp'd such a zombie forever, so a dead
dependency with no published host port never recovered.
Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with
a dead State.Pid → its nginx proxy 502'd → NetBird login broke
("Unauthenticated"). The dashboard publishes no host port, so the
Running branch had nothing to probe and never recreated it.
Add a zombie guard to the Running branch: verify the recorded State.Pid
is alive (its /proc entry exists) before trusting "running"; on a
concrete dead PID, stop+remove+install_fresh from the manifest.
Conservative by design — any uncertainty (inspect failed, PID
unparseable) assumes alive, so a transient podman hiccup never destroys
a healthy container. Unit test covers live/dead/out-of-range PIDs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
netbird is fully manifest-driven (apps/netbird-*/manifest.yml via the signed
catalog): install_stack_via_orchestrator renders the 3-member stack with
generated_certs (self-signed TLS for the #15 OIDC secure context), base64
generated_secrets, and templated config — and adopts the running stack by live
container name. The hardcoded `podman run` fallback was therefore dead code on
any node with the embedded catalog (verified live: .228 https:8087 -> 200).
Removes the per-app Rust installer anti-pattern the master plan calls out:
- install_netbird_stack: orchestrator -> adopt -> bail! (no in-Rust installer)
- deletes 6 now-dead helpers (write_netbird_config_files, ensure_netbird_tls_cert,
read_or_generate_b64_secret, netbird_net_resolver_ip, detect_netbird_public_host_ip,
wait_for_netbird_oidc_ready), 3 NETBIRD_*_IMAGE consts, unused base64::Engine import
- ~485 lines removed; prod_orchestrator doc-comments updated
Behavioural parity: the manifest path already executed on the fleet, so this
changes no live behavior. The legacy #10 OIDC-readiness wait was already bypassed
by the manifest path; if that race resurfaces, add an OIDC-ready gate to the
manifest rather than resurrecting the Rust fn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NOT yet validated on a node or fleet-deployed — cargo check passes, release build
+ .228 canary validation pending. Committed as a checkpoint so the work survives.
Two fixes the immich .198 incident exposed:
Fix A (reconcile_all_with_mode): a previously-running app whose container vanished
(e.g. a wedged podman teardown cleared by a reboot) was left absent on boot. Now,
when boot reconcile would leave an app 'absent' but it was running at the last
running-containers snapshot, recreate it (install_fresh). New
crash_recovery::load_last_running_names() reads the snapshot without the PID/crash
gate (+2 unit tests). Match is exact on compute_container_name (incl stack
members); user-stopped + uninstalled apps are already excluded, so no false
positives.
Fix B (ensure_bind_mount_dirs): a freshly-created bind dir was left root:root, so a
no-data_uid app running as container-root (→ host rootless user) hit EACCES and
crash-looped (the exact immich upload-dir failure). Now a newly-created bind dir
for a no-data_uid app is chowned via --reference=<parent> to match the rootless
data root — no host-uid guessing, only fresh dirs (no regression for existing
installs).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure_running_container_ownership re-probed and re-attempted the in-container
chown on every reconcile pass. For a mount that can't be re-owned from inside the
userns (observed: mempool-api /data -> 'Operation not permitted'), this burned
CPU and logged a WARN on every pass, forever (~6x/30min on .228/.116).
Remember hard chown failures in a process-lifetime set keyed by (container-id,
dest) and skip the probe+chown for known-unrepairable mounts. Keyed by Id (not
name) so a recreated container gets a fresh repair attempt. Verified on .116:
one recorded failure at startup, then silent across subsequent reconciles.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
handle_package_uninstall lumped every teardown failure into one `errors` vec
and returned Err on any of them BEFORE removing the package state entry — so a
non-fatal cleanup hiccup (a slow/failed `sudo rm -rf` of a large data dir, a
volume/network removal) left the app's containers gone but its entry in
package_data → a ghost in My Apps, and the spawned task reverted it to Installed.
Split the failures: container removal that even force-rm can't complete (app
genuinely still present) keeps the entry + returns Err; everything after the
containers are gone is best-effort. Remove the state entry as soon as the
containers are gone — BEFORE the slow volume/data teardown — so My Apps updates
immediately and residue can never ghost the app. set_uninstall_stage is a no-op
once the entry is gone (if-let guard), so the later stages don't re-create it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wait_for_manifest_host_ports TCP-connect-probed every published port, including
UDP/SCTP. netbird's 3478/udp STUN can never answer a TCP connect, so the probe
failed forever and drove an endless host-port repair/reconcile loop on .228
(netbird-server restarting ~every 60s). Filter to tcp (empty protocol = tcp).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Turn on registry-distributed manifests for all apps: generate-app-catalog.sh now
embeds each apps/<id>/manifest.yml by default (EMBED_MANIFESTS opt-out), so nodes
install from the signed catalog (origin-wins overlay, disk = fallback) with no
OTA-shipped disk manifest. main.rs awaits a bounded (25s) refresh_catalog before
load_manifests so a fresh boot overlays the latest embedded catalog instead of a
restart later; offline/ISO boot falls through to disk and never hangs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
image_exists ran `podman image inspect <image>` via .status() (inherits the
service stdout) with no --format, so every hit dumped the image's full ~249-line
manifest JSON into the journal — once per companion image, every reconcile pass
(.228: 21.6k journal lines / 10 min, 4131 inspect dumps). The service never
crashed (NRestarts=0); the sustained journald/IO flood starved the async runtime
and dropped the UI /ws/db websocket -> constant "connection lost"/reconnect.
Discard the child's stdout/stderr; only the exit status is used.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.restart resolved its container list via
ordered_containers_for_start, which injected every name from the
union startup_order list that wasn't already present — including
variant names not live on a given node (mysql-mempool,
archy-mempool-api, archy-mempool-web). The phantom mysql-mempool is
2nd in the mempool start order, so do_orchestrator_package_start hit
its unknown-app-id fallback, do_package_start failed the inspect
("no such object"), and the `?` aborted the whole start sequence —
leaving mempool-api + the frontend down until the health monitor
recovered them minutes later. That was the source of the 5× gate
flakes #73 (frontend not running in 180s) and #74 (api not queryable
in 300s); root-caused from the .228 journal
("Start failed: mysql-mempool").
Replace the inject-then-sort logic with a pure helper
order_present_containers that orders only the actually-present
containers and never adds phantom entries. startup_order remains a
union of name variants across install generations — it's now used
purely to order what's live, not to inject what isn't. +3 unit tests.
Also harden bitcoin-knots.bats "valid state" probe: poll ≤30s for a
settled state instead of a single-shot read, so a container caught
mid-reconcile (transient restarting/configured) can't flake a 20-min
iteration. A genuinely-stuck container never settles, so real
breakage is still caught.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename run-20x.sh → run-gate.sh, default ARCHY_ITERATIONS 20→5, and scrub
20× references across CLAUDE.md, the master plan, TESTING.md, app-registry
status, the orchestrator/config doc-comments, and the bats suites. Also add
a minimal fail() helper to mempool.bats so guard failures report cleanly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The companion-unit repair stage ran at the END of each boot-reconciler tick, after
reconcile_existing(). On a heavily loaded node that per-app pass takes >60-90s, so a
deleted/lost companion unit (electrs-ui, bitcoin-ui, …) wasn't repaired within any
reasonable window (gate test 31 'deleted unit recreated within one reconcile tick'
timed out at 90s on the 45-app .228 node). Detecting + rewriting a companion unit is
cheap, so spawn it as its own ~interval(30s) loop, independent of the slow app pass.
Handle is aborted when the main loop exits (shutdown uses notify_one, so a second
waiter would steal the wake permit). tick() is now app-reconcile only.
All 4 boot_reconciler cadence tests still green (companion_stage=false in tests).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running'
in container-list because its UI companion (electrs-ui, …) still serves the launch
port, and the state-refresh upgrades any reachable launch port to 'running'. The
gate's wait_for_container_status <app> stopped therefore never saw 'stopped'.
Fix: load the user_stopped marker in handle_container_list and force 'stopped' for
those apps before the launch-port refresh. The reconcile guard keeps the backend
down, so the marker is authoritative. package.start clears it first, so a started
app reports 'running' normally.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler
restarts it within ~8s: the reconcile filter's dependency_required override
re-includes a user-stopped app that an active app depends on, and the in-memory
disabled set is wiped on manifest reload — so ensure_running runs, the stopped
app's unreachable ports look like a fault, the host-port repair restarts it, and
package.stop never sticks (gate 'transitions to stopped' times out).
Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single
choke point every reconcile flows through) → Left('user-stopped'). Explicit
install/start clear the marker first (added clear_user_stopped to orchestrator
install/start, symmetric with disabled.remove; start/restart RPC already cleared
it) so user actions are unaffected. The container itself already stopped correctly
— this stops the resurrection.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live on .228 the post_install `exec` steps failed with "crun: write
cgroup.procs: Permission denied / OCI permission denied": a `podman exec`
launched from archipelago.service can't place its child in the container's
cgroup (under the service's own slice). Wrap `exec` in
`systemd-run --user --scope --quiet --collect podman exec …` so it gets its own
delegated cgroup — same trick as `podman_user_scope` for pasta starts.
`copy_from_host` (a host-side `cp`, no in-container process) stays direct.
Without this only copy_from_host worked; indeedhub happened to be unaffected
(its image pre-bakes the nginx config so the exec steps were no-ops), but the
hook capability is only generally useful with exec working. hooks unit tests
pass; live verify on .228 next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fresh-create path was blocked by hardcoded indeedhub orchestrator logic
that predated and conflicted with the manifest migration:
- ensure_running routed app_id=="indeedhub" → reconcile_indeedhub_stack, which
REFUSED to create the frontend from its manifest (returned Left("stack-managed")).
- run_pre_start_hooks("indeedhub") → start_indeedhub_backends →
wait_for_indeedhub_dependencies_ready(120) — a DNS gate with a chicken-and-egg
bug (required the frontend's own alias present before the frontend could be
created), which failed install_fresh with "dependencies were not ready within
120s" and left the frontend down (caught live on .228).
Delete all of it (−382 lines): reconcile_indeedhub_stack, start_indeedhub_backends,
wait_for_indeedhub_dependencies_ready, indeedhub_api_dependency_dns_ready,
indeedhub_required_aliases_present, repair_indeedhub_network_aliases,
indeedhub_alias_present, patch_indeedhub_nostr_provider, and the INDEEDHUB_*
consts. The manifests now carry everything these did: network_aliases (short
hostnames), generated_secrets, dependencies, and the post_install nginx hook. So
"indeedhub" + every member flows through the generic install_fresh/reconcile path
— the frontend fresh-creates normally and runs its hook.
(crash_recovery.rs's frontend-after-deps ordering guard is kept — it's beneficial
startup ordering, not a blocker.) cargo check + release build green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author the IndeedHub stack as 7 manifests (postgres/redis/minio/relay/api/
ffmpeg + frontend) and route install_indeedhub_stack through the
orchestrator first (immich pattern), falling back to the legacy installer
only when the manifests aren't deployed.
Data-preserving by construction — the manifests reproduce the live install
exactly so an existing node ADOPTS rather than recreates:
- container_name = the live hyphenated names the runtime already references
(health_monitor tiers/deps, crash_recovery).
- named volumes indeedhub-{postgres,redis,minio,relay}-data (not bind mounts).
- dedicated indeedhub-net + network_aliases [postgres|redis|minio|relay|api]
so the api/ffmpeg env hostnames and the frontend nginx upstreams resolve
unchanged.
- generated_secrets (indeedhub-db-password/-minio-password owned by their
backends, indeedhub-jwt by the api) reuse the live /var/lib/archipelago/
secrets values (ensure_one no-ops on existing files; postgres pw is fixed
at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept.
The frontend carries the post_install hook (#20) that replaces the hardcoded
patch_indeedhub_nostr_provider: strip X-Frame-Options, refresh
nostr-provider.js from /opt/archipelago/web-ui, inject the <script> if
absent, reload nginx — defensive/idempotent since indeedhub:1.0.0 already
bakes these. Frontend manifest also corrected off its dead Next.js shape
(health check now nginx :7777, tmpfs /run + /var/cache/nginx).
Builds + unit-tested; live adoption/lifecycle verification on .228 next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `container.network_aliases: Vec<String>` (serde default, DNS-label
validated) so a stack member can answer to short hostnames its peers bake
in, beyond its own container name. Rendered in both runtime paths:
- podman_client: merged (deduped) into the custom-network aliases array.
- quadlet from_manifest: appended after the container name; emitted only
for Bridge networks (slirp/pasta reject aliases).
Needed for the indeedhub migration: its frontend nginx proxies to
`api:4000` / `minio:9000` / `relay:8080`, so those members declare
`network_aliases: [api|minio|relay]` to keep the short names resolvable on
the dedicated indeedhub-net (vs. colliding generic aliases on archy-net).
Also fixes 4 pre-existing from_manifest test failures (unrelated to this
change, surfaced now that the quadlet suite runs green): test manifests
used the long-invalid `network_policy: archy-net` (allowlist is
isolated/bridge/host → moved to network_policy: isolated + container.network)
and bind sources outside /var/lib/archipelago.
Tests: container crate 53 pass; archipelago quadlet+alias 47 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add container::hooks::run_post_install — runs an app's declarative
post_install hooks against its own running container:
- Exec -> podman exec <container> <args…> (60s timeout-bounded)
- CopyFromHost -> resolve src against allowlist roots (<data_dir>/<app>
and /opt/archipelago), canonicalise + prefix-check (defeats symlink
escape), then podman cp <abs-src> <container>:<dest>
Best-effort + idempotent: a failed step is warned and skipped, never
fails the install — matching the legacy patch_indeedhub_nostr_provider
behaviour this replaces. Wired into install_fresh after the container is
up, so it runs only on a freshly created container (not plain start), and
re-applies on recreate-after-drift.
5 unit tests on resolve_copy_src (accept in-data-dir, reject absolute /
traversal / missing / symlink-escape). cargo test -p archipelago green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Orchestrator-installed backends (immich, btcpay-db, …) run as plain podman
`--restart=unless-stopped` containers until the Phase-3 Quadlet rollout flips
use_quadlet_backends on. Nothing in the codebase enabled the user's
podman-restart.service, so those containers had NO reboot-survival mechanism.
Enable it (idempotent, best-effort) at orchestrator startup so unless-stopped
containers come back after a reboot. Already applied manually on .228 (covers
31 containers incl. immich + btcpay); this codifies it fleet-wide.
The deeper fix (render Quadlet for all orchestrator installs) remains the gated
Phase-3 Quadlet-everywhere rollout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the manifest migration the launcher installed as "immich-server" (app_id),
which has no catalog entry → showed the raw id and no icon. Rename the server
manifest app_id immich-server→immich so it matches the catalog/curated "immich"
entry (title "Immich", icon immich.png) and is recognised as a known launcher app
(APP_CATEGORY_MAP) → stays in My Apps. immich_stack_app_ids now installs
[immich-postgres, immich-redis, immich]; orchestrator.install bypasses package
routing so there's no recursion with the "immich"→stack-installer mapping.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).
- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
* named by app_id (dropped container_name override) — using container_name
spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
on the same PGDATA, which corrupted a postgres cluster. Server reaches its
siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
* immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
host 100998 under rootless; verified the fresh dir is chowned correctly).
* immich-server version "release"→"2.7.4" (manifest validation requires a digit;
the bad version made the manifest silently skip → partial orchestrator install
→ legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
errors instead of double-creating containers on shared data (the corruption
root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
manifest — this gap let the bad immich-server version through.
Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generate-app-catalog.sh gains opt-in EMBED_MANIFESTS=1: embeds each
apps/<id>/manifest.yml into its catalog entry's `manifest` field (whole document,
top-level app: preserved — exactly what the Rust side deserializes). Default off
so routine catalog regen is unchanged during the migration window; turn on
deliberately, then sign via the existing release-root ceremony. Verified: default
embeds 0; EMBED_MANIFESTS=1 embeds 40 manifests (generated_secrets preserved).
Adds a round-trip guard test: every shipped apps/*/manifest.yml must deserialize
+ validate through catalog_manifest_to_overlay (image apps accepted, build apps
defer to disk) — catches schema drift between disk manifests and the catalog path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a
full manifest per entry; the orchestrator overlays it over the disk manifest
(origin-wins) with disk as the migration fallback. Moves apps toward
registry-distributed manifests with no OTA-shipped disk file.
- app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible,
covered by the existing release-root signature over the raw JSON);
`catalog_manifest_values()` accessor.
- prod_orchestrator: `load_manifests` overlays catalog manifests after the disk
walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on
unparseable value / app-id mismatch / failed validate() / build source
(build contexts aren't registry-distributed yet — phase 1 is image-only).
- manifest_dir stays PathBuf (build-only field); image-only apps never read it.
- 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so
existing nodes are unaffected.
See docs/registry-manifest-design.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Buyer-side paid downloads now persist: purchases are cached on disk
(content_owned.rs) keyed by (seller onion, content_id), the gallery shows
an "Owned" badge unblurred, and items view/play in-app from the local
cache with no re-payment or reliance on a browser download (which
silently failed on the mobile companion). New RPCs content.owned-list /
content.owned-get. Validated e2e .116<-.198 (paid 100 sats via Fedimint,
166KB jpeg returns, survives restart).
fedimint-clientd manifest: restore the standard container capability set
(CHOWN/DAC_OVERRIDE/FOWNER/SETUID/SETGID) so fmcd's startup chown of an
existing-federation /data succeeds instead of dying EPERM (#7). Confirmed
the orchestrator applies these to the running container.
FIPS perf: tighten the supervisor warm-path keepalive 45s -> 25s so peer
paths stay inside the ~30-60s NAT cold window. Dials now reliably land on
FIPS instead of re-punching and falling back to Tor. Measured to the same
peer: cloud browse 18-22s -> 0.4s; full Fedimint paid download 29s -> 11s
(residual is the seller-side guardian reissue round-trip).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fmcd crash-looped "Operation not permitted (os error 1)" on .116 (kernel
6.12.74): the default rootless seccomp profile blocks a syscall its Mainline-DHT
/ iroh transport needs, so the REST API never came up (:8178 → HTTP 000) and
federations couldn't be joined. Verified: with seccomp=unconfined fmcd boots and
answers /v2/* (HTTP 401 instead of dead). fmcd works on other nodes, so this is
kernel/seccomp-specific — but the relaxation is safe for an outbound-networking
daemon and harmless where not needed.
- new `security.seccomp_unconfined` manifest flag (SecurityPolicy);
- libpod backend sets `seccomp_profile_path: "unconfined"` (== --security-opt
seccomp=unconfined); quadlet backend emits `SeccompProfile=unconfined`;
- enabled in apps/fedimint-clientd/manifest.yml.
NOTE: manifests live on-disk at /opt/archipelago/apps/<id>/manifest.yml, so the
node needs the updated manifest deployed + the fmcd container recreated to apply.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- PeerFiles: new confirmation step after "pay from ecash" — shows the amount and
which wallet will be spent (Cashu/Fedimint) with balances, lets the user switch
backends, and a styled Confirm button. The chosen backend is passed to the
payment so it spends exactly what was confirmed.
- content.download-peer-paid: accept `method` (cashu|fedimint) to honor the
confirmed choice; log the backend + outcome; backend-specific rejection errors
("not in the same Fedimint federation" / "doesn't accept your Cashu mint").
- AUTO-REFUND: a minted token whose sale fails (peer unreachable, rejected, or
error) is now reclaimed (fedimint reissue / cashu receive) so the buyer no
longer loses the spent ecash — fixes the stuck-Fedimint-notes report.
- wallet.ecash-balance already reports cashu_sats/fedimint_sats/total_sats which
the confirm screen uses to pick/show the covering wallet.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Right after install the dashboard SPA opens and, if it loads before NetBird's
embedded OIDC provider is serving, caches a bad auth state — the user appears
logged-in but can't log out until it self-corrects. Container "running" != OIDC
ready, so gate the install's Done phase on the management server's
/oauth2/.well-known/openid-configuration answering (best-effort, 60s cap, never
fails the install since the stack is already up).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Paying for a peer file minted a Cashu-only token, so a node whose ecash balance
lived in Fedimint couldn't pay even with funds. Now both backends are tried:
- payer (content.download-peer-paid): mint a Cashu token first; on failure fall
back to spending Fedimint notes. Only error if BOTH backends can't cover it.
- seller (verify_and_receive_payment): accept Fedimint notes as well as Cashu —
anything not starting with "cashu" is redeemed via reissue_into_any.
- new fedimint_client::spend_from_any() — spend from whichever joined federation
has the balance, returning the notes + federation id (mirrors reissue_into_any).
- wallet.ecash-balance now also reports fedimint_sats + combined total_sats; the
pay-for-file pre-check uses the combined total so a Fedimint-funded node isn't
wrongly blocked.
Compiles (cargo check + vue-tsc). Live cross-node federation validation pending
(dual-ecash phase 6) — needs two nodes sharing a federation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A node reachable both over LoRa and federation has two MeshPeer rows (radio
twin: low contact_id + firmware key; federation twin: high contact_id +
archipelago key), and messages key by peer_contact_id split across the two ids
— so opening one twin shows an empty thread (the .120->.89 symptom).
- backend: new group_peer_twins() helper groups peers by arch_pubkey_hex (set on
BOTH twins by bind_federation_twins), keeps the radio id as the mesh-first
send target, and unions messages across all twin ids. Wired into
conversations.list / conversations.messages / mesh.contacts-list. +3 unit tests.
- frontend: the live chat list merges client-side (mergedPeers) and matched twins
by the "Archy-z6Mk..." advert prefix, which the Meshtastic device rename broke
(radio now advertises the server name). Merge by arch_pubkey_hex instead, which
the backend reliably sets on both twins. Expose arch_pubkey_hex on MeshPeer.
- fix unrelated stale test: EcashTransaction test missing the new `kind` field.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Meshtastic device rename was a no-op — set_advert_name only updated an
in-memory field and never told the radio, so the device kept its firmware
default ('Meshtastic xxxx') and wasn't findable from external Meshtastic
apps. MeshCore already renamed correctly (CMD_SET_ADVERT_NAME); this brings
Meshtastic to parity.
Send an AdminMessage{set_owner=User{long_name,short_name}} to the locally
connected node (admin packet to our own node_num on the ADMIN_APP port).
Local serial admin needs no session passkey, matching the official client.
long_name = server name (<=39 chars); short_name = first 4 alphanumerics,
upper-cased. Verified on real hardware: .120 -> 'Archy-X250-EXP', .5 ->
'Archy-X250-Beta' (name read back from the radio after reconnect).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A stock meshcore client (e.g. a phone) can't sign our typed envelopes, so it is
never 'authenticated' — which meant ticking it as an allowed assistant contact
had no effect and !ai stayed denied. The explicit per-contact allowlist is a
deliberate operator opt-in for a specific key, so match it regardless of
authentication, keyed on the asker's resolved identity (bound archipelago key,
else firmware routing key — how meshcore addresses the contact). The spoofable
federation-trust-list match still requires authentication.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- reissue_into_any now tries the UNION of the local registry AND fmcd's live
joined set (/v2/admin/info) before failing, so a valid Fedimint token isn't
wrongly rejected when the registry has drifted. On all-fail it returns a
friendly message: notes already redeemed into this wallet (funds safe) vs
didn't match any connected federation.
- Unified transaction history: a local Fedimint tx log (recorded on each
successful redeem) is merged with the Cashu history in wallet.ecash-history,
newest-first, each tagged kind=cashu|fedimint. Previously a Fedimint receive
appeared nowhere.
- fedimint-clientd healthcheck -> type:tcp. It was probing /health, which fmcd
doesn't serve (only /v2/*), pinning the container in (starting) forever; the
TCP probe is skipped by the Quadlet renderer (host-side lifecycle verifies),
so it reports running. Cosmetic for ecash, which worked throughout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A !ai (or any typed message) from a trusted, federated node was denied when
it arrived over the radio. The radio half of a node that is also a federation
peer carried no archipelago identity (identity adverts are no longer broadcast
on the public channel), so the trusted_only gate and signature verification
had no key to check the asker against — and the same node showed up as two
contacts (a radio twin + a federation twin).
- bind_federation_twins(): correlate a radio contact with its federation twin
by exact, case-insensitive advert_name and copy the federation peer's
arch_pubkey_hex/did/x25519 onto the radio record. Called from
upsert_federation_peer and refresh_contacts. Ambiguous names (held by >1
federation peer) are skipped. This is only a CANDIDATE key — security is
unchanged: the inbound envelope signature must still verify against it.
- send_message now signs the typed Text envelope (new_signed) so a radio !ai
authenticates against the bound key. A meshcore node merely named like a
trusted node cannot forge the signature, so it is still denied.
Receiver-side verification (handle_typed_envelope_direct) and federation-trust
matching (is_sender_allowed) already existed; this supplies the missing key
binding and signature. Also resolves the radio/federation duplicate-contact
display for same-named nodes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Messages to a federated peer that is out of LoRa range (e.g. on another
continent) were dropped into the radio with no fallback, or hung on a dead
FIPS path before reaching Tor — so they never arrived.
- Route a radio contact over the federation transport (FIPS->Tor) when it is
the same node as a federated peer (known archipelago identity -> onion) AND
it is not currently reachable over the radio. Reachable radio peers stay on
the mesh (preferred); oversized/file envelopes still always take federation.
- Resolve the onion via the archipelago identity key (arch_pubkey_hex), not
the firmware routing key, so a radio contact maps to its nodes.json onion.
- Add .fips_timeout(8s) to the federation message POST so an unreachable FIPS
overlay fast-fails to Tor (~3-5s) instead of burning the 120s budget.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>