archy/docs/HANDOFF-2026-06-20-mesh-netbird.md
archipelago b0c9bd2a0c docs: #7 exhaustive isolation — seccomp ruled out; fmcd runs standalone, orchestrator-managed fails (open)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 14:39:33 -04:00

12 KiB

Handoff — Mesh device rename, mesh routing, duplicate contacts, netbird logout (2026-06-20)

Session is a test-build iteration toward the 1.8.0 bug-bash release — sideload patched binaries to test nodes, NO version bump / NO OTA release (manifest stays 1.7.99-alpha). Because the version string never changes, verify a deploy by sha256-matching the deployed binary, not by current_version.

Test node roster (creds in the operator's local notes / agent memory — NOT in this repo)

  • .116 192.168.1.116 — this build host (archi-thinkpad), dev/validation.
  • .198 192.168.1.198, .228 192.168.1.228 — LAN resilience nodes.
  • .5 Tailscale 100.72.136.5 (archy-x250-beta) — Meshtastic radio.
  • .120 Tailscale 100.66.157.120 (archy-x250-exp) — Meshtastic radio.
  • .89 Tailscale 100.89.209.89 (archy-x250-pa) — dual radio: ttyACM0 Meshtastic (probe FAILS), ttyUSB0 MeshCore (active). Configured device_path = ttyACM0. Runs netbird (v2.38.0).

Deploy driver used this session: /tmp/archy-deploy/deploy-node.sh <user@host> <pw> <label> (scp binary + stream web/dist/neode-ui + sudo swap /usr/local/bin/archipelago, preserve aiui + claude-login.html, chown 1000:1000, restart, verify sha256+health). Recreate from this doc if /tmp is gone.

Deploy state (binary sha) at handoff

  • b5183dfc… (HEAD d00d1b20, includes Meshtastic rename) → on .5 and .120 (verified).
  • f702b4f1… (the 3 wallet/mesh/ui fixes, pre-rename) → on .116, .198, .228.
  • 7c17a96… (OLD, pre-f702b4f1) → .89 is STALE — update before re-testing .120→.89.

DONE

  1. Meshtastic device rename → server name — committed d00d1b20 (pushed to gitea-vps2/main). meshtastic.rs set_advert_name was a no-op (in-memory only). Now sends AdminMessage{set_owner=User{long_name,short_name}} to the local node on ADMIN_APP port (6), set_owner field = 32. long_name = server name (≤39), short_name = first 4 alphanumerics upper-cased. Hardware-verified: .120 radio now reads back Archy-X250-EXP, .5 reads back Archy-X250-Beta. MeshCore already renamed (CMD_SET_ADVERT_NAME, serial.rs:147) — unchanged, now at parity.
  2. Routing priority confirmed = Mesh → FIPS → Tor. send_typed_wire (mesh/mod.rs:1007): reachable radio peer → LoRa; federation-synthetic OR (!reachable && arch_pubkey_hex.is_some()) → federation. send_typed_wire_via_federation (mod.rs:1124): FIPS first w/ .fips_timeout(8s), Tor fallback.
  3. .120.89 "non-delivery" diagnosed — it is NOT a delivery failure. .120 sends to .89's federation contact_id 3027572739, logs Federation envelope delivered transport=tor (gated on HTTP 2xx, mod.rs:1185). The receiver returns 2xx ONLY after ed25519-verify + successful inject_typed_from_federation (node_message.rs:217-263). Identity matches (.89 pubkey 031875b4…). .89.120 works. So .120's messages ARE injected into .89's state under contact_id 2679725907 = federation_peer_contact_id(.120 pubkey 535fb91f…), name "Archy-X250-EXP". It's a duplicate-contact SURFACING problem (user confirmed doubles).

SESSION 3 PROGRESS (2026-06-20 — deployed fleet-wide, binary e1f2e88)

  • #5 Arch Mobile messages CONFIRMED FIXED by the #12 dedup — user verified MeshCore surfaces them.
  • #3 ecash pay-for-file — confirm UI + auto-refund (12f54e39): PeerFiles shows a confirmation step (amount + which wallet Cashu/Fedimint + balances + switch + styled Confirm); content.download-peer-paid takes method, logs the backend+outcome, gives backend-specific rejection errors, and RECLAIMS the spent token on any failure (fedimint reissue / cashu receive) so funds aren't lost. Root cause of the user's failed pay: .198 had no Cashu → spent Fedimint notes → seller .89 not in the SAME federation → rejected → notes stuck (now auto-refunded; old stuck notes auto-return in ~1h via the 3600s spend timeout). To COMPLETE a fedimint pay, payer+seller must share a federation (or share a Cashu mint w/ balance).
  • #1 companion crash — added an on-screen red error overlay (242baf5d) since chrome://inspect isn't reachable on the WebView; user reproduces → screenshots the box → that's the real error to fix on.
  • #7 NEW: can't add Fedimint federations on .116 — fmcd sidecar crash-loops Operation not permitted (os error 1), so :8178 answers HTTP 000 and wallet.fedimint-join fails. fmcd WORKS on .198/.89. EXHAUSTIVE black-box isolation on .116 (seccomp default vs unconfined; cap-drop ALL vs caps restored; fresh data vs a cp -a COPY of the real /data; default net vs archy-net; /data 755 vs 777) — fmcd ran in EVERY standalone podman run config, including full real security (cap-drop ALL + readonly + no-new-priv + archy-net + copy of real data). Only the ORCHESTRATOR-created container EPERMs. So:
    • seccomp is NOT the cause (default-seccomp standalone runs) — the seccomp "fix" was reverted (63b98599).
    • NOT caps, NOT /data perms/ownership, NOT the existing multimint.db (the copy runs), NOT archy-net.
    • The differentiator is something specific to the orchestrator's libpod-API create vs podman run that I did NOT pin (a related symptom: the orchestrator's volume self-heal logs chown /data: Operation not permitted because the container has cap-drop ALL → no CAP_CHOWN). NEXT: create fmcd via the libpod API socket directly (replicating prod_orchestrator's exact body) to repro outside the orchestrator, then diff. WORKAROUND for now: test Fedimint on .198/.89 (working fmcd), not .116. Not the ecash code.
  • Deploy: all 6 nodes verified on e1f2e88; pushed gitea-vps2 (gitea-local token still 401s).

SESSION 2 PROGRESS (2026-06-20, code-complete — NOT yet deployed; user held deploy)

All committed to local main; NOT pushed to gitea-vps2/origin yet, NOT sideloaded.

  • #12 dup contacts DONE (f92e442b, +3 unit tests pass). Backend group_peer_twins() helper (mesh/mod.rs) dedups by arch_pubkey_hex, radio twin = canonical send id, unions messages; wired into conversations.list/messages + mesh.contacts-list. KEY FINDING: conversations.list/messages have NO frontend consumer — the live chat list renders the frontend merge mergedPeers (Mesh.vue), which matched twins by the Archy-z6Mk… advert prefix that the device RENAME broke. Real fix = merge by arch_pubkey_hex (now exposed on the MeshPeer TS type). Should also clear .120→.89 and likely #5 (Arch Mobile on .116, same bug).
  • Companion crash diagnostic SHIPPED (b3633ec5): main.ts global handler now shows the REAL error + keeps a 25-entry window.__archyErrors ring buffer + catches async/unhandledrejection. Still need to deploy + repro on the optiplex node (read window.__archyErrors via chrome://inspect) to get the actual throw. User says LAN/mobile-browser fine → Tailscale-WebView-specific.
  • #3 dual-ecash pay-for-file DONE (8f06d88f, compiles): payer tries Cashu→Fedimint, seller accepts both (verify_and_receive_payment: non-"cashu" = reissue_into_any), new fedimint_client::spend_from_any(), wallet.ecash-balance reports total_sats. LIVE federation validation pending (two nodes sharing a federation).
  • #2 mobile scroll cutoff DONE (a8c668ee): DashboardMobileNav wrote --mobile-tab-bar-height:0px when the bar was hidden/unlaid-out, defeating the ,88px fallback → bar covered last row. Now never writes 0 (removes var → fallback), re-measures on rAF + post-WebView-injection. Backup hypothesis if it persists: .dashboard-view is min-h-screen(100vh) → mobile-browser toolbar overlap, switch to dvh.

DEPLOYED 2026-06-20 to ALL 6 nodes — binary sha 4a8f2198… (release build of commit a6957a48 + this handoff), FE rebuilt, all sha-verified + service active: .116(local) .198 .228 .89 .5 .120. .5/.120 needed a 30-min timeout (slow DERP). #10 netbird OIDC gate also shipped in this build. REMAINING VERIFICATION (on real hardware, user-side):

  • #12/#5: open mesh chat on .116 (and .89/.120) — confirm a federated node shows ONCE with its messages (no radio/federation double), and that "Arch Mobile" messages now surface.
  • #1 companion crash: open the companion app to the optiplex node over Tailscale, reproduce the crash, then read the REAL error from window.__archyErrors (chrome://inspect the WebView) or the now-detailed toast. That error is what's needed to write the actual fix. Confirm which node = optiplex.
  • #3: pay for a peer file when the buyer's balance is only in Fedimint (needs two nodes in a federation).
  • #2: check Cloud/files bottom rows clear the tab bar on mobile browser. Commits are LOCAL on main (f92e442b/b3633ec5/8f06d88f/a8c668ee/a6957a48 + docs) — NOT pushed to gitea-vps2/origin (no version bump; bug-bash sideload only).

TODO (original resume — #12 now DONE above)

#12 Fix duplicate mesh contacts ← DONE this session (see SESSION 2 PROGRESS)

Root cause: handle_mesh_contacts_list (api/rpc/mesh/typed_messages.rs:1126) and handle_conversations_list (api/rpc/mesh/status.rs:89) emit one row per state.peers entry with no cross-transport dedup. A node can have TWO peers: a radio peer (low contact_id, firmware key) and a federation peer (high contact_id ≥ 0x8000_0000, archipelago key). bind_federation_twins (mesh/mod.rs:85) correlates them by exact advert_name and copies arch_pubkey_hex onto the radio twin, but LEAVES BOTH ROWS. Messages are keyed by peer_contact_id (split across the two ids), so the federation-injected messages sit on the federation row while the user may open the radio row → empty.

Design constraint (important): the two twins have DIFFERENT routing. Collapsing must NOT break "mesh-first": the canonical SEND contact_id should be the RADIO twin when one exists (so send_typed_wire routes LoRa-if-reachable, else federation via the bound arch key), else the federation id. The merged THREAD must union messages from ALL twin contact_ids (group by arch_pubkey_hex). Apply the dedup in:

  • handle_conversations_list (status.rs:89) — one conversation per identity group; last msg = newest across twins.
  • handle_mesh_contacts_list (typed_messages.rs:1126).
  • handle_conversations_messages (status.rs ~146) — when asked for a contact_id, resolve its group's twin ids and filter messages by ANY of them. Add a shared helper (e.g. group peers by arch_pubkey_hex when Some, else singleton by contact_id). Do NOT merge/re-key at bind_federation_twins time — that would force federation routing and break mesh-first. MeshPeer struct: mesh/types.rs:28 (fields: contact_id, advert_name, did, pubkey_hex, arch_pubkey_hex, reachable…).

Before testing #12: update .89 to the current build (it's on stale 7c17a96), then re-check whether .120 ("Archy-X250-EXP") shows once with its messages. NB: .89 had 0 journal mentions of "Archy-X250-EXP" and no radio contact for .120 — so its specific double may be a stale-binary artifact; confirm on fresh build.

#10 Netbird logout race

Symptom: right after install netbird shows logged-in but can't log out; self-corrects after a while. Map: install stacks.rs install_netbird_stack (~1760-1918): 3 containers (netbird-server :8086, dashboard, nginx proxy :8087→443 self-signed TLS). wait_for_stack_containers waits for "running", NOT OIDC-ready. Dashboard is netbird's own SPA, opened in a NEW TAB (appLauncher.ts ~52-60, secure-context/crypto.subtle). Hypothesis: startup race — dashboard loads before netbird-server's OIDC provider is ready, caches a bad auth state; logout endpoint not ready. Likely fix: gate install completion / launch on netbird-server OIDC readiness (poll an endpoint) rather than container "running". Repro on .89 (has netbird running). Prior note: AccountInfoSection.vue ~602 release note claims a previous unified-origin fix for the 404 logout/login loop — the initial-state race remains.

Mesh parity directive

MeshCore "works great"; Meshtastic must reach the SAME parity (rename done; duplicate-contact + routing fallback shared across both). Meshtastic↔MeshCore are INCOMPATIBLE over-the-air, so cross-protocol federated peers (.120↔.89) rely entirely on the FIPS/Tor fallback.