archy/docs/SESSION-1.8.0-OTA-PROGRESS.md
archipelago 0eb5c258f5 fix(mesh): Meshtastic 3ccc pkc_capable pill + Sideband image interop + critical CBOR wire-bloat fix
Merges in the meshtastic agent's now-finished work alongside this session's
continuation: stock-peer (3ccc) PKI-capability is now stamped through
get_contacts -> refresh_contacts -> MeshPeer.pkc_capable, so a directed DM to/from
a PKC-capable stock Meshtastic peer correctly shows the E2E pill on the Sent row,
not just received messages. Confirmed live: .198 sees "Meshtastic 3ccc" with
pkc_capable=true.

Also fixes two real interop/correctness bugs found while live-testing the
Reticulum <-> Sideband link:
  - Receive: the daemon only ever read LXMF's plain-text content, silently
    dropping native FIELD_IMAGE/FIELD_FILE_ATTACHMENTS fields — a stock
    Sideband/NomadNet photo vanished into a blank-space message. Now decoded
    into the same ContentInline typed envelope our own attachments use.
  - Send: images to a non-archy (stock) peer now use native LXMF FIELD_IMAGE
    instead of our own opaque CBOR wire format, which Sideband can't decode.
  - Root cause of a garbled MC-chunk-fragment bug: TypedEnvelope.v/.sig (the
    OUTER wrapper every message type uses) serialized raw bytes as a CBOR
    array-of-integers instead of a native byte string, bloating every
    message on the wire ~2-3.5x — enough to push even a tiny ReadReceipt
    over the 140-byte single-frame chunking threshold. Root-caused by
    reading ciborium's deserializer source directly (deserialize_bytes only
    works within its internal scratch buffer; deserialize_byte_buf streams
    unbounded).

Frontend: consolidated the attach/record buttons into a single animated "+"
menu (was overflowing the compose row).

857/857 tests pass. Verified live across all 5 deploy-roster nodes.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-06-30 22:07:45 -04:00

22 KiB

1.8.0 OTA Session Progress

Updated: 2026-06-30


▶️▶️▶️▶️ LIVE CHECKPOINT 2026-06-30 (evening) — #17 deployed + verified on .198/.228

#17 (3ccc / stock-peer E2E pill) is now built, deployed, and live-verified on .198 and .228 only (.116 skipped per the hardware notice below — its radio is mid-reflash to RNode).

  • Built release binary sha b1d695fc626a7382 from the working tree (cargo check + cargo test -p archipelago mesh:: both green, 99 passed/0 failed/1 ignored, right before building — tree was settled, no collision with the Reticulum agent's concurrent edits).
  • Deployed via stop/swap/start to .198 (192.168.1.198) and .228 (192.168.1.228), sha256 confirmed matching on both, systemctl is-active = active on both (.228 took its usual ~couple-minute convergence — heavy resilience node, unrelated bitcoind/fedimint container startup noise in the logs during that window, no mesh errors).
  • Live-verified the actual fix, not just deploy: on .198, mesh.peers shows "advert_name":"Meshtastic 3ccc", "pkc_capable":true, and mesh.send to 3ccc (contact_id:1128152268) now returns "encrypted":true — confirms the archy || peer_pkc_capable(contact_id) TX fix is live, not just compiled.
  • .228's RPC password in memory (password123) was stale — user confirmed the correct password is ThisIsWeb54321@ (same as .198/.116, i.e. fully unified now). Re-verified via RPC: mesh.peers shows 3ccc pkc_capable:true, and mesh.send to 3ccc returns "encrypted":true — #17 confirmed live on .228 too, not just .198.

NOT yet done: push commit to gitea-vps2 (still uncommitted in the working tree, by design — shares the tree with the Reticulum agent's uncommitted work); user on-device confirmation that the E2E pill actually renders in the Mesh UI for 3ccc.


🛠️ HARDWARE NOTICE 2026-06-30 (~16:30) — .116's Heltec V3 is being repurposed

The Reticulum agent is reflashing .116's Heltec V3 (the board on /dev/ttyUSB0, currently .116's live Meshtastic radio) to RNode firmware, with explicit user approval, to unblock the Reticulum Phase-0 hardware gates (real RNode needed; see docs/RETICULUM-TRANSPORT-PROGRESS.md). This was user-confirmed specifically because it takes .116 offline as a Meshtastic radio.

Effect on this workstream: do all on-device Meshtastic testing on .198 and .228 only — .116 no longer has a Meshtastic-firmware radio attached once this lands. cargo check/cargo test -p archipelago were both confirmed clean (99/99 mesh tests) right before the reflash started, so the earlier "wait for their edit to settle" blocker above is cleared — software-side it's safe to build/test/deploy; only .116's physical radio role changed.


▶️▶️▶️ LIVE CHECKPOINT 2026-06-30 (later PM, ~15:50) — READ THIS FIRST IF RESUMING

#17 (3ccc / stock-peer E2E pill) is CODE-COMPLETE in the working tree, isolated to meshtastic.rs/protocol.rs/types.rs/mod.rs as planned (no session.rs transport-plumbing changes from this side):

  • ParsedContact.pkc_capable (protocol.rs) + MeshPeer.pkc_capable (types.rs), both #[serde(default)]/defaulted false at every construction site.
  • MeshtasticDevice::get_contacts() now stamps pkc_capable per contact from the existing peer_is_pkc_capable(node_num) seam (de-allow(dead_code)'d).
  • listener/session.rs::refresh_contacts ORs the new value into MeshPeer.pkc_capable (capability only grows, never cleared by a transient refresh) — this IS a touch of session.rs, but additive/non-colliding with the Reticulum device-enum match arms already there; did not touch transport plumbing/routing.
  • mod.rs::MeshService::send_message now does archy || self.peer_pkc_capable(contact_id) for the Sent-row encrypted flag (was archy-only before).
  • Verified via cargo check -p archipelago --bin archipelago (clean, exit 0) before the other agent's latest edit landed.

NOT YET DONE: rebuild release binary → redeploy 5 nodes → push → user on-device test (same as #16, both still pending live verification).

⚠️ BLOCKED right now — do not build/deploy/push until this clears: the Reticulum agent is actively mid-edit in the same working tree. A cargo test run right after the clean cargo check above failed with a real (but transient, not mine) signature mismatch: session.rs::auto_detect_and_open / run_mesh_session were observed with a new device_kind: Option<DeviceType> param that listener/mod.rs's call site didn't have yet — a normal in-flight snapshot of their work, not a regression to fix here. Action on resume: re-run cargo check first; if it's clean, the other agent's edit has settled and it's safe to proceed to build/test/deploy. If still broken, wait — do not stash, revert, or patch their in-progress session.rs/listener/mod.rs changes (see memory feedback_concurrent_agent_tree.md). Also: building/deploying right now would bundle their not-yet-finished reticulum.rs wiring into the binary — confirm with the user before shipping a combined build, since only the meshtastic #17 piece has been asked for/owned by this session.


▶️▶️ LIVE CHECKPOINT 2026-06-30 (late PM) — READ THIS FIRST

Fleet state: all 5 test nodes on binary 38c456b0bacec3c4 + frontend Mesh-CAkPgvLo.js, archipelago active on each: .116, .198, .228 (LAN, archipelago@ + ~/.ssh/archipelago-deploy), 100.72.136.5, 100.89.209.89 (Tailscale, same key — installed this session; SSH user archipelago / pw ThisIsWeb54321@; NOPASSWD sudo on all 5).

Shipped this session (commit 12e7990b on main, pushed to gitea-vps2):

  • #16 public-channel routing — inbound Meshtastic text to BROADCAST_NUM now files under the public channel thread (contact_id u32::MAX - idx), attributed to its real sender, instead of polluting per-sender DM threads. Directed text (to == our node) still routes to the DM thread (regression test packet_to_inbound_frame_directed_dm_stays_a_contact_message). send_channel_text now sets MeshPacket.channel so archy TX's on channel 0 (public). Code: meshtastic.rs (packet_to_inbound_frame, parse_mesh_packet to/channel, send_channel_text), protocol.rs (RESP_MESHTASTIC_CHANNEL_TEXT = 0x70), listener/frames.rs (handler + sender attribution), Mesh.vue (senderLabelFor). Tests green (95 mesh tests). Pending: user on-device test with the radios.

Push access: main is a PROTECTED branch on gitea-vps2. Direct push uses the dedicated ai account via remote gitea-ai (git push gitea-ai main). See memory reference_gitea_ai_push_account.md.

Coordination: another agent owns Reticulum (reticulum-daemon/ + Rust transport wiring). DO NOT touch mesh/listener/session.rs transport plumbing or mod.rs routing in ways that collide. Keep #17 work isolated to meshtastic.rs RX/TX + (if needed) the sent-row encrypted flag.

CODE-COMPLETE (not yet deployed/tested live) — #17 (3ccc / stock-peer E2E pill)

Goal: DMs to and from a PKC-capable stock peer (3ccc, NodeInfo public_key key_len=32 confirmed) must show the E2E pill.

  • RX side is already correct: parse_mesh_packet reads public_key (field 16)
    • pki_encrypted (field 17) per the MeshPacket proto; the directed-DM RX path promotes to RESP_CONTACT_MSG_V3_E2E when pki_encrypted. (Verify live.)
  • TX bug (root cause) — FIXED: mod.rs::send_message now records the Sent row with encrypted = archy || peer_pkc_capable(contact_id). peer_is_pkc_capable (meshtastic.rs) is wired out via get_contacts()ParsedContact.pkc_capablerefresh_contacts (session.rs) → MeshPeer.pkc_capableMeshService::peer_pkc_capable. See the LIVE CHECKPOINT at the top of this file for the exact touch points.
  • NEXT STEP when resuming: confirm cargo check is clean (the other agent's Reticulum work shares this tree and may be mid-edit — see top checkpoint), then rebuild → redeploy 5 nodes → push → user test (same pending step as #16).

Remaining open after #17: #12 (provisioning robustness — HOLD, session.rs churn risks reticulum collision), #8 (Device-tab settings panel + reboot button — RPC mesh.reboot-radio already exists), #6 (onboarding modal), #7 (.116 re-verify), #14 (RSSI/SNR per-contact indicator), #15 (peer-location map, POSITION_APP portnum=3).


▶️ RESUME HERE — archy↔archy LoRa (2026-06-30 PM) — READ FIRST

Goal: archy↔archy text over Meshtastic LoRa must DELIVER and show the E2E pill, identical in off-grid and normal mode. Test bed = .116 / .198 / .228 (all EU_868). Don't touch the federation/FIPS path.

SOLVED 2026-06-30 — archy↔archy LoRa WORKS (delivery + E2E pill + identity)

VERIFIED: .198→.228 directed DM → .228 row RECEIVED enc=True peer="Arch Optiplex". All three nodes (.116/.198/.228) now hear each other + stock peer 3ccc. Deployed binary 737b16c3235b active on all three. Fix source COMMITTED as a57ae388 on main (not yet pushed to gitea-vps2/origin).

THE fix (receive stream): archy ignored FromRadio.rebooted (field 8). Every config write reboots the radio → firmware PhoneAPI resets to STATE_SEND_NOTHING and stops streaming received packets until the client re-sends want_config. archy never did → went deaf to inbound (that's why old messages only arrived after a full restart = fresh want_config). Fix: handle FROM_RADIO_REBOOTED → set pending_reinit → re-send want_config; plus a 10s keepalive heartbeat (insurance vs 15-min idle serial close) and a pinned modem_preset=LONG_FAST so all radios share frequency. Combined with the earlier E2E send fix (plain TEXT_MESSAGE_APP DM, firmware PKC) this closes archy↔archy LoRa.

Open follow-ups: #A surface received msgs under archy identity in all UI views; #6 device-onboarding modal; #8 Device-tab settings panel; #7 re-verify .116 in rotation; #12 make modem_preset authoritative + hot-swap re-binding + RX-stall watchdog; #14 signal-strength (RSSI/SNR) indicator per contact (from MeshPacket rx_rssi/rx_snr); #15 map view plotting peer locations where shared (Meshtastic POSITION_APP portnum=3 lat/lon). See the resume memory project_session_resume_2026_06_30_lora.md for the full task list.

(historical) earlier TL;DR — RF-layer suspicion, now RESOLVED by the reboot-recovery fix

The archy software is correct and deployed. The blocker was at the radio/RF layer: the three radios are not hearing each other over the air at all. No amount of archy code change will fix that until the radios actually RF-link. Resume by testing the radios directly at home (Meshtastic phone app over Bluetooth) — see "DO THIS FIRST AT HOME" below. ← this turned out to be the want_config resubscribe bug above.

What is DONE and deployed (commit pending — see below)

  • E2E send fix (core/archipelago/src/mesh/mod.rs send_message, ~L1542): archy↔archy plain chat text is now sent as a native TEXT_MESSAGE_APP DM (firmware PKC-encrypts it E2E), NOT wrapped in our binary typed envelope. Archy peers' Sent rows are marked encrypted=true so the pill shows. Rich typed msgs still use send_typed_wire. This was the original root-cause fix (envelope-wrapped text silently broke archy↔archy LoRa).
  • NEW: software radio-reboot end-to-end, so a wedged/RX-deaf radio can be rebooted without physical access (and for the Device-tab settings panel the user requested):
    • meshtastic.rs: reboot(seconds) driver method + ADMIN_REBOOT_SECONDS_FIELD = 97 (verified vs meshtastic/protobufs admin.proto — set_owner=32/set_channel=33/set_config=34 matched our existing constants, confirming the proto read).
    • listener/mod.rs: MeshCommand::RebootRadio { seconds }.
    • listener/session.rs: device-enum reboot() dispatch (Meshtastic only) + handler arm.
    • mesh/mod.rs: MeshService::reboot_radio(seconds).
    • api/rpc/mesh/messaging.rs: handle_mesh_reboot_radio → RPC mesh.reboot-radio {seconds?} (default 2); dispatcher arm in api/rpc/dispatcher.rs.
    • cargo check passes. Built release sha ba4aed590027690d and DEPLOYED + active on .116/.198/.228. The RPC works ({"reboot":true,"seconds":2}).
    • ⚠️ Caveat: when called, archy logged "Sent Meshtastic radio reboot" but the radio did not visibly reboot afterward (no config re-stream). Either field 97 is still off, or newer firmware requires an admin session passkey even over local serial, or the USB serial stayed open through the 2s reboot so no reconnect was logged. Needs on-device verification.

The hard evidence (why "nothing works")

  • Directed DM tests .198→.228 AND .116→.228 (neither path reflashed): sender logs Sent plain native DM dest=30d258436d65 part=1 total=1 and RPC returns sent:true, encrypted:true, but .228 logs nothing — packet never reaches archy from the radio.
  • A raw broadcast from .198 (mesh.broadcast) was accepted by its radio but not heard by .228/.116.
  • In an 8-minute window, all three nodes received 0 inbound OTA packets from any other node. Each only logs its OWN once-a-minute Broadcast Meshtastic NodeInfo advert + local TX field=11 queue-status. .228 mesh.status = messages_received:1 total.
  • .198's radio is alive and transmitting NodeInfo every 60s — so it's not dead; it's that reception is broken on the receivers. A radio cannot drop a broadcast AND a unicast to its own node number while config matches, unless it simply isn't on the same airwaves.
  • archy provisioning is correct & identical across nodes (read back from device): PRIMARY = public LongFast (name="" psk_len=1), SECONDARY = archipelago, region=3 (EU_868). Admin field constants verified. The send path hands the radio a correct unicast MeshPacket (to=node, want_ack, hop_limit=3, plaintext decoded for the firmware to PKC-encrypt).

PRIME SUSPECT (software-fixable) — modem-preset / frequency mismatch

archy only ever writes region + use_preset and never explicitly pins modem_preset (it parses region but not preset; set_lora_region relies on the LongFast default). If ANY radio has a non-default modem preset / frequency slot persisted (e.g. set via the Meshtastic app, or a different factory default after the .198 reflash), the radios are on different airwaves despite identical channel name + region, and archy would never correct it.

DO THIS FIRST AT HOME (decisive, ~2 min, only the user can do it)

Open the Meshtastic phone app over Bluetooth (works alongside archy's USB serial) on each of .116/.198/.228 and check:

  1. Do the 3 nodes see each other in the node list (recent "heard")? → if NO, they're not RF-reaching (preset/freq/antenna/range).
  2. Do all 3 show the same Modem preset (LongFast), Region (EU_868), Frequency slot, and the same PRIMARY channel? → any difference = the cause. This single test separates "archy misconfigures the radios" from "radios physically can't reach each other."

THEN — the archy fix to apply (if preset/config differs)

Make archy authoritatively write the full LoRaConfig and force re-provision so all radios converge: in core/archipelago/src/mesh/meshtastic.rs::set_lora_region (and its caller/guard ensure_lora_region ~L304), explicitly set modem_preset = LONG_FAST (0) as a field in the LoRaConfig (it's currently omitted/defaulted), and make the startup provision path rewrite LoRa config when the preset doesn't match, then reboot the radio (use the new mesh.reboot-radio). Also verify the mesh.reboot-radio actually reboots the radio on-device (the caveat above).

TEST RECIPE (works on each node)

  • RPC helper used this session: a node-side rpc.sh that logs in (password ThisIsWeb54321@), grabs the csrf_token cookie, echoes it as X-CSRF-Token, and POSTs to http://127.0.0.1:5678/rpc/v1. Recreate it or run archy's RPC directly. Methods: mesh.peers, mesh.status, mesh.messages, mesh.send {contact_id,message}, mesh.broadcast, mesh.reboot-radio {seconds}.
  • LoRa contact ids: .116=1135977788 (prefix 3ca5b543), .198=3677050140 (db2b551c), .228=1129894448 (prefix 30d25843), stock 3ccc=1128152268.
  • Link health check (run on each node): look for inbound from=Some("!...") lines in journalctl -u archipelago that are NOT the node's own Broadcast ... NodeInfo advert. If zero across all nodes → RF link is down (the current state).
  • E2E success criteria: send .198→.228, the marker appears in .228 mesh.messages as an inbound row with encrypted:true / transport:"lora", AND .116↔.228 likewise.

DEPLOY / BUILD RECIPE

  • Build: from core/, CARGO_TARGET_DIR=/tmp/archy-hotfix-target CARGO_INCREMENTAL=0 cargo build --release -p archipelago --bin archipelago. (If rust-lld: undefined hidden symbol, it's incremental cache — CARGO_INCREMENTAL=0 fixes it.)
  • SSH key ~/.ssh/archipelago-deploy is authorized on .116/.198/.228. SSH/UI/RPC password ThisIsWeb54321@. Per node: scp the binary, sudo systemctl stop archipelagokill -9 $(pgrep -x archipelago)install -m0755 to /usr/local/bin/archipelagosystemctl start archipelago. Verify by sha256sum match + systemctl is-active.
  • Current deployed sha on all 3 = ba4aed590027690d (the reboot-enabled build).

Fleet state (as of 2026-06-30 PM)

  • All 3 nodes on binary ba4aed59, active. Off-grid mode currently OFF (mesh_only:false).
  • .198 radio was reflashed to factory firmware-heltec-v3-2.7.26 (recovered from corrupt NVS); region EU_868 persists. Its archy identity is NOT re-bound on .228 (.228 shows .198 as raw radio "Meshtastic 551c", arch_pubkey_hex absent) because .228 hasn't heard .198's identity broadcast — a downstream symptom of the dead RF link, not a separate bug.
  • The radios are powered & each transmitting; they are simply not hearing each other.

Deferred UI (after LoRa works)

  • Device-tab settings panel (gear/desktop) — host the "Reboot radio" button there; calls mesh.reboot-radio. Scoping done: add to the Mesh.vue actions row (mirrors Broadcast/Off-Grid buttons) + a rebootRadio() method in neode-ui/src/stores/mesh.ts. See Mesh.vue ~L1484 actions row and mesh.ts ~L373 broadcastIdentity() pattern.
  • Device-onboarding modal (detect plugged-in radio).

Current scope:

  • Preserve existing mesh work: E2E indicators, FIPS/Tor transport indicators, typed-message paths, Meshtastic region/channel provisioning, and dirty Meshtastic receive-attempt changes.
  • Take over the 3ccc stock Meshtastic peer bug: LoRa text from 3ccc to Archipelago .116 does not surface in mesh.messages.
  • Keep release-gate fixes already made in this session.

Local gate status so far:

  • cargo test -p archipelago --bin archipelago: green, 849/849 after Meshtastic fixes.
  • python3 scripts/check-app-catalog-drift.py --release --strict: green.
  • npm run type-check: green.

Key changes made so far:

  • Added cascade uninstall progress truthfulness assertion to tests/lifecycle/bats/cascade-uninstall.bats.
  • Fixed release catalog drift filters and regenerated catalog metadata.
  • Fixed invalid apps/fedimint-clientd/manifest.yml cpu_limit schema value.
  • Updated stale/tight Rust tests without changing production behavior.

Remaining non-automatable / operational gates:

  • Workstream B signing is blocked on the offline RELEASE_MASTER_MNEMONIC; code + runbook exist, but the publisher must pin/sign the release-root catalog.
  • Phase-3 Quadlet backend rollout is implemented behind use_quadlet_backends and default-off. The gate skip-passes until explicitly enabled on a node; flipping it fleet-wide requires a coordinated flag rollout plus backend reinstall/migration verification.
  • .116 read-only use-quadlet-backends-install.bats: 6/6 skip-clean; no backend .container units, so Phase-3 is not active on that node.
  • Release metadata still says 1.7.99-alpha in releases/manifest.json; changelog top is v1.8.00-alpha. Cutting an actual 1.8.0 OTA requires an explicit version/manifest update.

Do not discard:

  • core/archipelago/src/mesh/listener/decode.rs
  • core/archipelago/src/mesh/listener/session.rs
  • core/archipelago/src/mesh/meshtastic.rs

3ccc bug current hypothesis:

  • The prior attempted Meshtastic fix added a hard stale-packet filter using rx_time.
  • Stock Meshtastic radios without GPS/RTC can report tiny nonzero epoch values until time sync.
  • That would make live 3ccc packets look older than 10 minutes and get dropped before mesh.messages.
  • Current patch treats implausibly early rx_time values as unknown rather than stale.

.116 live validation after 2026-06-30 hotfix:

  • .116 reachable by SSH; archipelago active; /dev/mesh-radio -> ttyUSB0 attached.
  • Current canary deploy is commit b4531bb4; backend sha 4ab53e539d89679ef664401a9a57996267772fed02327abc2912c3e77543acbf; frontend bundle index-YOAeJF7w.js / Mesh-BSAo88jN.js.
  • main pushed to gitea-vps2.
  • RPC on .116:
    • transport.status currently reports mesh_only:false (off-grid mode is not enabled unless the user toggles it).
    • mesh.status reports Meshtastic connected: device_type:"meshtastic", self_node_id:1135977788, peer_count:13.
    • Recent .116 -> 3ccc sent rows are stored with real 2026 timestamps and transport:"lora".
  • UI/backend fixes included in b4531bb4:
    • transportLabel("lora") displays LoRa.
    • mesh sends refetch messages after send so transport pills settle without browser refresh.
    • off-grid mode blocks the mesh-chat FIPS/Tor federation fallback and forces LoRa-only sends; banner text is Tor/FIPS disabled - LoRa only.
    • empty mesh-chat placeholder opacity reduced.
  • Meshtastic diagnostics now identify the remaining blocker:
    • 3ccc NodeInfo is discovered: Meshtastic peer is PKC-capable (NodeInfo public_key) node=1128152268 key_len=32.
    • Bytes from stock Meshtastic text reach .116, but the custom parser rejects the packet: Meshtastic FromRadio.packet did not parse into a decoded MeshPacket len=73 head=0dcc3c3e43153ca5b5432a16df56cbed.
    • Non-text packets decode and are ignored with port numbers (portnum=3/4/5), so the serial read path is alive. Resume inside core/archipelago/src/mesh/meshtastic.rs::parse_mesh_packet.
  • LoRa is therefore not fully fixed yet: stock 3ccc -> .116 text does not surface in mesh.messages, and .116 -> 3ccc still needs user-visible confirmation in the Meshtastic app.