archy/docs/SESSION-1.8.0-OTA-PROGRESS.md
archipelago 12e7990b10 fix(mesh): route Meshtastic public-channel text to the channel thread, not DMs
Inbound Meshtastic text addressed to BROADCAST_NUM (the default public
LongFast channel, or any channel slot) was filed into a per-sender 1:1 DM
thread, so public-channel messages polluted individual people's DM chats
and appeared as if sent directly to the user.

packet_to_inbound_frame now detects `to == BROADCAST_NUM` and emits a new
synthetic RESP_MESHTASTIC_CHANNEL_TEXT frame
([channel_idx][sender_prefix(6)][text]) that the listener files under the
channel thread (contact_id = u32::MAX - idx) while still attributing the
message to its real sender. Directed text (to == our node) still routes to
the DM thread — a regression test locks that split in.

send_channel_text now sets MeshPacket.channel (field 3) so archy actually
transmits on channel 0 (public) instead of ignoring the slot. Mesh.vue keeps
the synthetic "Meshtastic !xxxx" sender id when that is the best identity
available for a stock public-channel device.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 14:33:30 -04:00

14 KiB

1.8.0 OTA Session Progress

Updated: 2026-06-30


▶️ RESUME HERE — archy↔archy LoRa (2026-06-30 PM) — READ FIRST

Goal: archy↔archy text over Meshtastic LoRa must DELIVER and show the E2E pill, identical in off-grid and normal mode. Test bed = .116 / .198 / .228 (all EU_868). Don't touch the federation/FIPS path.

SOLVED 2026-06-30 — archy↔archy LoRa WORKS (delivery + E2E pill + identity)

VERIFIED: .198→.228 directed DM → .228 row RECEIVED enc=True peer="Arch Optiplex". All three nodes (.116/.198/.228) now hear each other + stock peer 3ccc. Deployed binary 737b16c3235b active on all three. Fix source COMMITTED as a57ae388 on main (not yet pushed to gitea-vps2/origin).

THE fix (receive stream): archy ignored FromRadio.rebooted (field 8). Every config write reboots the radio → firmware PhoneAPI resets to STATE_SEND_NOTHING and stops streaming received packets until the client re-sends want_config. archy never did → went deaf to inbound (that's why old messages only arrived after a full restart = fresh want_config). Fix: handle FROM_RADIO_REBOOTED → set pending_reinit → re-send want_config; plus a 10s keepalive heartbeat (insurance vs 15-min idle serial close) and a pinned modem_preset=LONG_FAST so all radios share frequency. Combined with the earlier E2E send fix (plain TEXT_MESSAGE_APP DM, firmware PKC) this closes archy↔archy LoRa.

Open follow-ups: #A surface received msgs under archy identity in all UI views; #6 device-onboarding modal; #8 Device-tab settings panel; #7 re-verify .116 in rotation; #12 make modem_preset authoritative + hot-swap re-binding + RX-stall watchdog; #14 signal-strength (RSSI/SNR) indicator per contact (from MeshPacket rx_rssi/rx_snr); #15 map view plotting peer locations where shared (Meshtastic POSITION_APP portnum=3 lat/lon). See the resume memory project_session_resume_2026_06_30_lora.md for the full task list.

(historical) earlier TL;DR — RF-layer suspicion, now RESOLVED by the reboot-recovery fix

The archy software is correct and deployed. The blocker was at the radio/RF layer: the three radios are not hearing each other over the air at all. No amount of archy code change will fix that until the radios actually RF-link. Resume by testing the radios directly at home (Meshtastic phone app over Bluetooth) — see "DO THIS FIRST AT HOME" below. ← this turned out to be the want_config resubscribe bug above.

What is DONE and deployed (commit pending — see below)

  • E2E send fix (core/archipelago/src/mesh/mod.rs send_message, ~L1542): archy↔archy plain chat text is now sent as a native TEXT_MESSAGE_APP DM (firmware PKC-encrypts it E2E), NOT wrapped in our binary typed envelope. Archy peers' Sent rows are marked encrypted=true so the pill shows. Rich typed msgs still use send_typed_wire. This was the original root-cause fix (envelope-wrapped text silently broke archy↔archy LoRa).
  • NEW: software radio-reboot end-to-end, so a wedged/RX-deaf radio can be rebooted without physical access (and for the Device-tab settings panel the user requested):
    • meshtastic.rs: reboot(seconds) driver method + ADMIN_REBOOT_SECONDS_FIELD = 97 (verified vs meshtastic/protobufs admin.proto — set_owner=32/set_channel=33/set_config=34 matched our existing constants, confirming the proto read).
    • listener/mod.rs: MeshCommand::RebootRadio { seconds }.
    • listener/session.rs: device-enum reboot() dispatch (Meshtastic only) + handler arm.
    • mesh/mod.rs: MeshService::reboot_radio(seconds).
    • api/rpc/mesh/messaging.rs: handle_mesh_reboot_radio → RPC mesh.reboot-radio {seconds?} (default 2); dispatcher arm in api/rpc/dispatcher.rs.
    • cargo check passes. Built release sha ba4aed590027690d and DEPLOYED + active on .116/.198/.228. The RPC works ({"reboot":true,"seconds":2}).
    • ⚠️ Caveat: when called, archy logged "Sent Meshtastic radio reboot" but the radio did not visibly reboot afterward (no config re-stream). Either field 97 is still off, or newer firmware requires an admin session passkey even over local serial, or the USB serial stayed open through the 2s reboot so no reconnect was logged. Needs on-device verification.

The hard evidence (why "nothing works")

  • Directed DM tests .198→.228 AND .116→.228 (neither path reflashed): sender logs Sent plain native DM dest=30d258436d65 part=1 total=1 and RPC returns sent:true, encrypted:true, but .228 logs nothing — packet never reaches archy from the radio.
  • A raw broadcast from .198 (mesh.broadcast) was accepted by its radio but not heard by .228/.116.
  • In an 8-minute window, all three nodes received 0 inbound OTA packets from any other node. Each only logs its OWN once-a-minute Broadcast Meshtastic NodeInfo advert + local TX field=11 queue-status. .228 mesh.status = messages_received:1 total.
  • .198's radio is alive and transmitting NodeInfo every 60s — so it's not dead; it's that reception is broken on the receivers. A radio cannot drop a broadcast AND a unicast to its own node number while config matches, unless it simply isn't on the same airwaves.
  • archy provisioning is correct & identical across nodes (read back from device): PRIMARY = public LongFast (name="" psk_len=1), SECONDARY = archipelago, region=3 (EU_868). Admin field constants verified. The send path hands the radio a correct unicast MeshPacket (to=node, want_ack, hop_limit=3, plaintext decoded for the firmware to PKC-encrypt).

PRIME SUSPECT (software-fixable) — modem-preset / frequency mismatch

archy only ever writes region + use_preset and never explicitly pins modem_preset (it parses region but not preset; set_lora_region relies on the LongFast default). If ANY radio has a non-default modem preset / frequency slot persisted (e.g. set via the Meshtastic app, or a different factory default after the .198 reflash), the radios are on different airwaves despite identical channel name + region, and archy would never correct it.

DO THIS FIRST AT HOME (decisive, ~2 min, only the user can do it)

Open the Meshtastic phone app over Bluetooth (works alongside archy's USB serial) on each of .116/.198/.228 and check:

  1. Do the 3 nodes see each other in the node list (recent "heard")? → if NO, they're not RF-reaching (preset/freq/antenna/range).
  2. Do all 3 show the same Modem preset (LongFast), Region (EU_868), Frequency slot, and the same PRIMARY channel? → any difference = the cause. This single test separates "archy misconfigures the radios" from "radios physically can't reach each other."

THEN — the archy fix to apply (if preset/config differs)

Make archy authoritatively write the full LoRaConfig and force re-provision so all radios converge: in core/archipelago/src/mesh/meshtastic.rs::set_lora_region (and its caller/guard ensure_lora_region ~L304), explicitly set modem_preset = LONG_FAST (0) as a field in the LoRaConfig (it's currently omitted/defaulted), and make the startup provision path rewrite LoRa config when the preset doesn't match, then reboot the radio (use the new mesh.reboot-radio). Also verify the mesh.reboot-radio actually reboots the radio on-device (the caveat above).

TEST RECIPE (works on each node)

  • RPC helper used this session: a node-side rpc.sh that logs in (password ThisIsWeb54321@), grabs the csrf_token cookie, echoes it as X-CSRF-Token, and POSTs to http://127.0.0.1:5678/rpc/v1. Recreate it or run archy's RPC directly. Methods: mesh.peers, mesh.status, mesh.messages, mesh.send {contact_id,message}, mesh.broadcast, mesh.reboot-radio {seconds}.
  • LoRa contact ids: .116=1135977788 (prefix 3ca5b543), .198=3677050140 (db2b551c), .228=1129894448 (prefix 30d25843), stock 3ccc=1128152268.
  • Link health check (run on each node): look for inbound from=Some("!...") lines in journalctl -u archipelago that are NOT the node's own Broadcast ... NodeInfo advert. If zero across all nodes → RF link is down (the current state).
  • E2E success criteria: send .198→.228, the marker appears in .228 mesh.messages as an inbound row with encrypted:true / transport:"lora", AND .116↔.228 likewise.

DEPLOY / BUILD RECIPE

  • Build: from core/, CARGO_TARGET_DIR=/tmp/archy-hotfix-target CARGO_INCREMENTAL=0 cargo build --release -p archipelago --bin archipelago. (If rust-lld: undefined hidden symbol, it's incremental cache — CARGO_INCREMENTAL=0 fixes it.)
  • SSH key ~/.ssh/archipelago-deploy is authorized on .116/.198/.228. SSH/UI/RPC password ThisIsWeb54321@. Per node: scp the binary, sudo systemctl stop archipelagokill -9 $(pgrep -x archipelago)install -m0755 to /usr/local/bin/archipelagosystemctl start archipelago. Verify by sha256sum match + systemctl is-active.
  • Current deployed sha on all 3 = ba4aed590027690d (the reboot-enabled build).

Fleet state (as of 2026-06-30 PM)

  • All 3 nodes on binary ba4aed59, active. Off-grid mode currently OFF (mesh_only:false).
  • .198 radio was reflashed to factory firmware-heltec-v3-2.7.26 (recovered from corrupt NVS); region EU_868 persists. Its archy identity is NOT re-bound on .228 (.228 shows .198 as raw radio "Meshtastic 551c", arch_pubkey_hex absent) because .228 hasn't heard .198's identity broadcast — a downstream symptom of the dead RF link, not a separate bug.
  • The radios are powered & each transmitting; they are simply not hearing each other.

Deferred UI (after LoRa works)

  • Device-tab settings panel (gear/desktop) — host the "Reboot radio" button there; calls mesh.reboot-radio. Scoping done: add to the Mesh.vue actions row (mirrors Broadcast/Off-Grid buttons) + a rebootRadio() method in neode-ui/src/stores/mesh.ts. See Mesh.vue ~L1484 actions row and mesh.ts ~L373 broadcastIdentity() pattern.
  • Device-onboarding modal (detect plugged-in radio).

Current scope:

  • Preserve existing mesh work: E2E indicators, FIPS/Tor transport indicators, typed-message paths, Meshtastic region/channel provisioning, and dirty Meshtastic receive-attempt changes.
  • Take over the 3ccc stock Meshtastic peer bug: LoRa text from 3ccc to Archipelago .116 does not surface in mesh.messages.
  • Keep release-gate fixes already made in this session.

Local gate status so far:

  • cargo test -p archipelago --bin archipelago: green, 849/849 after Meshtastic fixes.
  • python3 scripts/check-app-catalog-drift.py --release --strict: green.
  • npm run type-check: green.

Key changes made so far:

  • Added cascade uninstall progress truthfulness assertion to tests/lifecycle/bats/cascade-uninstall.bats.
  • Fixed release catalog drift filters and regenerated catalog metadata.
  • Fixed invalid apps/fedimint-clientd/manifest.yml cpu_limit schema value.
  • Updated stale/tight Rust tests without changing production behavior.

Remaining non-automatable / operational gates:

  • Workstream B signing is blocked on the offline RELEASE_MASTER_MNEMONIC; code + runbook exist, but the publisher must pin/sign the release-root catalog.
  • Phase-3 Quadlet backend rollout is implemented behind use_quadlet_backends and default-off. The gate skip-passes until explicitly enabled on a node; flipping it fleet-wide requires a coordinated flag rollout plus backend reinstall/migration verification.
  • .116 read-only use-quadlet-backends-install.bats: 6/6 skip-clean; no backend .container units, so Phase-3 is not active on that node.
  • Release metadata still says 1.7.99-alpha in releases/manifest.json; changelog top is v1.8.00-alpha. Cutting an actual 1.8.0 OTA requires an explicit version/manifest update.

Do not discard:

  • core/archipelago/src/mesh/listener/decode.rs
  • core/archipelago/src/mesh/listener/session.rs
  • core/archipelago/src/mesh/meshtastic.rs

3ccc bug current hypothesis:

  • The prior attempted Meshtastic fix added a hard stale-packet filter using rx_time.
  • Stock Meshtastic radios without GPS/RTC can report tiny nonzero epoch values until time sync.
  • That would make live 3ccc packets look older than 10 minutes and get dropped before mesh.messages.
  • Current patch treats implausibly early rx_time values as unknown rather than stale.

.116 live validation after 2026-06-30 hotfix:

  • .116 reachable by SSH; archipelago active; /dev/mesh-radio -> ttyUSB0 attached.
  • Current canary deploy is commit b4531bb4; backend sha 4ab53e539d89679ef664401a9a57996267772fed02327abc2912c3e77543acbf; frontend bundle index-YOAeJF7w.js / Mesh-BSAo88jN.js.
  • main pushed to gitea-vps2.
  • RPC on .116:
    • transport.status currently reports mesh_only:false (off-grid mode is not enabled unless the user toggles it).
    • mesh.status reports Meshtastic connected: device_type:"meshtastic", self_node_id:1135977788, peer_count:13.
    • Recent .116 -> 3ccc sent rows are stored with real 2026 timestamps and transport:"lora".
  • UI/backend fixes included in b4531bb4:
    • transportLabel("lora") displays LoRa.
    • mesh sends refetch messages after send so transport pills settle without browser refresh.
    • off-grid mode blocks the mesh-chat FIPS/Tor federation fallback and forces LoRa-only sends; banner text is Tor/FIPS disabled - LoRa only.
    • empty mesh-chat placeholder opacity reduced.
  • Meshtastic diagnostics now identify the remaining blocker:
    • 3ccc NodeInfo is discovered: Meshtastic peer is PKC-capable (NodeInfo public_key) node=1128152268 key_len=32.
    • Bytes from stock Meshtastic text reach .116, but the custom parser rejects the packet: Meshtastic FromRadio.packet did not parse into a decoded MeshPacket len=73 head=0dcc3c3e43153ca5b5432a16df56cbed.
    • Non-text packets decode and are ignored with port numbers (portnum=3/4/5), so the serial read path is alive. Resume inside core/archipelago/src/mesh/meshtastic.rs::parse_mesh_packet.
  • LoRa is therefore not fully fixed yet: stock 3ccc -> .116 text does not surface in mesh.messages, and .116 -> 3ccc still needs user-visible confirmation in the Meshtastic app.