Pin RELEASE_ROOT_PUBKEY_HEX from the 2026-07-02 release-root signing ceremony
(signer did🔑z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur) so nodes verify
the publisher identity of the app-catalog. Sign releases/app-catalog.json in place.
Fix two floats that made the catalog unsignable: archy-btcpay-db manifest version
-> string, fedimint-clientd cpu_limit 0.25 -> 1 (u32). Add scripts/sign-catalog.sh
helper, the 1.8.0 release-hardening plan/tracker, and the commit-and-push project
rule in CLAUDE.md.
Backward-compatible: old binaries still accept the signed catalog; the pinned-anchor
binary ships in the next build/OTA.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
20 KiB
Archipelago 1.8.0 — Release Hardening Plan & Tracker
The one living checklist for shipping 1.8.0. Derived from a full-system deep audit (2026-07-02): backend security, backend code-quality, frontend, mesh, tests/release pipeline, and the ISO build. Supersedes nothing — it sits above
docs/UNIFIED-TASK-TRACKER.md(day-to-day) as the release exit-criteria list. Keep it updated: tick a box the moment an item lands, with the commit sha.
Definition of done for 1.8.0: the supply chain is authenticated end-to-end (§A), OTA self-update is safe and rollback-proven on real hardware (§B), no secrets ship in the image (§F), and the single-node gate stays 5/5 green through all of it. Everything else is polish that should not block the tag.
Legend: [ ] open · [~] in progress · [x] done · 🔴 critical · 🟠 high ·
🟡 medium · 🟢 low/polish · ⛔ blocked on you.
🎯 The single most important insight
The release signing ceremony (Workstream B) is the linchpin. ✅ The ceremony
KEY was generated (user confirmed 2026-07-02) — the hard offline part is done. But
the outputs are not yet wired into the repo: anchor.rs:21 is still None and
releases/app-catalog.json carries no signature/signed_by (its image_signature
fields are literal "cosign://..." placeholders). Three mechanical steps remain,
split by who can run them: (1) pin the pubkey — needs only the public hex, can
be done in-repo now; (2) sign the catalog with the RELEASE_MASTER_MNEMONIC —
only the publisher, secret never touches a host; (3) implement + flip cosign
enforcement on the pull path. Until (1)+(2) land, every "verify the signature" task
below is written but not enforced. This is still the critical path; §A converges on it.
§A — Supply-chain authentication (🔴 THE release blocker)
Today an attacker who controls the mirror IP (or any MITM on the plaintext HTTP
path) can ship an arbitrary root binary, arbitrary container images, and an
arbitrary app catalog to the entire fleet — fully unattended under
auto_apply. These four items are one story and must land together.
- 🔴 Pin
RELEASE_ROOT_PUBKEY_HEX+ sign the catalog — DONE 2026-07-02.anchor.rspinned to5d15cbee…d469951(signerdid:key:z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur); trust tests updated (16/16 green).releases/app-catalog.jsonsigned in place (signed_bymatches, 64-byte sig); two blocking floats fixed en route (archy-btcpay-dbversion→string,cpu_limit0.25→1). Ship order (backward-compatible): signed catalog goes out first (old binaries still accept it), pinned-anchor binary follows in the next build/OTA. Still ahead: (a) the pinned-anchor binary must actually be built + shipped for enforcement to be live on nodes; (b) flip "accept unsigned" → "reject unsigned" only after the whole fleet is on the pinned binary (container/app_catalog.rs:397, theUnsignedarm) — see the next item. - 🔴 Enforce a signature on the OTA manifest before trusting it.
update.rs:68fetcheshttp://146.59.87.168:3000/.../manifest.jsonover cleartext and parses/trusts it with notrust::verify_detachedcall; component sha256/blake3 are only checked against that same unauthenticated manifest → remote root RCE. Move to HTTPS + pinned cert, require an Ed25519 release-root signature, and refuseauto_applyuntil the anchor is pinned. - 🔴 Implement container image signature verification (cosign).
container/src/podman_client.rs:255—pull_image(.., _signature)silently discards the signature that the manifest threads all the way down (prod_orchestrator.rs:1978/2435). Wiresigstore-rs/cosign verify(orpodman pull --signature-policy); hard-fail when a declared signature doesn't verify. - 🟠 Move the image mirror to HTTPS; drop
--tls-verify=false.podman_client.rs:641INSECURE_REGISTRY_HOSTS = ["146.59.87.168:3000"]+config.rs:104,124allowlist pull images over unauthenticated HTTP. Remove the raw-IP entries; give the mirror a valid/pinned cert. (Same host also baked insecurely into the ISO — see §F.) - 🟠 Validate every image string at the pull site, not just the RPC boundary.
is_valid_docker_imageruns ininstall.rs:224/runtime.rs:549butprod_orchestrator::install_fresh(1978) andresolve_catalog_image(944-971) pass catalog/manifest images straight topull_image. Call the validator right before every pull.
§B — OTA self-update safety (🔴 1.8.0's headline feature is untested live)
The apply path itself is well-built (resumable download, staged-complete marker, atomic swap, single-depth backup). The gaps are authenticity (§A) and verification depth — plus the fact that the upgrade path has never run end-to-end on real hardware.
- 🔴 Deepen the post-OTA health check.
update.rs:456(probe_frontend_once) passes on any 2xx/3xx fromGET /, andverify_pending_update(494-593) only rolls back on that. A release with a broken RPC API, dead containers, or failed LND unlock passes and never rolls back. Add/rpc/v1 update.status+ container-list/required-stack health assertions before clearing the pending-verify marker. - 🟠 Run one real upgrade-from-vN-1 soak on hardware before tagging. No test installs the previous version, points it at a staged 1.8.0 manifest, applies, and asserts health + rollback. This is the top release risk for an OTA release. A two-VM (or two-node) harness is enough.
- 🟡 Guard the frontend-build-no-op in the actual release path. The
ui-dist-versiongrep guard (tests/release/run.sh:82) is behind--with-build, whichscripts/create-release.sh:90never passes → a stale frontend can ship with a valid sha256. Callrun.sh --with-build --manifestfrom create-release (or fold the grep in). - 🟢 publish-release-assets verifies size, not sha256 (
publish-release-assets.sh:97). Add a HEAD/GET sha256 compare so a size-correct/content-wrong mirror asset fails the publish gate.
§C — Backend robustness (🟠 stability, mostly low-effort/high-ROI)
Note: the .unwrap()/panic! worry is a non-issue — nearly all are in test
modules; production request/boot paths are essentially panic-free. The real risks:
- 🟠 Log swallowed persistence writes. ~30-40 dangerous
let _ = save_*().awaitsites discard durability failures with zero diagnostics:server.rs:270(mesh config),bitcoin_relay.rs:865(relay state),update.rs:163/1223(mirrors/update state),registry.rs:158,mesh/status.rs:286,scheduler.rs:179,install.rs:34. Convert toif let Err(e) = … { warn!(…) }; leave genuinely fire-and-forget ones commented. - 🟠 Remove blocking
std::process::Commandfrom async handlers.install.rs:2222published_host_port(sync podman on the install path),dependencies.rs:316(df),system/handlers.rs:578(sudo),transport/fips.rs:50(systemctl) stall tokio workers under load. Convert totokio::processorspawn_blocking. Only 8 files usestd::process::Command— bounded. - 🟡 Restrict Bitcoin RPC exposure.
bootstrap.rs:409writesrpcallowip=0.0.0.0/0. Scope to the container subnet /127.0.0.1. - 🟡 Move generated secrets from env to file mounts.
manifest.rs:1208-1226injects secrets as-e KEY=value, readable viapodman inspect//proc/<pid>/environ. Prefer bind-mounting the existing0600secret file orpodman --secret. - 🟡 Harden rate-limit IP extraction.
middleware.rs:120-128trusts client-spoofableX-Real-IP/X-Forwarded-For→ per-request bucket rotation defeats the login limiter. Trust forwarded headers only from a configured proxy; have nginx set them. - 🟢 Include
seqin the mesh signed preimage.message_types.rs:245-288signs(t,v,ts)but sets the anti-replayseqafter signing → a radio MITM can alter ordering without breaking the signature. - 🟢 Guard the short-DID slice panic (
mesh/listener/decode.rs:566) and gate the dev-modepassword123bypass (auth.rs:18) behind#[cfg]before it can reach a release build. - 🟢 Apply the seccomp/apparmor profile —
security/src/container_policies.rs:71is a TODO; the profile is defined but never applied to podman.
§D — Frontend security & performance (🟠)
The untrusted mesh/LoRa chat path is safe (interpolation, no v-html — good).
The real issues are the app-bridge origin model and a bloated bundle.
- 🟠 Validate
event.origin+ add consent gates in the NIP-07 nostr bridge.stores/appLauncher.ts:385-490derives the caller from the launcher's own URL, neverevent.origin, andgetPublicKey/nip04.decrypt/nip44.decrypthave no consent gate → any co-resident iframe can deanonymize the nostr identity or use the node as a decryption oracle while an app is open. Checkevent.originagainst the open app's real origin; key approvals on it; gate decrypt/getPublicKey likesignEvent. - 🟠 Origin-check the
share-to-meshhandler.App.vue:450-464acts on{type:'share-to-mesh', cid}from any sender and force-navigates to/meshwith the CID pre-staged. Addev.origin === window.location.origin(asChat.vue:95already does). - 🟡 Decide the app-iframe isolation model.
AppSessionFrame.vue:54/AppLauncherOverlay.vue:79embed apps same-origin with no meaningfulsandbox; a same-origin app can read the CSRF cookie +localStorage. Ideal fix (serve apps from a per-app subdomain origin) is architectural — at minimum decide + document for 1.8.0. - 🟡 Shrink the 93 MB dist.
assets/video/video-intro.mp4is 14.7 MB (precached by the service worker → blocks PWA install), plus ~18 MB of ~1 MB full-screen JPEGs. Convert backgrounds to WebP/AVIF at responsive sizes, lazy/stream the intro video, and exclude video/audio from the Workbox precache. Biggest, easiest perf win. - 🟢 DOMPurify the
Server.vueQR SVG (:283/:295renderv-htmlunsanitized whileTwoFactorSection.vuesanitizes the analogous SVG); guard the unguardedpollInterval(Mesh.vue:391); surface silent data-fetch failures (curatedApps.ts:58/71).
§E — Mesh transports (🟢 mostly done — verify & polish)
Confirmed fixed in HEAD: B8 (1970 timestamps), B6 (inbound RX surfacing), the per-message transport pill, and the archy↔archy plain-TEXT-DM E2E fix. Remaining:
- 🟠 Active Reticulum daemon-death detection.
reticulum.rs:589onlywarn!s on socket EOF andtry_recv_framethen returnsOk(None)forever; nothing callschild.try_wait(). On an idle link a crashed daemon is invisible for up to 30 min (the RX-stall timeout). Treat socket EOF asErr→ immediate respawn. (Pairs with the currentfix/reticulum-daemon-pdeathsigbranch work.) - 🟡 Persist chat history across restarts.
state.messagesboots empty (listener/mod.rs:283) while outbox/scheduler/peers survive — inconsistent; bubbles vanish on restart. Addmesh-messages.jsonmirroring thescheduler.rs/outbox.rspattern (or explicitly accept the loss). - 🟡 Tighten the 30 s legacy dedup (
listener/mod.rs:383-389) — it silently drops a peer legitimately sending identical text twice within 30 s. - 🟢 Wire the PyInstaller daemon binary into the release tarball / deploy script
(Rust expects
/usr/local/bin/archy-reticulum-daemon,reticulum.rs:80); add the RNode udev rule; finishARCHY:2:announce→arch_pubkey_hexbinding (reticulum.rs:119). - 🟢 Duty-cycle guard for LoRa TX — none exists; EU 868 is legally 1%. At minimum an airtime budget/warning.
§F — ISO / image build (🔴 one secret leak; otherwise 🟠 hardening)
image-recipe/_archived/build-auto-installer-iso.sh (3604 lines) is the real
builder; OTA is the normal update path but the ISO is what produces installable
media (latest artifact only one minor behind).
- ⛔🔴 Anthropic API key — INTENTIONAL for alpha/beta, hard GO-LIVE gate.
build-auto-installer-iso.sh:2645bakes a livesk-ant-…key intoclaude-api-proxy.serviceso alpha/beta testers get frictionless AI (deliberate — per user 2026-07-02). Do NOT remove for alpha/beta. Before public GA it MUST be removed + rotated + injected at runtime (a second copy also exists in a worktree). Track it here so it can't be forgotten at launch. - 🔴 Per-device secrets on first boot. The self-signed TLS private key is generated
at build time (
:426) → every device ships the same key; SSH host keys likewise not regenerated. Generate TLS + SSH host keys on first boot. - 🟠 Kill default credentials.
archipelago/archipelago(SSH+root), webpassword123, and SSHPasswordAuthentication yes(:411) all ship. Lock root, force credential creation in onboarding, disable SSH password auth (or force-change on first login). - 🟠 Sign + checksum the ISO. Pipeline ends at
xorrisowith noSHA256SUMS, no GPG/minisign, no Secure Boot (BOOTX64.EFIis unsigned thoughgrub-efi-amd64-signedis installed). Emit + sign checksums; wire signed Secure Boot. - 🟠 Registries over HTTPS in the image too —
146.59.87.168:3000/git.tx1138.comare bakedinsecure=true/tls_verify:false(:216,:2308). (Ties to §A.) - 🟡 Add
unattended-upgrades+ a default-deny nftables firewall (allow 22/80/443 + mesh/WG). Neither exists today; OS packages drift until reflash and there is no host firewall. - 🟡 Pin the build for reproducibility. FIPS daemon is built from unpinned upstream
main, Tailscale from its live apt repo, andscripts/image-versions.shuses many:latest/stabletags (+bitcoin-ui:1.7.84-alpha, 15 behind). Pin to commits/versions; snapshot apt. Wire ISO version toCargo.tomlso it can't drift. - 🟢 Harden LUKS + roadmap A/B partitioning. The LUKS data key sits in plaintext on the
unencrypted root (
:2137); add TPM2/passphrase binding. Longer-term: A/B (or factory-reset) partitions for safe OTA rollback, and a real install-time TUI (docs/INSTALL-SCREENS-DESIGN.mdexists but the installer is headless "press Enter").
§G — Refactor & code health (🟢 not release-blocking; do after the tag or opportunistically)
- 🟢 Manifest-drive per-app special-casing. App names are branched on across 5-7 Rust
files (
config.rs36 match arms,runtime.rs17,install.rs:275-287dispatch,prod_orchestrator.rs:54-83baseline/restart-sensitive lists). Movebaseline,restart_sensitive,stack_members,multi_containerinto the manifest schema; collapse the five near-identicalinstall_*_stack()wrappers into one generic call. Biggest maintainability win. - 🟢 Route all podman/systemctl through
podman_client. 113 rawCommand::new("podman")+ 32systemctlcalls bypass the existing 952-LOC wrapper → untestable + the blocking-call risk (§C). Consolidating also unlocks unit tests for the thinly-testedpackage/handlers (stacks.rs1 test,config.rs2,runtime.rs3,install.rs7). - 🟢 Split the god-modules.
prod_orchestrator.rs(5,263 LOC) →orchestrator/{reconcile, host_ports,ownership,hooks}.rs;Mesh.vue(2,485 LOC / 241 KB chunk) → sub-components. Both are well-tested, so safe. - 🟢 Delete dead code. ~4,100 LOC of orphan StartOS crates (
js-engine,models,helpers,container-init) not in the workspace or linked; the committed AppleDouble._*.rsfiles; the committed.venv//build//__pycache__under the duplicatereticulum-daemon/tree; promoteMeshRadioDeviceenum → trait. - 🟢 Resolve the Quadlet flag & dep hygiene. Decide
use_quadlet_backends' fate (flip default + delete the legacycreate_containerbranch, or freeze as experimental — don't ship both half-maintained). Consolidate the mixed hyper 0.14/1.x ecosystem; bump stale majors (reqwest, base64, thiserror, tokio-tungstenite).
§H — Testing gaps that gate confidence (🟠)
- 🟠 Add the OTA upgrade soak (same as §B item 2) — the highest-value missing test.
- 🟡 Add a host-reboot survival tier — every app is
○(untested) for reboot inTESTING.md:138; the gate can't reboot the node it runs on. Run SSH-reboot-then-reprobe out-of-band per node. - 🟡 Make the release gate run the full Rust suite (or hard-require a green CI sha).
tests/release/run.sh:101runs only a 6-module slice because the full 1000-test suite hangs PTYs on the dev box → 994 tests unverified at release time if CI is stale. - 🟡 Add
--max-timetonode_rpc()(tests/multinode/lib/multinode.bash) — a slow server-side RPC hangs the whole multinode suite with no feedback. - 🟢 De-hardcode creds/IPs in tests (
tests/multinode/smoke.sh:32,remote-lifecycle.sh:136); snapshot/restore node baseline between destructive iterations (teardown currently only clears/tmpsession files).
§I — Carried-over open items (from UNIFIED-TASK-TRACKER.md, still valid)
- [~] 🟠 Multinode gate pass — 5× destructive gate was launched on node
.5; bring the rest of the fleet to precondition, then run the existing (undocumented-but-present)tests/multinode/{smoke,meshtastic}.shcross-node suites. - 🟠 Federation
remove-nodetombstone regression.federation/storage.rs:187doeslet _ = tombstone_did(...)— swallows the write error, so a removed peer reappears after the next sync. (This is a specific, confirmed instance of the §C swallowed-writes class.) Needs a careful fix +smoke.shre-verify. - 🟠 Phase-3 Quadlet default-flip — validated + opt-in on .228/.198; flip
config.rs:256once the .5 gate reports clean. - 🟠 Developer CLI suite (
archy app validate/render/install/test) — gates external app publishing (APP-PACKAGING-MIGRATION-PLAN.mdstep 5). - ⛔🟡 Version-naming decision (
1.7.99-alpha→1.8.0vs1.8.00-alpha) — a one-line call, then a mechanical bump + tag. Needs your decision. - ⛔🟢 Bitcoin multi-version fleet OTA —
.228working on branch; rollout timing is held for your call (docs/bitcoin-version-bulletproof-rollout.md). - ⛔🟢 3ccc stock-Meshtastic RF validation — code fix in place; needs a live radio send.
Suggested order of attack
- The critical path: §A signing ceremony → then turn on manifest/catalog/image signature enforcement (§A) + OTA HTTPS/signature + deeper health check (§B).
- Cheap high-ROI stability: §C swallowed-writes + blocking-calls; §D nostr-bridge
- share-to-mesh origin checks; §H OTA soak + reboot tier.
- Image hardening: rest of §F (per-device secrets, default creds, ISO signing, firewall/unattended-upgrades, pinning).
- Polish, post-tag: §G refactors, §E mesh persistence/dedup, §D bundle shrink.
- Decisions you own (⛔): version name, signing mnemonic, bitcoin OTA timing, 3ccc test.
- Before public GA only (NOT alpha/beta): remove + rotate the Anthropic key (§F) — intentionally left in for frictionless AI during alpha/beta.
Last updated: 2026-07-02 (initial deep-audit synthesis). Update this line + tick boxes with commit shas as items land.