archy/docs/1.8.0-RELEASE-HARDENING-PLAN.md
archipelago 51647b21cd feat(trust): verify release-root signature on the OTA manifest
check_for_updates now fetches the manifest as raw JSON and runs
trust::verify_detached before parsing: a tampered or wrong-signer
signature rejects the mirror outright, and unsigned manifests are
offered for MANUAL apply only — the 3 AM auto-apply scheduler refuses
them, closing the unattended remote-root hole (§A of the 1.8.0
hardening plan). UpdateState gains manifest_signed so the UI can
surface authenticity.

Publisher side: create-release.sh signs the manifest during the
release (ceremony, mnemonic via TTY/env only), publish-release-assets
hard-refuses to ship an unsigned manifest (grep + new 'ceremony
verify' cryptographic gate), and scripts/sign-manifest.sh covers
re-signing outside a release run.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 12:33:01 -04:00

20 KiB
Raw Blame History

Archipelago 1.8.0 — Release Hardening Plan & Tracker

The one living checklist for shipping 1.8.0. Derived from a full-system deep audit (2026-07-02): backend security, backend code-quality, frontend, mesh, tests/release pipeline, and the ISO build. Supersedes nothing — it sits above docs/UNIFIED-TASK-TRACKER.md (day-to-day) as the release exit-criteria list. Keep it updated: tick a box the moment an item lands, with the commit sha.

Definition of done for 1.8.0: the supply chain is authenticated end-to-end (§A), OTA self-update is safe and rollback-proven on real hardware (§B), no secrets ship in the image (§F), and the single-node gate stays 5/5 green through all of it. Everything else is polish that should not block the tag.

Legend: [ ] open · [~] in progress · [x] done · 🔴 critical · 🟠 high · 🟡 medium · 🟢 low/polish · blocked on you.


🎯 The single most important insight

The release signing ceremony (Workstream B) is the linchpin. The ceremony KEY was generated (user confirmed 2026-07-02) — the hard offline part is done. But the outputs are not yet wired into the repo: anchor.rs:21 is still None and releases/app-catalog.json carries no signature/signed_by (its image_signature fields are literal "cosign://..." placeholders). Three mechanical steps remain, split by who can run them: (1) pin the pubkey — needs only the public hex, can be done in-repo now; (2) sign the catalog with the RELEASE_MASTER_MNEMONIC — only the publisher, secret never touches a host; (3) implement + flip cosign enforcement on the pull path. Until (1)+(2) land, every "verify the signature" task below is written but not enforced. This is still the critical path; §A converges on it.


§A — Supply-chain authentication (🔴 THE release blocker)

Today an attacker who controls the mirror IP (or any MITM on the plaintext HTTP path) can ship an arbitrary root binary, arbitrary container images, and an arbitrary app catalog to the entire fleet — fully unattended under auto_apply. These four items are one story and must land together.

  • 🔴 Pin RELEASE_ROOT_PUBKEY_HEX + sign the catalog — DONE 2026-07-02. anchor.rs pinned to 5d15cbee…d469951 (signer did:key:z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur); trust tests updated (16/16 green). releases/app-catalog.json signed in place (signed_by matches, 64-byte sig); two blocking floats fixed en route (archy-btcpay-db version→string, cpu_limit 0.25→1). Ship order (backward-compatible): signed catalog goes out first (old binaries still accept it), pinned-anchor binary follows in the next build/OTA. Still ahead: (a) the pinned-anchor binary must actually be built + shipped for enforcement to be live on nodes; (b) flip "accept unsigned" → "reject unsigned" only after the whole fleet is on the pinned binary (container/app_catalog.rs:397, the Unsigned arm) — see the next item.
  • [~] 🔴 Enforce a signature on the OTA manifest before trusting it. Signature verification LANDED 2026-07-02: check_for_updates now fetches raw JSON and runs trust::verify_detached — a present-but-invalid/wrong-signer signature hard-rejects the mirror; unsigned manifests are offered for MANUAL apply only (manifest_signed surfaced in UpdateState) and auto-apply refuses them. Publisher side: create-release.sh signs the manifest inline (ceremony), publish-release-assets.sh hard-refuses to ship unsigned (grep + ceremony verify crypto gate), and scripts/sign-manifest.sh exists for re-signs. Still open: move the mirror to HTTPS + pinned cert (tracked with the next item); flip unsigned-manual-apply → hard-reject once the fleet is on a pinned-anchor binary.
  • 🔴 Implement container image signature verification (cosign). container/src/podman_client.rs:255pull_image(.., _signature) silently discards the signature that the manifest threads all the way down (prod_orchestrator.rs:1978/2435). Wire sigstore-rs/cosign verify (or podman pull --signature-policy); hard-fail when a declared signature doesn't verify.
  • 🟠 Move the image mirror to HTTPS; drop --tls-verify=false. podman_client.rs:641 INSECURE_REGISTRY_HOSTS = ["146.59.87.168:3000"] + config.rs:104,124 allowlist pull images over unauthenticated HTTP. Remove the raw-IP entries; give the mirror a valid/pinned cert. (Same host also baked insecurely into the ISO — see §F.)
  • 🟠 Validate every image string at the pull site, not just the RPC boundary. is_valid_docker_image runs in install.rs:224/runtime.rs:549 but prod_orchestrator::install_fresh (1978) and resolve_catalog_image (944-971) pass catalog/manifest images straight to pull_image. Call the validator right before every pull.

§B — OTA self-update safety (🔴 1.8.0's headline feature is untested live)

The apply path itself is well-built (resumable download, staged-complete marker, atomic swap, single-depth backup). The gaps are authenticity (§A) and verification depth — plus the fact that the upgrade path has never run end-to-end on real hardware.

  • 🔴 Deepen the post-OTA health check. update.rs:456 (probe_frontend_once) passes on any 2xx/3xx from GET /, and verify_pending_update (494-593) only rolls back on that. A release with a broken RPC API, dead containers, or failed LND unlock passes and never rolls back. Add /rpc/v1 update.status + container-list/required-stack health assertions before clearing the pending-verify marker.
  • 🟠 Run one real upgrade-from-vN-1 soak on hardware before tagging. No test installs the previous version, points it at a staged 1.8.0 manifest, applies, and asserts health + rollback. This is the top release risk for an OTA release. A two-VM (or two-node) harness is enough.
  • 🟡 Guard the frontend-build-no-op in the actual release path. The ui-dist-version grep guard (tests/release/run.sh:82) is behind --with-build, which scripts/create-release.sh:90 never passes → a stale frontend can ship with a valid sha256. Call run.sh --with-build --manifest from create-release (or fold the grep in).
  • 🟢 publish-release-assets verifies size, not sha256 (publish-release-assets.sh:97). Add a HEAD/GET sha256 compare so a size-correct/content-wrong mirror asset fails the publish gate.

§C — Backend robustness (🟠 stability, mostly low-effort/high-ROI)

Note: the .unwrap()/panic! worry is a non-issue — nearly all are in test modules; production request/boot paths are essentially panic-free. The real risks:

  • 🟠 Log swallowed persistence writes. ~30-40 dangerous let _ = save_*().await sites discard durability failures with zero diagnostics: server.rs:270 (mesh config), bitcoin_relay.rs:865 (relay state), update.rs:163/1223 (mirrors/update state), registry.rs:158, mesh/status.rs:286, scheduler.rs:179, install.rs:34. Convert to if let Err(e) = … { warn!(…) }; leave genuinely fire-and-forget ones commented.
  • 🟠 Remove blocking std::process::Command from async handlers. install.rs:2222 published_host_port (sync podman on the install path), dependencies.rs:316 (df), system/handlers.rs:578 (sudo), transport/fips.rs:50 (systemctl) stall tokio workers under load. Convert to tokio::process or spawn_blocking. Only 8 files use std::process::Command — bounded.
  • 🟡 Restrict Bitcoin RPC exposure. bootstrap.rs:409 writes rpcallowip=0.0.0.0/0. Scope to the container subnet / 127.0.0.1.
  • 🟡 Move generated secrets from env to file mounts. manifest.rs:1208-1226 injects secrets as -e KEY=value, readable via podman inspect / /proc/<pid>/environ. Prefer bind-mounting the existing 0600 secret file or podman --secret.
  • 🟡 Harden rate-limit IP extraction. middleware.rs:120-128 trusts client-spoofable X-Real-IP/X-Forwarded-For → per-request bucket rotation defeats the login limiter. Trust forwarded headers only from a configured proxy; have nginx set them.
  • 🟢 Include seq in the mesh signed preimage. message_types.rs:245-288 signs (t,v,ts) but sets the anti-replay seq after signing → a radio MITM can alter ordering without breaking the signature.
  • 🟢 Guard the short-DID slice panic (mesh/listener/decode.rs:566) and gate the dev-mode password123 bypass (auth.rs:18) behind #[cfg] before it can reach a release build.
  • 🟢 Apply the seccomp/apparmor profilesecurity/src/container_policies.rs:71 is a TODO; the profile is defined but never applied to podman.

§D — Frontend security & performance (🟠)

The untrusted mesh/LoRa chat path is safe (interpolation, no v-html — good). The real issues are the app-bridge origin model and a bloated bundle.

  • 🟠 Validate event.origin + add consent gates in the NIP-07 nostr bridge. stores/appLauncher.ts:385-490 derives the caller from the launcher's own URL, never event.origin, and getPublicKey/nip04.decrypt/nip44.decrypt have no consent gate → any co-resident iframe can deanonymize the nostr identity or use the node as a decryption oracle while an app is open. Check event.origin against the open app's real origin; key approvals on it; gate decrypt/getPublicKey like signEvent.
  • 🟠 Origin-check the share-to-mesh handler. App.vue:450-464 acts on {type:'share-to-mesh', cid} from any sender and force-navigates to /mesh with the CID pre-staged. Add ev.origin === window.location.origin (as Chat.vue:95 already does).
  • 🟡 Decide the app-iframe isolation model. AppSessionFrame.vue:54 / AppLauncherOverlay.vue:79 embed apps same-origin with no meaningful sandbox; a same-origin app can read the CSRF cookie + localStorage. Ideal fix (serve apps from a per-app subdomain origin) is architectural — at minimum decide + document for 1.8.0.
  • 🟡 Shrink the 93 MB dist. assets/video/video-intro.mp4 is 14.7 MB (precached by the service worker → blocks PWA install), plus ~18 MB of ~1 MB full-screen JPEGs. Convert backgrounds to WebP/AVIF at responsive sizes, lazy/stream the intro video, and exclude video/audio from the Workbox precache. Biggest, easiest perf win.
  • 🟢 DOMPurify the Server.vue QR SVG (:283/:295 render v-html unsanitized while TwoFactorSection.vue sanitizes the analogous SVG); guard the unguarded pollInterval (Mesh.vue:391); surface silent data-fetch failures (curatedApps.ts:58/71).

§E — Mesh transports (🟢 mostly done — verify & polish)

Confirmed fixed in HEAD: B8 (1970 timestamps), B6 (inbound RX surfacing), the per-message transport pill, and the archy↔archy plain-TEXT-DM E2E fix. Remaining:

  • 🟠 Active Reticulum daemon-death detection. reticulum.rs:589 only warn!s on socket EOF and try_recv_frame then returns Ok(None) forever; nothing calls child.try_wait(). On an idle link a crashed daemon is invisible for up to 30 min (the RX-stall timeout). Treat socket EOF as Err → immediate respawn. (Pairs with the current fix/reticulum-daemon-pdeathsig branch work.)
  • 🟡 Persist chat history across restarts. state.messages boots empty (listener/mod.rs:283) while outbox/scheduler/peers survive — inconsistent; bubbles vanish on restart. Add mesh-messages.json mirroring the scheduler.rs/outbox.rs pattern (or explicitly accept the loss).
  • 🟡 Tighten the 30 s legacy dedup (listener/mod.rs:383-389) — it silently drops a peer legitimately sending identical text twice within 30 s.
  • 🟢 Wire the PyInstaller daemon binary into the release tarball / deploy script (Rust expects /usr/local/bin/archy-reticulum-daemon, reticulum.rs:80); add the RNode udev rule; finish ARCHY:2: announce→arch_pubkey_hex binding (reticulum.rs:119).
  • 🟢 Duty-cycle guard for LoRa TX — none exists; EU 868 is legally 1%. At minimum an airtime budget/warning.

§F — ISO / image build (🔴 one secret leak; otherwise 🟠 hardening)

image-recipe/_archived/build-auto-installer-iso.sh (3604 lines) is the real builder; OTA is the normal update path but the ISO is what produces installable media (latest artifact only one minor behind).

  • 🔴 Anthropic API key — INTENTIONAL for alpha/beta, hard GO-LIVE gate. build-auto-installer-iso.sh:2645 bakes a live sk-ant-… key into claude-api-proxy.service so alpha/beta testers get frictionless AI (deliberate — per user 2026-07-02). Do NOT remove for alpha/beta. Before public GA it MUST be removed + rotated + injected at runtime (a second copy also exists in a worktree). Track it here so it can't be forgotten at launch.
  • 🔴 Per-device secrets on first boot. The self-signed TLS private key is generated at build time (:426) → every device ships the same key; SSH host keys likewise not regenerated. Generate TLS + SSH host keys on first boot.
  • 🟠 Kill default credentials. archipelago/archipelago (SSH+root), web password123, and SSH PasswordAuthentication yes (:411) all ship. Lock root, force credential creation in onboarding, disable SSH password auth (or force-change on first login).
  • 🟠 Sign + checksum the ISO. Pipeline ends at xorriso with no SHA256SUMS, no GPG/minisign, no Secure Boot (BOOTX64.EFI is unsigned though grub-efi-amd64-signed is installed). Emit + sign checksums; wire signed Secure Boot.
  • 🟠 Registries over HTTPS in the image too146.59.87.168:3000 / git.tx1138.com are baked insecure=true/tls_verify:false (:216, :2308). (Ties to §A.)
  • 🟡 Add unattended-upgrades + a default-deny nftables firewall (allow 22/80/443 + mesh/WG). Neither exists today; OS packages drift until reflash and there is no host firewall.
  • 🟡 Pin the build for reproducibility. FIPS daemon is built from unpinned upstream main, Tailscale from its live apt repo, and scripts/image-versions.sh uses many :latest/stable tags (+ bitcoin-ui:1.7.84-alpha, 15 behind). Pin to commits/versions; snapshot apt. Wire ISO version to Cargo.toml so it can't drift.
  • 🟢 Harden LUKS + roadmap A/B partitioning. The LUKS data key sits in plaintext on the unencrypted root (:2137); add TPM2/passphrase binding. Longer-term: A/B (or factory-reset) partitions for safe OTA rollback, and a real install-time TUI (docs/INSTALL-SCREENS-DESIGN.md exists but the installer is headless "press Enter").

§G — Refactor & code health (🟢 not release-blocking; do after the tag or opportunistically)

  • 🟢 Manifest-drive per-app special-casing. App names are branched on across 5-7 Rust files (config.rs 36 match arms, runtime.rs 17, install.rs:275-287 dispatch, prod_orchestrator.rs:54-83 baseline/restart-sensitive lists). Move baseline, restart_sensitive, stack_members, multi_container into the manifest schema; collapse the five near-identical install_*_stack() wrappers into one generic call. Biggest maintainability win.
  • 🟢 Route all podman/systemctl through podman_client. 113 raw Command::new("podman") + 32 systemctl calls bypass the existing 952-LOC wrapper → untestable + the blocking-call risk (§C). Consolidating also unlocks unit tests for the thinly-tested package/ handlers (stacks.rs 1 test, config.rs 2, runtime.rs 3, install.rs 7).
  • 🟢 Split the god-modules. prod_orchestrator.rs (5,263 LOC) → orchestrator/{reconcile, host_ports,ownership,hooks}.rs; Mesh.vue (2,485 LOC / 241 KB chunk) → sub-components. Both are well-tested, so safe.
  • 🟢 Delete dead code. ~4,100 LOC of orphan StartOS crates (js-engine, models, helpers, container-init) not in the workspace or linked; the committed AppleDouble ._*.rs files; the committed .venv//build//__pycache__ under the duplicate reticulum-daemon/ tree; promote MeshRadioDevice enum → trait.
  • 🟢 Resolve the Quadlet flag & dep hygiene. Decide use_quadlet_backends' fate (flip default + delete the legacy create_container branch, or freeze as experimental — don't ship both half-maintained). Consolidate the mixed hyper 0.14/1.x ecosystem; bump stale majors (reqwest, base64, thiserror, tokio-tungstenite).

§H — Testing gaps that gate confidence (🟠)

  • 🟠 Add the OTA upgrade soak (same as §B item 2) — the highest-value missing test.
  • 🟡 Add a host-reboot survival tier — every app is (untested) for reboot in TESTING.md:138; the gate can't reboot the node it runs on. Run SSH-reboot-then-reprobe out-of-band per node.
  • 🟡 Make the release gate run the full Rust suite (or hard-require a green CI sha). tests/release/run.sh:101 runs only a 6-module slice because the full 1000-test suite hangs PTYs on the dev box → 994 tests unverified at release time if CI is stale.
  • 🟡 Add --max-time to node_rpc() (tests/multinode/lib/multinode.bash) — a slow server-side RPC hangs the whole multinode suite with no feedback.
  • 🟢 De-hardcode creds/IPs in tests (tests/multinode/smoke.sh:32, remote-lifecycle.sh:136); snapshot/restore node baseline between destructive iterations (teardown currently only clears /tmp session files).

§I — Carried-over open items (from UNIFIED-TASK-TRACKER.md, still valid)

  • [~] 🟠 Multinode gate pass — 5× destructive gate was launched on node .5; bring the rest of the fleet to precondition, then run the existing (undocumented-but-present) tests/multinode/{smoke,meshtastic}.sh cross-node suites.
  • 🟠 Federation remove-node tombstone regression. federation/storage.rs:187 does let _ = tombstone_did(...) — swallows the write error, so a removed peer reappears after the next sync. (This is a specific, confirmed instance of the §C swallowed-writes class.) Needs a careful fix + smoke.sh re-verify.
  • 🟠 Phase-3 Quadlet default-flip — validated + opt-in on .228/.198; flip config.rs:256 once the .5 gate reports clean.
  • 🟠 Developer CLI suite (archy app validate/render/install/test) — gates external app publishing (APP-PACKAGING-MIGRATION-PLAN.md step 5).
  • 🟡 Version-naming decision (1.7.99-alpha1.8.0 vs 1.8.00-alpha) — a one-line call, then a mechanical bump + tag. Needs your decision.
  • 🟢 Bitcoin multi-version fleet OTA.228 working on branch; rollout timing is held for your call (docs/bitcoin-version-bulletproof-rollout.md).
  • 🟢 3ccc stock-Meshtastic RF validation — code fix in place; needs a live radio send.

Suggested order of attack

  1. The critical path: §A signing ceremony → then turn on manifest/catalog/image signature enforcement (§A) + OTA HTTPS/signature + deeper health check (§B).
  2. Cheap high-ROI stability: §C swallowed-writes + blocking-calls; §D nostr-bridge
    • share-to-mesh origin checks; §H OTA soak + reboot tier.
  3. Image hardening: rest of §F (per-device secrets, default creds, ISO signing, firewall/unattended-upgrades, pinning).
  4. Polish, post-tag: §G refactors, §E mesh persistence/dedup, §D bundle shrink.
  5. Decisions you own (): version name, signing mnemonic, bitcoin OTA timing, 3ccc test.
  6. Before public GA only (NOT alpha/beta): remove + rotate the Anthropic key (§F) — intentionally left in for frictionless AI during alpha/beta.

Last updated: 2026-07-02 (initial deep-audit synthesis). Update this line + tick boxes with commit shas as items land.