archy/docs/1.8.0-RELEASE-HARDENING-PLAN.md
archipelago bd7edb4376 feat(update): deepen post-OTA verification beyond a frontend 200
verify_pending_update previously cleared the rollback marker on any
2xx/3xx from GET / — a release with a dead RPC API or broken podman
access passed and never rolled back. Verification now requires, in the
same attempt: the frontend via nginx, backend RPC liveness (an
unauthenticated POST /rpc/v1 — 401 proves the stack is up, 5xx/404/
refused fails it), and rootless podman reachability. A pre-loop check
also asserts the running binary's version matches what the marker says
was applied, catching a silent or half swap deterministically.

Per-app container assertions are deliberately excluded: the
pre-Quadlet service restart legitimately takes containers down and the
boot reconciler can need minutes for heavy apps — that would
false-rollback healthy updates. Revisit after the Phase-3 flip.

§B of the 1.8.0 hardening plan; update suite 38/38 green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:50:00 -04:00

22 KiB
Raw Blame History

Archipelago 1.8.0 — Release Hardening Plan & Tracker

The one living checklist for shipping 1.8.0. Derived from a full-system deep audit (2026-07-02): backend security, backend code-quality, frontend, mesh, tests/release pipeline, and the ISO build. Supersedes nothing — it sits above docs/UNIFIED-TASK-TRACKER.md (day-to-day) as the release exit-criteria list. Keep it updated: tick a box the moment an item lands, with the commit sha.

Definition of done for 1.8.0: the supply chain is authenticated end-to-end (§A), OTA self-update is safe and rollback-proven on real hardware (§B), no secrets ship in the image (§F), and the single-node gate stays 5/5 green through all of it. Everything else is polish that should not block the tag.

Legend: [ ] open · [~] in progress · [x] done · 🔴 critical · 🟠 high · 🟡 medium · 🟢 low/polish · blocked on you.


🎯 The single most important insight

The release signing ceremony (Workstream B) is the linchpin. The ceremony KEY was generated (user confirmed 2026-07-02) — the hard offline part is done. But the outputs are not yet wired into the repo: anchor.rs:21 is still None and releases/app-catalog.json carries no signature/signed_by (its image_signature fields are literal "cosign://..." placeholders). Three mechanical steps remain, split by who can run them: (1) pin the pubkey — needs only the public hex, can be done in-repo now; (2) sign the catalog with the RELEASE_MASTER_MNEMONIC — only the publisher, secret never touches a host; (3) implement + flip cosign enforcement on the pull path. Until (1)+(2) land, every "verify the signature" task below is written but not enforced. This is still the critical path; §A converges on it.


§A — Supply-chain authentication (🔴 THE release blocker)

Today an attacker who controls the mirror IP (or any MITM on the plaintext HTTP path) can ship an arbitrary root binary, arbitrary container images, and an arbitrary app catalog to the entire fleet — fully unattended under auto_apply. These four items are one story and must land together.

  • 🔴 Pin RELEASE_ROOT_PUBKEY_HEX + sign the catalog — DONE 2026-07-02. anchor.rs pinned to 5d15cbee…d469951 (signer did:key:z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur); trust tests updated (16/16 green). releases/app-catalog.json signed in place (signed_by matches, 64-byte sig); two blocking floats fixed en route (archy-btcpay-db version→string, cpu_limit 0.25→1). Ship order (backward-compatible): signed catalog goes out first (old binaries still accept it), pinned-anchor binary follows in the next build/OTA. Still ahead: (a) the pinned-anchor binary must actually be built + shipped for enforcement to be live on nodes; (b) flip "accept unsigned" → "reject unsigned" only after the whole fleet is on the pinned binary (container/app_catalog.rs:397, the Unsigned arm) — see the next item.
  • [~] 🔴 Enforce a signature on the OTA manifest before trusting it. Signature verification LANDED 2026-07-02: check_for_updates now fetches raw JSON and runs trust::verify_detached — a present-but-invalid/wrong-signer signature hard-rejects the mirror; unsigned manifests are offered for MANUAL apply only (manifest_signed surfaced in UpdateState) and auto-apply refuses them. Publisher side: create-release.sh signs the manifest inline (ceremony), publish-release-assets.sh hard-refuses to ship unsigned (grep + ceremony verify crypto gate), and scripts/sign-manifest.sh exists for re-signs. Still open: move the mirror to HTTPS + pinned cert (tracked with the next item); flip unsigned-manual-apply → hard-reject once the fleet is on a pinned-anchor binary.
  • 🔴 Implement container image signature verification (cosign). container/src/podman_client.rs:255pull_image(.., _signature) silently discards the signature that the manifest threads all the way down (prod_orchestrator.rs:1978/2435). Wire sigstore-rs/cosign verify (or podman pull --signature-policy); hard-fail when a declared signature doesn't verify.
  • 🟠 Move the image mirror to HTTPS; drop --tls-verify=false. podman_client.rs:641 INSECURE_REGISTRY_HOSTS = ["146.59.87.168:3000"] + config.rs:104,124 allowlist pull images over unauthenticated HTTP. Remove the raw-IP entries; give the mirror a valid/pinned cert. (Same host also baked insecurely into the ISO — see §F.)
  • 🟠 Validate every image string at the pull site, not just the RPC boundary. DONE 2026-07-03: policy extracted to container::image_policy (single source of truth; RPC-boundary check delegates to it) and BOTH orchestrator pull sites (install_fresh + ensure_resolved_source_available) hard-bail on refs that fail it. Policy accepts trusted-registry refs + registry-less Docker Hub shorthand (grafana/grafana — used by 8 manifests, can't name an attacker host); rejects any explicit non-allowlisted registry host, shell metachars, malformed refs. 4 new unit tests; container 159 / package 46 green.

§B — OTA self-update safety (🔴 1.8.0's headline feature is untested live)

The apply path itself is well-built (resumable download, staged-complete marker, atomic swap, single-depth backup). The gaps are authenticity (§A) and verification depth — plus the fact that the upgrade path has never run end-to-end on real hardware.

  • 🔴 Deepen the post-OTA health check. DONE 2026-07-03: verify_pending_update now requires, in the same attempt, (1) frontend 2xx/3xx via nginx, (2) backend RPC liveness — unauthenticated POST /rpc/v1; 401/403 = alive, 5xx/404/refused = dead, so a 502-behind-static-files release now rolls back, (3) rootless podman ps reachability; plus a pre-loop binary-version==marker assertion that catches a silent or half swap (new frontend + old binary) deterministically. Per-app container assertions deliberately EXCLUDED — the pre-Quadlet service restart legitimately kills containers and the reconciler can need minutes (false-rollback risk); revisit after the Phase-3 flip. LND-unlock-level checks remain out of scope for the 90s window.
  • 🟠 Run one real upgrade-from-vN-1 soak on hardware before tagging. No test installs the previous version, points it at a staged 1.8.0 manifest, applies, and asserts health + rollback. This is the top release risk for an OTA release. A two-VM (or two-node) harness is enough.
  • 🟡 Guard the frontend-build-no-op in the actual release path. The ui-dist-version grep guard (tests/release/run.sh:82) is behind --with-build, which scripts/create-release.sh:90 never passes → a stale frontend can ship with a valid sha256. Call run.sh --with-build --manifest from create-release (or fold the grep in).
  • 🟢 publish-release-assets verifies size, not sha256 (publish-release-assets.sh:97). Add a HEAD/GET sha256 compare so a size-correct/content-wrong mirror asset fails the publish gate.

§C — Backend robustness (🟠 stability, mostly low-effort/high-ROI)

Note: the .unwrap()/panic! worry is a non-issue — nearly all are in test modules; production request/boot paths are essentially panic-free. The real risks:

  • 🟠 Log swallowed persistence writes. DONE 2026-07-02 (full-workspace re-inventory found 19 production sites): 16 converted to if let Err(e) = … { warn!(…) } — mesh config (server.rs), relay tor endpoint (bitcoin_relay.rs), update mirrors/state + staging flush/sync (update.rs), registry config, radio-contact blocklist, mesh outbox sweep (scheduler.rs), block-header cache (mesh/mod.rs), 7× peer-transport badge (sync.rs + content.rs). Federation tombstone/untombstone upgraded to hard errors (see §I). Install-log line write left fire-and-forget with an explanatory comment.
  • 🟠 Remove blocking std::process::Command from async handlers. DONE 2026-07-03: converted to tokio::processpublished_host_port (install), detect_disk_gb (dependencies), factory-reset restart (system/handlers), config.rs detect_host_ip, the orchestrator host-facts helpers (detect_host_ip/mdns/disk_gb, bitcoin_host, resolve_dynamic_env now async through all 6 call sites), and AutoRuntime::new probes. transport/fips.rs is_available() (sync trait method on the async route path) now serves the cached value and refreshes via a background thread (stale-while- revalidate) instead of blocking on systemctl. image_verifier.rs cosign sites have no callers yet — handled with the §A cosign item. Tests: container 155 / transport 29 / config 29 / package 46 all green.
  • 🟡 Restrict Bitcoin RPC exposure. bootstrap.rs:409 writes rpcallowip=0.0.0.0/0. Scope to the container subnet / 127.0.0.1.
  • 🟡 Move generated secrets from env to file mounts. manifest.rs:1208-1226 injects secrets as -e KEY=value, readable via podman inspect / /proc/<pid>/environ. Prefer bind-mounting the existing 0600 secret file or podman --secret.
  • 🟡 Harden rate-limit IP extraction. middleware.rs:120-128 trusts client-spoofable X-Real-IP/X-Forwarded-For → per-request bucket rotation defeats the login limiter. Trust forwarded headers only from a configured proxy; have nginx set them.
  • 🟢 Include seq in the mesh signed preimage. message_types.rs:245-288 signs (t,v,ts) but sets the anti-replay seq after signing → a radio MITM can alter ordering without breaking the signature.
  • 🟢 Guard the short-DID slice panic (mesh/listener/decode.rs:566) and gate the dev-mode password123 bypass (auth.rs:18) behind #[cfg] before it can reach a release build.
  • 🟢 Apply the seccomp/apparmor profilesecurity/src/container_policies.rs:71 is a TODO; the profile is defined but never applied to podman.

§D — Frontend security & performance (🟠)

The untrusted mesh/LoRa chat path is safe (interpolation, no v-html — good). The real issues are the app-bridge origin model and a bloated bundle.

  • 🟠 Validate event.origin + add consent gates in the NIP-07 nostr bridge. DONE 2026-07-02: handleNostrRequest rejects senders whose event.origin doesn't match the open app's URL origin, and ALL identity-sensitive methods (getPublicKey, signEvent, nip04/nip44 encrypt+decrypt) now go through the consent/approved-origins gate, not just signEvent. Verified present in the built bundle.
  • 🟠 Origin-check the share-to-mesh handler. DONE 2026-07-02: App.vue onShareToMeshMessage now requires ev.origin === window.location.origin (matching Chat.vue).
  • 🟡 Decide the app-iframe isolation model. AppSessionFrame.vue:54 / AppLauncherOverlay.vue:79 embed apps same-origin with no meaningful sandbox; a same-origin app can read the CSRF cookie + localStorage. Ideal fix (serve apps from a per-app subdomain origin) is architectural — at minimum decide + document for 1.8.0.
  • 🟡 Shrink the 93 MB dist. assets/video/video-intro.mp4 is 14.7 MB (precached by the service worker → blocks PWA install), plus ~18 MB of ~1 MB full-screen JPEGs. Convert backgrounds to WebP/AVIF at responsive sizes, lazy/stream the intro video, and exclude video/audio from the Workbox precache. Biggest, easiest perf win.
  • 🟢 DOMPurify the Server.vue QR SVG / guard Mesh.vue pollInterval / surface curatedApps.ts fetch failures. DONE 2026-07-03: WireGuard peer QR now sanitized with the same USE_PROFILES: {svg} call as TwoFactorSection; Mesh poll interval guarded + nulled on unmount; catalog fetch failures log per-URL console.warn incl. the all-sources-failed fallback. Bundle-verified.

§E — Mesh transports (🟢 mostly done — verify & polish)

Confirmed fixed in HEAD: B8 (1970 timestamps), B6 (inbound RX surfacing), the per-message transport pill, and the archy↔archy plain-TEXT-DM E2E fix. Remaining:

  • 🟠 Active Reticulum daemon-death detection. reticulum.rs:589 only warn!s on socket EOF and try_recv_frame then returns Ok(None) forever; nothing calls child.try_wait(). On an idle link a crashed daemon is invisible for up to 30 min (the RX-stall timeout). Treat socket EOF as Err → immediate respawn. (Pairs with the current fix/reticulum-daemon-pdeathsig branch work.)
  • 🟡 Persist chat history across restarts. state.messages boots empty (listener/mod.rs:283) while outbox/scheduler/peers survive — inconsistent; bubbles vanish on restart. Add mesh-messages.json mirroring the scheduler.rs/outbox.rs pattern (or explicitly accept the loss).
  • 🟡 Tighten the 30 s legacy dedup (listener/mod.rs:383-389) — it silently drops a peer legitimately sending identical text twice within 30 s.
  • 🟢 Wire the PyInstaller daemon binary into the release tarball / deploy script (Rust expects /usr/local/bin/archy-reticulum-daemon, reticulum.rs:80); add the RNode udev rule; finish ARCHY:2: announce→arch_pubkey_hex binding (reticulum.rs:119).
  • 🟢 Duty-cycle guard for LoRa TX — none exists; EU 868 is legally 1%. At minimum an airtime budget/warning.

§F — ISO / image build (🔴 one secret leak; otherwise 🟠 hardening)

image-recipe/_archived/build-auto-installer-iso.sh (3604 lines) is the real builder; OTA is the normal update path but the ISO is what produces installable media (latest artifact only one minor behind).

  • 🔴 Anthropic API key — INTENTIONAL for alpha/beta, hard GO-LIVE gate. build-auto-installer-iso.sh:2645 bakes a live sk-ant-… key into claude-api-proxy.service so alpha/beta testers get frictionless AI (deliberate — per user 2026-07-02). Do NOT remove for alpha/beta. Before public GA it MUST be removed + rotated + injected at runtime (a second copy also exists in a worktree). Track it here so it can't be forgotten at launch.
  • 🔴 Per-device secrets on first boot. The self-signed TLS private key is generated at build time (:426) → every device ships the same key; SSH host keys likewise not regenerated. Generate TLS + SSH host keys on first boot.
  • 🟠 Kill default credentials. archipelago/archipelago (SSH+root), web password123, and SSH PasswordAuthentication yes (:411) all ship. Lock root, force credential creation in onboarding, disable SSH password auth (or force-change on first login).
  • 🟠 Sign + checksum the ISO. Pipeline ends at xorriso with no SHA256SUMS, no GPG/minisign, no Secure Boot (BOOTX64.EFI is unsigned though grub-efi-amd64-signed is installed). Emit + sign checksums; wire signed Secure Boot.
  • 🟠 Registries over HTTPS in the image too146.59.87.168:3000 / git.tx1138.com are baked insecure=true/tls_verify:false (:216, :2308). (Ties to §A.)
  • 🟡 Add unattended-upgrades + a default-deny nftables firewall (allow 22/80/443 + mesh/WG). Neither exists today; OS packages drift until reflash and there is no host firewall.
  • 🟡 Pin the build for reproducibility. FIPS daemon is built from unpinned upstream main, Tailscale from its live apt repo, and scripts/image-versions.sh uses many :latest/stable tags (+ bitcoin-ui:1.7.84-alpha, 15 behind). Pin to commits/versions; snapshot apt. Wire ISO version to Cargo.toml so it can't drift.
  • 🟢 Harden LUKS + roadmap A/B partitioning. The LUKS data key sits in plaintext on the unencrypted root (:2137); add TPM2/passphrase binding. Longer-term: A/B (or factory-reset) partitions for safe OTA rollback, and a real install-time TUI (docs/INSTALL-SCREENS-DESIGN.md exists but the installer is headless "press Enter").

§G — Refactor & code health (🟢 not release-blocking; do after the tag or opportunistically)

  • 🟢 Manifest-drive per-app special-casing. App names are branched on across 5-7 Rust files (config.rs 36 match arms, runtime.rs 17, install.rs:275-287 dispatch, prod_orchestrator.rs:54-83 baseline/restart-sensitive lists). Move baseline, restart_sensitive, stack_members, multi_container into the manifest schema; collapse the five near-identical install_*_stack() wrappers into one generic call. Biggest maintainability win.
  • 🟢 Route all podman/systemctl through podman_client. 113 raw Command::new("podman") + 32 systemctl calls bypass the existing 952-LOC wrapper → untestable + the blocking-call risk (§C). Consolidating also unlocks unit tests for the thinly-tested package/ handlers (stacks.rs 1 test, config.rs 2, runtime.rs 3, install.rs 7).
  • 🟢 Split the god-modules. prod_orchestrator.rs (5,263 LOC) → orchestrator/{reconcile, host_ports,ownership,hooks}.rs; Mesh.vue (2,485 LOC / 241 KB chunk) → sub-components. Both are well-tested, so safe.
  • 🟢 Delete dead code. ~4,100 LOC of orphan StartOS crates (js-engine, models, helpers, container-init) not in the workspace or linked; the committed AppleDouble ._*.rs files; the committed .venv//build//__pycache__ under the duplicate reticulum-daemon/ tree; promote MeshRadioDevice enum → trait.
  • 🟢 Resolve the Quadlet flag & dep hygiene. Decide use_quadlet_backends' fate (flip default + delete the legacy create_container branch, or freeze as experimental — don't ship both half-maintained). Consolidate the mixed hyper 0.14/1.x ecosystem; bump stale majors (reqwest, base64, thiserror, tokio-tungstenite).

§H — Testing gaps that gate confidence (🟠)

  • 🟠 Add the OTA upgrade soak (same as §B item 2) — the highest-value missing test.
  • 🟡 Add a host-reboot survival tier — every app is (untested) for reboot in TESTING.md:138; the gate can't reboot the node it runs on. Run SSH-reboot-then-reprobe out-of-band per node.
  • 🟡 Make the release gate run the full Rust suite (or hard-require a green CI sha). tests/release/run.sh:101 runs only a 6-module slice because the full 1000-test suite hangs PTYs on the dev box → 994 tests unverified at release time if CI is stale.
  • 🟡 Add --max-time to node_rpc() (tests/multinode/lib/multinode.bash) — a slow server-side RPC hangs the whole multinode suite with no feedback.
  • 🟢 De-hardcode creds/IPs in tests (tests/multinode/smoke.sh:32, remote-lifecycle.sh:136); snapshot/restore node baseline between destructive iterations (teardown currently only clears /tmp session files).

§I — Carried-over open items (from UNIFIED-TASK-TRACKER.md, still valid)

  • [~] 🟠 Multinode gate pass — 5× destructive gate was launched on node .5; bring the rest of the fleet to precondition, then run the existing (undocumented-but-present) tests/multinode/{smoke,meshtastic}.sh cross-node suites.
  • [~] 🟠 Federation remove-node tombstone regression. Code fix DONE 2026-07-02: remove_node now tombstones BEFORE trimming the node list and propagates the write error (idempotent, so retries are clean); add_node's untombstone likewise propagates before mutating. Still open: tests/multinode/smoke.sh re-verify on real nodes.
  • 🟠 Phase-3 Quadlet default-flip — validated + opt-in on .228/.198; flip config.rs:256 once the .5 gate reports clean.
  • 🟠 Developer CLI suite (archy app validate/render/install/test) — gates external app publishing (APP-PACKAGING-MIGRATION-PLAN.md step 5).
  • 🟡 Version-naming decision (1.7.99-alpha1.8.0 vs 1.8.00-alpha) — a one-line call, then a mechanical bump + tag. Needs your decision.
  • 🟢 Bitcoin multi-version fleet OTA.228 working on branch; rollout timing is held for your call (docs/bitcoin-version-bulletproof-rollout.md).
  • 🟢 3ccc stock-Meshtastic RF validation — code fix in place; needs a live radio send.

Suggested order of attack

  1. The critical path: §A signing ceremony → then turn on manifest/catalog/image signature enforcement (§A) + OTA HTTPS/signature + deeper health check (§B).
  2. Cheap high-ROI stability: §C swallowed-writes + blocking-calls; §D nostr-bridge
    • share-to-mesh origin checks; §H OTA soak + reboot tier.
  3. Image hardening: rest of §F (per-device secrets, default creds, ISO signing, firewall/unattended-upgrades, pinning).
  4. Polish, post-tag: §G refactors, §E mesh persistence/dedup, §D bundle shrink.
  5. Decisions you own (): version name, signing mnemonic, bitcoin OTA timing, 3ccc test.
  6. Before public GA only (NOT alpha/beta): remove + rotate the Anthropic key (§F) — intentionally left in for frictionless AI during alpha/beta.

Last updated: 2026-07-02 PM (hardening session 1: §A anchor+catalog 1977bdef, §A manifest signature 51647b21, §D origin checks 206d5fe8, §C swallowed writes + §I tombstone fix — this commit). Update this line + tick boxes with commit shas as items land.