archy/docs/1.8.0-RELEASE-HARDENING-PLAN.md

302 lines
20 KiB
Markdown
Raw Normal View History

# Archipelago 1.8.0 — Release Hardening Plan & Tracker
> **The one living checklist for shipping 1.8.0.** Derived from a full-system deep
> audit (2026-07-02): backend security, backend code-quality, frontend, mesh,
> tests/release pipeline, and the ISO build. Supersedes nothing — it *sits above*
> `docs/UNIFIED-TASK-TRACKER.md` (day-to-day) as the release exit-criteria list.
> **Keep it updated: tick a box the moment an item lands, with the commit sha.**
**Definition of done for 1.8.0:** the supply chain is authenticated end-to-end
(§A), OTA self-update is safe and rollback-proven on real hardware (§B), no
secrets ship in the image (§F), and the single-node gate stays 5/5 green through
all of it. Everything else is polish that should not block the tag.
**Legend:** `[ ]` open · `[~]` in progress · `[x]` done · 🔴 critical · 🟠 high ·
🟡 medium · 🟢 low/polish · ⛔ blocked on you.
---
## 🎯 The single most important insight
The **release signing ceremony (Workstream B) is the linchpin.** ✅ The ceremony
KEY was generated (user confirmed 2026-07-02) — the hard offline part is done. But
the outputs are **not yet wired into the repo**: `anchor.rs:21` is still `None` and
`releases/app-catalog.json` carries no `signature`/`signed_by` (its `image_signature`
fields are literal `"cosign://..."` placeholders). Three mechanical steps remain,
split by who can run them: **(1)** pin the pubkey — needs only the *public* hex, can
be done in-repo now; **(2)** sign the catalog with the `RELEASE_MASTER_MNEMONIC`
only the publisher, secret never touches a host; **(3)** implement + flip cosign
enforcement on the pull path. Until (1)+(2) land, every "verify the signature" task
below is written but not enforced. **This is still the critical path; §A converges on it.**
---
## §A — Supply-chain authentication (🔴 THE release blocker)
Today an attacker who controls the mirror IP (or any MITM on the plaintext HTTP
path) can ship an arbitrary root binary, arbitrary container images, and an
arbitrary app catalog to the entire fleet — fully unattended under
`auto_apply`. These four items are one story and must land together.
- [x] 🔴 **Pin `RELEASE_ROOT_PUBKEY_HEX` + sign the catalog** — DONE 2026-07-02.
`anchor.rs` pinned to `5d15cbee…d469951` (signer
`did:key:z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur`); trust tests updated (16/16
green). `releases/app-catalog.json` signed in place (`signed_by` matches, 64-byte sig);
two blocking floats fixed en route (`archy-btcpay-db` version→string, `cpu_limit` 0.25→1).
Ship order (backward-compatible): signed catalog goes out first (old binaries still accept
it), pinned-anchor binary follows in the next build/OTA. **Still ahead:** (a) the
pinned-anchor binary must actually be built + shipped for enforcement to be live on nodes;
(b) flip "accept unsigned" → "reject unsigned" only after the whole fleet is on the pinned
binary (`container/app_catalog.rs:397`, the `Unsigned` arm) — see the next item.
- [ ] 🔴 **Enforce a signature on the OTA manifest before trusting it.**
`update.rs:68` fetches `http://146.59.87.168:3000/.../manifest.json` over cleartext
and parses/trusts it with no `trust::verify_detached` call; component sha256/blake3
are only checked against that same unauthenticated manifest → remote root RCE.
Move to HTTPS + pinned cert, require an Ed25519 release-root signature, and
**refuse `auto_apply` until the anchor is pinned.**
- [ ] 🔴 **Implement container image signature verification (cosign).**
`container/src/podman_client.rs:255``pull_image(.., _signature)` silently discards
the signature that the manifest threads all the way down
(`prod_orchestrator.rs:1978/2435`). Wire `sigstore-rs`/`cosign verify` (or
`podman pull --signature-policy`); hard-fail when a declared signature doesn't verify.
- [ ] 🟠 **Move the image mirror to HTTPS; drop `--tls-verify=false`.**
`podman_client.rs:641` `INSECURE_REGISTRY_HOSTS = ["146.59.87.168:3000"]` +
`config.rs:104,124` allowlist pull images over unauthenticated HTTP. Remove the raw-IP
entries; give the mirror a valid/pinned cert. (Same host also baked insecurely into
the ISO — see §F.)
- [ ] 🟠 **Validate every image string at the pull site, not just the RPC boundary.**
`is_valid_docker_image` runs in `install.rs:224`/`runtime.rs:549` but
`prod_orchestrator::install_fresh` (1978) and `resolve_catalog_image` (944-971) pass
catalog/manifest images straight to `pull_image`. Call the validator right before
every pull.
---
## §B — OTA self-update safety (🔴 1.8.0's headline feature is untested live)
The apply path itself is well-built (resumable download, staged-complete marker,
atomic swap, single-depth backup). The gaps are **authenticity** (§A) and
**verification depth** — plus the fact that the upgrade path has never run
end-to-end on real hardware.
- [ ] 🔴 **Deepen the post-OTA health check.** `update.rs:456` (`probe_frontend_once`)
passes on any 2xx/3xx from `GET /`, and `verify_pending_update` (494-593) only rolls
back on that. A release with a broken RPC API, dead containers, or failed LND unlock
passes and never rolls back. Add `/rpc/v1 update.status` + container-list/required-stack
health assertions before clearing the pending-verify marker.
- [ ] 🟠 **Run one real upgrade-from-vN-1 soak on hardware before tagging.**
No test installs the previous version, points it at a staged 1.8.0 manifest, applies,
and asserts health + rollback. This is the top release risk for an OTA release. A
two-VM (or two-node) harness is enough.
- [ ] 🟡 **Guard the frontend-build-no-op in the *actual* release path.** The
`ui-dist-version` grep guard (`tests/release/run.sh:82`) is behind `--with-build`, which
`scripts/create-release.sh:90` never passes → a stale frontend can ship with a valid
sha256. Call `run.sh --with-build --manifest` from create-release (or fold the grep in).
- [ ] 🟢 **publish-release-assets verifies size, not sha256** (`publish-release-assets.sh:97`).
Add a HEAD/GET sha256 compare so a size-correct/content-wrong mirror asset fails the
publish gate.
---
## §C — Backend robustness (🟠 stability, mostly low-effort/high-ROI)
Note: the `.unwrap()`/`panic!` worry is a **non-issue** — nearly all are in test
modules; production request/boot paths are essentially panic-free. The real risks:
- [ ] 🟠 **Log swallowed persistence writes.** ~30-40 dangerous `let _ = save_*().await`
sites discard durability failures with zero diagnostics: `server.rs:270` (mesh config),
`bitcoin_relay.rs:865` (relay state), `update.rs:163/1223` (mirrors/update state),
`registry.rs:158`, `mesh/status.rs:286`, `scheduler.rs:179`, `install.rs:34`. Convert to
`if let Err(e) = … { warn!(…) }`; leave genuinely fire-and-forget ones commented.
- [ ] 🟠 **Remove blocking `std::process::Command` from async handlers.**
`install.rs:2222` `published_host_port` (sync podman on the install path),
`dependencies.rs:316` (`df`), `system/handlers.rs:578` (`sudo`), `transport/fips.rs:50`
(`systemctl`) stall tokio workers under load. Convert to `tokio::process` or
`spawn_blocking`. Only 8 files use `std::process::Command` — bounded.
- [ ] 🟡 **Restrict Bitcoin RPC exposure.** `bootstrap.rs:409` writes
`rpcallowip=0.0.0.0/0`. Scope to the container subnet / `127.0.0.1`.
- [ ] 🟡 **Move generated secrets from env to file mounts.** `manifest.rs:1208-1226`
injects secrets as `-e KEY=value`, readable via `podman inspect` / `/proc/<pid>/environ`.
Prefer bind-mounting the existing `0600` secret file or `podman --secret`.
- [ ] 🟡 **Harden rate-limit IP extraction.** `middleware.rs:120-128` trusts
client-spoofable `X-Real-IP`/`X-Forwarded-For` → per-request bucket rotation defeats the
login limiter. Trust forwarded headers only from a configured proxy; have nginx set them.
- [ ] 🟢 **Include `seq` in the mesh signed preimage.** `message_types.rs:245-288` signs
`(t,v,ts)` but sets the anti-replay `seq` after signing → a radio MITM can alter ordering
without breaking the signature.
- [ ] 🟢 **Guard the short-DID slice panic** (`mesh/listener/decode.rs:566`) and gate the
dev-mode `password123` bypass (`auth.rs:18`) behind `#[cfg]` before it can reach a
release build.
- [ ] 🟢 **Apply the seccomp/apparmor profile**`security/src/container_policies.rs:71` is a
TODO; the profile is defined but never applied to podman.
---
## §D — Frontend security & performance (🟠)
The untrusted mesh/LoRa chat path is **safe** (interpolation, no `v-html` — good).
The real issues are the app-bridge origin model and a bloated bundle.
- [ ] 🟠 **Validate `event.origin` + add consent gates in the NIP-07 nostr bridge.**
`stores/appLauncher.ts:385-490` derives the caller from the launcher's own URL, never
`event.origin`, and `getPublicKey`/`nip04.decrypt`/`nip44.decrypt` have no consent gate →
any co-resident iframe can deanonymize the nostr identity or use the node as a decryption
oracle while an app is open. Check `event.origin` against the open app's real origin; key
approvals on it; gate decrypt/getPublicKey like `signEvent`.
- [ ] 🟠 **Origin-check the `share-to-mesh` handler.** `App.vue:450-464` acts on
`{type:'share-to-mesh', cid}` from any sender and force-navigates to `/mesh` with the CID
pre-staged. Add `ev.origin === window.location.origin` (as `Chat.vue:95` already does).
- [ ] 🟡 **Decide the app-iframe isolation model.** `AppSessionFrame.vue:54` /
`AppLauncherOverlay.vue:79` embed apps same-origin with no meaningful `sandbox`; a
same-origin app can read the CSRF cookie + `localStorage`. Ideal fix (serve apps from a
per-app subdomain origin) is architectural — at minimum decide + document for 1.8.0.
- [ ] 🟡 **Shrink the 93 MB dist.** `assets/video/video-intro.mp4` is **14.7 MB**
(precached by the service worker → blocks PWA install), plus ~18 MB of ~1 MB full-screen
JPEGs. Convert backgrounds to WebP/AVIF at responsive sizes, lazy/stream the intro video,
and exclude video/audio from the Workbox precache. Biggest, easiest perf win.
- [ ] 🟢 **DOMPurify the `Server.vue` QR SVG** (`:283/:295` render `v-html` unsanitized while
`TwoFactorSection.vue` sanitizes the analogous SVG); guard the unguarded `pollInterval`
(`Mesh.vue:391`); surface silent data-fetch failures (`curatedApps.ts:58/71`).
---
## §E — Mesh transports (🟢 mostly done — verify & polish)
Confirmed **fixed in HEAD:** B8 (1970 timestamps), B6 (inbound RX surfacing), the
per-message transport pill, and the archy↔archy plain-TEXT-DM E2E fix. Remaining:
- [ ] 🟠 **Active Reticulum daemon-death detection.** `reticulum.rs:589` only `warn!`s on
socket EOF and `try_recv_frame` then returns `Ok(None)` forever; nothing calls
`child.try_wait()`. On an idle link a crashed daemon is invisible for up to 30 min (the
RX-stall timeout). Treat socket EOF as `Err` → immediate respawn. (Pairs with the current
`fix/reticulum-daemon-pdeathsig` branch work.)
- [ ] 🟡 **Persist chat history across restarts.** `state.messages` boots empty
(`listener/mod.rs:283`) while outbox/scheduler/peers survive — inconsistent; bubbles
vanish on restart. Add `mesh-messages.json` mirroring the `scheduler.rs`/`outbox.rs`
pattern (or explicitly accept the loss).
- [ ] 🟡 **Tighten the 30 s legacy dedup** (`listener/mod.rs:383-389`) — it silently drops a
peer legitimately sending identical text twice within 30 s.
- [ ] 🟢 **Wire the PyInstaller daemon binary into the release tarball / deploy script**
(Rust expects `/usr/local/bin/archy-reticulum-daemon`, `reticulum.rs:80`); add the RNode
udev rule; finish `ARCHY:2:` announce→`arch_pubkey_hex` binding (`reticulum.rs:119`).
- [ ] 🟢 **Duty-cycle guard for LoRa TX** — none exists; EU 868 is legally 1%. At minimum an
airtime budget/warning.
---
## §F — ISO / image build (🔴 one secret leak; otherwise 🟠 hardening)
`image-recipe/_archived/build-auto-installer-iso.sh` (3604 lines) is the real
builder; OTA is the normal update path but the ISO is what produces installable
media (latest artifact only one minor behind).
- [ ] ⛔🔴 **Anthropic API key — INTENTIONAL for alpha/beta, hard GO-LIVE gate.**
`build-auto-installer-iso.sh:2645` bakes a live `sk-ant-…` key into `claude-api-proxy.service`
so alpha/beta testers get frictionless AI (deliberate — per user 2026-07-02). **Do NOT
remove for alpha/beta.** Before public GA it MUST be removed + rotated + injected at runtime
(a second copy also exists in a worktree). Track it here so it can't be forgotten at launch.
- [ ] 🔴 **Per-device secrets on first boot.** The self-signed TLS **private key is generated
at build time** (`:426`) → every device ships the same key; SSH host keys likewise not
regenerated. Generate TLS + SSH host keys on first boot.
- [ ] 🟠 **Kill default credentials.** `archipelago`/`archipelago` (SSH+root), web `password123`,
and SSH `PasswordAuthentication yes` (`:411`) all ship. Lock root, force credential
creation in onboarding, disable SSH password auth (or force-change on first login).
- [ ] 🟠 **Sign + checksum the ISO.** Pipeline ends at `xorriso` with no `SHA256SUMS`, no
GPG/minisign, no Secure Boot (`BOOTX64.EFI` is unsigned though `grub-efi-amd64-signed` is
installed). Emit + sign checksums; wire signed Secure Boot.
- [ ] 🟠 **Registries over HTTPS in the image too**`146.59.87.168:3000` / `git.tx1138.com`
are baked `insecure=true`/`tls_verify:false` (`:216`, `:2308`). (Ties to §A.)
- [ ] 🟡 **Add `unattended-upgrades` + a default-deny nftables firewall** (allow 22/80/443 +
mesh/WG). Neither exists today; OS packages drift until reflash and there is no host
firewall.
- [ ] 🟡 **Pin the build for reproducibility.** FIPS daemon is built from unpinned upstream
`main`, Tailscale from its live apt repo, and `scripts/image-versions.sh` uses many
`:latest`/`stable` tags (+ `bitcoin-ui:1.7.84-alpha`, 15 behind). Pin to commits/versions;
snapshot apt. Wire ISO version to `Cargo.toml` so it can't drift.
- [ ] 🟢 **Harden LUKS + roadmap A/B partitioning.** The LUKS data key sits in plaintext on the
unencrypted root (`:2137`); add TPM2/passphrase binding. Longer-term: A/B (or
factory-reset) partitions for safe OTA rollback, and a real install-time TUI
(`docs/INSTALL-SCREENS-DESIGN.md` exists but the installer is headless "press Enter").
---
## §G — Refactor & code health (🟢 not release-blocking; do after the tag or opportunistically)
- [ ] 🟢 **Manifest-drive per-app special-casing.** App names are branched on across 5-7 Rust
files (`config.rs` 36 match arms, `runtime.rs` 17, `install.rs:275-287` dispatch,
`prod_orchestrator.rs:54-83` baseline/restart-sensitive lists). Move `baseline`,
`restart_sensitive`, `stack_members`, `multi_container` into the manifest schema; collapse
the five near-identical `install_*_stack()` wrappers into one generic call. **Biggest
maintainability win.**
- [ ] 🟢 **Route all podman/systemctl through `podman_client`.** 113 raw `Command::new("podman")`
+ 32 `systemctl` calls bypass the existing 952-LOC wrapper → untestable + the blocking-call
risk (§C). Consolidating also unlocks unit tests for the thinly-tested `package/` handlers
(`stacks.rs` 1 test, `config.rs` 2, `runtime.rs` 3, `install.rs` 7).
- [ ] 🟢 **Split the god-modules.** `prod_orchestrator.rs` (5,263 LOC) → `orchestrator/{reconcile,
host_ports,ownership,hooks}.rs`; `Mesh.vue` (2,485 LOC / 241 KB chunk) → sub-components.
Both are well-tested, so safe.
- [ ] 🟢 **Delete dead code.** ~4,100 LOC of orphan StartOS crates (`js-engine`, `models`,
`helpers`, `container-init`) not in the workspace or linked; the committed AppleDouble
`._*.rs` files; the committed `.venv/`/`build/`/`__pycache__` under the duplicate
`reticulum-daemon/` tree; promote `MeshRadioDevice` enum → trait.
- [ ] 🟢 **Resolve the Quadlet flag & dep hygiene.** Decide `use_quadlet_backends`' fate
(flip default + delete the legacy `create_container` branch, or freeze as experimental —
don't ship both half-maintained). Consolidate the mixed hyper 0.14/1.x ecosystem; bump
stale majors (reqwest, base64, thiserror, tokio-tungstenite).
---
## §H — Testing gaps that gate confidence (🟠)
- [ ] 🟠 **Add the OTA upgrade soak** (same as §B item 2) — the highest-value missing test.
- [ ] 🟡 **Add a host-reboot survival tier** — every app is `○` (untested) for reboot in
`TESTING.md:138`; the gate can't reboot the node it runs on. Run SSH-`reboot`-then-reprobe
out-of-band per node.
- [ ] 🟡 **Make the release gate run the full Rust suite** (or hard-require a green CI sha).
`tests/release/run.sh:101` runs only a 6-module slice because the full 1000-test suite
hangs PTYs on the dev box → 994 tests unverified at release time if CI is stale.
- [ ] 🟡 **Add `--max-time` to `node_rpc()`** (`tests/multinode/lib/multinode.bash`) — a slow
server-side RPC hangs the whole multinode suite with no feedback.
- [ ] 🟢 **De-hardcode creds/IPs in tests** (`tests/multinode/smoke.sh:32`,
`remote-lifecycle.sh:136`); snapshot/restore node baseline between destructive iterations
(teardown currently only clears `/tmp` session files).
---
## §I — Carried-over open items (from `UNIFIED-TASK-TRACKER.md`, still valid)
- [~] 🟠 **Multinode gate pass** — 5× destructive gate was launched on node `.5`; bring the
rest of the fleet to precondition, then run the existing (undocumented-but-present)
`tests/multinode/{smoke,meshtastic}.sh` cross-node suites.
- [ ] 🟠 **Federation `remove-node` tombstone regression.**
`federation/storage.rs:187` does `let _ = tombstone_did(...)` — swallows the write error,
so a removed peer reappears after the next sync. (This is a specific, confirmed instance
of the §C swallowed-writes class.) Needs a careful fix + `smoke.sh` re-verify.
- [ ] 🟠 **Phase-3 Quadlet default-flip** — validated + opt-in on .228/.198; flip
`config.rs:256` once the .5 gate reports clean.
- [ ] 🟠 **Developer CLI suite** (`archy app validate/render/install/test`) — gates external
app publishing (`APP-PACKAGING-MIGRATION-PLAN.md` step 5).
- [ ] ⛔🟡 **Version-naming decision** (`1.7.99-alpha``1.8.0` vs `1.8.00-alpha`) — a one-line
call, then a mechanical bump + tag. **Needs your decision.**
- [ ] ⛔🟢 **Bitcoin multi-version fleet OTA**`.228` working on branch; rollout timing is
held for your call (`docs/bitcoin-version-bulletproof-rollout.md`).
- [ ] ⛔🟢 **3ccc stock-Meshtastic RF validation** — code fix in place; needs a live radio send.
---
## Suggested order of attack
1. **The critical path:** §A signing ceremony → then turn on manifest/catalog/image
signature enforcement (§A) + OTA HTTPS/signature + deeper health check (§B).
2. **Cheap high-ROI stability:** §C swallowed-writes + blocking-calls; §D nostr-bridge
+ share-to-mesh origin checks; §H OTA soak + reboot tier.
3. **Image hardening:** rest of §F (per-device secrets, default creds, ISO signing,
firewall/unattended-upgrades, pinning).
4. **Polish, post-tag:** §G refactors, §E mesh persistence/dedup, §D bundle shrink.
5. **Decisions you own (⛔):** version name, signing mnemonic, bitcoin OTA timing, 3ccc test.
6. **Before public GA only (NOT alpha/beta):** remove + rotate the Anthropic key (§F) —
intentionally left in for frictionless AI during alpha/beta.
*Last updated: 2026-07-02 (initial deep-audit synthesis). Update this line + tick
boxes with commit shas as items land.*