Pin RELEASE_ROOT_PUBKEY_HEX from the 2026-07-02 release-root signing ceremony
(signer did🔑z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur) so nodes verify
the publisher identity of the app-catalog. Sign releases/app-catalog.json in place.
Fix two floats that made the catalog unsignable: archy-btcpay-db manifest version
-> string, fedimint-clientd cpu_limit 0.25 -> 1 (u32). Add scripts/sign-catalog.sh
helper, the 1.8.0 release-hardening plan/tracker, and the commit-and-push project
rule in CLAUDE.md.
Backward-compatible: old binaries still accept the signed catalog; the pinned-anchor
binary ships in the next build/OTA.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
302 lines
20 KiB
Markdown
302 lines
20 KiB
Markdown
# Archipelago 1.8.0 — Release Hardening Plan & Tracker
|
||
|
||
> **The one living checklist for shipping 1.8.0.** Derived from a full-system deep
|
||
> audit (2026-07-02): backend security, backend code-quality, frontend, mesh,
|
||
> tests/release pipeline, and the ISO build. Supersedes nothing — it *sits above*
|
||
> `docs/UNIFIED-TASK-TRACKER.md` (day-to-day) as the release exit-criteria list.
|
||
> **Keep it updated: tick a box the moment an item lands, with the commit sha.**
|
||
|
||
**Definition of done for 1.8.0:** the supply chain is authenticated end-to-end
|
||
(§A), OTA self-update is safe and rollback-proven on real hardware (§B), no
|
||
secrets ship in the image (§F), and the single-node gate stays 5/5 green through
|
||
all of it. Everything else is polish that should not block the tag.
|
||
|
||
**Legend:** `[ ]` open · `[~]` in progress · `[x]` done · 🔴 critical · 🟠 high ·
|
||
🟡 medium · 🟢 low/polish · ⛔ blocked on you.
|
||
|
||
---
|
||
|
||
## 🎯 The single most important insight
|
||
|
||
The **release signing ceremony (Workstream B) is the linchpin.** ✅ The ceremony
|
||
KEY was generated (user confirmed 2026-07-02) — the hard offline part is done. But
|
||
the outputs are **not yet wired into the repo**: `anchor.rs:21` is still `None` and
|
||
`releases/app-catalog.json` carries no `signature`/`signed_by` (its `image_signature`
|
||
fields are literal `"cosign://..."` placeholders). Three mechanical steps remain,
|
||
split by who can run them: **(1)** pin the pubkey — needs only the *public* hex, can
|
||
be done in-repo now; **(2)** sign the catalog with the `RELEASE_MASTER_MNEMONIC` —
|
||
only the publisher, secret never touches a host; **(3)** implement + flip cosign
|
||
enforcement on the pull path. Until (1)+(2) land, every "verify the signature" task
|
||
below is written but not enforced. **This is still the critical path; §A converges on it.**
|
||
|
||
---
|
||
|
||
## §A — Supply-chain authentication (🔴 THE release blocker)
|
||
|
||
Today an attacker who controls the mirror IP (or any MITM on the plaintext HTTP
|
||
path) can ship an arbitrary root binary, arbitrary container images, and an
|
||
arbitrary app catalog to the entire fleet — fully unattended under
|
||
`auto_apply`. These four items are one story and must land together.
|
||
|
||
- [x] 🔴 **Pin `RELEASE_ROOT_PUBKEY_HEX` + sign the catalog** — DONE 2026-07-02.
|
||
`anchor.rs` pinned to `5d15cbee…d469951` (signer
|
||
`did:key:z6MkkidEnEpo6qHMCNSZoNKWtvQvxq3whnaME9wGgEFhq7ur`); trust tests updated (16/16
|
||
green). `releases/app-catalog.json` signed in place (`signed_by` matches, 64-byte sig);
|
||
two blocking floats fixed en route (`archy-btcpay-db` version→string, `cpu_limit` 0.25→1).
|
||
Ship order (backward-compatible): signed catalog goes out first (old binaries still accept
|
||
it), pinned-anchor binary follows in the next build/OTA. **Still ahead:** (a) the
|
||
pinned-anchor binary must actually be built + shipped for enforcement to be live on nodes;
|
||
(b) flip "accept unsigned" → "reject unsigned" only after the whole fleet is on the pinned
|
||
binary (`container/app_catalog.rs:397`, the `Unsigned` arm) — see the next item.
|
||
- [ ] 🔴 **Enforce a signature on the OTA manifest before trusting it.**
|
||
`update.rs:68` fetches `http://146.59.87.168:3000/.../manifest.json` over cleartext
|
||
and parses/trusts it with no `trust::verify_detached` call; component sha256/blake3
|
||
are only checked against that same unauthenticated manifest → remote root RCE.
|
||
Move to HTTPS + pinned cert, require an Ed25519 release-root signature, and
|
||
**refuse `auto_apply` until the anchor is pinned.**
|
||
- [ ] 🔴 **Implement container image signature verification (cosign).**
|
||
`container/src/podman_client.rs:255` — `pull_image(.., _signature)` silently discards
|
||
the signature that the manifest threads all the way down
|
||
(`prod_orchestrator.rs:1978/2435`). Wire `sigstore-rs`/`cosign verify` (or
|
||
`podman pull --signature-policy`); hard-fail when a declared signature doesn't verify.
|
||
- [ ] 🟠 **Move the image mirror to HTTPS; drop `--tls-verify=false`.**
|
||
`podman_client.rs:641` `INSECURE_REGISTRY_HOSTS = ["146.59.87.168:3000"]` +
|
||
`config.rs:104,124` allowlist pull images over unauthenticated HTTP. Remove the raw-IP
|
||
entries; give the mirror a valid/pinned cert. (Same host also baked insecurely into
|
||
the ISO — see §F.)
|
||
- [ ] 🟠 **Validate every image string at the pull site, not just the RPC boundary.**
|
||
`is_valid_docker_image` runs in `install.rs:224`/`runtime.rs:549` but
|
||
`prod_orchestrator::install_fresh` (1978) and `resolve_catalog_image` (944-971) pass
|
||
catalog/manifest images straight to `pull_image`. Call the validator right before
|
||
every pull.
|
||
|
||
---
|
||
|
||
## §B — OTA self-update safety (🔴 1.8.0's headline feature is untested live)
|
||
|
||
The apply path itself is well-built (resumable download, staged-complete marker,
|
||
atomic swap, single-depth backup). The gaps are **authenticity** (§A) and
|
||
**verification depth** — plus the fact that the upgrade path has never run
|
||
end-to-end on real hardware.
|
||
|
||
- [ ] 🔴 **Deepen the post-OTA health check.** `update.rs:456` (`probe_frontend_once`)
|
||
passes on any 2xx/3xx from `GET /`, and `verify_pending_update` (494-593) only rolls
|
||
back on that. A release with a broken RPC API, dead containers, or failed LND unlock
|
||
passes and never rolls back. Add `/rpc/v1 update.status` + container-list/required-stack
|
||
health assertions before clearing the pending-verify marker.
|
||
- [ ] 🟠 **Run one real upgrade-from-vN-1 soak on hardware before tagging.**
|
||
No test installs the previous version, points it at a staged 1.8.0 manifest, applies,
|
||
and asserts health + rollback. This is the top release risk for an OTA release. A
|
||
two-VM (or two-node) harness is enough.
|
||
- [ ] 🟡 **Guard the frontend-build-no-op in the *actual* release path.** The
|
||
`ui-dist-version` grep guard (`tests/release/run.sh:82`) is behind `--with-build`, which
|
||
`scripts/create-release.sh:90` never passes → a stale frontend can ship with a valid
|
||
sha256. Call `run.sh --with-build --manifest` from create-release (or fold the grep in).
|
||
- [ ] 🟢 **publish-release-assets verifies size, not sha256** (`publish-release-assets.sh:97`).
|
||
Add a HEAD/GET sha256 compare so a size-correct/content-wrong mirror asset fails the
|
||
publish gate.
|
||
|
||
---
|
||
|
||
## §C — Backend robustness (🟠 stability, mostly low-effort/high-ROI)
|
||
|
||
Note: the `.unwrap()`/`panic!` worry is a **non-issue** — nearly all are in test
|
||
modules; production request/boot paths are essentially panic-free. The real risks:
|
||
|
||
- [ ] 🟠 **Log swallowed persistence writes.** ~30-40 dangerous `let _ = save_*().await`
|
||
sites discard durability failures with zero diagnostics: `server.rs:270` (mesh config),
|
||
`bitcoin_relay.rs:865` (relay state), `update.rs:163/1223` (mirrors/update state),
|
||
`registry.rs:158`, `mesh/status.rs:286`, `scheduler.rs:179`, `install.rs:34`. Convert to
|
||
`if let Err(e) = … { warn!(…) }`; leave genuinely fire-and-forget ones commented.
|
||
- [ ] 🟠 **Remove blocking `std::process::Command` from async handlers.**
|
||
`install.rs:2222` `published_host_port` (sync podman on the install path),
|
||
`dependencies.rs:316` (`df`), `system/handlers.rs:578` (`sudo`), `transport/fips.rs:50`
|
||
(`systemctl`) stall tokio workers under load. Convert to `tokio::process` or
|
||
`spawn_blocking`. Only 8 files use `std::process::Command` — bounded.
|
||
- [ ] 🟡 **Restrict Bitcoin RPC exposure.** `bootstrap.rs:409` writes
|
||
`rpcallowip=0.0.0.0/0`. Scope to the container subnet / `127.0.0.1`.
|
||
- [ ] 🟡 **Move generated secrets from env to file mounts.** `manifest.rs:1208-1226`
|
||
injects secrets as `-e KEY=value`, readable via `podman inspect` / `/proc/<pid>/environ`.
|
||
Prefer bind-mounting the existing `0600` secret file or `podman --secret`.
|
||
- [ ] 🟡 **Harden rate-limit IP extraction.** `middleware.rs:120-128` trusts
|
||
client-spoofable `X-Real-IP`/`X-Forwarded-For` → per-request bucket rotation defeats the
|
||
login limiter. Trust forwarded headers only from a configured proxy; have nginx set them.
|
||
- [ ] 🟢 **Include `seq` in the mesh signed preimage.** `message_types.rs:245-288` signs
|
||
`(t,v,ts)` but sets the anti-replay `seq` after signing → a radio MITM can alter ordering
|
||
without breaking the signature.
|
||
- [ ] 🟢 **Guard the short-DID slice panic** (`mesh/listener/decode.rs:566`) and gate the
|
||
dev-mode `password123` bypass (`auth.rs:18`) behind `#[cfg]` before it can reach a
|
||
release build.
|
||
- [ ] 🟢 **Apply the seccomp/apparmor profile** — `security/src/container_policies.rs:71` is a
|
||
TODO; the profile is defined but never applied to podman.
|
||
|
||
---
|
||
|
||
## §D — Frontend security & performance (🟠)
|
||
|
||
The untrusted mesh/LoRa chat path is **safe** (interpolation, no `v-html` — good).
|
||
The real issues are the app-bridge origin model and a bloated bundle.
|
||
|
||
- [ ] 🟠 **Validate `event.origin` + add consent gates in the NIP-07 nostr bridge.**
|
||
`stores/appLauncher.ts:385-490` derives the caller from the launcher's own URL, never
|
||
`event.origin`, and `getPublicKey`/`nip04.decrypt`/`nip44.decrypt` have no consent gate →
|
||
any co-resident iframe can deanonymize the nostr identity or use the node as a decryption
|
||
oracle while an app is open. Check `event.origin` against the open app's real origin; key
|
||
approvals on it; gate decrypt/getPublicKey like `signEvent`.
|
||
- [ ] 🟠 **Origin-check the `share-to-mesh` handler.** `App.vue:450-464` acts on
|
||
`{type:'share-to-mesh', cid}` from any sender and force-navigates to `/mesh` with the CID
|
||
pre-staged. Add `ev.origin === window.location.origin` (as `Chat.vue:95` already does).
|
||
- [ ] 🟡 **Decide the app-iframe isolation model.** `AppSessionFrame.vue:54` /
|
||
`AppLauncherOverlay.vue:79` embed apps same-origin with no meaningful `sandbox`; a
|
||
same-origin app can read the CSRF cookie + `localStorage`. Ideal fix (serve apps from a
|
||
per-app subdomain origin) is architectural — at minimum decide + document for 1.8.0.
|
||
- [ ] 🟡 **Shrink the 93 MB dist.** `assets/video/video-intro.mp4` is **14.7 MB**
|
||
(precached by the service worker → blocks PWA install), plus ~18 MB of ~1 MB full-screen
|
||
JPEGs. Convert backgrounds to WebP/AVIF at responsive sizes, lazy/stream the intro video,
|
||
and exclude video/audio from the Workbox precache. Biggest, easiest perf win.
|
||
- [ ] 🟢 **DOMPurify the `Server.vue` QR SVG** (`:283/:295` render `v-html` unsanitized while
|
||
`TwoFactorSection.vue` sanitizes the analogous SVG); guard the unguarded `pollInterval`
|
||
(`Mesh.vue:391`); surface silent data-fetch failures (`curatedApps.ts:58/71`).
|
||
|
||
---
|
||
|
||
## §E — Mesh transports (🟢 mostly done — verify & polish)
|
||
|
||
Confirmed **fixed in HEAD:** B8 (1970 timestamps), B6 (inbound RX surfacing), the
|
||
per-message transport pill, and the archy↔archy plain-TEXT-DM E2E fix. Remaining:
|
||
|
||
- [ ] 🟠 **Active Reticulum daemon-death detection.** `reticulum.rs:589` only `warn!`s on
|
||
socket EOF and `try_recv_frame` then returns `Ok(None)` forever; nothing calls
|
||
`child.try_wait()`. On an idle link a crashed daemon is invisible for up to 30 min (the
|
||
RX-stall timeout). Treat socket EOF as `Err` → immediate respawn. (Pairs with the current
|
||
`fix/reticulum-daemon-pdeathsig` branch work.)
|
||
- [ ] 🟡 **Persist chat history across restarts.** `state.messages` boots empty
|
||
(`listener/mod.rs:283`) while outbox/scheduler/peers survive — inconsistent; bubbles
|
||
vanish on restart. Add `mesh-messages.json` mirroring the `scheduler.rs`/`outbox.rs`
|
||
pattern (or explicitly accept the loss).
|
||
- [ ] 🟡 **Tighten the 30 s legacy dedup** (`listener/mod.rs:383-389`) — it silently drops a
|
||
peer legitimately sending identical text twice within 30 s.
|
||
- [ ] 🟢 **Wire the PyInstaller daemon binary into the release tarball / deploy script**
|
||
(Rust expects `/usr/local/bin/archy-reticulum-daemon`, `reticulum.rs:80`); add the RNode
|
||
udev rule; finish `ARCHY:2:` announce→`arch_pubkey_hex` binding (`reticulum.rs:119`).
|
||
- [ ] 🟢 **Duty-cycle guard for LoRa TX** — none exists; EU 868 is legally 1%. At minimum an
|
||
airtime budget/warning.
|
||
|
||
---
|
||
|
||
## §F — ISO / image build (🔴 one secret leak; otherwise 🟠 hardening)
|
||
|
||
`image-recipe/_archived/build-auto-installer-iso.sh` (3604 lines) is the real
|
||
builder; OTA is the normal update path but the ISO is what produces installable
|
||
media (latest artifact only one minor behind).
|
||
|
||
- [ ] ⛔🔴 **Anthropic API key — INTENTIONAL for alpha/beta, hard GO-LIVE gate.**
|
||
`build-auto-installer-iso.sh:2645` bakes a live `sk-ant-…` key into `claude-api-proxy.service`
|
||
so alpha/beta testers get frictionless AI (deliberate — per user 2026-07-02). **Do NOT
|
||
remove for alpha/beta.** Before public GA it MUST be removed + rotated + injected at runtime
|
||
(a second copy also exists in a worktree). Track it here so it can't be forgotten at launch.
|
||
- [ ] 🔴 **Per-device secrets on first boot.** The self-signed TLS **private key is generated
|
||
at build time** (`:426`) → every device ships the same key; SSH host keys likewise not
|
||
regenerated. Generate TLS + SSH host keys on first boot.
|
||
- [ ] 🟠 **Kill default credentials.** `archipelago`/`archipelago` (SSH+root), web `password123`,
|
||
and SSH `PasswordAuthentication yes` (`:411`) all ship. Lock root, force credential
|
||
creation in onboarding, disable SSH password auth (or force-change on first login).
|
||
- [ ] 🟠 **Sign + checksum the ISO.** Pipeline ends at `xorriso` with no `SHA256SUMS`, no
|
||
GPG/minisign, no Secure Boot (`BOOTX64.EFI` is unsigned though `grub-efi-amd64-signed` is
|
||
installed). Emit + sign checksums; wire signed Secure Boot.
|
||
- [ ] 🟠 **Registries over HTTPS in the image too** — `146.59.87.168:3000` / `git.tx1138.com`
|
||
are baked `insecure=true`/`tls_verify:false` (`:216`, `:2308`). (Ties to §A.)
|
||
- [ ] 🟡 **Add `unattended-upgrades` + a default-deny nftables firewall** (allow 22/80/443 +
|
||
mesh/WG). Neither exists today; OS packages drift until reflash and there is no host
|
||
firewall.
|
||
- [ ] 🟡 **Pin the build for reproducibility.** FIPS daemon is built from unpinned upstream
|
||
`main`, Tailscale from its live apt repo, and `scripts/image-versions.sh` uses many
|
||
`:latest`/`stable` tags (+ `bitcoin-ui:1.7.84-alpha`, 15 behind). Pin to commits/versions;
|
||
snapshot apt. Wire ISO version to `Cargo.toml` so it can't drift.
|
||
- [ ] 🟢 **Harden LUKS + roadmap A/B partitioning.** The LUKS data key sits in plaintext on the
|
||
unencrypted root (`:2137`); add TPM2/passphrase binding. Longer-term: A/B (or
|
||
factory-reset) partitions for safe OTA rollback, and a real install-time TUI
|
||
(`docs/INSTALL-SCREENS-DESIGN.md` exists but the installer is headless "press Enter").
|
||
|
||
---
|
||
|
||
## §G — Refactor & code health (🟢 not release-blocking; do after the tag or opportunistically)
|
||
|
||
- [ ] 🟢 **Manifest-drive per-app special-casing.** App names are branched on across 5-7 Rust
|
||
files (`config.rs` 36 match arms, `runtime.rs` 17, `install.rs:275-287` dispatch,
|
||
`prod_orchestrator.rs:54-83` baseline/restart-sensitive lists). Move `baseline`,
|
||
`restart_sensitive`, `stack_members`, `multi_container` into the manifest schema; collapse
|
||
the five near-identical `install_*_stack()` wrappers into one generic call. **Biggest
|
||
maintainability win.**
|
||
- [ ] 🟢 **Route all podman/systemctl through `podman_client`.** 113 raw `Command::new("podman")`
|
||
+ 32 `systemctl` calls bypass the existing 952-LOC wrapper → untestable + the blocking-call
|
||
risk (§C). Consolidating also unlocks unit tests for the thinly-tested `package/` handlers
|
||
(`stacks.rs` 1 test, `config.rs` 2, `runtime.rs` 3, `install.rs` 7).
|
||
- [ ] 🟢 **Split the god-modules.** `prod_orchestrator.rs` (5,263 LOC) → `orchestrator/{reconcile,
|
||
host_ports,ownership,hooks}.rs`; `Mesh.vue` (2,485 LOC / 241 KB chunk) → sub-components.
|
||
Both are well-tested, so safe.
|
||
- [ ] 🟢 **Delete dead code.** ~4,100 LOC of orphan StartOS crates (`js-engine`, `models`,
|
||
`helpers`, `container-init`) not in the workspace or linked; the committed AppleDouble
|
||
`._*.rs` files; the committed `.venv/`/`build/`/`__pycache__` under the duplicate
|
||
`reticulum-daemon/` tree; promote `MeshRadioDevice` enum → trait.
|
||
- [ ] 🟢 **Resolve the Quadlet flag & dep hygiene.** Decide `use_quadlet_backends`' fate
|
||
(flip default + delete the legacy `create_container` branch, or freeze as experimental —
|
||
don't ship both half-maintained). Consolidate the mixed hyper 0.14/1.x ecosystem; bump
|
||
stale majors (reqwest, base64, thiserror, tokio-tungstenite).
|
||
|
||
---
|
||
|
||
## §H — Testing gaps that gate confidence (🟠)
|
||
|
||
- [ ] 🟠 **Add the OTA upgrade soak** (same as §B item 2) — the highest-value missing test.
|
||
- [ ] 🟡 **Add a host-reboot survival tier** — every app is `○` (untested) for reboot in
|
||
`TESTING.md:138`; the gate can't reboot the node it runs on. Run SSH-`reboot`-then-reprobe
|
||
out-of-band per node.
|
||
- [ ] 🟡 **Make the release gate run the full Rust suite** (or hard-require a green CI sha).
|
||
`tests/release/run.sh:101` runs only a 6-module slice because the full 1000-test suite
|
||
hangs PTYs on the dev box → 994 tests unverified at release time if CI is stale.
|
||
- [ ] 🟡 **Add `--max-time` to `node_rpc()`** (`tests/multinode/lib/multinode.bash`) — a slow
|
||
server-side RPC hangs the whole multinode suite with no feedback.
|
||
- [ ] 🟢 **De-hardcode creds/IPs in tests** (`tests/multinode/smoke.sh:32`,
|
||
`remote-lifecycle.sh:136`); snapshot/restore node baseline between destructive iterations
|
||
(teardown currently only clears `/tmp` session files).
|
||
|
||
---
|
||
|
||
## §I — Carried-over open items (from `UNIFIED-TASK-TRACKER.md`, still valid)
|
||
|
||
- [~] 🟠 **Multinode gate pass** — 5× destructive gate was launched on node `.5`; bring the
|
||
rest of the fleet to precondition, then run the existing (undocumented-but-present)
|
||
`tests/multinode/{smoke,meshtastic}.sh` cross-node suites.
|
||
- [ ] 🟠 **Federation `remove-node` tombstone regression.**
|
||
`federation/storage.rs:187` does `let _ = tombstone_did(...)` — swallows the write error,
|
||
so a removed peer reappears after the next sync. (This is a specific, confirmed instance
|
||
of the §C swallowed-writes class.) Needs a careful fix + `smoke.sh` re-verify.
|
||
- [ ] 🟠 **Phase-3 Quadlet default-flip** — validated + opt-in on .228/.198; flip
|
||
`config.rs:256` once the .5 gate reports clean.
|
||
- [ ] 🟠 **Developer CLI suite** (`archy app validate/render/install/test`) — gates external
|
||
app publishing (`APP-PACKAGING-MIGRATION-PLAN.md` step 5).
|
||
- [ ] ⛔🟡 **Version-naming decision** (`1.7.99-alpha` → `1.8.0` vs `1.8.00-alpha`) — a one-line
|
||
call, then a mechanical bump + tag. **Needs your decision.**
|
||
- [ ] ⛔🟢 **Bitcoin multi-version fleet OTA** — `.228` working on branch; rollout timing is
|
||
held for your call (`docs/bitcoin-version-bulletproof-rollout.md`).
|
||
- [ ] ⛔🟢 **3ccc stock-Meshtastic RF validation** — code fix in place; needs a live radio send.
|
||
|
||
---
|
||
|
||
## Suggested order of attack
|
||
|
||
1. **The critical path:** §A signing ceremony → then turn on manifest/catalog/image
|
||
signature enforcement (§A) + OTA HTTPS/signature + deeper health check (§B).
|
||
2. **Cheap high-ROI stability:** §C swallowed-writes + blocking-calls; §D nostr-bridge
|
||
+ share-to-mesh origin checks; §H OTA soak + reboot tier.
|
||
3. **Image hardening:** rest of §F (per-device secrets, default creds, ISO signing,
|
||
firewall/unattended-upgrades, pinning).
|
||
4. **Polish, post-tag:** §G refactors, §E mesh persistence/dedup, §D bundle shrink.
|
||
5. **Decisions you own (⛔):** version name, signing mnemonic, bitcoin OTA timing, 3ccc test.
|
||
6. **Before public GA only (NOT alpha/beta):** remove + rotate the Anthropic key (§F) —
|
||
intentionally left in for frictionless AI during alpha/beta.
|
||
|
||
*Last updated: 2026-07-02 (initial deep-audit synthesis). Update this line + tick
|
||
boxes with commit shas as items land.*
|