archy/docs/SESSION-2026-03-18.md

57 lines
2.4 KiB
Markdown
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
# Session 2026-03-18 — Resume Guide
## What Was Done
### Rootless Podman Migration (TASK-11 DONE)
- .228: 30 containers running rootless with full security hardening
- All `sudo podman` removed from Rust backend (9 files) + deploy script
- UID mapping: container UID N → host UID (100000 + N - 1)
- Deploy script auto-fixes ownership + sysctl + linger on every deploy
### .198 Migration (IN PROGRESS)
- Root containers stopped, UID ownership fixed, IndeedHub images migrated
- `/etc/hosts` fixed to 644 (rootless podman needs read access)
- **Only 2 containers running — needs full container recreation**
- Next: run container setup (Bitcoin, LND, ElectrumX, all apps)
- The `--both` deploy only copies binary+frontend, doesn't create containers
### Security Hardening (TASK-8 — 9/12 pentest findings fixed)
- C1: /lnd-connect-info requires session auth
- C3: DEV_MODE removed from production service
- H1: node-message verifies ed25519 signatures
- M1: content.add rejects `..` path traversal
- M2: NIP-07 postMessage uses specific origin
- M3: AIUI nginx checks session_id cookie
- L2: Strict v3 onion validation
- **Still open**: H2/H3 (federation signature verification), H4 (bind ports to 127.0.0.1)
### UI/UX Fixes
- Mesh serial: auto-detect, backoff, udev rule, Connect button
- External iframes: CSP https: added
- Container startup: "Checking..." shimmer, marketplace sort
- Port mapping: all nginx+frontend+backend synced
- ElectrumX: shows index size during indexing
- Fedimintd → "Fedimint Guardian"
- IndeedHub Studio version
- On-Chain first in receive modals
- Tab-launch icons, iframe error screen, CPU alert threshold
- Mesh mobile: header hidden, overflow fixed
- Federation/Cloud: DID on hover
### Git Tags
- v1.2.0-alpha.1 through v1.2.0-alpha.8 (current)
## Resume Checklist
1. **Finish .198 containers** — create Bitcoin, LND, ElectrumX, MariaDB, Mempool, BTCPay, Grafana, etc.
2. **H2/H3** — federation peer-joined/address-changed signature verification
3. **H4** — bind service ports to 127.0.0.1
4. **BUG-1** — CSRF mismatch (P0 critical)
5. **Many /task items** in MASTER_PLAN.md from testing session
6. **Tailscale migration** for other nodes (preserve auth state)
## Key Facts
- Rootless subnet: 10.89.0.0/16
- Bitcoin RPC: rpcallowip=0.0.0.0/0, password in /var/lib/archipelago/secrets/
- .198 /etc/hosts must be 644
- Deploy --both only copies, --live creates containers