archy/loop/prompt.md

50 lines
2.4 KiB
Markdown
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
You are working through an overnight automation plan for the Archipelago (archy) project. Read these files first:
1. `loop/plan.md` -- Your task checklist (mark items `- [x]` as you complete them)
2. `CLAUDE.md` -- Project conventions, architecture, and coding standards
## Working Process
For each task in `loop/plan.md`:
1. Find the first unchecked `- [ ]` item
2. Read the task description carefully
3. Read the relevant source files before making changes
4. Implement following CLAUDE.md conventions
5. Run any test/build commands specified in the task
6. Fix all errors before continuing
7. Commit with conventional format: `type: description`
8. Mark it done `- [x]` in `loop/plan.md`
9. Move to the next unchecked task immediately
## Critical Rules
- **Deploy-test-fix LOOPS**: Many tasks require you to deploy, test, find failures, fix them, redeploy, and retest. Do NOT mark a task complete until ALL tests in that task pass. If a fix introduces a new failure, fix that too. Keep looping.
- **Read logs obsessively**: After every deploy, read `journalctl`, `podman logs`, and curl output. The logs tell you what's broken.
- **Fix the root cause**: Don't patch symptoms. If a container won't restart, find out WHY (wrong restart policy? health check failing? missing dependency?) and fix the actual cause.
- Never skip a testing gate -- if tests fail, fix before moving on
- If a task is proving difficult, make at least 10 genuine attempts before moving on
- Always read source files before editing them
- Do not stop until all tasks are checked or you are rate limited
- Commit after each completed fix (multiple commits per task is fine)
- DO NOT PUSH -- a CI build is in progress, we will push manually later
- Deploy to .228 -- `ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228`
- Run Rust builds/checks on .228, NOT macOS
- Production-quality code only -- no shortcuts, no TODO comments, no unwrap()
## SSH Quick Reference
```bash
SSH="ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228"
# Deploy from macOS:
./scripts/deploy-to-target.sh --target 192.168.1.228
# Build Rust on .228:
$SSH "cd ~/archy/core && cargo clippy --all-targets --all-features && cargo test --all-features"
# Check containers:
$SSH "podman ps -a --format '{{.Names}} {{.State}} {{.Status}}' | sort"
# Read container logs:
$SSH "podman logs bitcoin-knots --tail 30"
# Check backend:
$SSH "journalctl -u archipelago --no-pager -n 50"
```