2026-03-22 03:30:21 +00:00
You are working through an overnight automation plan for the Archipelago (archy) project. Read these files first:
1. `loop/plan.md` -- Your task checklist (mark items `- [x]` as you complete them)
2. `CLAUDE.md` -- Project conventions, architecture, and coding standards
## Working Process
For each task in `loop/plan.md` :
1. Find the first unchecked `- [ ]` item
2. Read the task description carefully
3. Read the relevant source files before making changes
4. Implement following CLAUDE.md conventions
5. Run any test/build commands specified in the task
6. Fix all errors before continuing
7. Commit with conventional format: `type: description`
8. Mark it done `- [x]` in `loop/plan.md`
9. Move to the next unchecked task immediately
fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy
UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry
Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)
Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0
Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
## Critical Rules
2026-03-22 03:30:21 +00:00
fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy
UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry
Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)
Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0
Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
- **Deploy-test-fix LOOPS**: Many tasks require you to deploy, test, find failures, fix them, redeploy, and retest. Do NOT mark a task complete until ALL tests in that task pass. If a fix introduces a new failure, fix that too. Keep looping.
- **Read logs obsessively**: After every deploy, read `journalctl` , `podman logs` , and curl output. The logs tell you what's broken.
- **Fix the root cause**: Don't patch symptoms. If a container won't restart, find out WHY (wrong restart policy? health check failing? missing dependency?) and fix the actual cause.
2026-03-22 03:30:21 +00:00
- Never skip a testing gate -- if tests fail, fix before moving on
- If a task is proving difficult, make at least 10 genuine attempts before moving on
- Always read source files before editing them
- Do not stop until all tasks are checked or you are rate limited
fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy
UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry
Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)
Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0
Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
- Commit after each completed fix (multiple commits per task is fine)
- DO NOT PUSH -- a CI build is in progress, we will push manually later
- Deploy to .228 -- `ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228`
- Run Rust builds/checks on .228, NOT macOS
- Production-quality code only -- no shortcuts, no TODO comments, no unwrap()
## SSH Quick Reference
```bash
SSH="ssh -i ~/.ssh/archipelago-deploy archipelago@192 .168.1.228"
# Deploy from macOS:
./scripts/deploy-to-target.sh --target 192.168.1.228
# Build Rust on .228:
$SSH "cd ~/archy/core & & cargo clippy --all-targets --all-features & & cargo test --all-features"
# Check containers:
$SSH "podman ps -a --format '{{.Names}} {{.State}} {{.Status}}' | sort"
# Read container logs:
$SSH "podman logs bitcoin-knots --tail 30"
# Check backend:
$SSH "journalctl -u archipelago --no-pager -n 50"
```