Dorian 1e283daf13 fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00

90 lines
3.7 KiB
Markdown

---
name: podman
description: Rootless Podman container management — diagnose, fix, and harden uptime. Use for container issues, port problems, UID mapping, health checks, or uptime hardening.
disable-model-invocation: true
allowed-tools: Bash, Read, Edit, Write, Glob, Grep
argument-hint: "[diagnose|fix|uptime] [container-name]"
---
# Podman — Container Management
Archipelago runs rootless Podman as `archipelago` user (UID 1000). All `podman` commands run without sudo. UID mapping: container UID N → host UID (100000 + N).
**SSH**: `ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228`
## Diagnose
```bash
# Container status
podman ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}\t{{.Networks}}"
# Restart policies (must be "unless-stopped")
for c in $(podman ps -a --format "{{.Names}}"); do
echo -n "$c: "; podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}"
done
# Health checks
for c in $(podman ps --format "{{.Names}}"); do
health=$(podman inspect "$c" --format "{{.State.Health.Status}}" 2>/dev/null)
[ -n "$health" ] && [ "$health" != "<no value>" ] && echo "$c: $health"
done
# Resource usage + recent deaths
podman stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
podman events --filter event=died --since 24h 2>/dev/null | tail -10
# Rootless prerequisites
echo "XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR" # must be /run/user/1000
grep archipelago /etc/subuid # must show archipelago:100000:65536
ls /var/lib/systemd/linger/ | grep archipelago # must exist
grep DEFAULT_FORWARD_POLICY /etc/default/ufw # must be ACCEPT
```
Cross-check 4 layers for port consistency: Backend config (package.rs) → Podman ports → Nginx proxy → Frontend appLauncher.ts. See `references/port-map.md`.
## Fix
**Restart policy missing**: `podman update --restart unless-stopped CONTAINER_NAME`
**UID mapping (permission denied)**: `sudo chown -R HOST_UID:HOST_UID /var/lib/archipelago/APP`. Formula: host_uid = 100000 + container_uid. See `references/uid-mapping.md`.
**Port conflict**: `ss -tlnp | grep :PORT` to find offender. Can't add ports to running container — must recreate.
**Network missing**: `podman network connect archy-net CONTAINER_NAME`
**UFW blocking LAN**: `sudo sed -i 's/DEFAULT_FORWARD_POLICY="DROP"/DEFAULT_FORWARD_POLICY="ACCEPT"/' /etc/default/ufw && sudo ufw reload`
**Stale processes**: `pgrep -c -f "podman ps"` — if >10, kill stuck processes.
See `references/common-failures.md` for the full error→cause→fix lookup table.
## Uptime Hardening
### Layer 1: Restart policies
```bash
for c in $(podman ps -a --format "{{.Names}}"); do
policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
[ "$policy" = "no" ] || [ -z "$policy" ] && podman update --restart unless-stopped "$c"
done
```
### Layer 2: Watchdog timer
Create `/usr/local/bin/archipelago-container-watchdog.sh` that restarts stopped/unhealthy containers every 2 minutes via systemd timer. Script runs as archipelago user with `XDG_RUNTIME_DIR=/run/user/1000`.
### Layer 3: Ordered startup
Bitcoin stack has dependency chain: bitcoin-knots → electrumx + lnd → mempool + btcpay + fedimint → UI containers. Create `/usr/local/bin/archipelago-ordered-start.sh` with wait-for-container logic between tiers.
### Verification
```bash
sudo reboot # then SSH back after 3 min
podman ps --format "{{.Names}}" | sort # should match pre-reboot list
```
## Systemd Requirements
The archipelago.service needs these for rootless Podman:
- `ProtectHome=no` (podman stores in ~/.local/share/containers/)
- `PrivateTmp=no` (runtime in /tmp/podman-run-1000/)
- Do not set `RestrictNamespaces=` or `SystemCallFilter=`
- `Environment=XDG_RUNTIME_DIR=/run/user/1000`