Dorian 1e283daf13 fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00

3.7 KiB

name, description, disable-model-invocation, allowed-tools, argument-hint
name description disable-model-invocation allowed-tools argument-hint
podman Rootless Podman container management — diagnose, fix, and harden uptime. Use for container issues, port problems, UID mapping, health checks, or uptime hardening. true Bash, Read, Edit, Write, Glob, Grep [diagnose|fix|uptime] [container-name]

Podman — Container Management

Archipelago runs rootless Podman as archipelago user (UID 1000). All podman commands run without sudo. UID mapping: container UID N → host UID (100000 + N).

SSH: ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228

Diagnose

# Container status
podman ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}\t{{.Networks}}"

# Restart policies (must be "unless-stopped")
for c in $(podman ps -a --format "{{.Names}}"); do
  echo -n "$c: "; podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}"
done

# Health checks
for c in $(podman ps --format "{{.Names}}"); do
  health=$(podman inspect "$c" --format "{{.State.Health.Status}}" 2>/dev/null)
  [ -n "$health" ] && [ "$health" != "<no value>" ] && echo "$c: $health"
done

# Resource usage + recent deaths
podman stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
podman events --filter event=died --since 24h 2>/dev/null | tail -10

# Rootless prerequisites
echo "XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR"  # must be /run/user/1000
grep archipelago /etc/subuid                # must show archipelago:100000:65536
ls /var/lib/systemd/linger/ | grep archipelago  # must exist
grep DEFAULT_FORWARD_POLICY /etc/default/ufw    # must be ACCEPT

Cross-check 4 layers for port consistency: Backend config (package.rs) → Podman ports → Nginx proxy → Frontend appLauncher.ts. See references/port-map.md.

Fix

Restart policy missing: podman update --restart unless-stopped CONTAINER_NAME

UID mapping (permission denied): sudo chown -R HOST_UID:HOST_UID /var/lib/archipelago/APP. Formula: host_uid = 100000 + container_uid. See references/uid-mapping.md.

Port conflict: ss -tlnp | grep :PORT to find offender. Can't add ports to running container — must recreate.

Network missing: podman network connect archy-net CONTAINER_NAME

UFW blocking LAN: sudo sed -i 's/DEFAULT_FORWARD_POLICY="DROP"/DEFAULT_FORWARD_POLICY="ACCEPT"/' /etc/default/ufw && sudo ufw reload

Stale processes: pgrep -c -f "podman ps" — if >10, kill stuck processes.

See references/common-failures.md for the full error→cause→fix lookup table.

Uptime Hardening

Layer 1: Restart policies

for c in $(podman ps -a --format "{{.Names}}"); do
  policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
  [ "$policy" = "no" ] || [ -z "$policy" ] && podman update --restart unless-stopped "$c"
done

Layer 2: Watchdog timer

Create /usr/local/bin/archipelago-container-watchdog.sh that restarts stopped/unhealthy containers every 2 minutes via systemd timer. Script runs as archipelago user with XDG_RUNTIME_DIR=/run/user/1000.

Layer 3: Ordered startup

Bitcoin stack has dependency chain: bitcoin-knots → electrumx + lnd → mempool + btcpay + fedimint → UI containers. Create /usr/local/bin/archipelago-ordered-start.sh with wait-for-container logic between tiers.

Verification

sudo reboot  # then SSH back after 3 min
podman ps --format "{{.Names}}" | sort  # should match pre-reboot list

Systemd Requirements

The archipelago.service needs these for rootless Podman:

  • ProtectHome=no (podman stores in ~/.local/share/containers/)
  • PrivateTmp=no (runtime in /tmp/podman-run-1000/)
  • Do not set RestrictNamespaces= or SystemCallFilter=
  • Environment=XDG_RUNTIME_DIR=/run/user/1000