- Correct off-by-one in UID mapping: container UID N → host UID (100000 + N - 1), not (100000 + N) - Deploy script auto-fixes UID ownership on every deploy - Bitcoin UI nginx uses __BITCOIN_RPC_AUTH__ placeholder injected from secrets at deploy time - container rules updated for rootless podman architecture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
13 KiB
name, description, allowed-tools
| name | description | allowed-tools |
|---|---|---|
| podman-uptime | Ensure 100% container uptime on Archipelago. Sets up systemd watchdog timers, verifies restart policies, creates health check monitors, and configures auto-recovery for all containers. Handles rootless Podman (user: archipelago, UID 1000, subuid 100000:65536). Use when asked to "ensure uptime", "containers keep dying", "auto-restart", "watchdog", "container monitoring", "uptime guarantee", "keep containers running", "survive reboot", or to harden container reliability. | Bash Read Edit Write Glob Grep |
Podman Uptime — Container Reliability Guardian
Ensures every Archipelago container survives reboots, recovers from crashes, and stays healthy. Sets up the three layers of uptime defense: restart policies, systemd watchdog, and health-based auto-recovery.
SSH command: ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228
ROOTLESS PODMAN: All
podmancommands run as thearchipelagouser — NO sudo. Only usesudofor: systemd unit files, chown on volumes, UFW changes. The archipelago user runs containers directly via user namespaces.
Prerequisites for Rootless Uptime
Before setting up uptime infrastructure, verify rootless Podman basics are working:
# Must be the archipelago user
whoami # archipelago
# User lingering must be enabled (keeps user services running after logout)
ls /var/lib/systemd/linger/ | grep archipelago || sudo loginctl enable-linger archipelago
# XDG_RUNTIME_DIR must be set
echo $XDG_RUNTIME_DIR # /run/user/1000
# Subuid/subgid must be configured
grep archipelago /etc/subuid # archipelago:100000:65536
# UFW forward policy must be ACCEPT (for LAN access to containers)
grep DEFAULT_FORWARD_POLICY /etc/default/ufw # Must be "ACCEPT"
Layer 1: Restart Policies (Survive Reboots)
Every container MUST have --restart unless-stopped. This is non-negotiable.
Audit and fix all containers
# Audit
for c in $(podman ps -a --format "{{.Names}}"); do
policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
echo "$c: $policy"
done
# Fix any with "no" or empty policy
for c in $(podman ps -a --format "{{.Names}}"); do
policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
if [ "$policy" = "no" ] || [ -z "$policy" ]; then
echo "Fixing: $c"
podman update --restart unless-stopped "$c"
fi
done
Ensure podman auto-starts containers on boot
For rootless Podman, containers with restart policies are auto-started by podman-restart as a user service:
# Enable the rootless podman-restart user service
systemctl --user enable podman-restart.service 2>/dev/null
# If the user service doesn't exist, create a system-level one
# (runs as archipelago user via User= directive)
cat <<'EOF' | sudo tee /etc/systemd/system/podman-restart.service
[Unit]
Description=Podman Start All Containers With Restart Policy
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/bin/podman start --all --filter restart-policy=unless-stopped
RemainAfterExit=yes
TimeoutStartSec=300
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable podman-restart.service
Layer 2: Systemd Watchdog (Detect and Recover)
Create a systemd timer that checks container health every 2 minutes and restarts unhealthy or stopped containers.
Create the watchdog script
cat <<'SCRIPT' | sudo tee /usr/local/bin/archipelago-container-watchdog.sh
#!/bin/bash
# Archipelago Container Watchdog (Rootless Podman)
# Runs as archipelago user — NO sudo for podman commands
LOG_TAG="container-watchdog"
# Run podman as the archipelago user with correct XDG path
export XDG_RUNTIME_DIR=/run/user/1000
PODMAN="/usr/bin/podman"
# Restart any stopped containers that should be running (have restart policy)
for c in $($PODMAN ps -a --filter status=exited --filter restart-policy=unless-stopped --format "{{.Names}}" 2>/dev/null); do
logger -t "$LOG_TAG" "Restarting stopped container: $c"
$PODMAN start "$c" 2>&1 | logger -t "$LOG_TAG"
done
# Restart unhealthy containers
for c in $($PODMAN ps --filter health=unhealthy --format "{{.Names}}" 2>/dev/null); do
logger -t "$LOG_TAG" "Restarting unhealthy container: $c"
$PODMAN restart "$c" 2>&1 | logger -t "$LOG_TAG"
done
# Check for containers in "created" state (never started)
for c in $($PODMAN ps -a --filter status=created --format "{{.Names}}" 2>/dev/null); do
logger -t "$LOG_TAG" "Starting created container: $c"
$PODMAN start "$c" 2>&1 | logger -t "$LOG_TAG"
done
SCRIPT
sudo chmod +x /usr/local/bin/archipelago-container-watchdog.sh
Create the systemd timer
# Service unit — runs as archipelago user for rootless podman
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-watchdog.service
[Unit]
Description=Archipelago Container Watchdog
After=podman-restart.service
[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/local/bin/archipelago-container-watchdog.sh
EOF
# Timer unit — runs every 2 minutes
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-watchdog.timer
[Unit]
Description=Run Archipelago Container Watchdog every 2 minutes
[Timer]
OnBootSec=120
OnUnitActiveSec=120
AccuracySec=30
[Install]
WantedBy=timers.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now archipelago-watchdog.timer
Verify watchdog is running
sudo systemctl status archipelago-watchdog.timer
sudo systemctl list-timers | grep archipelago
# Check watchdog logs
sudo journalctl -t container-watchdog --since "1 hour ago" --no-pager
Layer 3: Dependency-Aware Startup Order
Some containers depend on others. The watchdog handles restarts, but initial boot order matters.
Create ordered startup script
cat <<'SCRIPT' | sudo tee /usr/local/bin/archipelago-ordered-start.sh
#!/bin/bash
# Ordered container startup for Archipelago (Rootless Podman)
# Runs as archipelago user — NO sudo for podman commands
# Respects dependency chain: bitcoin → electrs/lnd → mempool/btcpay
LOG_TAG="ordered-start"
export XDG_RUNTIME_DIR=/run/user/1000
PODMAN="/usr/bin/podman"
wait_for_container() {
local name=$1
local max_wait=${2:-60}
local waited=0
while [ $waited -lt $max_wait ]; do
status=$($PODMAN inspect "$name" --format "{{.State.Running}}" 2>/dev/null)
if [ "$status" = "true" ]; then
logger -t "$LOG_TAG" "$name is running"
return 0
fi
sleep 5
waited=$((waited + 5))
done
logger -t "$LOG_TAG" "WARNING: $name not running after ${max_wait}s"
return 1
}
# Tier 0: Infrastructure
logger -t "$LOG_TAG" "Starting Tier 0: Infrastructure"
$PODMAN start tailscale 2>/dev/null
# Tier 1: Databases (must start before services that depend on them)
logger -t "$LOG_TAG" "Starting Tier 1: Databases"
$PODMAN start mempool-db 2>/dev/null
$PODMAN start btcpay-postgres 2>/dev/null
$PODMAN start immich_postgres 2>/dev/null
sleep 5
# Tier 2: Bitcoin (foundation for Lightning and explorers)
logger -t "$LOG_TAG" "Starting Tier 2: Bitcoin"
$PODMAN start bitcoin-knots 2>/dev/null
wait_for_container bitcoin-knots 120
# Tier 3: Bitcoin-dependent services
logger -t "$LOG_TAG" "Starting Tier 3: Bitcoin-dependent"
$PODMAN start electrumx 2>/dev/null
$PODMAN start lnd 2>/dev/null
wait_for_container electrumx 90
wait_for_container lnd 90
# Tier 4: Services depending on Tier 3
logger -t "$LOG_TAG" "Starting Tier 4: Second-order dependencies"
$PODMAN start mempool 2>/dev/null
$PODMAN start nbxplorer 2>/dev/null
sleep 10
$PODMAN start btcpay-server 2>/dev/null
$PODMAN start fedimint 2>/dev/null
$PODMAN start fedimint-gateway 2>/dev/null
# Tier 5: Independent apps (start all remaining)
logger -t "$LOG_TAG" "Starting Tier 5: Independent apps"
$PODMAN start --all 2>/dev/null
# Tier 6: UI containers (need parent apps running first)
logger -t "$LOG_TAG" "Starting Tier 6: UI containers"
$PODMAN start bitcoin-ui 2>/dev/null
$PODMAN start lnd-ui 2>/dev/null
$PODMAN start electrs-ui 2>/dev/null
logger -t "$LOG_TAG" "Startup sequence complete"
SCRIPT
sudo chmod +x /usr/local/bin/archipelago-ordered-start.sh
Wire into boot sequence
# Runs as archipelago user for rootless podman
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-containers.service
[Unit]
Description=Archipelago Ordered Container Startup
After=network-online.target
Wants=network-online.target
Before=archipelago.service
[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/local/bin/archipelago-ordered-start.sh
RemainAfterExit=yes
TimeoutStartSec=600
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable archipelago-containers.service
Rootless-Specific Uptime Considerations
Volume ownership survives reboots
Volume ownership doesn't change on reboot, but if a container image is updated (re-pulled), the new container may run as a different UID. Always verify after image updates:
# Quick ownership audit after image pull
podman inspect CONTAINER_NAME --format "{{.Config.User}}"
# Then verify: sudo stat -c '%u:%g' /var/lib/archipelago/APP_NAME
# Formula: host_uid = 100000 + container_uid
XDG_RUNTIME_DIR on boot
Rootless Podman requires /run/user/1000 to exist. This is created by pam_systemd when the user logs in, or by loginctl enable-linger. If it's missing after boot, containers won't start.
# Verify it exists
ls -la /run/user/1000/ || echo "CRITICAL: /run/user/1000 missing — run: sudo loginctl enable-linger archipelago"
Systemd sandbox must not block podman
If the archipelago.service sandbox blocks namespace/syscall operations, the Rust backend can't scan containers. See Fix 10 in /podman-fix.
Verification Checklist
After setting up all 3 layers, verify:
echo "=== Rootless Podman Prerequisites ==="
echo "User: $(whoami)"
echo "XDG_RUNTIME_DIR: $XDG_RUNTIME_DIR"
grep archipelago /etc/subuid | head -1
ls /var/lib/systemd/linger/ | grep archipelago && echo "Linger: enabled" || echo "Linger: DISABLED"
grep DEFAULT_FORWARD_POLICY /etc/default/ufw
echo ""
echo "=== Layer 1: Restart Policies ==="
for c in $(podman ps -a --format "{{.Names}}"); do
policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
echo " $c: $policy"
done
echo ""
echo "=== Layer 2: Watchdog Timer ==="
sudo systemctl is-active archipelago-watchdog.timer
sudo systemctl list-timers | grep archipelago
echo ""
echo "=== Layer 3: Boot Services ==="
sudo systemctl is-enabled podman-restart.service 2>/dev/null || echo "podman-restart: not found"
sudo systemctl is-enabled archipelago-containers.service 2>/dev/null || echo "ordered-start: not found"
sudo systemctl is-enabled archipelago-watchdog.timer 2>/dev/null || echo "watchdog: not found"
echo ""
echo "=== Container Health Summary ==="
total=$(podman ps -a --format "{{.Names}}" | wc -l)
running=$(podman ps --format "{{.Names}}" | wc -l)
stopped=$((total - running))
unhealthy=$(podman ps --filter health=unhealthy --format "{{.Names}}" | wc -l)
echo " Total: $total | Running: $running | Stopped: $stopped | Unhealthy: $unhealthy"
echo ""
echo "=== Volume Ownership Spot Check ==="
for dir in bitcoin lnd grafana; do
if [ -d "/var/lib/archipelago/$dir" ]; then
echo " $dir: $(stat -c '%u:%g' /var/lib/archipelago/$dir)"
fi
done
Reboot Test
The ultimate uptime test — reboot the server and verify everything comes back:
# Before reboot: record running containers
podman ps --format "{{.Names}}" | sort > /tmp/before-reboot.txt
# Reboot
sudo reboot
# After reboot (wait ~3 minutes, then SSH back in):
podman ps --format "{{.Names}}" | sort > /tmp/after-reboot.txt
# Compare
diff /tmp/before-reboot.txt /tmp/after-reboot.txt
# Should show no differences
# Also verify XDG_RUNTIME_DIR survived reboot
ls /run/user/1000/ || echo "CRITICAL: lingering not working"
Monitoring
Check uptime status anytime:
# Quick status
podman ps -a --format "table {{.Names}}\t{{.Status}}" | sort
# Watchdog activity
sudo journalctl -t container-watchdog --since "24 hours ago" --no-pager
# Container events (starts, stops, deaths)
podman events --since 24h --filter event=start --filter event=stop --filter event=died 2>/dev/null | tail -30
# Check for permission denied errors (rootless UID mapping issue)
podman ps -a --filter status=exited --format "{{.Names}}" | while read c; do
podman logs --tail 5 "$c" 2>&1 | grep -i "permission denied" && echo " ^ UID mapping issue in: $c"
done
Integration
- Run
/podman-doctorfirst to identify issues (includes rootless health checks) - Run
/podman-fixfor specific container repairs (includes UID mapping fixes) - Run
/podman-uptimeto set up permanent reliability infrastructure - Add to ISO build: copy watchdog scripts to
image-recipe/configs/and enable in first-boot