lfg2025/archy

Dorian 5008cb6d1f fix: rootless UID mapping corrections + credential injection

- Correct off-by-one in UID mapping: container UID N → host UID
  (100000 + N - 1), not (100000 + N)
- Deploy script auto-fixes UID ownership on every deploy
- Bitcoin UI nginx uses __BITCOIN_RPC_AUTH__ placeholder injected
  from secrets at deploy time
- container rules updated for rootless podman architecture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 15:57:16 +00:00

13 KiB

Raw Blame History

name, description, allowed-tools

name	description	allowed-tools
podman-uptime	Ensure 100% container uptime on Archipelago. Sets up systemd watchdog timers, verifies restart policies, creates health check monitors, and configures auto-recovery for all containers. Handles rootless Podman (user: archipelago, UID 1000, subuid 100000:65536). Use when asked to "ensure uptime", "containers keep dying", "auto-restart", "watchdog", "container monitoring", "uptime guarantee", "keep containers running", "survive reboot", or to harden container reliability.	Bash Read Edit Write Glob Grep

Podman Uptime — Container Reliability Guardian

Ensures every Archipelago container survives reboots, recovers from crashes, and stays healthy. Sets up the three layers of uptime defense: restart policies, systemd watchdog, and health-based auto-recovery.

SSH command: ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228

ROOTLESS PODMAN: All podman commands run as the archipelago user — NO sudo. Only use sudo for: systemd unit files, chown on volumes, UFW changes. The archipelago user runs containers directly via user namespaces.

Prerequisites for Rootless Uptime

Before setting up uptime infrastructure, verify rootless Podman basics are working:

# Must be the archipelago user
whoami  # archipelago

# User lingering must be enabled (keeps user services running after logout)
ls /var/lib/systemd/linger/ | grep archipelago || sudo loginctl enable-linger archipelago

# XDG_RUNTIME_DIR must be set
echo $XDG_RUNTIME_DIR  # /run/user/1000

# Subuid/subgid must be configured
grep archipelago /etc/subuid  # archipelago:100000:65536

# UFW forward policy must be ACCEPT (for LAN access to containers)
grep DEFAULT_FORWARD_POLICY /etc/default/ufw  # Must be "ACCEPT"

Layer 1: Restart Policies (Survive Reboots)

Every container MUST have --restart unless-stopped. This is non-negotiable.

Audit and fix all containers

# Audit
for c in $(podman ps -a --format "{{.Names}}"); do
  policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
  echo "$c: $policy"
done

# Fix any with "no" or empty policy
for c in $(podman ps -a --format "{{.Names}}"); do
  policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
  if [ "$policy" = "no" ] || [ -z "$policy" ]; then
    echo "Fixing: $c"
    podman update --restart unless-stopped "$c"
  fi
done

Ensure podman auto-starts containers on boot

For rootless Podman, containers with restart policies are auto-started by podman-restart as a user service:

# Enable the rootless podman-restart user service
systemctl --user enable podman-restart.service 2>/dev/null

# If the user service doesn't exist, create a system-level one
# (runs as archipelago user via User= directive)
cat <<'EOF' | sudo tee /etc/systemd/system/podman-restart.service
[Unit]
Description=Podman Start All Containers With Restart Policy
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/bin/podman start --all --filter restart-policy=unless-stopped
RemainAfterExit=yes
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable podman-restart.service

Layer 2: Systemd Watchdog (Detect and Recover)

Create a systemd timer that checks container health every 2 minutes and restarts unhealthy or stopped containers.

Create the watchdog script

cat <<'SCRIPT' | sudo tee /usr/local/bin/archipelago-container-watchdog.sh
#!/bin/bash
# Archipelago Container Watchdog (Rootless Podman)
# Runs as archipelago user — NO sudo for podman commands

LOG_TAG="container-watchdog"

# Run podman as the archipelago user with correct XDG path
export XDG_RUNTIME_DIR=/run/user/1000
PODMAN="/usr/bin/podman"

# Restart any stopped containers that should be running (have restart policy)
for c in $($PODMAN ps -a --filter status=exited --filter restart-policy=unless-stopped --format "{{.Names}}" 2>/dev/null); do
  logger -t "$LOG_TAG" "Restarting stopped container: $c"
  $PODMAN start "$c" 2>&1 | logger -t "$LOG_TAG"
done

# Restart unhealthy containers
for c in $($PODMAN ps --filter health=unhealthy --format "{{.Names}}" 2>/dev/null); do
  logger -t "$LOG_TAG" "Restarting unhealthy container: $c"
  $PODMAN restart "$c" 2>&1 | logger -t "$LOG_TAG"
done

# Check for containers in "created" state (never started)
for c in $($PODMAN ps -a --filter status=created --format "{{.Names}}" 2>/dev/null); do
  logger -t "$LOG_TAG" "Starting created container: $c"
  $PODMAN start "$c" 2>&1 | logger -t "$LOG_TAG"
done
SCRIPT

sudo chmod +x /usr/local/bin/archipelago-container-watchdog.sh

Create the systemd timer

# Service unit — runs as archipelago user for rootless podman
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-watchdog.service
[Unit]
Description=Archipelago Container Watchdog
After=podman-restart.service

[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/local/bin/archipelago-container-watchdog.sh
EOF

# Timer unit — runs every 2 minutes
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-watchdog.timer
[Unit]
Description=Run Archipelago Container Watchdog every 2 minutes

[Timer]
OnBootSec=120
OnUnitActiveSec=120
AccuracySec=30

[Install]
WantedBy=timers.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now archipelago-watchdog.timer

Verify watchdog is running

sudo systemctl status archipelago-watchdog.timer
sudo systemctl list-timers | grep archipelago
# Check watchdog logs
sudo journalctl -t container-watchdog --since "1 hour ago" --no-pager

Layer 3: Dependency-Aware Startup Order

Some containers depend on others. The watchdog handles restarts, but initial boot order matters.

Create ordered startup script

cat <<'SCRIPT' | sudo tee /usr/local/bin/archipelago-ordered-start.sh
#!/bin/bash
# Ordered container startup for Archipelago (Rootless Podman)
# Runs as archipelago user — NO sudo for podman commands
# Respects dependency chain: bitcoin → electrs/lnd → mempool/btcpay

LOG_TAG="ordered-start"
export XDG_RUNTIME_DIR=/run/user/1000
PODMAN="/usr/bin/podman"

wait_for_container() {
  local name=$1
  local max_wait=${2:-60}
  local waited=0
  while [ $waited -lt $max_wait ]; do
    status=$($PODMAN inspect "$name" --format "{{.State.Running}}" 2>/dev/null)
    if [ "$status" = "true" ]; then
      logger -t "$LOG_TAG" "$name is running"
      return 0
    fi
    sleep 5
    waited=$((waited + 5))
  done
  logger -t "$LOG_TAG" "WARNING: $name not running after ${max_wait}s"
  return 1
}

# Tier 0: Infrastructure
logger -t "$LOG_TAG" "Starting Tier 0: Infrastructure"
$PODMAN start tailscale 2>/dev/null

# Tier 1: Databases (must start before services that depend on them)
logger -t "$LOG_TAG" "Starting Tier 1: Databases"
$PODMAN start mempool-db 2>/dev/null
$PODMAN start btcpay-postgres 2>/dev/null
$PODMAN start immich_postgres 2>/dev/null
sleep 5

# Tier 2: Bitcoin (foundation for Lightning and explorers)
logger -t "$LOG_TAG" "Starting Tier 2: Bitcoin"
$PODMAN start bitcoin-knots 2>/dev/null
wait_for_container bitcoin-knots 120

# Tier 3: Bitcoin-dependent services
logger -t "$LOG_TAG" "Starting Tier 3: Bitcoin-dependent"
$PODMAN start electrumx 2>/dev/null
$PODMAN start lnd 2>/dev/null
wait_for_container electrumx 90
wait_for_container lnd 90

# Tier 4: Services depending on Tier 3
logger -t "$LOG_TAG" "Starting Tier 4: Second-order dependencies"
$PODMAN start mempool 2>/dev/null
$PODMAN start nbxplorer 2>/dev/null
sleep 10
$PODMAN start btcpay-server 2>/dev/null
$PODMAN start fedimint 2>/dev/null
$PODMAN start fedimint-gateway 2>/dev/null

# Tier 5: Independent apps (start all remaining)
logger -t "$LOG_TAG" "Starting Tier 5: Independent apps"
$PODMAN start --all 2>/dev/null

# Tier 6: UI containers (need parent apps running first)
logger -t "$LOG_TAG" "Starting Tier 6: UI containers"
$PODMAN start bitcoin-ui 2>/dev/null
$PODMAN start lnd-ui 2>/dev/null
$PODMAN start electrs-ui 2>/dev/null

logger -t "$LOG_TAG" "Startup sequence complete"
SCRIPT

sudo chmod +x /usr/local/bin/archipelago-ordered-start.sh

Wire into boot sequence

# Runs as archipelago user for rootless podman
cat <<'EOF' | sudo tee /etc/systemd/system/archipelago-containers.service
[Unit]
Description=Archipelago Ordered Container Startup
After=network-online.target
Wants=network-online.target
Before=archipelago.service

[Service]
Type=oneshot
User=archipelago
Group=archipelago
Environment=XDG_RUNTIME_DIR=/run/user/1000
ExecStart=/usr/local/bin/archipelago-ordered-start.sh
RemainAfterExit=yes
TimeoutStartSec=600

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable archipelago-containers.service

Rootless-Specific Uptime Considerations

Volume ownership survives reboots

Volume ownership doesn't change on reboot, but if a container image is updated (re-pulled), the new container may run as a different UID. Always verify after image updates:

# Quick ownership audit after image pull
podman inspect CONTAINER_NAME --format "{{.Config.User}}"
# Then verify: sudo stat -c '%u:%g' /var/lib/archipelago/APP_NAME
# Formula: host_uid = 100000 + container_uid

XDG_RUNTIME_DIR on boot

Rootless Podman requires /run/user/1000 to exist. This is created by pam_systemd when the user logs in, or by loginctl enable-linger. If it's missing after boot, containers won't start.

# Verify it exists
ls -la /run/user/1000/ || echo "CRITICAL: /run/user/1000 missing — run: sudo loginctl enable-linger archipelago"

Systemd sandbox must not block podman

If the archipelago.service sandbox blocks namespace/syscall operations, the Rust backend can't scan containers. See Fix 10 in /podman-fix.

Verification Checklist

After setting up all 3 layers, verify:

echo "=== Rootless Podman Prerequisites ==="
echo "User: $(whoami)"
echo "XDG_RUNTIME_DIR: $XDG_RUNTIME_DIR"
grep archipelago /etc/subuid | head -1
ls /var/lib/systemd/linger/ | grep archipelago && echo "Linger: enabled" || echo "Linger: DISABLED"
grep DEFAULT_FORWARD_POLICY /etc/default/ufw

echo ""
echo "=== Layer 1: Restart Policies ==="
for c in $(podman ps -a --format "{{.Names}}"); do
  policy=$(podman inspect "$c" --format "{{.HostConfig.RestartPolicy.Name}}")
  echo "  $c: $policy"
done

echo ""
echo "=== Layer 2: Watchdog Timer ==="
sudo systemctl is-active archipelago-watchdog.timer
sudo systemctl list-timers | grep archipelago

echo ""
echo "=== Layer 3: Boot Services ==="
sudo systemctl is-enabled podman-restart.service 2>/dev/null || echo "podman-restart: not found"
sudo systemctl is-enabled archipelago-containers.service 2>/dev/null || echo "ordered-start: not found"
sudo systemctl is-enabled archipelago-watchdog.timer 2>/dev/null || echo "watchdog: not found"

echo ""
echo "=== Container Health Summary ==="
total=$(podman ps -a --format "{{.Names}}" | wc -l)
running=$(podman ps --format "{{.Names}}" | wc -l)
stopped=$((total - running))
unhealthy=$(podman ps --filter health=unhealthy --format "{{.Names}}" | wc -l)
echo "  Total: $total | Running: $running | Stopped: $stopped | Unhealthy: $unhealthy"

echo ""
echo "=== Volume Ownership Spot Check ==="
for dir in bitcoin lnd grafana; do
  if [ -d "/var/lib/archipelago/$dir" ]; then
    echo "  $dir: $(stat -c '%u:%g' /var/lib/archipelago/$dir)"
  fi
done

Reboot Test

The ultimate uptime test — reboot the server and verify everything comes back:

# Before reboot: record running containers
podman ps --format "{{.Names}}" | sort > /tmp/before-reboot.txt

# Reboot
sudo reboot

# After reboot (wait ~3 minutes, then SSH back in):
podman ps --format "{{.Names}}" | sort > /tmp/after-reboot.txt

# Compare
diff /tmp/before-reboot.txt /tmp/after-reboot.txt
# Should show no differences

# Also verify XDG_RUNTIME_DIR survived reboot
ls /run/user/1000/ || echo "CRITICAL: lingering not working"

Monitoring

Check uptime status anytime:

# Quick status
podman ps -a --format "table {{.Names}}\t{{.Status}}" | sort

# Watchdog activity
sudo journalctl -t container-watchdog --since "24 hours ago" --no-pager

# Container events (starts, stops, deaths)
podman events --since 24h --filter event=start --filter event=stop --filter event=died 2>/dev/null | tail -30

# Check for permission denied errors (rootless UID mapping issue)
podman ps -a --filter status=exited --format "{{.Names}}" | while read c; do
  podman logs --tail 5 "$c" 2>&1 | grep -i "permission denied" && echo "  ^ UID mapping issue in: $c"
done

Integration

Run /podman-doctor first to identify issues (includes rootless health checks)
Run /podman-fix for specific container repairs (includes UID mapping fixes)
Run /podman-uptime to set up permanent reliability infrastructure
Add to ISO build: copy watchdog scripts to image-recipe/configs/ and enable in first-boot

13 KiB Raw Blame History