archy/.claude/plans/smooth-roaming-wadler.md
Dorian 1e283daf13 fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00

8.6 KiB

Container Orchestration Dev Testing Infrastructure

Context

Container orchestration has been unreliable for months. Every fix requires a full deploy to .228 (5+ minutes), manual SSH debugging, and prayer. No way to test orchestration logic locally or catch regressions before deploy. We need three layers of testing so orchestration is bulletproof before it ever touches a server.

Three Layers

Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant)

Tests the orchestration LOGIC without any containers. Runs in cargo test, takes seconds.

What it tests: Retry backoff timing, restart tracker persistence, tier ordering, stop grace periods, failsafe install flow, health monitor state machine, crash recovery.

Implementation:

Create core/archipelago/src/container/mock_podman.rs — a fake podman command executor:

pub struct MockPodman {
    containers: Arc<Mutex<HashMap<String, MockContainer>>>,
    fail_pull: Arc<AtomicBool>,        // simulate registry down
    fail_start: Arc<AtomicBool>,       // simulate container crash on start
    pull_delay_ms: Arc<AtomicU64>,     // simulate slow pull
}

struct MockContainer {
    name: String,
    image: String,
    state: ContainerState,  // Created/Running/Exited/Stopped
    exit_code: i32,
    created_at: DateTime<Utc>,
}

Key trait to add in runtime.rs:

#[async_trait]
pub trait CommandExecutor: Send + Sync {
    async fn execute(&self, program: &str, args: &[&str]) -> Result<CommandOutput>;
}

Production uses RealExecutor (calls tokio::process::Command). Tests use MockPodman.

Test file: core/archipelago/tests/orchestration_tests.rs

Tests to write:

  1. test_stop_grace_periods — bitcoin gets 600s, lnd 330s, unknown gets 30s
  2. test_pull_retry_backoff — fail twice, succeed third, verify 5s/15s delays
  3. test_pull_all_attempts_fail — fail 3x, verify error returned
  4. test_restart_tracker_persistence — save to disk, reload, verify counters survive
  5. test_restart_tracker_stability_reset — after 1h, counters clear
  6. test_failsafe_install_rollback — container exits immediately, verify cleanup
  7. test_failsafe_install_image_missing — pull succeeds but image not found, verify error
  8. test_health_monitor_tier_ordering — databases restart before apps
  9. test_health_monitor_skips_user_stopped — user-stopped containers not restarted
  10. test_health_monitor_max_attempts — stops after 3 failures
  11. test_crash_recovery_loads_snapshot — PID file + snapshot → containers restarted
  12. test_crash_recovery_skips_user_stopped — user-stopped not recovered

Files to modify:

  • core/archipelago/src/container/mod.rs — add pub mod mock_podman;
  • core/archipelago/src/container/mock_podman.rs — NEW mock implementation
  • core/archipelago/tests/orchestration_tests.rs — NEW test file
  • core/archipelago/src/health_monitor.rs — extract logic into testable functions (pure functions that take data, not functions that call podman)
  • core/archipelago/src/api/rpc/package/runtime.rs — make stop_timeout_secs public for testing

Key refactors to make code testable:

  • Extract stop_timeout_secs()pub fn so tests can call it directly
  • Extract health monitor check_and_restart() into a function that takes container list + tracker + user_stopped, returns actions to take (restart X, notify Y, skip Z) — pure logic, no IO
  • Extract RestartTracker + RestartHistory into own file for independent testing
  • Make pull_image_with_progress retry logic independent of progress streaming

Layer A: SSH Dev Loop in dev-start.sh (real containers on .228)

New option 9 in dev-start.sh: "Container orchestration dev (live on .228)"

What it does:

  1. Rsync code to .228 (2 seconds)
  2. Build backend on .228 (incremental: 5-15 seconds)
  3. Restart archipelago service
  4. Run orchestration smoke tests via RPC
  5. Show container status + health monitor logs
  6. Loop: edit locally → press Enter → rsync+rebuild+test

What it tests: Real podman, real containers, real networking. The actual install/start/stop/restart/health cycle.

Implementation:

Add option 9 to scripts/dev-start.sh:

9)
    echo "Container Orchestration Dev (live testing on .228)"
    exec "$SCRIPT_DIR/dev-container-test.sh"
    ;;

Create scripts/dev-container-test.sh (~150 lines):

#!/bin/bash
# Fast edit-build-test loop for container orchestration on .228
#
# Usage: ./scripts/dev-container-test.sh [--once]
#
# Syncs code, builds, restarts, runs orchestration smoke tests.
# Press Enter to re-run, Ctrl+C to stop.

SSH="ssh -o StrictHostKeyChecking=no -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228"

sync_and_build() {
    rsync (same excludes as deploy script)
    ssh: cargo build --release -p archipelago (incremental)
    ssh: sudo systemctl restart archipelago
    ssh: wait for health endpoint (15s timeout)
}

run_smoke_tests() {
    # Test 1: Container list works
    curl -s /rpc/v1 -d '{"method":"container.list"}'

    # Test 2: Install filebrowser (small, fast, no deps)
    curl -s /rpc/v1 -d '{"method":"package.install","params":{"id":"filebrowser","dockerImage":"..."}}'
    # Wait for running state

    # Test 3: Stop with grace period
    curl -s /rpc/v1 -d '{"method":"package.stop","params":{"id":"filebrowser"}}'
    # Verify stopped

    # Test 4: Start
    curl -s /rpc/v1 -d '{"method":"package.start","params":{"id":"filebrowser"}}'
    # Verify running

    # Test 5: Health check
    curl -s /rpc/v1 -d '{"method":"container.health"}'

    # Test 6: Check restart-tracker.json exists
    ssh: cat /var/lib/archipelago/restart-tracker.json

    # Test 7: Check health monitor logs for errors
    ssh: journalctl -u archipelago --since "2 min ago" | grep -i "error\|panic\|fail"

    # Test 8: Uninstall
    curl -s /rpc/v1 -d '{"method":"package.uninstall","params":{"id":"filebrowser"}}'
}

# Main loop
while true; do
    sync_and_build
    run_smoke_tests
    echo "Press Enter to re-run, Ctrl+C to stop"
    read
done

Files:

  • scripts/dev-start.sh — add option 9
  • scripts/dev-container-test.sh — NEW

Layer B: CI Integration Tests (runs on .228 via Gitea Actions)

Extend the existing CI to run container orchestration tests on every push to dev-iso.

What it tests: Full lifecycle on real hardware after every code change. Catches regressions automatically.

Implementation:

Create .gitea/workflows/container-tests.yml:

name: Container Orchestration Tests
on:
  push:
    branches: [dev-iso, main]
    paths:
      - 'core/**'
      - 'scripts/container-*.sh'
      - 'scripts/reconcile-*.sh'

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Rust unit tests (orchestration)
        run: cargo test -p archipelago -- orchestration --no-fail-fast

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to test node
        run: |
          # Rsync + build on .228
          # Run orchestration smoke tests
          bash scripts/run-container-tests.sh

Create scripts/run-container-tests.sh (~200 lines): Reuses the smoke test logic from dev-container-test.sh but structured for CI:

  • JSON output for CI parsing
  • Exit codes for pass/fail
  • Timeout handling (5 min max)
  • Cleanup after test (remove test containers)
  • Tests: install, start, stop, restart, uninstall, health check, restart tracker, reconciliation

Files:

  • .gitea/workflows/container-tests.yml — NEW
  • scripts/run-container-tests.sh — NEW

Execution Order

  1. Layer C first (mock tests) — Get the logic tested, runs locally, fast feedback
  2. Layer A second (dev loop) — Test against real containers with fast iteration
  3. Layer B last (CI) — Automate regression catching

Files Summary

File Action Layer
core/archipelago/src/container/mock_podman.rs NEW C
core/archipelago/src/container/mod.rs MODIFY C
core/archipelago/tests/orchestration_tests.rs NEW C
core/archipelago/src/health_monitor.rs REFACTOR (extract pure logic) C
core/archipelago/src/api/rpc/package/runtime.rs MODIFY (pub fn) C
scripts/dev-start.sh MODIFY (add option 9) A
scripts/dev-container-test.sh NEW A
.gitea/workflows/container-tests.yml NEW B
scripts/run-container-tests.sh NEW B

Verification

  • Layer C: cargo test -p archipelago -- orchestration — all pass on macOS
  • Layer A: ./scripts/dev-start.sh → option 9 → green smoke tests on .228
  • Layer B: Push to dev-iso → CI green on container-tests workflow