lfg2025/archy

Dorian 1e283daf13 fix: overhaul container lifecycle — recovery, health, uninstall, UI state

Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-31 07:03:57 +01:00

8.6 KiB

Raw Blame History

Container Orchestration Dev Testing Infrastructure

Context

Container orchestration has been unreliable for months. Every fix requires a full deploy to .228 (5+ minutes), manual SSH debugging, and prayer. No way to test orchestration logic locally or catch regressions before deploy. We need three layers of testing so orchestration is bulletproof before it ever touches a server.

Three Layers

Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant)

Tests the orchestration LOGIC without any containers. Runs in cargo test, takes seconds.

What it tests: Retry backoff timing, restart tracker persistence, tier ordering, stop grace periods, failsafe install flow, health monitor state machine, crash recovery.

Implementation:

Create core/archipelago/src/container/mock_podman.rs — a fake podman command executor:

pub struct MockPodman {
    containers: Arc<Mutex<HashMap<String, MockContainer>>>,
    fail_pull: Arc<AtomicBool>,        // simulate registry down
    fail_start: Arc<AtomicBool>,       // simulate container crash on start
    pull_delay_ms: Arc<AtomicU64>,     // simulate slow pull
}

struct MockContainer {
    name: String,
    image: String,
    state: ContainerState,  // Created/Running/Exited/Stopped
    exit_code: i32,
    created_at: DateTime<Utc>,
}

Key trait to add in runtime.rs:

#[async_trait]
pub trait CommandExecutor: Send + Sync {
    async fn execute(&self, program: &str, args: &[&str]) -> Result<CommandOutput>;
}

Production uses RealExecutor (calls tokio::process::Command). Tests use MockPodman.

Test file: core/archipelago/tests/orchestration_tests.rs

Tests to write:

test_stop_grace_periods — bitcoin gets 600s, lnd 330s, unknown gets 30s
test_pull_retry_backoff — fail twice, succeed third, verify 5s/15s delays
test_pull_all_attempts_fail — fail 3x, verify error returned
test_restart_tracker_persistence — save to disk, reload, verify counters survive
test_restart_tracker_stability_reset — after 1h, counters clear
test_failsafe_install_rollback — container exits immediately, verify cleanup
test_failsafe_install_image_missing — pull succeeds but image not found, verify error
test_health_monitor_tier_ordering — databases restart before apps
test_health_monitor_skips_user_stopped — user-stopped containers not restarted
test_health_monitor_max_attempts — stops after 3 failures
test_crash_recovery_loads_snapshot — PID file + snapshot → containers restarted
test_crash_recovery_skips_user_stopped — user-stopped not recovered

Files to modify:

core/archipelago/src/container/mod.rs — add pub mod mock_podman;
core/archipelago/src/container/mock_podman.rs — NEW mock implementation
core/archipelago/tests/orchestration_tests.rs — NEW test file
core/archipelago/src/health_monitor.rs — extract logic into testable functions (pure functions that take data, not functions that call podman)
core/archipelago/src/api/rpc/package/runtime.rs — make stop_timeout_secs public for testing

Key refactors to make code testable:

Extract stop_timeout_secs() → pub fn so tests can call it directly
Extract health monitor check_and_restart() into a function that takes container list + tracker + user_stopped, returns actions to take (restart X, notify Y, skip Z) — pure logic, no IO
Extract RestartTracker + RestartHistory into own file for independent testing
Make pull_image_with_progress retry logic independent of progress streaming

Layer A: SSH Dev Loop in dev-start.sh (real containers on .228)

New option 9 in dev-start.sh: "Container orchestration dev (live on .228)"

What it does:

Rsync code to .228 (2 seconds)
Build backend on .228 (incremental: 5-15 seconds)
Restart archipelago service
Run orchestration smoke tests via RPC
Show container status + health monitor logs
Loop: edit locally → press Enter → rsync+rebuild+test

What it tests: Real podman, real containers, real networking. The actual install/start/stop/restart/health cycle.

Implementation:

Add option 9 to scripts/dev-start.sh:

9)
    echo "Container Orchestration Dev (live testing on .228)"
    exec "$SCRIPT_DIR/dev-container-test.sh"
    ;;

Create scripts/dev-container-test.sh (~150 lines):

#!/bin/bash
# Fast edit-build-test loop for container orchestration on .228
#
# Usage: ./scripts/dev-container-test.sh [--once]
#
# Syncs code, builds, restarts, runs orchestration smoke tests.
# Press Enter to re-run, Ctrl+C to stop.

SSH="ssh -o StrictHostKeyChecking=no -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228"

sync_and_build() {
    rsync (same excludes as deploy script)
    ssh: cargo build --release -p archipelago (incremental)
    ssh: sudo systemctl restart archipelago
    ssh: wait for health endpoint (15s timeout)
}

run_smoke_tests() {
    # Test 1: Container list works
    curl -s /rpc/v1 -d '{"method":"container.list"}'

    # Test 2: Install filebrowser (small, fast, no deps)
    curl -s /rpc/v1 -d '{"method":"package.install","params":{"id":"filebrowser","dockerImage":"..."}}'
    # Wait for running state

    # Test 3: Stop with grace period
    curl -s /rpc/v1 -d '{"method":"package.stop","params":{"id":"filebrowser"}}'
    # Verify stopped

    # Test 4: Start
    curl -s /rpc/v1 -d '{"method":"package.start","params":{"id":"filebrowser"}}'
    # Verify running

    # Test 5: Health check
    curl -s /rpc/v1 -d '{"method":"container.health"}'

    # Test 6: Check restart-tracker.json exists
    ssh: cat /var/lib/archipelago/restart-tracker.json

    # Test 7: Check health monitor logs for errors
    ssh: journalctl -u archipelago --since "2 min ago" | grep -i "error\|panic\|fail"

    # Test 8: Uninstall
    curl -s /rpc/v1 -d '{"method":"package.uninstall","params":{"id":"filebrowser"}}'
}

# Main loop
while true; do
    sync_and_build
    run_smoke_tests
    echo "Press Enter to re-run, Ctrl+C to stop"
    read
done

Files:

scripts/dev-start.sh — add option 9
scripts/dev-container-test.sh — NEW

Layer B: CI Integration Tests (runs on .228 via Gitea Actions)

Extend the existing CI to run container orchestration tests on every push to dev-iso.

What it tests: Full lifecycle on real hardware after every code change. Catches regressions automatically.

Implementation:

Create .gitea/workflows/container-tests.yml:

name: Container Orchestration Tests
on:
  push:
    branches: [dev-iso, main]
    paths:
      - 'core/**'
      - 'scripts/container-*.sh'
      - 'scripts/reconcile-*.sh'

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Rust unit tests (orchestration)
        run: cargo test -p archipelago -- orchestration --no-fail-fast

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to test node
        run: |
          # Rsync + build on .228
          # Run orchestration smoke tests
          bash scripts/run-container-tests.sh

Create scripts/run-container-tests.sh (~200 lines): Reuses the smoke test logic from dev-container-test.sh but structured for CI:

JSON output for CI parsing
Exit codes for pass/fail
Timeout handling (5 min max)
Cleanup after test (remove test containers)
Tests: install, start, stop, restart, uninstall, health check, restart tracker, reconciliation

Files:

.gitea/workflows/container-tests.yml — NEW
scripts/run-container-tests.sh — NEW

Execution Order

Layer C first (mock tests) — Get the logic tested, runs locally, fast feedback
Layer A second (dev loop) — Test against real containers with fast iteration
Layer B last (CI) — Automate regression catching

Files Summary

File	Action	Layer
`core/archipelago/src/container/mock_podman.rs`	NEW	C
`core/archipelago/src/container/mod.rs`	MODIFY	C
`core/archipelago/tests/orchestration_tests.rs`	NEW	C
`core/archipelago/src/health_monitor.rs`	REFACTOR (extract pure logic)	C
`core/archipelago/src/api/rpc/package/runtime.rs`	MODIFY (pub fn)	C
`scripts/dev-start.sh`	MODIFY (add option 9)	A
`scripts/dev-container-test.sh`	NEW	A
`.gitea/workflows/container-tests.yml`	NEW	B
`scripts/run-container-tests.sh`	NEW	B

Verification

Layer C: cargo test -p archipelago -- orchestration — all pass on macOS
Layer A: ./scripts/dev-start.sh → option 9 → green smoke tests on .228
Layer B: Push to dev-iso → CI green on container-tests workflow

8.6 KiB Raw Blame History

Container Orchestration Dev Testing Infrastructure

Context

Three Layers

Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant)

Layer A: SSH Dev Loop in dev-start.sh (real containers on .228)

Layer B: CI Integration Tests (runs on .228 via Gitea Actions)

Execution Order

Files Summary

Verification

8.6 KiB

Raw Blame History