Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.6 KiB
Container Orchestration Dev Testing Infrastructure
Context
Container orchestration has been unreliable for months. Every fix requires a full deploy to .228 (5+ minutes), manual SSH debugging, and prayer. No way to test orchestration logic locally or catch regressions before deploy. We need three layers of testing so orchestration is bulletproof before it ever touches a server.
Three Layers
Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant)
Tests the orchestration LOGIC without any containers. Runs in cargo test, takes seconds.
What it tests: Retry backoff timing, restart tracker persistence, tier ordering, stop grace periods, failsafe install flow, health monitor state machine, crash recovery.
Implementation:
Create core/archipelago/src/container/mock_podman.rs — a fake podman command executor:
pub struct MockPodman {
containers: Arc<Mutex<HashMap<String, MockContainer>>>,
fail_pull: Arc<AtomicBool>, // simulate registry down
fail_start: Arc<AtomicBool>, // simulate container crash on start
pull_delay_ms: Arc<AtomicU64>, // simulate slow pull
}
struct MockContainer {
name: String,
image: String,
state: ContainerState, // Created/Running/Exited/Stopped
exit_code: i32,
created_at: DateTime<Utc>,
}
Key trait to add in runtime.rs:
#[async_trait]
pub trait CommandExecutor: Send + Sync {
async fn execute(&self, program: &str, args: &[&str]) -> Result<CommandOutput>;
}
Production uses RealExecutor (calls tokio::process::Command). Tests use MockPodman.
Test file: core/archipelago/tests/orchestration_tests.rs
Tests to write:
test_stop_grace_periods— bitcoin gets 600s, lnd 330s, unknown gets 30stest_pull_retry_backoff— fail twice, succeed third, verify 5s/15s delaystest_pull_all_attempts_fail— fail 3x, verify error returnedtest_restart_tracker_persistence— save to disk, reload, verify counters survivetest_restart_tracker_stability_reset— after 1h, counters cleartest_failsafe_install_rollback— container exits immediately, verify cleanuptest_failsafe_install_image_missing— pull succeeds but image not found, verify errortest_health_monitor_tier_ordering— databases restart before appstest_health_monitor_skips_user_stopped— user-stopped containers not restartedtest_health_monitor_max_attempts— stops after 3 failurestest_crash_recovery_loads_snapshot— PID file + snapshot → containers restartedtest_crash_recovery_skips_user_stopped— user-stopped not recovered
Files to modify:
core/archipelago/src/container/mod.rs— addpub mod mock_podman;core/archipelago/src/container/mock_podman.rs— NEW mock implementationcore/archipelago/tests/orchestration_tests.rs— NEW test filecore/archipelago/src/health_monitor.rs— extract logic into testable functions (pure functions that take data, not functions that call podman)core/archipelago/src/api/rpc/package/runtime.rs— makestop_timeout_secspublic for testing
Key refactors to make code testable:
- Extract
stop_timeout_secs()→pub fnso tests can call it directly - Extract health monitor
check_and_restart()into a function that takes container list + tracker + user_stopped, returns actions to take (restart X, notify Y, skip Z) — pure logic, no IO - Extract
RestartTracker+RestartHistoryinto own file for independent testing - Make
pull_image_with_progressretry logic independent of progress streaming
Layer A: SSH Dev Loop in dev-start.sh (real containers on .228)
New option 9 in dev-start.sh: "Container orchestration dev (live on .228)"
What it does:
- Rsync code to .228 (2 seconds)
- Build backend on .228 (incremental: 5-15 seconds)
- Restart archipelago service
- Run orchestration smoke tests via RPC
- Show container status + health monitor logs
- Loop: edit locally → press Enter → rsync+rebuild+test
What it tests: Real podman, real containers, real networking. The actual install/start/stop/restart/health cycle.
Implementation:
Add option 9 to scripts/dev-start.sh:
9)
echo "Container Orchestration Dev (live testing on .228)"
exec "$SCRIPT_DIR/dev-container-test.sh"
;;
Create scripts/dev-container-test.sh (~150 lines):
#!/bin/bash
# Fast edit-build-test loop for container orchestration on .228
#
# Usage: ./scripts/dev-container-test.sh [--once]
#
# Syncs code, builds, restarts, runs orchestration smoke tests.
# Press Enter to re-run, Ctrl+C to stop.
SSH="ssh -o StrictHostKeyChecking=no -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228"
sync_and_build() {
rsync (same excludes as deploy script)
ssh: cargo build --release -p archipelago (incremental)
ssh: sudo systemctl restart archipelago
ssh: wait for health endpoint (15s timeout)
}
run_smoke_tests() {
# Test 1: Container list works
curl -s /rpc/v1 -d '{"method":"container.list"}'
# Test 2: Install filebrowser (small, fast, no deps)
curl -s /rpc/v1 -d '{"method":"package.install","params":{"id":"filebrowser","dockerImage":"..."}}'
# Wait for running state
# Test 3: Stop with grace period
curl -s /rpc/v1 -d '{"method":"package.stop","params":{"id":"filebrowser"}}'
# Verify stopped
# Test 4: Start
curl -s /rpc/v1 -d '{"method":"package.start","params":{"id":"filebrowser"}}'
# Verify running
# Test 5: Health check
curl -s /rpc/v1 -d '{"method":"container.health"}'
# Test 6: Check restart-tracker.json exists
ssh: cat /var/lib/archipelago/restart-tracker.json
# Test 7: Check health monitor logs for errors
ssh: journalctl -u archipelago --since "2 min ago" | grep -i "error\|panic\|fail"
# Test 8: Uninstall
curl -s /rpc/v1 -d '{"method":"package.uninstall","params":{"id":"filebrowser"}}'
}
# Main loop
while true; do
sync_and_build
run_smoke_tests
echo "Press Enter to re-run, Ctrl+C to stop"
read
done
Files:
scripts/dev-start.sh— add option 9scripts/dev-container-test.sh— NEW
Layer B: CI Integration Tests (runs on .228 via Gitea Actions)
Extend the existing CI to run container orchestration tests on every push to dev-iso.
What it tests: Full lifecycle on real hardware after every code change. Catches regressions automatically.
Implementation:
Create .gitea/workflows/container-tests.yml:
name: Container Orchestration Tests
on:
push:
branches: [dev-iso, main]
paths:
- 'core/**'
- 'scripts/container-*.sh'
- 'scripts/reconcile-*.sh'
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Rust unit tests (orchestration)
run: cargo test -p archipelago -- orchestration --no-fail-fast
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- name: Deploy to test node
run: |
# Rsync + build on .228
# Run orchestration smoke tests
bash scripts/run-container-tests.sh
Create scripts/run-container-tests.sh (~200 lines):
Reuses the smoke test logic from dev-container-test.sh but structured for CI:
- JSON output for CI parsing
- Exit codes for pass/fail
- Timeout handling (5 min max)
- Cleanup after test (remove test containers)
- Tests: install, start, stop, restart, uninstall, health check, restart tracker, reconciliation
Files:
.gitea/workflows/container-tests.yml— NEWscripts/run-container-tests.sh— NEW
Execution Order
- Layer C first (mock tests) — Get the logic tested, runs locally, fast feedback
- Layer A second (dev loop) — Test against real containers with fast iteration
- Layer B last (CI) — Automate regression catching
Files Summary
| File | Action | Layer |
|---|---|---|
core/archipelago/src/container/mock_podman.rs |
NEW | C |
core/archipelago/src/container/mod.rs |
MODIFY | C |
core/archipelago/tests/orchestration_tests.rs |
NEW | C |
core/archipelago/src/health_monitor.rs |
REFACTOR (extract pure logic) | C |
core/archipelago/src/api/rpc/package/runtime.rs |
MODIFY (pub fn) | C |
scripts/dev-start.sh |
MODIFY (add option 9) | A |
scripts/dev-container-test.sh |
NEW | A |
.gitea/workflows/container-tests.yml |
NEW | B |
scripts/run-container-tests.sh |
NEW | B |
Verification
- Layer C:
cargo test -p archipelago -- orchestration— all pass on macOS - Layer A:
./scripts/dev-start.sh→ option 9 → green smoke tests on .228 - Layer B: Push to dev-iso → CI green on container-tests workflow