# Container Orchestration Dev Testing Infrastructure ## Context Container orchestration has been unreliable for months. Every fix requires a full deploy to .228 (5+ minutes), manual SSH debugging, and prayer. No way to test orchestration logic locally or catch regressions before deploy. We need three layers of testing so orchestration is bulletproof before it ever touches a server. ## Three Layers ### Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant) Tests the orchestration LOGIC without any containers. Runs in `cargo test`, takes seconds. **What it tests:** Retry backoff timing, restart tracker persistence, tier ordering, stop grace periods, failsafe install flow, health monitor state machine, crash recovery. **Implementation:** Create `core/archipelago/src/container/mock_podman.rs` — a fake podman command executor: ```rust pub struct MockPodman { containers: Arc>>, fail_pull: Arc, // simulate registry down fail_start: Arc, // simulate container crash on start pull_delay_ms: Arc, // simulate slow pull } struct MockContainer { name: String, image: String, state: ContainerState, // Created/Running/Exited/Stopped exit_code: i32, created_at: DateTime, } ``` Key trait to add in `runtime.rs`: ```rust #[async_trait] pub trait CommandExecutor: Send + Sync { async fn execute(&self, program: &str, args: &[&str]) -> Result; } ``` Production uses `RealExecutor` (calls `tokio::process::Command`). Tests use `MockPodman`. **Test file:** `core/archipelago/tests/orchestration_tests.rs` Tests to write: 1. `test_stop_grace_periods` — bitcoin gets 600s, lnd 330s, unknown gets 30s 2. `test_pull_retry_backoff` — fail twice, succeed third, verify 5s/15s delays 3. `test_pull_all_attempts_fail` — fail 3x, verify error returned 4. `test_restart_tracker_persistence` — save to disk, reload, verify counters survive 5. `test_restart_tracker_stability_reset` — after 1h, counters clear 6. `test_failsafe_install_rollback` — container exits immediately, verify cleanup 7. `test_failsafe_install_image_missing` — pull succeeds but image not found, verify error 8. `test_health_monitor_tier_ordering` — databases restart before apps 9. `test_health_monitor_skips_user_stopped` — user-stopped containers not restarted 10. `test_health_monitor_max_attempts` — stops after 3 failures 11. `test_crash_recovery_loads_snapshot` — PID file + snapshot → containers restarted 12. `test_crash_recovery_skips_user_stopped` — user-stopped not recovered **Files to modify:** - `core/archipelago/src/container/mod.rs` — add `pub mod mock_podman;` - `core/archipelago/src/container/mock_podman.rs` — NEW mock implementation - `core/archipelago/tests/orchestration_tests.rs` — NEW test file - `core/archipelago/src/health_monitor.rs` — extract logic into testable functions (pure functions that take data, not functions that call podman) - `core/archipelago/src/api/rpc/package/runtime.rs` — make `stop_timeout_secs` public for testing **Key refactors to make code testable:** - Extract `stop_timeout_secs()` → `pub fn` so tests can call it directly - Extract health monitor `check_and_restart()` into a function that takes container list + tracker + user_stopped, returns actions to take (restart X, notify Y, skip Z) — pure logic, no IO - Extract `RestartTracker` + `RestartHistory` into own file for independent testing - Make `pull_image_with_progress` retry logic independent of progress streaming --- ### Layer A: SSH Dev Loop in dev-start.sh (real containers on .228) New option 9 in `dev-start.sh`: "Container orchestration dev (live on .228)" **What it does:** 1. Rsync code to .228 (2 seconds) 2. Build backend on .228 (incremental: 5-15 seconds) 3. Restart archipelago service 4. Run orchestration smoke tests via RPC 5. Show container status + health monitor logs 6. Loop: edit locally → press Enter → rsync+rebuild+test **What it tests:** Real podman, real containers, real networking. The actual install/start/stop/restart/health cycle. **Implementation:** Add option 9 to `scripts/dev-start.sh`: ```bash 9) echo "Container Orchestration Dev (live testing on .228)" exec "$SCRIPT_DIR/dev-container-test.sh" ;; ``` Create `scripts/dev-container-test.sh` (~150 lines): ```bash #!/bin/bash # Fast edit-build-test loop for container orchestration on .228 # # Usage: ./scripts/dev-container-test.sh [--once] # # Syncs code, builds, restarts, runs orchestration smoke tests. # Press Enter to re-run, Ctrl+C to stop. SSH="ssh -o StrictHostKeyChecking=no -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228" sync_and_build() { rsync (same excludes as deploy script) ssh: cargo build --release -p archipelago (incremental) ssh: sudo systemctl restart archipelago ssh: wait for health endpoint (15s timeout) } run_smoke_tests() { # Test 1: Container list works curl -s /rpc/v1 -d '{"method":"container.list"}' # Test 2: Install filebrowser (small, fast, no deps) curl -s /rpc/v1 -d '{"method":"package.install","params":{"id":"filebrowser","dockerImage":"..."}}' # Wait for running state # Test 3: Stop with grace period curl -s /rpc/v1 -d '{"method":"package.stop","params":{"id":"filebrowser"}}' # Verify stopped # Test 4: Start curl -s /rpc/v1 -d '{"method":"package.start","params":{"id":"filebrowser"}}' # Verify running # Test 5: Health check curl -s /rpc/v1 -d '{"method":"container.health"}' # Test 6: Check restart-tracker.json exists ssh: cat /var/lib/archipelago/restart-tracker.json # Test 7: Check health monitor logs for errors ssh: journalctl -u archipelago --since "2 min ago" | grep -i "error\|panic\|fail" # Test 8: Uninstall curl -s /rpc/v1 -d '{"method":"package.uninstall","params":{"id":"filebrowser"}}' } # Main loop while true; do sync_and_build run_smoke_tests echo "Press Enter to re-run, Ctrl+C to stop" read done ``` **Files:** - `scripts/dev-start.sh` — add option 9 - `scripts/dev-container-test.sh` — NEW --- ### Layer B: CI Integration Tests (runs on .228 via Gitea Actions) Extend the existing CI to run container orchestration tests on every push to dev-iso. **What it tests:** Full lifecycle on real hardware after every code change. Catches regressions automatically. **Implementation:** Create `.gitea/workflows/container-tests.yml`: ```yaml name: Container Orchestration Tests on: push: branches: [dev-iso, main] paths: - 'core/**' - 'scripts/container-*.sh' - 'scripts/reconcile-*.sh' jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Rust unit tests (orchestration) run: cargo test -p archipelago -- orchestration --no-fail-fast integration-tests: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - name: Deploy to test node run: | # Rsync + build on .228 # Run orchestration smoke tests bash scripts/run-container-tests.sh ``` Create `scripts/run-container-tests.sh` (~200 lines): Reuses the smoke test logic from dev-container-test.sh but structured for CI: - JSON output for CI parsing - Exit codes for pass/fail - Timeout handling (5 min max) - Cleanup after test (remove test containers) - Tests: install, start, stop, restart, uninstall, health check, restart tracker, reconciliation **Files:** - `.gitea/workflows/container-tests.yml` — NEW - `scripts/run-container-tests.sh` — NEW --- ## Execution Order 1. **Layer C first** (mock tests) — Get the logic tested, runs locally, fast feedback 2. **Layer A second** (dev loop) — Test against real containers with fast iteration 3. **Layer B last** (CI) — Automate regression catching ## Files Summary | File | Action | Layer | |------|--------|-------| | `core/archipelago/src/container/mock_podman.rs` | NEW | C | | `core/archipelago/src/container/mod.rs` | MODIFY | C | | `core/archipelago/tests/orchestration_tests.rs` | NEW | C | | `core/archipelago/src/health_monitor.rs` | REFACTOR (extract pure logic) | C | | `core/archipelago/src/api/rpc/package/runtime.rs` | MODIFY (pub fn) | C | | `scripts/dev-start.sh` | MODIFY (add option 9) | A | | `scripts/dev-container-test.sh` | NEW | A | | `.gitea/workflows/container-tests.yml` | NEW | B | | `scripts/run-container-tests.sh` | NEW | B | ## Verification - Layer C: `cargo test -p archipelago -- orchestration` — all pass on macOS - Layer A: `./scripts/dev-start.sh` → option 9 → green smoke tests on .228 - Layer B: Push to dev-iso → CI green on container-tests workflow