Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
242 lines
8.6 KiB
Markdown
242 lines
8.6 KiB
Markdown
# Container Orchestration Dev Testing Infrastructure
|
|
|
|
## Context
|
|
|
|
Container orchestration has been unreliable for months. Every fix requires a full deploy to .228 (5+ minutes), manual SSH debugging, and prayer. No way to test orchestration logic locally or catch regressions before deploy. We need three layers of testing so orchestration is bulletproof before it ever touches a server.
|
|
|
|
## Three Layers
|
|
|
|
### Layer C: Mock Podman in Rust Unit Tests (runs on macOS, instant)
|
|
|
|
Tests the orchestration LOGIC without any containers. Runs in `cargo test`, takes seconds.
|
|
|
|
**What it tests:** Retry backoff timing, restart tracker persistence, tier ordering, stop grace periods, failsafe install flow, health monitor state machine, crash recovery.
|
|
|
|
**Implementation:**
|
|
|
|
Create `core/archipelago/src/container/mock_podman.rs` — a fake podman command executor:
|
|
|
|
```rust
|
|
pub struct MockPodman {
|
|
containers: Arc<Mutex<HashMap<String, MockContainer>>>,
|
|
fail_pull: Arc<AtomicBool>, // simulate registry down
|
|
fail_start: Arc<AtomicBool>, // simulate container crash on start
|
|
pull_delay_ms: Arc<AtomicU64>, // simulate slow pull
|
|
}
|
|
|
|
struct MockContainer {
|
|
name: String,
|
|
image: String,
|
|
state: ContainerState, // Created/Running/Exited/Stopped
|
|
exit_code: i32,
|
|
created_at: DateTime<Utc>,
|
|
}
|
|
```
|
|
|
|
Key trait to add in `runtime.rs`:
|
|
```rust
|
|
#[async_trait]
|
|
pub trait CommandExecutor: Send + Sync {
|
|
async fn execute(&self, program: &str, args: &[&str]) -> Result<CommandOutput>;
|
|
}
|
|
```
|
|
|
|
Production uses `RealExecutor` (calls `tokio::process::Command`). Tests use `MockPodman`.
|
|
|
|
**Test file:** `core/archipelago/tests/orchestration_tests.rs`
|
|
|
|
Tests to write:
|
|
1. `test_stop_grace_periods` — bitcoin gets 600s, lnd 330s, unknown gets 30s
|
|
2. `test_pull_retry_backoff` — fail twice, succeed third, verify 5s/15s delays
|
|
3. `test_pull_all_attempts_fail` — fail 3x, verify error returned
|
|
4. `test_restart_tracker_persistence` — save to disk, reload, verify counters survive
|
|
5. `test_restart_tracker_stability_reset` — after 1h, counters clear
|
|
6. `test_failsafe_install_rollback` — container exits immediately, verify cleanup
|
|
7. `test_failsafe_install_image_missing` — pull succeeds but image not found, verify error
|
|
8. `test_health_monitor_tier_ordering` — databases restart before apps
|
|
9. `test_health_monitor_skips_user_stopped` — user-stopped containers not restarted
|
|
10. `test_health_monitor_max_attempts` — stops after 3 failures
|
|
11. `test_crash_recovery_loads_snapshot` — PID file + snapshot → containers restarted
|
|
12. `test_crash_recovery_skips_user_stopped` — user-stopped not recovered
|
|
|
|
**Files to modify:**
|
|
- `core/archipelago/src/container/mod.rs` — add `pub mod mock_podman;`
|
|
- `core/archipelago/src/container/mock_podman.rs` — NEW mock implementation
|
|
- `core/archipelago/tests/orchestration_tests.rs` — NEW test file
|
|
- `core/archipelago/src/health_monitor.rs` — extract logic into testable functions (pure functions that take data, not functions that call podman)
|
|
- `core/archipelago/src/api/rpc/package/runtime.rs` — make `stop_timeout_secs` public for testing
|
|
|
|
**Key refactors to make code testable:**
|
|
- Extract `stop_timeout_secs()` → `pub fn` so tests can call it directly
|
|
- Extract health monitor `check_and_restart()` into a function that takes container list + tracker + user_stopped, returns actions to take (restart X, notify Y, skip Z) — pure logic, no IO
|
|
- Extract `RestartTracker` + `RestartHistory` into own file for independent testing
|
|
- Make `pull_image_with_progress` retry logic independent of progress streaming
|
|
|
|
---
|
|
|
|
### Layer A: SSH Dev Loop in dev-start.sh (real containers on .228)
|
|
|
|
New option 9 in `dev-start.sh`: "Container orchestration dev (live on .228)"
|
|
|
|
**What it does:**
|
|
1. Rsync code to .228 (2 seconds)
|
|
2. Build backend on .228 (incremental: 5-15 seconds)
|
|
3. Restart archipelago service
|
|
4. Run orchestration smoke tests via RPC
|
|
5. Show container status + health monitor logs
|
|
6. Loop: edit locally → press Enter → rsync+rebuild+test
|
|
|
|
**What it tests:** Real podman, real containers, real networking. The actual install/start/stop/restart/health cycle.
|
|
|
|
**Implementation:**
|
|
|
|
Add option 9 to `scripts/dev-start.sh`:
|
|
```bash
|
|
9)
|
|
echo "Container Orchestration Dev (live testing on .228)"
|
|
exec "$SCRIPT_DIR/dev-container-test.sh"
|
|
;;
|
|
```
|
|
|
|
Create `scripts/dev-container-test.sh` (~150 lines):
|
|
```bash
|
|
#!/bin/bash
|
|
# Fast edit-build-test loop for container orchestration on .228
|
|
#
|
|
# Usage: ./scripts/dev-container-test.sh [--once]
|
|
#
|
|
# Syncs code, builds, restarts, runs orchestration smoke tests.
|
|
# Press Enter to re-run, Ctrl+C to stop.
|
|
|
|
SSH="ssh -o StrictHostKeyChecking=no -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228"
|
|
|
|
sync_and_build() {
|
|
rsync (same excludes as deploy script)
|
|
ssh: cargo build --release -p archipelago (incremental)
|
|
ssh: sudo systemctl restart archipelago
|
|
ssh: wait for health endpoint (15s timeout)
|
|
}
|
|
|
|
run_smoke_tests() {
|
|
# Test 1: Container list works
|
|
curl -s /rpc/v1 -d '{"method":"container.list"}'
|
|
|
|
# Test 2: Install filebrowser (small, fast, no deps)
|
|
curl -s /rpc/v1 -d '{"method":"package.install","params":{"id":"filebrowser","dockerImage":"..."}}'
|
|
# Wait for running state
|
|
|
|
# Test 3: Stop with grace period
|
|
curl -s /rpc/v1 -d '{"method":"package.stop","params":{"id":"filebrowser"}}'
|
|
# Verify stopped
|
|
|
|
# Test 4: Start
|
|
curl -s /rpc/v1 -d '{"method":"package.start","params":{"id":"filebrowser"}}'
|
|
# Verify running
|
|
|
|
# Test 5: Health check
|
|
curl -s /rpc/v1 -d '{"method":"container.health"}'
|
|
|
|
# Test 6: Check restart-tracker.json exists
|
|
ssh: cat /var/lib/archipelago/restart-tracker.json
|
|
|
|
# Test 7: Check health monitor logs for errors
|
|
ssh: journalctl -u archipelago --since "2 min ago" | grep -i "error\|panic\|fail"
|
|
|
|
# Test 8: Uninstall
|
|
curl -s /rpc/v1 -d '{"method":"package.uninstall","params":{"id":"filebrowser"}}'
|
|
}
|
|
|
|
# Main loop
|
|
while true; do
|
|
sync_and_build
|
|
run_smoke_tests
|
|
echo "Press Enter to re-run, Ctrl+C to stop"
|
|
read
|
|
done
|
|
```
|
|
|
|
**Files:**
|
|
- `scripts/dev-start.sh` — add option 9
|
|
- `scripts/dev-container-test.sh` — NEW
|
|
|
|
---
|
|
|
|
### Layer B: CI Integration Tests (runs on .228 via Gitea Actions)
|
|
|
|
Extend the existing CI to run container orchestration tests on every push to dev-iso.
|
|
|
|
**What it tests:** Full lifecycle on real hardware after every code change. Catches regressions automatically.
|
|
|
|
**Implementation:**
|
|
|
|
Create `.gitea/workflows/container-tests.yml`:
|
|
```yaml
|
|
name: Container Orchestration Tests
|
|
on:
|
|
push:
|
|
branches: [dev-iso, main]
|
|
paths:
|
|
- 'core/**'
|
|
- 'scripts/container-*.sh'
|
|
- 'scripts/reconcile-*.sh'
|
|
|
|
jobs:
|
|
unit-tests:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- name: Rust unit tests (orchestration)
|
|
run: cargo test -p archipelago -- orchestration --no-fail-fast
|
|
|
|
integration-tests:
|
|
runs-on: ubuntu-latest
|
|
needs: unit-tests
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- name: Deploy to test node
|
|
run: |
|
|
# Rsync + build on .228
|
|
# Run orchestration smoke tests
|
|
bash scripts/run-container-tests.sh
|
|
```
|
|
|
|
Create `scripts/run-container-tests.sh` (~200 lines):
|
|
Reuses the smoke test logic from dev-container-test.sh but structured for CI:
|
|
- JSON output for CI parsing
|
|
- Exit codes for pass/fail
|
|
- Timeout handling (5 min max)
|
|
- Cleanup after test (remove test containers)
|
|
- Tests: install, start, stop, restart, uninstall, health check, restart tracker, reconciliation
|
|
|
|
**Files:**
|
|
- `.gitea/workflows/container-tests.yml` — NEW
|
|
- `scripts/run-container-tests.sh` — NEW
|
|
|
|
---
|
|
|
|
## Execution Order
|
|
|
|
1. **Layer C first** (mock tests) — Get the logic tested, runs locally, fast feedback
|
|
2. **Layer A second** (dev loop) — Test against real containers with fast iteration
|
|
3. **Layer B last** (CI) — Automate regression catching
|
|
|
|
## Files Summary
|
|
|
|
| File | Action | Layer |
|
|
|------|--------|-------|
|
|
| `core/archipelago/src/container/mock_podman.rs` | NEW | C |
|
|
| `core/archipelago/src/container/mod.rs` | MODIFY | C |
|
|
| `core/archipelago/tests/orchestration_tests.rs` | NEW | C |
|
|
| `core/archipelago/src/health_monitor.rs` | REFACTOR (extract pure logic) | C |
|
|
| `core/archipelago/src/api/rpc/package/runtime.rs` | MODIFY (pub fn) | C |
|
|
| `scripts/dev-start.sh` | MODIFY (add option 9) | A |
|
|
| `scripts/dev-container-test.sh` | NEW | A |
|
|
| `.gitea/workflows/container-tests.yml` | NEW | B |
|
|
| `scripts/run-container-tests.sh` | NEW | B |
|
|
|
|
## Verification
|
|
|
|
- Layer C: `cargo test -p archipelago -- orchestration` — all pass on macOS
|
|
- Layer A: `./scripts/dev-start.sh` → option 9 → green smoke tests on .228
|
|
- Layer B: Push to dev-iso → CI green on container-tests workflow
|