archy/scripts/smoke-test.sh

87 lines
2.3 KiB
Bash
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
#!/bin/bash
# Smoke test for Archipelago — verifies critical endpoints
# Usage: ./scripts/smoke-test.sh [host]
# Exit 0 if all pass, exit 1 on any failure.
set -euo pipefail
HOST="${1:-192.168.1.198}"
PASS=0
FAIL=0
FAILURES=""
check() {
local name="$1" cmd="$2"
if eval "$cmd" >/dev/null 2>&1; then
echo "$name"
PASS=$((PASS + 1))
else
echo "$name"
FAIL=$((FAIL + 1))
FAILURES="$FAILURES\n - $name"
fi
}
echo "=== Archipelago Smoke Test ==="
echo "Target: $HOST"
echo ""
# 1. Health endpoint
check "GET /health returns OK" \
"curl -sf http://${HOST}/health | grep -q '\"status\"'"
# 2. Login via RPC
SESSION=$(curl -sf -X POST "http://${HOST}/rpc/v1" \
-H 'Content-Type: application/json' \
-d '{"method":"auth.login","params":{"password":"'"${TEST_PASSWORD:-password123}"'"}}' \
-c - 2>/dev/null | grep session | awk '{print $NF}' || echo "")
if [ -n "$SESSION" ]; then
echo " ✓ Login via RPC"
PASS=$((PASS + 1))
else
echo " ✗ Login via RPC"
FAIL=$((FAIL + 1))
FAILURES="$FAILURES\n - Login via RPC"
fi
# 3. Authenticated RPC call
check "server.get-info returns JSON" \
"curl -sf -X POST http://${HOST}/rpc/v1 \
-H 'Content-Type: application/json' \
-b 'session=${SESSION}' \
-d '{\"method\":\"server.get-info\"}' | grep -q '\"result\"'"
# 4. Container list
check "container.list returns JSON" \
"curl -sf -X POST http://${HOST}/rpc/v1 \
-H 'Content-Type: application/json' \
-b 'session=${SESSION}' \
-d '{\"method\":\"container.list\"}' | grep -q '\"result\"'"
# 5. WebSocket upgrade
check "WebSocket upgrade (101)" \
"curl -sf -o /dev/null -w '%{http_code}' \
-H 'Upgrade: websocket' -H 'Connection: Upgrade' \
-H 'Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==' \
-H 'Sec-WebSocket-Version: 13' \
http://${HOST}/ws/db | grep -q '101'"
# 6. Static assets served
check "Frontend index.html served" \
"curl -sf http://${HOST}/ | grep -q '<div id=\"app\"'"
# 7. Onboarding check (unauthenticated)
check "auth.isOnboardingComplete RPC" \
"curl -sf -X POST http://${HOST}/rpc/v1 \
-H 'Content-Type: application/json' \
-d '{\"method\":\"auth.isOnboardingComplete\"}' | grep -q '\"result\"'"
echo ""
echo "=== Results: $PASS passed, $FAIL failed ==="
if [ $FAIL -gt 0 ]; then
echo -e "Failures:$FAILURES"
exit 1
fi
exit 0