archy/scripts/check-release-manifest.sh

79 lines
3.1 KiB
Bash
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
#!/bin/bash
# Validate releases/manifest.json:
# - version matches core/archipelago/Cargo.toml
# - changelog is non-empty (release notes are mandatory per product policy)
# - every component's download_url exists on disk and matches sha256/size
#
# Run on every push from CI, and also locally before publishing a release:
# scripts/check-release-manifest.sh
#
# Exits non-zero on any mismatch so the release process fails loud.
set -eo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
MANIFEST="$REPO_ROOT/releases/manifest.json"
if [ ! -f "$MANIFEST" ]; then
echo "❌ releases/manifest.json missing"
exit 1
fi
fail() { echo "$*"; exit 1; }
ok() { echo "$*"; }
MANIFEST_VERSION=$(python3 -c "import json; print(json.load(open('$MANIFEST'))['version'])")
CARGO_VERSION=$(grep '^version' "$REPO_ROOT/core/archipelago/Cargo.toml" | head -1 | sed -E 's/.*"([^"]+)".*/\1/')
if [ "$MANIFEST_VERSION" != "$CARGO_VERSION" ]; then
fail "manifest version ($MANIFEST_VERSION) ≠ Cargo.toml ($CARGO_VERSION)"
fi
ok "version matches: $MANIFEST_VERSION"
# Release notes mandatory — ships stuff nobody can read otherwise.
CHANGELOG_COUNT=$(python3 -c "import json; print(len(json.load(open('$MANIFEST'))['changelog']))")
if [ "$CHANGELOG_COUNT" -eq 0 ]; then
fail "changelog is empty — every release MUST have release notes"
fi
ok "changelog has $CHANGELOG_COUNT lines"
# Each component: the artifact on disk under releases/v<version>/ must match
# the declared sha256 and size_bytes.
VERSION_DIR="$REPO_ROOT/releases/v${MANIFEST_VERSION}"
if [ ! -d "$VERSION_DIR" ]; then
fail "releases/v${MANIFEST_VERSION}/ missing — artifacts not staged"
fi
COMPONENT_COUNT=$(python3 -c "import json; print(len(json.load(open('$MANIFEST'))['components']))")
for i in $(seq 0 $((COMPONENT_COUNT - 1))); do
NAME=$(python3 -c "import json; print(json.load(open('$MANIFEST'))['components'][$i]['name'])")
DECLARED_SHA=$(python3 -c "import json; print(json.load(open('$MANIFEST'))['components'][$i]['sha256'])")
DECLARED_SIZE=$(python3 -c "import json; print(json.load(open('$MANIFEST'))['components'][$i]['size_bytes'])")
# Component names other than exactly "archipelago" are the tarball's
# filename; use as-is. The bare "archipelago" component maps to the
# binary file literally named `archipelago`.
FILE="$VERSION_DIR/$NAME"
if [ "$NAME" = "archipelago" ]; then
FILE="$VERSION_DIR/archipelago"
fi
if [ ! -f "$FILE" ]; then
fail "component '$NAME' file missing at $FILE"
fi
ACTUAL_SHA=$(sha256sum "$FILE" | awk '{print $1}')
ACTUAL_SIZE=$(stat -c%s "$FILE")
if [ "$ACTUAL_SHA" != "$DECLARED_SHA" ]; then
fail "component '$NAME' sha256 mismatch (declared=$DECLARED_SHA actual=$ACTUAL_SHA)"
fi
if [ "$ACTUAL_SIZE" != "$DECLARED_SIZE" ]; then
fail "component '$NAME' size mismatch (declared=$DECLARED_SIZE actual=$ACTUAL_SIZE)"
fi
ok "component '$NAME': sha256 + size match on-disk artifact"
done
echo
ok "releases/manifest.json passes all checks — safe to publish v${MANIFEST_VERSION}"