archy/scripts/validate-app-manifest.sh

168 lines
4.8 KiB
Bash
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
#!/usr/bin/env bash
#
# validate-app-manifest.sh — Validate a community-submitted app manifest
#
# Usage: ./scripts/validate-app-manifest.sh <manifest.yml>
#
# Checks:
# 1. Valid YAML syntax
# 2. Required fields present (id, title, version, image, description)
# 3. Image from trusted registry (docker.io, ghcr.io, quay.io)
# 4. No :latest tag (must pin specific version)
# 5. Resource limits specified (memory, cpu)
# 6. Security: no privileged mode, no host networking
# 7. No hardcoded secrets/passwords in environment
# 8. Port conflicts with existing apps
#
# Exit 0 = valid, Exit 1 = issues found
set -euo pipefail
if [[ $# -lt 1 ]]; then
echo "Usage: $0 <manifest.yml>"
exit 1
fi
MANIFEST="$1"
PASS=0
FAIL=0
WARN=0
check() {
local desc="$1" result="$2"
if [[ "$result" == "pass" ]]; then
PASS=$((PASS + 1))
echo " PASS: $desc"
elif [[ "$result" == "warn" ]]; then
WARN=$((WARN + 1))
echo " WARN: $desc"
else
FAIL=$((FAIL + 1))
echo " FAIL: $desc"
fi
}
echo "Validating: $MANIFEST"
echo ""
# 1. File exists and is readable
if [[ ! -f "$MANIFEST" ]]; then
echo " FAIL: File not found: $MANIFEST"
exit 1
fi
check "File exists" "pass"
# 2. Valid YAML
if ! python3 -c "import yaml; yaml.safe_load(open('$MANIFEST'))" 2>/dev/null; then
check "Valid YAML syntax" "fail"
echo " Cannot continue with invalid YAML"
exit 1
fi
check "Valid YAML syntax" "pass"
# 3. Required fields
CONTENT=$(python3 -c "
import yaml, json
with open('$MANIFEST') as f:
d = yaml.safe_load(f)
print(json.dumps(d))
" 2>/dev/null)
for field in id title version description; do
val=$(echo "$CONTENT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('$field',''))" 2>/dev/null)
if [[ -n "$val" && "$val" != "None" ]]; then
check "Required field '$field' present" "pass"
else
check "Required field '$field' present" "fail"
fi
done
# 4. Image reference
IMAGE=$(echo "$CONTENT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('image','') or d.get('docker_image','') or '')" 2>/dev/null)
if [[ -z "$IMAGE" || "$IMAGE" == "None" ]]; then
check "Container image specified" "fail"
else
check "Container image specified" "pass"
# Check trusted registry
TRUSTED=false
for reg in "docker.io" "ghcr.io" "quay.io" "registry.hub.docker.com" "git.tx1138.com"; do
if echo "$IMAGE" | grep -q "$reg"; then
TRUSTED=true
break
fi
done
# Also allow short-form Docker Hub images (no registry prefix)
if ! echo "$IMAGE" | grep -q "/"; then
TRUSTED=true # single-name images are Docker Hub official
fi
if [[ "$TRUSTED" == "true" ]]; then
check "Image from trusted registry" "pass"
else
check "Image from trusted registry ($IMAGE)" "warn"
fi
# Check no :latest
if echo "$IMAGE" | grep -q ":latest$"; then
check "No :latest tag (pin specific version)" "fail"
elif ! echo "$IMAGE" | grep -q ":"; then
check "No version tag specified (should pin version)" "warn"
else
check "Version tag pinned" "pass"
fi
fi
# 5. Security checks
PRIVILEGED=$(echo "$CONTENT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('privileged', False))" 2>/dev/null)
if [[ "$PRIVILEGED" == "True" ]]; then
check "No privileged mode" "fail"
else
check "No privileged mode" "pass"
fi
HOST_NET=$(echo "$CONTENT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('host_network', d.get('network_mode','')))" 2>/dev/null)
if [[ "$HOST_NET" == "host" ]]; then
check "No host networking" "fail"
else
check "No host networking" "pass"
fi
# 6. Check for hardcoded secrets in env vars
ENV_VARS=$(echo "$CONTENT" | python3 -c "
import sys,json
d=json.load(sys.stdin)
env = d.get('environment', d.get('env', {}))
if isinstance(env, dict):
for k,v in env.items():
print(f'{k}={v}')
elif isinstance(env, list):
for e in env:
print(e)
" 2>/dev/null || echo "")
SECRET_PATTERNS="password|secret|api_key|private_key|token"
if echo "$ENV_VARS" | grep -iqE "$SECRET_PATTERNS"; then
check "No hardcoded secrets in environment" "warn"
else
check "No hardcoded secrets in environment" "pass"
fi
# 7. Memory limit
MEM=$(echo "$CONTENT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('memory', d.get('mem_limit', d.get('resources',{}).get('memory',''))))" 2>/dev/null)
if [[ -n "$MEM" && "$MEM" != "None" && "$MEM" != "" ]]; then
check "Memory limit specified ($MEM)" "pass"
else
check "Memory limit specified" "warn"
fi
echo ""
echo "Results: $PASS passed, $FAIL failed, $WARN warnings"
if [[ "$FAIL" -gt 0 ]]; then
echo "STATUS: REJECTED — fix failures before resubmitting"
exit 1
else
echo "STATUS: APPROVED (with $WARN warnings)"
exit 0
fi