70 lines
1.7 KiB
Plaintext
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
#!/bin/sh
# ----------------------------------------------------------------------
# extract-ikconfig - Extract the .config file from a kernel image
#
# This will only work when the kernel was compiled with CONFIG_IKCONFIG.
#
# The obscure use of the "tr" filter is to work around older versions of
# "grep" that report the byte offset of the line instead of the pattern.
#
# (c) 2009,2010 Dick Streefland <dick@streefland.net>
# Licensed under the terms of the GNU General Public License.
# ----------------------------------------------------------------------
cf1='IKCFG_ST\037\213\010'
cf2='0123456789'
dump_config()
{
if pos=`tr "$cf1\n$cf2" "\n$cf2=" < "$1" | grep -abo "^$cf2"`
then
pos=${pos%%:*}
tail -c+$(($pos+8)) "$1" | zcat > $tmp1 2> /dev/null
if [ $? != 1 ]
then # exit status must be 0 or 2 (trailing garbage warning)
cat $tmp1
exit 0
fi
fi
}
try_decompress()
{
for pos in `tr "$1\n$2" "\n$2=" < "$img" | grep -abo "^$2"`
do
pos=${pos%%:*}
tail -c+$pos "$img" | $3 > $tmp2 2> /dev/null
dump_config $tmp2
done
}
# Check invocation:
me=${0##*/}
img=$1
if [ $# -ne 1 -o ! -s "$img" ]
then
echo "Usage: $me <kernel-image>" >&2
exit 2
fi
# Prepare temp files:
tmp1=/tmp/ikconfig$$.1
tmp2=/tmp/ikconfig$$.2
trap "rm -f $tmp1 $tmp2" 0
# Initial attempt for uncompressed images or objects:
dump_config "$img"
# That didn't work, so retry after decompression.
try_decompress '\037\213\010' xy gunzip
try_decompress '\3757zXZ\000' abcde unxz
try_decompress 'BZh' xy bunzip2
try_decompress '\135\0\0\0' xxx unlzma
try_decompress '\211\114\132' xy 'lzop -d'
try_decompress '\002\041\114\030' xyy 'lz4 -d -l'
try_decompress '\050\265\057\375' xxx unzstd
# Bail out:
echo "$me: Cannot find kernel config." >&2
exit 1