release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet
Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.
What changed:
- apply_update() writes a pending-verify marker with old+new version and
a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
present and within its freshness window, the new binary waits 15s for
nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
nothing else happens.
- On window-exhaust, the new binary:
1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
(quarantined, not deleted, so we can post-mortem).
2. Restores web-ui.bak on top of web-ui.
3. Calls rollback_update() to restore the previous binary.
4. Updates state.current_version to reflect the rollback.
5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
probing, so a crashed-during-startup marker from weeks ago cannot
spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
tokio::fs::copy, so it escapes the service's ProtectSystem=strict
mount namespace. Without this, the rollback silently failed with
EROFS on /usr/local/bin and orphaned the rollback - the exact
opposite of what auto-rollback is for.
Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.
Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
|
|
|
FROM git.tx1138.com/lfg2025/nginx:1.27.4-alpine
|
|
|
|
|
COPY index.html /usr/share/nginx/html/
|
|
|
|
|
COPY 50x.html /usr/share/nginx/html/
|
|
|
|
|
COPY qrcode.js /usr/share/nginx/html/
|
|
|
|
|
COPY nginx.conf /etc/nginx/conf.d/default.conf
|
|
|
|
|
# Run nginx as root to avoid chown failures in rootless Podman user namespaces
|
|
|
|
|
RUN sed -i 's/^user nginx;/user root;/' /etc/nginx/nginx.conf && \
|
|
|
|
|
mkdir -p /var/cache/nginx/client_temp /var/cache/nginx/proxy_temp \
|
|
|
|
|
/var/cache/nginx/fastcgi_temp /var/cache/nginx/uwsgi_temp \
|
|
|
|
|
/var/cache/nginx/scgi_temp
|
|
|
|
|
EXPOSE 50002
|
|
|
|
|
ENTRYPOINT []
|
|
|
|
|
CMD ["nginx", "-g", "daemon off;"]
|