Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.
What changed:
- apply_update() writes a pending-verify marker with old+new version and
a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
present and within its freshness window, the new binary waits 15s for
nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
nothing else happens.
- On window-exhaust, the new binary:
1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
(quarantined, not deleted, so we can post-mortem).
2. Restores web-ui.bak on top of web-ui.
3. Calls rollback_update() to restore the previous binary.
4. Updates state.current_version to reflect the rollback.
5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
probing, so a crashed-during-startup marker from weeks ago cannot
spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
tokio::fs::copy, so it escapes the service's ProtectSystem=strict
mount namespace. Without this, the rollback silently failed with
EROFS on /usr/local/bin and orphaned the rollback - the exact
opposite of what auto-rollback is for.
Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.
Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
v1.7.38 and v1.7.39 both shipped with `./` inside the frontend tarball marked
drwx------ (700). Tar extraction preserves archive perms, so every node that
pulled the OTA landed with /opt/archipelago/web-ui at 700, nginx (www-data)
returned 500 "permission denied" on every page, and the browser showed
"Internal Server Error nginx". .116 hit this on both v1.7.38 and v1.7.39
rollouts. The v1.7.39 runtime self-heal in main.rs was the wrong layer —
systemd's ReadOnlyPaths namespace made /opt/archipelago read-only from inside
the archipelago service, so chmod from there returned EROFS.
Root cause: create-release-manifest.sh used mktemp -d (700 default umask) for
staging, then tar preserved that 700 in the archive's root entry.
Fix the archive itself:
- chmod 755 staging dir + `find -type d -exec chmod 755` + `-type f chmod 644`
before tar, so the on-disk entries are correct.
- tar --owner=0 --group=0 --mode='u=rwX,go=rX' to normalize archive perms
belt-and-braces in case file-mode drift ever reappears.
- Post-tar verify: `tar tvzf | head -1` must show drwxr-xr-x at root, or
the release script aborts before the manifest is even generated.
Binary unchanged semantically — the main.rs self-heal stays in as a last-
resort belt (can't hurt on nodes whose FS isn't namespace-isolated), and the
update.rs in-extractor chmod stays in so v1.7.40-onwards extractors are
double-safe. The authoritative fix is the archive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- auth.rs now infers onboarding-complete from setup_complete + password_hash so
nodes stop bouncing users through the intro wizard after browser clear / update
/ reboot; the flag self-heals to disk on next check
- frontend: "backend uncertain" no longer defaults to /onboarding/intro —
useOnboarding returns null + callers poll / retry instead of flashing the wizard
- login sounds (synthwave, welcome voice, pop, whoosh, oomph) gated by
isFirstInstallPhase(); typing sounds unaffected
- removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog,
frontend config, Rust AppMetadata + install dispatch + install_penpot_stack;
docker/fips-ui + docker/nostr-vpn-ui + apps/penpot dirs and 5 icons deleted;
15 image versions deleted from tx1138, .168, gitea-local registries (.160
Gitea was 502 at release time — follow-up)
- AIUI baked into frontend release tarball via demo/aiui/; deploy-to-target
falls back to demo/aiui/ when the AIUI sibling checkout is missing
- prebuild hook syncs app-catalog/catalog.json → public/catalog.json so the
two copies can no longer drift (was the source of the "apps still visible"
bug — public/ had stale data)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Added new dependencies: `adler2`, `crc32fast`, `flate2`, `miniz_oxide`, and `libredox`.
- Updated existing dependencies: `tokio-rustls` to version 0.26.4 and `filetime` to version 0.2.27.
- Removed the `backup.rs` file as it is no longer needed.
- Introduced tests for configuration and credential management.
- Enhanced the `identity` module to generate W3C compliant DID documents.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>