From 92612ddc700e2478551f506dd75e2d866f06c54b Mon Sep 17 00:00:00 2001 From: archipelago Date: Thu, 23 Apr 2026 09:42:19 -0400 Subject: [PATCH] feat(reconcile): add --create-missing flag for recovering from failed-update rollbacks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Context: when package update fails after remove-old-container but before reconcile-recreate, the rollback path in update.rs tries to restart the old container by name. If the container is already gone (removed in step 3 of the update), rollback fails silently and the node is left with no live container for that app but on-disk data still intact. This is exactly the state .228 ended up in after the reconcile-script-missing bug killed bitcoin-knots and lnd. Reconcile was designed to only repair existing containers for optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries on the assumption that the install RPC creates them. That safety check is correct for normal operation but blocks recovery when an optional-marked container has been destroyed by a failed update. Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip. When set, reconcile treats absent containers exactly the same as broken containers — it creates them from the canonical spec using the existing on-disk data directory. Narrow-scope override; the default behaviour is unchanged. Updated --help to document all four flags. Verified on .228: after the failed bitcoin-core update took out both bitcoin-knots and lnd, running reconcile --container=bitcoin-knots --create-missing --force (as the archipelago user, not root — podman is rootless) brought bitcoin-knots back using the pruned chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All containers now running; electrumx reconnecting; UIs recovering. Does NOT fix the underlying update-flow rollback hole (rollback should be able to re-create a container from spec, not just restart by name). That is a separate commit — this flag is the manual recovery tool plus the primitive the improved rollback will call. --- scripts/reconcile-containers.sh | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/scripts/reconcile-containers.sh b/scripts/reconcile-containers.sh index a9bb2473..b19484f6 100755 --- a/scripts/reconcile-containers.sh +++ b/scripts/reconcile-containers.sh @@ -18,16 +18,25 @@ SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" # ── Parse arguments ────────────────────────────────────────────────── CHECK_ONLY=false FORCE=false +CREATE_MISSING=false FILTER_TIER="" FILTER_CONTAINER="" for arg in "$@"; do case "$arg" in --check-only) CHECK_ONLY=true ;; --force) FORCE=true ;; + --create-missing) CREATE_MISSING=true ;; --tier=*) FILTER_TIER="${arg#*=}" ;; --container=*) FILTER_CONTAINER="${arg#*=}" ;; -h|--help) - echo "Usage: $0 [--check-only] [--force] [--tier=N] [--container=NAME]" + echo "Usage: $0 [--check-only] [--force] [--create-missing] [--tier=N] [--container=NAME]" + echo "" + echo " --check-only Audit only, no changes." + echo " --force Override user-stopped state." + echo " --create-missing Override SPEC_OPTIONAL for containers that have on-disk" + echo " data but no live container (recovery from failed updates)." + echo " --tier=N Only reconcile containers in tier N." + echo " --container=NAME Only reconcile the named container (spec key)." exit 0 ;; esac done @@ -213,7 +222,9 @@ reconcile() { # Optional apps: only reconcile if already installed (container exists). # The install RPC creates the container; the reconciler just keeps it running. - if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name"; then + # --create-missing overrides this so we can recover from failed-update rollbacks + # that deleted a container without restoring it (on-disk data still present). + if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name" && ! $CREATE_MISSING; then skip "$name — not installed" COUNT_SKIPPED=$((COUNT_SKIPPED + 1)) return