feat(reconcile): add --create-missing flag for recovering from failed-update rollbacks

Context: when package update fails after remove-old-container but
before reconcile-recreate, the rollback path in update.rs tries to
restart the old container by name. If the container is already gone
(removed in step 3 of the update), rollback fails silently and the
node is left with no live container for that app but on-disk data
still intact. This is exactly the state .228 ended up in after the
reconcile-script-missing bug killed bitcoin-knots and lnd.

Reconcile was designed to only repair existing containers for
optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries
on the assumption that the install RPC creates them. That safety
check is correct for normal operation but blocks recovery when an
optional-marked container has been destroyed by a failed update.

Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip.
When set, reconcile treats absent containers exactly the same as
broken containers — it creates them from the canonical spec using
the existing on-disk data directory. Narrow-scope override; the
default behaviour is unchanged.

Updated --help to document all four flags.

Verified on .228: after the failed bitcoin-core update took out both
bitcoin-knots and lnd, running reconcile --container=bitcoin-knots
--create-missing --force (as the archipelago user, not root —
podman is rootless) brought bitcoin-knots back using the pruned
chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All
containers now running; electrumx reconnecting; UIs recovering.

Does NOT fix the underlying update-flow rollback hole (rollback
should be able to re-create a container from spec, not just restart
by name). That is a separate commit — this flag is the manual
recovery tool plus the primitive the improved rollback will call.
This commit is contained in:
archipelago 2026-04-23 09:42:19 -04:00
parent 353825b66c
commit 92612ddc70

View File

@ -18,16 +18,25 @@ SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
# ── Parse arguments ──────────────────────────────────────────────────
CHECK_ONLY=false
FORCE=false
CREATE_MISSING=false
FILTER_TIER=""
FILTER_CONTAINER=""
for arg in "$@"; do
case "$arg" in
--check-only) CHECK_ONLY=true ;;
--force) FORCE=true ;;
--create-missing) CREATE_MISSING=true ;;
--tier=*) FILTER_TIER="${arg#*=}" ;;
--container=*) FILTER_CONTAINER="${arg#*=}" ;;
-h|--help)
echo "Usage: $0 [--check-only] [--force] [--tier=N] [--container=NAME]"
echo "Usage: $0 [--check-only] [--force] [--create-missing] [--tier=N] [--container=NAME]"
echo ""
echo " --check-only Audit only, no changes."
echo " --force Override user-stopped state."
echo " --create-missing Override SPEC_OPTIONAL for containers that have on-disk"
echo " data but no live container (recovery from failed updates)."
echo " --tier=N Only reconcile containers in tier N."
echo " --container=NAME Only reconcile the named container (spec key)."
exit 0 ;;
esac
done
@ -213,7 +222,9 @@ reconcile() {
# Optional apps: only reconcile if already installed (container exists).
# The install RPC creates the container; the reconciler just keeps it running.
if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name"; then
# --create-missing overrides this so we can recover from failed-update rollbacks
# that deleted a container without restoring it (on-disk data still present).
if [ "$SPEC_OPTIONAL" = "true" ] && ! container_exists "$name" && ! $CREATE_MISSING; then
skip "$name — not installed"
COUNT_SKIPPED=$((COUNT_SKIPPED + 1))
return