test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes

os-audit.sh: one non-destructive scorecard tying backend/RPC health, the
all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards
(port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The
per-boot building block for the reboot-survival loop. FM12 check uses jq has()
not // (// treats a legit false as empty). Section A validated all-PASS on .116.

docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-14 04:36:06 -04:00
parent 21aaacc8b4
commit 329e7811eb
4 changed files with 518 additions and 2 deletions

View File

@ -26,8 +26,20 @@ State right now, so any disconnect resumes cleanly:
`feedback_no_systemctl_deploy_until_quadlet` cgroup-cascade warning does NOT apply to `.116`'s
current config. (The reconciler does recreate a few app containers like jellyfin/fedimint on
adoption — normal level-triggered behavior, not casualties.)
- **NEXT (gated on user OK):** full fleet release via `create-release.sh` → publish vps2 →
push origin + gitea-local → tag. User chose "validate on .116 first" — validation now GREEN.
- **RELEASE IN PROGRESS — v1.7.91-alpha (user approved 2026-06-14).** Bundles the other agent's
4 fixes (`0ed892a4`) + F1 (`a483fe4b`) + changelog (`ab858271`). Steps:
1. ✅ Freed `/tmp` (removed stale published frontend tarballs 1.7.83→1.7.89; ~1.1G free) —
`create-release.sh` writes the 184MB frontend tarball to `/tmp` (hardcoded, NOT TMPDIR).
2. ✅ `cargo fmt -p archipelago --check` clean; curated layman changelog added + committed.
3. 🔄 `TMPDIR=/home/archipelago/.buildtmp scripts/create-release.sh 1.7.91-alpha`
(runs `tests/release/run.sh` gate → bumps Cargo.toml/package.json → builds backend+frontend
→ manifest → commit "chore: release v1.7.91-alpha" → tag `v1.7.91-alpha`). MUST set TMPDIR
or cargo's ring C-build fails on the full `/tmp` tmpfs.
- **AFTER create-release.sh:** `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`
`git push origin main && git push gitea-local main``git push --tags` (origin+gitea-local).
Ship target per memory: vps2 (146.59.87.168) is PRIMARY OTA manifest; tx1138 RETIRED.
- Verify packaged tarball actually contains the new version string before trusting the build
(npm run build can silently produce stale dist — see `feedback_frontend_build_verify`).
## Validation node (ACTIVE)

44
docs/PROGRESS_MEMORY.md Normal file
View File

@ -0,0 +1,44 @@
# Progress Memory
Last updated: 2026-06-13
## Current State
- `v1.7.90-alpha` release is complete, tagged, pushed, uploaded, and verified on vps2.
- Release commit: `bb808df8` (chore: release v1.7.90-alpha).
- Feature commit: `c800293f` (fix: bitcoin receive, AIUI pointer input, electrs self-heal, OTA timeout).
- Gitea tag: `v1.7.90-alpha` (on origin/gitea-vps2).
- Live OTA manifest on the update host (146.59.87.168) now resolves to `1.7.90-alpha`; both
artifact download URLs (binary + frontend tarball) return HTTP 200.
- v1.7.89-alpha was already fully shipped before this session.
## What shipped in v1.7.90-alpha
- Bitcoin receive address generation fixed (correct address type, no more 400).
- AIUI/app session: on-screen pointer can click + type into app content (incl. app store
search); "open in new tab" opens the phone browser; mobile credential modal centered.
- Electrs self-heals from a corrupt index and shows a percent/block-height progress screen.
- update.rs: retired tx1138 secondary mirror dropped (one-time migration); longer download
timeout for slow connections.
## Verification
- Full release harness green (8 stages): git-diff, cargo-fmt, catalog-drift, release-manifest,
ui-type-check, ui-unit-tests (80 files / 655 tests), cargo-check, cargo-test-weekly.
- Freshly built binary embeds `1.7.90-alpha` (no stale 1.7.89); frontend dist rebuilt fresh
(new AppSession bundle); manifest sha256 + size match on-disk artifacts.
## Known gaps / follow-ups
- `gitea-local` (localhost:3000) push FAILS from this node — redirects to /login (auth).
The v1.7.88 and v1.7.89 tags were also already missing there, so this is a pre-existing
condition on this node, not a v1.7.90 regression. vps2 is the primary OTA mirror and is fine.
- OTA self-update verification on THIS node (.116) not yet observed this session — the node
should auto-apply from the live 1.7.90-alpha manifest; confirm
`update_state.json.current_version == 1.7.90-alpha` after the scheduler runs.
## Resume Context
- If a later session resumes, continue from the next active product/release task, not this
finished release.
- Broader context: docs/WEEKLY_RELEASE_TRACKER.md, docs/RESUME.md, docs/NEXT_TERMINAL_HANDOFF.md

View File

@ -0,0 +1,221 @@
# Weekly Release Tracker
Last updated: 2026-06-14 (session on node .116 / archi-thinkpad)
---
# ▶ CURRENT PASS — v1.7.91-alpha (2026-06-14)
## RESUME PROMPT (paste into a fresh session, any machine)
> Resume the v1.7.91-alpha release pass for the `archy` repo (on node .116 / archi-thinkpad
> the tree is at /home/archipelago/Projects/archy; on another machine, clone/pull `main` from
> gitea-vps2 http://146.59.87.168:3000/lfg2025/archy.git — my fix commit is pushed there).
> Read the top section of docs/WEEKLY_RELEASE_TRACKER.md FIRST — it has the blocker, the fix
> already made, and exact next steps. Two goals: (1) cut & PUBLISH v1.7.91-alpha, (2) finish
> validating + integrating the new tests/lifecycle/os-audit.sh OS-wide health harness.
> Do NOT redo: the bitcoinReceive.ts TS2538 fix (done, committed) or the os-audit jq false-trap
> fix (done). Resume at "EXACT NEXT STEPS — v1.7.91" below. .116 login password: ThisIsWeb54321@
> (.116 serves http on :80 → ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http).
## What happened this session
- `scripts/create-release.sh 1.7.91-alpha` was running; its release gate PASSED all 7 checks,
backend built clean (7m22s), then it **FAILED at step [4/8] frontend build** with:
`src/utils/bitcoinReceive.ts(23,24): error TS2538: Type 'undefined' cannot be used as an index type.`
Cause: `noUncheckedIndexedAccess``codeMatch[1]` is `string | undefined` and was used directly
to index `RECEIVE_CODE_MESSAGES`. **FIXED**`const code = message.match(/\[([A-Z_]+)\]/)?.[1]`
then `if (code && RECEIVE_CODE_MESSAGES[code])`. `npx vue-tsc --noEmit` is now clean (exit 0).
The failed run aborted BEFORE bumping the manifest (still 1.7.90) or tagging (no v1.7.91 tag),
but it HAD already partial-bumped Cargo.toml/package.json/locks to 1.7.91 — those partial bumps
are reverted (create-release.sh re-owns the bump); only the genuine TS fix + harness are committed.
- Built a new OS-wide health harness `tests/lifecycle/os-audit.sh` (non-destructive, one scorecard):
Section A backend/RPC health, Section B all-apps lifecycle audit (delegates to remote-lifecycle.sh),
Section C FM-guards (port-drift + secret-completeness bats, orphan-container sweep). Section A
validated all-PASS on .116. Fixed a jq bug in the FM12 OTA-wedge check: `//` treats a legit
`false` as empty and fell through to "unknown" — now uses `has()`. Section B is slow (~3 min) and
opaque while running because output is captured (`out=$(...)`) not streamed — minor wart, TODO.
## EXACT NEXT STEPS — v1.7.91 (in order)
1. Confirm clean tree + on main (`git status`; create-release.sh requires `git diff --quiet HEAD`).
The TS fix + os-audit.sh are committed & pushed; version-bump artifacts reverted to 1.7.90.
2. Re-run the release: `scripts/create-release.sh 1.7.91-alpha`. Backend is cached (only a .ts
changed) so it's fast; the frontend build now passes. It bumps versions, builds, writes
releases/manifest.json (→1.7.91-alpha), commits, and tags v1.7.91-alpha.
- Memory guards: grep the staged frontend tarball for "1.7.91-alpha" before shipping (silent
vue-tsc failures); tarball must be flat (`tar -C web/dist/neode-ui .`).
3. Publish: `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` (origin pushes to BOTH gitea-local + vps2).
4. Verify manifest LIVE (this is "published"):
`curl -fsS http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
must show `1.7.91-alpha`. **Then notify the user — they asked to be told when 1.7.91 publishes.**
5. os-audit harness: run a full green pass on .116
(`ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD='ThisIsWeb54321@' tests/lifecycle/os-audit.sh`),
confirm Section A FM12 now reads `update_in_progress=false` (PASS not WARN), review B + C findings,
then wire os-audit.sh into the reboot-survival (L3) loop as the per-boot gate.
---
# ─ HISTORY — v1.7.89-alpha pass (2026-06-12), superseded ─
Last updated: 2026-06-12 ~17:45 EDT (session on node .116)
## RESUME PROMPT (paste into a fresh session)
> Continue the v1.7.89-alpha release pass from /home/archipelago/Projects/archy on node .116.
> Read docs/WEEKLY_RELEASE_TRACKER.md fully first — it has root causes, fixes already made,
> and exact next steps. Do NOT redo: AIUI revert (done, validated), updater fixes in
> core/archipelago/src/update.rs (done, uncommitted), .116 OTA unwedge (done). Resume at
> "EXACT NEXT STEPS" below.
## EXACT NEXT STEPS (in order)
1. Backend focused tests were running in background:
`cd core && timeout 1500 cargo test -p archipelago -- update:: lnd container::image_versions scanner`
(log: /tmp/claude-.../tasks/bds4jk19e.output — if lost, just rerun the command; first
attempt died at 400s timeout during test compile, 1500s is the right budget).
Need: all green.
2. RESOLVED before session end: vitest recheck passed clean — EXIT=0, 79 files / 645 tests,
even while cargo test was compiling. The earlier harness ui-unit-tests FAIL was load/flake
(machine saturated by the parallel cargo test compile), not a real failure. On resume just
rerun `tests/release/run.sh --quick` WITHOUT a parallel cargo build to confirm green;
if it ever fails again, the failing test name is in the stage output (drop `--silent`).
3. Run full harness: `tests/release/run.sh` (static+frontend+backend). Then commit ALL
working-tree changes (one commit, e.g. "fix: harden OTA updates, AIUI desktop gap, LND
no-proxy" — CHANGELOG v1.7.89 section is already curated).
4. Cut release: `scripts/create-release.sh 1.7.89-alpha` (needs clean tree, on main,
validates CHANGELOG section exists — it does). Then
`tests/release/run.sh --manifest` should pass, and grep the staged frontend tarball
for 1.7.89-alpha (memory: silent build failures).
5. Publish: `scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` and push gitea-local + tags too.
Verify manifest live on http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json
6. Verify OTA on THIS node (.116): schedule is auto_apply; either wait for the scheduler
or trigger via UI. Confirm /var/lib/archipelago/update_state.json current_version
becomes 1.7.89-alpha, `update_in_progress` returns to false, web-ui + binary versions
MATCH (this node currently has web-ui 1.7.84 / binary 1.7.85 mismatch — the OTA heals it),
and journalctl shows "Post-OTA verification succeeded" (the new probe falls back to
http://127.0.0.1/ which is what .116 serves).
7. Update this tracker + docs/PROGRESS_MEMORY.md, mark tasks done.
Purpose: live tracker for this pass — test everything shipped this week (v1.7.83→v1.7.89),
build the release test harness, fix OTA updates on .116, make updates bulletproof, cut v1.7.89-alpha.
If the session is cut off, resume from here.
## Task status
| # | Task | Status |
|---|------|--------|
| 1 | AIUI revert (mobile back/close gone, desktop gap fixed) | DONE — validated |
| 2 | Dev server on :8100 with embedded AIUI | DONE — see below |
| 3 | Inventory this week's release-log items | DONE — see checklist |
| 4 | Test harness covering this week + seed of system-wide harness | IN PROGRESS |
| 5 | Fix OTA updates on .116 + bulletproof updates | IN PROGRESS — diagnosis below |
| 6 | Cut v1.7.89-alpha release | PENDING (gates: 4, 5) |
## State of the working tree
- HEAD = 495b9078 (v1.7.89 changelog + AIUI mobile restore committed).
- Uncommitted, intended for v1.7.89-alpha:
- `neode-ui/src/views/Dashboard.vue` — chat route back to plain `h-full` (desktop bottom-gap fix). Validated.
- `core/.../rpc/lnd/*` + `container/lnd.rs` — LND REST no-proxy + wallet readiness/unlock fixes.
- Version bumps to 1.7.89-alpha (Cargo.toml, package.json, locks), CHANGELOG entry.
- `neode-ui/vite.config.ts` — added `/aiui` dev proxy (keep; dev-only convenience).
## AIUI validation (task 1) — DONE
- HEAD already removed the mobile back button and restored `hideClose=true` (495b9078).
- Working-tree Dashboard.vue removes `dashboard-scroll-panel mobile-scroll-pad` from the chat
route (that padding caused the desktop bottom gap); mesh keeps its styling.
- Chat CSS verified byte-identical to last-good 34c4e87d (May 20).
- Playwright check (desktop 1440x900, mobile 390x844): chat fills full viewport, no bottom gap,
no mobile back/close. `npm run type-check` + focused route tests + full vitest (645/645) pass.
## Dev server on :8100 (task 2) — DONE
- Running: `BACKEND_URL=http://127.0.0.1:5678 VITE_AIUI_URL=/aiui/ npx vite --host 0.0.0.0 --port 8100`
from `neode-ui/` (real local backend on 5678).
- AIUI now embeds in /dashboard/chat via new vite proxy `/aiui``http://127.0.0.1:80`
(the node's deployed AIUI), same-origin like production.
- Secondary throwaway instance for automated checks: :8101 against mock backend
(`node mock-backend.js` on 5959, password `password123`).
## This week's shipped items (v1.7.83 → v1.7.89) — test checklist
### Frontend (vitest/type-check/build cover most; full suite 645/645 green 2026-06-12)
- [x] AIUI fast launch, no availability probe (v1.7.88) — covered by visual check + Chat.vue tests
- [x] AIUI mobile layout restore (v1.7.89) — playwright visual check
- [x] App-session launch metadata from manifests / typed interfaces (v1.7.83) — appSessionConfig tests
- [x] OnlyOffice + Saleor removal (v1.7.83) — catalog tests
- [ ] Bitcoin receive UI flow end-to-end (v1.7.87/88) — needs live LND node check
- [ ] Fleet tab keeps node list/alerts during refresh, names not hashes (v1.7.85/86) — store tests?
- [ ] Credential interstitial full-screen overlay (v1.7.87) — visual
- [ ] Mobile federation/system-update buttons full width (v1.7.86) — visual
### Backend (cargo)
- [ ] LND REST no-proxy client + GET newaddress p2wkh (v1.7.88/89) — unit tests + live check
- [ ] LND wallet readiness/unlock after restart (v1.7.89) — unit + live
- [ ] Bitcoin trusted-node relay rpcauth/txrelay (v1.7.84) — unit tests exist? check
- [ ] Container scanner RAII in-flight guard (v1.7.84) — cargo test
- [ ] ElectrumX health-check startup window + cache tuning (v1.7.85/86)
- [ ] Portainer pin 2.19.4 / bitcoin-ui image pin (v1.7.84/85) — image-versions tests
- [ ] Fleet telemetry name/hostname/URL fields (v1.7.85)
- [ ] Federation no self-import (v1.7.85)
- [ ] Kiosk safe-area + self-update refreshes kiosk files (v1.7.84)
- [ ] Wi-Fi scan error/retry/escaped SSID/open networks (v1.7.84)
### OTA / updates (task 5)
- [ ] .116 stuck: current 1.7.85-alpha, `update_in_progress: true` since 1.7.88 attempt — diagnose+fix
- [ ] Updater hardening: stuck-in-progress recovery, resumable/atomic apply, verify post-restart version
## OTA diagnosis on .116 — ROOT CAUSES FOUND + FIXED (code staged for v1.7.89)
Four bugs, all reproduced from the journal (Jun 12 03:4504:33):
1. Post-OTA probe only tries `https://127.0.0.1/`; .116's nginx binds only :80 (443 is
tailscale's) → connection refused × 18 → a GOOD 1.7.85 update was "rolled back".
FIX: probe falls back to `http://127.0.0.1/` on connect error (update.rs probe_frontend_once).
2. That rollback's binary restore did `host_sudo cp` onto the RUNNING binary → ETXTBSY exit 1
→ binary stayed 1.7.85 while web-ui rolled back to 1.7.84 (mismatch confirmed live).
FIX: rollback now cp→tmp→atomic mv, same pattern as apply (update.rs rollback_update).
3. The rollback chown'd `update-backup/archipelago` root:root IN PLACE → next apply's
fs::copy (as service user) hit EACCES → "Failed to backup current binary" × 3 → 1.7.86/88
never applied. FIX: apply unlinks stale backup first; rollback chowns only its temp copy.
4. Failed apply left `update_in_progress: true` wedged (staging still populated so the
stale-flag guard never fires). Unwedged operationally; fixed structurally by 13.
Operational cleanup DONE on .116 (2026-06-12 17:15): removed root-owned
`update-backup/archipelago`, stale `update-staging/` (1.7.86), and the stale
`update-pending-verify.json`. Next state load clears `update_in_progress`.
NOTE: live web-ui is 1.7.84 / binary 1.7.85 (mismatch from bug 2). Not hand-patched —
the v1.7.89 OTA will resync both. Good 1.7.85 frontend is quarantined at
`/opt/archipelago/web-ui.failed.1781250438247`.
Verification plan: after v1.7.89 release, watch .116 auto-apply (schedule auto_apply),
confirm `update_state.json.current_version == 1.7.89-alpha` and web-ui version matches.
## Test harness (task 4) — CREATED at tests/release/run.sh
- Stages: static (git diff --check, cargo fmt, catalog drift, optional --manifest),
frontend (type-check, full vitest), optional --with-build (build + grep dist for version),
backend (cargo check + focused cargo test: update:: lnd container::image_versions scanner,
all wrapped in `timeout`), optional --live URL smoke (/, /aiui/, /rpc/v1).
- Results so far (2026-06-12): type-check PASS, full vitest 645/645 PASS, cargo fmt PASS,
cargo check PASS, catalog drift PASS (3 pre-existing MISSING_CATALOG warnings, exit 0,
identical on HEAD). Focused backend cargo tests running (first run hit the known slow
test-compile on .116 at 400s timeout; rerunning with 1500s).
- AIUI embed verified end-to-end via playwright on :8101 (mock backend): iframe loads,
`ready` handshake clears the loading overlay, hideClose honored.
- Release flow confirmed: commit all → `scripts/create-release.sh 1.7.89-alpha` (validates
curated CHANGELOG section, builds, manifests, commits, tags) →
`scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2` → push origin main + tags.
Tarball layout/perms safety is already inside create-release-manifest.sh.
- CHANGELOG v1.7.89 section rewritten layman-readable (updater fixes added).
## Release gates for v1.7.89-alpha (task 6)
1. All harness stages green locally.
2. OTA fix for stuck `update_in_progress` included + .116 updates successfully to the new release.
3. Frontend build: grep packaged tarball for "1.7.89-alpha" before shipping (memory: silent vue-tsc failures).
4. Flat tarball layout (`tar -C web/dist/neode-ui .`).
5. Commit, tag `v1.7.89-alpha`, push origin + gitea-local + tags, publish release assets, verify
manifest + node OTA picks it up.

239
tests/lifecycle/os-audit.sh Executable file
View File

@ -0,0 +1,239 @@
#!/usr/bin/env bash
# tests/lifecycle/os-audit.sh — one non-destructive OS-wide health gate.
#
# Ties together, in a single pass with one scorecard + exit code:
# A. Backend / RPC health — node is up, not wedged mid-OTA, core daemons answer
# B. All-apps lifecycle audit — every catalog app: valid state, real health,
# reachable launch URL, populated launch metadata
# (delegates to remote-lifecycle.sh, audit-only)
# C. FM-guards — the concrete failure modes that have bitten the
# fleet: port-drift (FM8), secret-completeness (FM2),
# orphaned container states (FM9), OTA wedge (FM12)
#
# Everything here is READ-ONLY: no install/stop/start/uninstall, no service bounce.
# Safe to run against a live production node. It is the per-boot building block the
# reboot-survival harness (L3) calls after each reboot.
#
# Env:
# ARCHY_HOST (default 127.0.0.1)
# ARCHY_SCHEME (default https; use http for .116 / nginx-:80-only nodes)
# ARCHY_PASSWORD (required)
# ARCHY_LOCAL (auto: 1 when ARCHY_HOST is loopback) — gates host-only podman checks
#
# Usage:
# ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD=... tests/lifecycle/os-audit.sh
#
# Exit: 0 = every section green; 1 = one or more checks failed; 2 = setup/usage error.
set -uo pipefail
HERE="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
ARCHY_HOST="${ARCHY_HOST:-127.0.0.1}"
ARCHY_SCHEME="${ARCHY_SCHEME:-https}"
ARCHY_PASSWORD="${ARCHY_PASSWORD:-}"
BASE_URL="${ARCHY_SCHEME}://${ARCHY_HOST}"
# Host-only checks (podman sweeps) make sense only when this script runs ON the node.
if [[ -z "${ARCHY_LOCAL:-}" ]]; then
case "$ARCHY_HOST" in
127.0.0.1|localhost|::1) ARCHY_LOCAL=1 ;;
*) ARCHY_LOCAL=0 ;;
esac
fi
if [[ -z "$ARCHY_PASSWORD" ]]; then
echo "ARCHY_PASSWORD env var must be set." >&2
exit 2
fi
for tool in curl jq; do
command -v "$tool" >/dev/null 2>&1 || { echo "missing required tool: $tool" >&2; exit 2; }
done
# ── scorecard state ───────────────────────────────────────────────────────────
PASS=0; FAIL=0; WARN=0
declare -a RESULTS=()
record() { # record <PASS|FAIL|WARN> <label> [detail]
local status="$1" label="$2" detail="${3:-}"
case "$status" in
PASS) PASS=$((PASS+1)) ;;
FAIL) FAIL=$((FAIL+1)) ;;
WARN) WARN=$((WARN+1)) ;;
esac
RESULTS+=("$(printf '%-4s %-38s %s' "$status" "$label" "$detail")")
printf ' [%s] %s %s\n' "$status" "$label" "$detail"
}
# ── minimal RPC client (session + CSRF) ────────────────────────────────────────
SESSION=""; CSRF=""
rpc_login() {
local hdr; hdr=$(mktemp)
curl -sk -D "$hdr" -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
-d "$(jq -nc --arg p "$ARCHY_PASSWORD" '{jsonrpc:"2.0",id:1,method:"auth.login",params:{password:$p}}')" \
-o /dev/null 2>/dev/null
SESSION=$(grep -i '^set-cookie: session=' "$hdr" | head -1 | sed -E 's/.*session=([^;]+).*/\1/' | tr -d '\r')
CSRF=$(grep -i '^set-cookie: csrf_token=' "$hdr" | head -1 | sed -E 's/.*csrf_token=([^;]+).*/\1/' | tr -d '\r')
rm -f "$hdr"
[[ -n "$SESSION" && -n "$CSRF" ]]
}
# rpc <method> [params-json] -> prints raw JSON response
rpc() {
local method="$1" params="${2:-{\}}"
curl -sk -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
-H "Cookie: session=${SESSION}; csrf_token=${CSRF}" -H "X-CSRF-Token: ${CSRF}" \
-d "$(jq -nc --arg m "$method" --argjson p "$params" '{jsonrpc:"2.0",id:2,method:$m,params:$p}')" 2>/dev/null
}
# rpc_ok <method> [params] -> 0 if a result came back with no error
rpc_ok() {
local resp; resp=$(rpc "$@")
[[ -n "$resp" ]] && [[ "$(jq -r '.error // empty' <<<"$resp" 2>/dev/null)" == "" ]] \
&& [[ "$(jq -r 'has("result")' <<<"$resp" 2>/dev/null)" == "true" ]]
}
# ══ Section A — Backend / RPC health ═══════════════════════════════════════════
section_a() {
echo
echo "== A. Backend / RPC health =="
# unauth health probe first (doesn't need a session)
local health; health=$(curl -sk -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"health","params":{}}' 2>/dev/null)
if [[ "$(jq -r '.result.status // empty' <<<"$health" 2>/dev/null)" =~ ^(ok|degraded)$ ]]; then
record PASS "node responds (health)" "status=$(jq -r '.result.status' <<<"$health")"
else
record FAIL "node responds (health)" "no/invalid health response — node down?"
return
fi
if ! rpc_login; then
record FAIL "auth.login" "could not establish session (wrong password or rate-limited)"
return
fi
record PASS "auth.login" "session established"
# FM12 — OTA must not be wedged mid-apply.
# NB: must use has() not `//` — jq's `//` treats a legit `false` as empty and
# would fall through to "unknown" on a perfectly healthy node.
local us; us=$(rpc update.status)
local inprog; inprog=$(jq -r '
if (.result|type=="object") and (.result|has("update_in_progress")) then .result.update_in_progress
elif (.result|type=="object") and (.result|has("in_progress")) then .result.in_progress
else "unknown" end' <<<"$us" 2>/dev/null)
if [[ "$inprog" == "false" ]]; then
record PASS "OTA not wedged (update.status)" "update_in_progress=false"
elif [[ "$inprog" == "unknown" ]]; then
record WARN "OTA not wedged (update.status)" "could not read update_in_progress"
else
record FAIL "OTA not wedged (update.status)" "update_in_progress=$inprog (FM12 wedge)"
fi
# Core daemons answer (only assert for ones present on this node)
if rpc_ok bitcoin.getinfo || rpc_ok bitcoin.relay-status; then
record PASS "bitcoin RPC reachable" ""
else
record WARN "bitcoin RPC reachable" "bitcoin.getinfo/relay-status did not answer (not installed?)"
fi
if rpc_ok lnd.getinfo; then
record PASS "lnd RPC reachable" ""
else
record WARN "lnd RPC reachable" "lnd.getinfo did not answer (not installed / wallet locked?)"
fi
if rpc_ok system.stats || rpc_ok system.get-metrics; then
record PASS "system metrics reachable" ""
else
record WARN "system metrics reachable" "system.stats/get-metrics did not answer"
fi
# FM13 — disk pressure early-warning (best-effort; field names vary by version)
local ds; ds=$(rpc system.disk-status)
local usep; usep=$(jq -r '[.result.use_percent,.result.used_percent,.result.percent]|map(select(.!=null))|first // empty' <<<"$ds" 2>/dev/null)
if [[ -n "$usep" ]]; then
if (( ${usep%.*} >= 90 )); then
record FAIL "disk pressure (system.disk-status)" "${usep}% used (FM13 risk)"
else
record PASS "disk pressure (system.disk-status)" "${usep}% used"
fi
fi
}
# ══ Section B — All-apps lifecycle audit (delegates to remote-lifecycle.sh) ═════
section_b() {
echo
echo "== B. All-apps lifecycle audit (non-destructive, all catalog apps) =="
local out rc
# No ARCHY_APPS + no ARCHY_FULL_LIFECYCLE => audit every catalog app (audit_app).
out=$(ARCHY_HOST="$ARCHY_HOST" ARCHY_SCHEME="$ARCHY_SCHEME" ARCHY_PASSWORD="$ARCHY_PASSWORD" \
ARCHY_APPS="" ARCHY_FULL_LIFECYCLE=0 \
"$HERE/remote-lifecycle.sh" 2>&1)
rc=$?
# Surface the per-app lines but drop the noisy optional-probe jq parse errors.
echo "$out" | grep -vE '^jq: (parse )?error' | sed 's/^/ /'
if (( rc == 0 )); then
record PASS "broad all-apps audit" "remote-lifecycle.sh exit 0"
else
local n; n=$(echo "$out" | grep -oE 'FAILED checks: [0-9]+' | grep -oE '[0-9]+' | tail -1)
record FAIL "broad all-apps audit" "remote-lifecycle.sh exit $rc (${n:-?} app checks failed)"
fi
}
# ══ Section C — FM-guards ══════════════════════════════════════════════════════
run_bats_guard() { # run_bats_guard <suite> <label> <fm>
local suite="$1" label="$2" fm="$3" out rc
if ! command -v bats >/dev/null 2>&1; then
record WARN "$label" "bats not installed — $fm guard skipped"
return
fi
out=$(ARCHY_HOST="$ARCHY_HOST" ARCHY_SCHEME="$ARCHY_SCHEME" ARCHY_PASSWORD="$ARCHY_PASSWORD" \
"$HERE/run.sh" "$suite" 2>&1); rc=$?
if (( rc == 0 )); then
record PASS "$label" "$fm guard green"
else
record FAIL "$label" "$fm$(echo "$out" | grep -E '^not ok' | head -1)"
fi
}
section_c() {
echo
echo "== C. FM-guards (the concrete fleet failure modes) =="
run_bats_guard port-drift "port bindings match manifest" "FM8"
run_bats_guard secret-completeness "all referenced secrets exist" "FM2"
# FM9 — orphaned container states (host-only: needs local podman)
if [[ "$ARCHY_LOCAL" == "1" ]] && command -v podman >/dev/null 2>&1; then
local orphans
orphans=$(podman ps -a --format '{{.Names}} {{.Status}}' 2>/dev/null \
| grep -iE '(^| )(stopping|removing|created)( |$)' || true)
if [[ -z "$orphans" ]]; then
record PASS "no orphaned container states" "no stopping/removing/created"
else
record FAIL "no orphaned container states" "FM9: $(echo "$orphans" | tr '\n' ';')"
fi
else
record WARN "no orphaned container states" "remote node — host podman sweep skipped"
fi
}
# ── run ────────────────────────────────────────────────────────────────────────
echo "=============================================================="
echo " OS-wide audit — ${BASE_URL} ($(date '+%Y-%m-%d %H:%M:%S'))"
echo " local=${ARCHY_LOCAL}"
echo "=============================================================="
section_a
# Only proceed to apps/FM-guards if the node itself answered.
if (( FAIL == 0 )) || [[ -n "$SESSION" ]]; then
section_b
section_c
fi
echo
echo "=============================================================="
echo " SCORECARD: ${PASS} pass / ${FAIL} fail / ${WARN} warn"
echo "=============================================================="
printf '%s\n' "${RESULTS[@]}"
echo
if (( FAIL > 0 )); then
echo "RESULT: FAIL ($FAIL critical checks failed)"
exit 1
fi
echo "RESULT: PASS"
exit 0