test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes
os-audit.sh: one non-destructive scorecard tying backend/RPC health, the all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards (port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The per-boot building block for the reboot-survival loop. FM12 check uses jq has() not // (// treats a legit false as empty). Section A validated all-PASS on .116. docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
21aaacc8b4
commit
329e7811eb
@ -26,8 +26,20 @@ State right now, so any disconnect resumes cleanly:
|
||||
`feedback_no_systemctl_deploy_until_quadlet` cgroup-cascade warning does NOT apply to `.116`'s
|
||||
current config. (The reconciler does recreate a few app containers like jellyfin/fedimint on
|
||||
adoption — normal level-triggered behavior, not casualties.)
|
||||
- **NEXT (gated on user OK):** full fleet release via `create-release.sh` → publish vps2 →
|
||||
push origin + gitea-local → tag. User chose "validate on .116 first" — validation now GREEN.
|
||||
- **RELEASE IN PROGRESS — v1.7.91-alpha (user approved 2026-06-14).** Bundles the other agent's
|
||||
4 fixes (`0ed892a4`) + F1 (`a483fe4b`) + changelog (`ab858271`). Steps:
|
||||
1. ✅ Freed `/tmp` (removed stale published frontend tarballs 1.7.83→1.7.89; ~1.1G free) —
|
||||
`create-release.sh` writes the 184MB frontend tarball to `/tmp` (hardcoded, NOT TMPDIR).
|
||||
2. ✅ `cargo fmt -p archipelago --check` clean; curated layman changelog added + committed.
|
||||
3. 🔄 `TMPDIR=/home/archipelago/.buildtmp scripts/create-release.sh 1.7.91-alpha`
|
||||
(runs `tests/release/run.sh` gate → bumps Cargo.toml/package.json → builds backend+frontend
|
||||
→ manifest → commit "chore: release v1.7.91-alpha" → tag `v1.7.91-alpha`). MUST set TMPDIR
|
||||
or cargo's ring C-build fails on the full `/tmp` tmpfs.
|
||||
- **AFTER create-release.sh:** `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`
|
||||
→ `git push origin main && git push gitea-local main` → `git push --tags` (origin+gitea-local).
|
||||
Ship target per memory: vps2 (146.59.87.168) is PRIMARY OTA manifest; tx1138 RETIRED.
|
||||
- Verify packaged tarball actually contains the new version string before trusting the build
|
||||
(npm run build can silently produce stale dist — see `feedback_frontend_build_verify`).
|
||||
|
||||
## Validation node (ACTIVE)
|
||||
|
||||
|
||||
44
docs/PROGRESS_MEMORY.md
Normal file
44
docs/PROGRESS_MEMORY.md
Normal file
@ -0,0 +1,44 @@
|
||||
# Progress Memory
|
||||
|
||||
Last updated: 2026-06-13
|
||||
|
||||
## Current State
|
||||
|
||||
- `v1.7.90-alpha` release is complete, tagged, pushed, uploaded, and verified on vps2.
|
||||
- Release commit: `bb808df8` (chore: release v1.7.90-alpha).
|
||||
- Feature commit: `c800293f` (fix: bitcoin receive, AIUI pointer input, electrs self-heal, OTA timeout).
|
||||
- Gitea tag: `v1.7.90-alpha` (on origin/gitea-vps2).
|
||||
- Live OTA manifest on the update host (146.59.87.168) now resolves to `1.7.90-alpha`; both
|
||||
artifact download URLs (binary + frontend tarball) return HTTP 200.
|
||||
- v1.7.89-alpha was already fully shipped before this session.
|
||||
|
||||
## What shipped in v1.7.90-alpha
|
||||
|
||||
- Bitcoin receive address generation fixed (correct address type, no more 400).
|
||||
- AIUI/app session: on-screen pointer can click + type into app content (incl. app store
|
||||
search); "open in new tab" opens the phone browser; mobile credential modal centered.
|
||||
- Electrs self-heals from a corrupt index and shows a percent/block-height progress screen.
|
||||
- update.rs: retired tx1138 secondary mirror dropped (one-time migration); longer download
|
||||
timeout for slow connections.
|
||||
|
||||
## Verification
|
||||
|
||||
- Full release harness green (8 stages): git-diff, cargo-fmt, catalog-drift, release-manifest,
|
||||
ui-type-check, ui-unit-tests (80 files / 655 tests), cargo-check, cargo-test-weekly.
|
||||
- Freshly built binary embeds `1.7.90-alpha` (no stale 1.7.89); frontend dist rebuilt fresh
|
||||
(new AppSession bundle); manifest sha256 + size match on-disk artifacts.
|
||||
|
||||
## Known gaps / follow-ups
|
||||
|
||||
- `gitea-local` (localhost:3000) push FAILS from this node — redirects to /login (auth).
|
||||
The v1.7.88 and v1.7.89 tags were also already missing there, so this is a pre-existing
|
||||
condition on this node, not a v1.7.90 regression. vps2 is the primary OTA mirror and is fine.
|
||||
- OTA self-update verification on THIS node (.116) not yet observed this session — the node
|
||||
should auto-apply from the live 1.7.90-alpha manifest; confirm
|
||||
`update_state.json.current_version == 1.7.90-alpha` after the scheduler runs.
|
||||
|
||||
## Resume Context
|
||||
|
||||
- If a later session resumes, continue from the next active product/release task, not this
|
||||
finished release.
|
||||
- Broader context: docs/WEEKLY_RELEASE_TRACKER.md, docs/RESUME.md, docs/NEXT_TERMINAL_HANDOFF.md
|
||||
221
docs/WEEKLY_RELEASE_TRACKER.md
Normal file
221
docs/WEEKLY_RELEASE_TRACKER.md
Normal file
@ -0,0 +1,221 @@
|
||||
# Weekly Release Tracker
|
||||
|
||||
Last updated: 2026-06-14 (session on node .116 / archi-thinkpad)
|
||||
|
||||
---
|
||||
|
||||
# ▶ CURRENT PASS — v1.7.91-alpha (2026-06-14)
|
||||
|
||||
## RESUME PROMPT (paste into a fresh session, any machine)
|
||||
|
||||
> Resume the v1.7.91-alpha release pass for the `archy` repo (on node .116 / archi-thinkpad
|
||||
> the tree is at /home/archipelago/Projects/archy; on another machine, clone/pull `main` from
|
||||
> gitea-vps2 http://146.59.87.168:3000/lfg2025/archy.git — my fix commit is pushed there).
|
||||
> Read the top section of docs/WEEKLY_RELEASE_TRACKER.md FIRST — it has the blocker, the fix
|
||||
> already made, and exact next steps. Two goals: (1) cut & PUBLISH v1.7.91-alpha, (2) finish
|
||||
> validating + integrating the new tests/lifecycle/os-audit.sh OS-wide health harness.
|
||||
> Do NOT redo: the bitcoinReceive.ts TS2538 fix (done, committed) or the os-audit jq false-trap
|
||||
> fix (done). Resume at "EXACT NEXT STEPS — v1.7.91" below. .116 login password: ThisIsWeb54321@
|
||||
> (.116 serves http on :80 → ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http).
|
||||
|
||||
## What happened this session
|
||||
|
||||
- `scripts/create-release.sh 1.7.91-alpha` was running; its release gate PASSED all 7 checks,
|
||||
backend built clean (7m22s), then it **FAILED at step [4/8] frontend build** with:
|
||||
`src/utils/bitcoinReceive.ts(23,24): error TS2538: Type 'undefined' cannot be used as an index type.`
|
||||
Cause: `noUncheckedIndexedAccess` — `codeMatch[1]` is `string | undefined` and was used directly
|
||||
to index `RECEIVE_CODE_MESSAGES`. **FIXED** → `const code = message.match(/\[([A-Z_]+)\]/)?.[1]`
|
||||
then `if (code && RECEIVE_CODE_MESSAGES[code])`. `npx vue-tsc --noEmit` is now clean (exit 0).
|
||||
The failed run aborted BEFORE bumping the manifest (still 1.7.90) or tagging (no v1.7.91 tag),
|
||||
but it HAD already partial-bumped Cargo.toml/package.json/locks to 1.7.91 — those partial bumps
|
||||
are reverted (create-release.sh re-owns the bump); only the genuine TS fix + harness are committed.
|
||||
- Built a new OS-wide health harness `tests/lifecycle/os-audit.sh` (non-destructive, one scorecard):
|
||||
Section A backend/RPC health, Section B all-apps lifecycle audit (delegates to remote-lifecycle.sh),
|
||||
Section C FM-guards (port-drift + secret-completeness bats, orphan-container sweep). Section A
|
||||
validated all-PASS on .116. Fixed a jq bug in the FM12 OTA-wedge check: `//` treats a legit
|
||||
`false` as empty and fell through to "unknown" — now uses `has()`. Section B is slow (~3 min) and
|
||||
opaque while running because output is captured (`out=$(...)`) not streamed — minor wart, TODO.
|
||||
|
||||
## EXACT NEXT STEPS — v1.7.91 (in order)
|
||||
|
||||
1. Confirm clean tree + on main (`git status`; create-release.sh requires `git diff --quiet HEAD`).
|
||||
The TS fix + os-audit.sh are committed & pushed; version-bump artifacts reverted to 1.7.90.
|
||||
2. Re-run the release: `scripts/create-release.sh 1.7.91-alpha`. Backend is cached (only a .ts
|
||||
changed) so it's fast; the frontend build now passes. It bumps versions, builds, writes
|
||||
releases/manifest.json (→1.7.91-alpha), commits, and tags v1.7.91-alpha.
|
||||
- Memory guards: grep the staged frontend tarball for "1.7.91-alpha" before shipping (silent
|
||||
vue-tsc failures); tarball must be flat (`tar -C web/dist/neode-ui .`).
|
||||
3. Publish: `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`, then
|
||||
`git push origin main && git push origin --tags` (origin pushes to BOTH gitea-local + vps2).
|
||||
4. Verify manifest LIVE (this is "published"):
|
||||
`curl -fsS http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
|
||||
must show `1.7.91-alpha`. **Then notify the user — they asked to be told when 1.7.91 publishes.**
|
||||
5. os-audit harness: run a full green pass on .116
|
||||
(`ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD='ThisIsWeb54321@' tests/lifecycle/os-audit.sh`),
|
||||
confirm Section A FM12 now reads `update_in_progress=false` (PASS not WARN), review B + C findings,
|
||||
then wire os-audit.sh into the reboot-survival (L3) loop as the per-boot gate.
|
||||
|
||||
---
|
||||
|
||||
# ─ HISTORY — v1.7.89-alpha pass (2026-06-12), superseded ─
|
||||
|
||||
Last updated: 2026-06-12 ~17:45 EDT (session on node .116)
|
||||
|
||||
## RESUME PROMPT (paste into a fresh session)
|
||||
|
||||
> Continue the v1.7.89-alpha release pass from /home/archipelago/Projects/archy on node .116.
|
||||
> Read docs/WEEKLY_RELEASE_TRACKER.md fully first — it has root causes, fixes already made,
|
||||
> and exact next steps. Do NOT redo: AIUI revert (done, validated), updater fixes in
|
||||
> core/archipelago/src/update.rs (done, uncommitted), .116 OTA unwedge (done). Resume at
|
||||
> "EXACT NEXT STEPS" below.
|
||||
|
||||
## EXACT NEXT STEPS (in order)
|
||||
|
||||
1. Backend focused tests were running in background:
|
||||
`cd core && timeout 1500 cargo test -p archipelago -- update:: lnd container::image_versions scanner`
|
||||
(log: /tmp/claude-.../tasks/bds4jk19e.output — if lost, just rerun the command; first
|
||||
attempt died at 400s timeout during test compile, 1500s is the right budget).
|
||||
Need: all green.
|
||||
2. RESOLVED before session end: vitest recheck passed clean — EXIT=0, 79 files / 645 tests,
|
||||
even while cargo test was compiling. The earlier harness ui-unit-tests FAIL was load/flake
|
||||
(machine saturated by the parallel cargo test compile), not a real failure. On resume just
|
||||
rerun `tests/release/run.sh --quick` WITHOUT a parallel cargo build to confirm green;
|
||||
if it ever fails again, the failing test name is in the stage output (drop `--silent`).
|
||||
3. Run full harness: `tests/release/run.sh` (static+frontend+backend). Then commit ALL
|
||||
working-tree changes (one commit, e.g. "fix: harden OTA updates, AIUI desktop gap, LND
|
||||
no-proxy" — CHANGELOG v1.7.89 section is already curated).
|
||||
4. Cut release: `scripts/create-release.sh 1.7.89-alpha` (needs clean tree, on main,
|
||||
validates CHANGELOG section exists — it does). Then
|
||||
`tests/release/run.sh --manifest` should pass, and grep the staged frontend tarball
|
||||
for 1.7.89-alpha (memory: silent build failures).
|
||||
5. Publish: `scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2`, then
|
||||
`git push origin main && git push origin --tags` and push gitea-local + tags too.
|
||||
Verify manifest live on http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json
|
||||
6. Verify OTA on THIS node (.116): schedule is auto_apply; either wait for the scheduler
|
||||
or trigger via UI. Confirm /var/lib/archipelago/update_state.json current_version
|
||||
becomes 1.7.89-alpha, `update_in_progress` returns to false, web-ui + binary versions
|
||||
MATCH (this node currently has web-ui 1.7.84 / binary 1.7.85 mismatch — the OTA heals it),
|
||||
and journalctl shows "Post-OTA verification succeeded" (the new probe falls back to
|
||||
http://127.0.0.1/ which is what .116 serves).
|
||||
7. Update this tracker + docs/PROGRESS_MEMORY.md, mark tasks done.
|
||||
Purpose: live tracker for this pass — test everything shipped this week (v1.7.83→v1.7.89),
|
||||
build the release test harness, fix OTA updates on .116, make updates bulletproof, cut v1.7.89-alpha.
|
||||
If the session is cut off, resume from here.
|
||||
|
||||
## Task status
|
||||
|
||||
| # | Task | Status |
|
||||
|---|------|--------|
|
||||
| 1 | AIUI revert (mobile back/close gone, desktop gap fixed) | DONE — validated |
|
||||
| 2 | Dev server on :8100 with embedded AIUI | DONE — see below |
|
||||
| 3 | Inventory this week's release-log items | DONE — see checklist |
|
||||
| 4 | Test harness covering this week + seed of system-wide harness | IN PROGRESS |
|
||||
| 5 | Fix OTA updates on .116 + bulletproof updates | IN PROGRESS — diagnosis below |
|
||||
| 6 | Cut v1.7.89-alpha release | PENDING (gates: 4, 5) |
|
||||
|
||||
## State of the working tree
|
||||
|
||||
- HEAD = 495b9078 (v1.7.89 changelog + AIUI mobile restore committed).
|
||||
- Uncommitted, intended for v1.7.89-alpha:
|
||||
- `neode-ui/src/views/Dashboard.vue` — chat route back to plain `h-full` (desktop bottom-gap fix). Validated.
|
||||
- `core/.../rpc/lnd/*` + `container/lnd.rs` — LND REST no-proxy + wallet readiness/unlock fixes.
|
||||
- Version bumps to 1.7.89-alpha (Cargo.toml, package.json, locks), CHANGELOG entry.
|
||||
- `neode-ui/vite.config.ts` — added `/aiui` dev proxy (keep; dev-only convenience).
|
||||
|
||||
## AIUI validation (task 1) — DONE
|
||||
|
||||
- HEAD already removed the mobile back button and restored `hideClose=true` (495b9078).
|
||||
- Working-tree Dashboard.vue removes `dashboard-scroll-panel mobile-scroll-pad` from the chat
|
||||
route (that padding caused the desktop bottom gap); mesh keeps its styling.
|
||||
- Chat CSS verified byte-identical to last-good 34c4e87d (May 20).
|
||||
- Playwright check (desktop 1440x900, mobile 390x844): chat fills full viewport, no bottom gap,
|
||||
no mobile back/close. `npm run type-check` + focused route tests + full vitest (645/645) pass.
|
||||
|
||||
## Dev server on :8100 (task 2) — DONE
|
||||
|
||||
- Running: `BACKEND_URL=http://127.0.0.1:5678 VITE_AIUI_URL=/aiui/ npx vite --host 0.0.0.0 --port 8100`
|
||||
from `neode-ui/` (real local backend on 5678).
|
||||
- AIUI now embeds in /dashboard/chat via new vite proxy `/aiui` → `http://127.0.0.1:80`
|
||||
(the node's deployed AIUI), same-origin like production.
|
||||
- Secondary throwaway instance for automated checks: :8101 against mock backend
|
||||
(`node mock-backend.js` on 5959, password `password123`).
|
||||
|
||||
## This week's shipped items (v1.7.83 → v1.7.89) — test checklist
|
||||
|
||||
### Frontend (vitest/type-check/build cover most; full suite 645/645 green 2026-06-12)
|
||||
- [x] AIUI fast launch, no availability probe (v1.7.88) — covered by visual check + Chat.vue tests
|
||||
- [x] AIUI mobile layout restore (v1.7.89) — playwright visual check
|
||||
- [x] App-session launch metadata from manifests / typed interfaces (v1.7.83) — appSessionConfig tests
|
||||
- [x] OnlyOffice + Saleor removal (v1.7.83) — catalog tests
|
||||
- [ ] Bitcoin receive UI flow end-to-end (v1.7.87/88) — needs live LND node check
|
||||
- [ ] Fleet tab keeps node list/alerts during refresh, names not hashes (v1.7.85/86) — store tests?
|
||||
- [ ] Credential interstitial full-screen overlay (v1.7.87) — visual
|
||||
- [ ] Mobile federation/system-update buttons full width (v1.7.86) — visual
|
||||
|
||||
### Backend (cargo)
|
||||
- [ ] LND REST no-proxy client + GET newaddress p2wkh (v1.7.88/89) — unit tests + live check
|
||||
- [ ] LND wallet readiness/unlock after restart (v1.7.89) — unit + live
|
||||
- [ ] Bitcoin trusted-node relay rpcauth/txrelay (v1.7.84) — unit tests exist? check
|
||||
- [ ] Container scanner RAII in-flight guard (v1.7.84) — cargo test
|
||||
- [ ] ElectrumX health-check startup window + cache tuning (v1.7.85/86)
|
||||
- [ ] Portainer pin 2.19.4 / bitcoin-ui image pin (v1.7.84/85) — image-versions tests
|
||||
- [ ] Fleet telemetry name/hostname/URL fields (v1.7.85)
|
||||
- [ ] Federation no self-import (v1.7.85)
|
||||
- [ ] Kiosk safe-area + self-update refreshes kiosk files (v1.7.84)
|
||||
- [ ] Wi-Fi scan error/retry/escaped SSID/open networks (v1.7.84)
|
||||
|
||||
### OTA / updates (task 5)
|
||||
- [ ] .116 stuck: current 1.7.85-alpha, `update_in_progress: true` since 1.7.88 attempt — diagnose+fix
|
||||
- [ ] Updater hardening: stuck-in-progress recovery, resumable/atomic apply, verify post-restart version
|
||||
|
||||
## OTA diagnosis on .116 — ROOT CAUSES FOUND + FIXED (code staged for v1.7.89)
|
||||
|
||||
Four bugs, all reproduced from the journal (Jun 12 03:45–04:33):
|
||||
|
||||
1. Post-OTA probe only tries `https://127.0.0.1/`; .116's nginx binds only :80 (443 is
|
||||
tailscale's) → connection refused × 18 → a GOOD 1.7.85 update was "rolled back".
|
||||
FIX: probe falls back to `http://127.0.0.1/` on connect error (update.rs probe_frontend_once).
|
||||
2. That rollback's binary restore did `host_sudo cp` onto the RUNNING binary → ETXTBSY exit 1
|
||||
→ binary stayed 1.7.85 while web-ui rolled back to 1.7.84 (mismatch confirmed live).
|
||||
FIX: rollback now cp→tmp→atomic mv, same pattern as apply (update.rs rollback_update).
|
||||
3. The rollback chown'd `update-backup/archipelago` root:root IN PLACE → next apply's
|
||||
fs::copy (as service user) hit EACCES → "Failed to backup current binary" × 3 → 1.7.86/88
|
||||
never applied. FIX: apply unlinks stale backup first; rollback chowns only its temp copy.
|
||||
4. Failed apply left `update_in_progress: true` wedged (staging still populated so the
|
||||
stale-flag guard never fires). Unwedged operationally; fixed structurally by 1–3.
|
||||
|
||||
Operational cleanup DONE on .116 (2026-06-12 17:15): removed root-owned
|
||||
`update-backup/archipelago`, stale `update-staging/` (1.7.86), and the stale
|
||||
`update-pending-verify.json`. Next state load clears `update_in_progress`.
|
||||
NOTE: live web-ui is 1.7.84 / binary 1.7.85 (mismatch from bug 2). Not hand-patched —
|
||||
the v1.7.89 OTA will resync both. Good 1.7.85 frontend is quarantined at
|
||||
`/opt/archipelago/web-ui.failed.1781250438247`.
|
||||
Verification plan: after v1.7.89 release, watch .116 auto-apply (schedule auto_apply),
|
||||
confirm `update_state.json.current_version == 1.7.89-alpha` and web-ui version matches.
|
||||
|
||||
## Test harness (task 4) — CREATED at tests/release/run.sh
|
||||
|
||||
- Stages: static (git diff --check, cargo fmt, catalog drift, optional --manifest),
|
||||
frontend (type-check, full vitest), optional --with-build (build + grep dist for version),
|
||||
backend (cargo check + focused cargo test: update:: lnd container::image_versions scanner,
|
||||
all wrapped in `timeout`), optional --live URL smoke (/, /aiui/, /rpc/v1).
|
||||
- Results so far (2026-06-12): type-check PASS, full vitest 645/645 PASS, cargo fmt PASS,
|
||||
cargo check PASS, catalog drift PASS (3 pre-existing MISSING_CATALOG warnings, exit 0,
|
||||
identical on HEAD). Focused backend cargo tests running (first run hit the known slow
|
||||
test-compile on .116 at 400s timeout; rerunning with 1500s).
|
||||
- AIUI embed verified end-to-end via playwright on :8101 (mock backend): iframe loads,
|
||||
`ready` handshake clears the loading overlay, hideClose honored.
|
||||
- Release flow confirmed: commit all → `scripts/create-release.sh 1.7.89-alpha` (validates
|
||||
curated CHANGELOG section, builds, manifests, commits, tags) →
|
||||
`scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2` → push origin main + tags.
|
||||
Tarball layout/perms safety is already inside create-release-manifest.sh.
|
||||
- CHANGELOG v1.7.89 section rewritten layman-readable (updater fixes added).
|
||||
|
||||
## Release gates for v1.7.89-alpha (task 6)
|
||||
|
||||
1. All harness stages green locally.
|
||||
2. OTA fix for stuck `update_in_progress` included + .116 updates successfully to the new release.
|
||||
3. Frontend build: grep packaged tarball for "1.7.89-alpha" before shipping (memory: silent vue-tsc failures).
|
||||
4. Flat tarball layout (`tar -C web/dist/neode-ui .`).
|
||||
5. Commit, tag `v1.7.89-alpha`, push origin + gitea-local + tags, publish release assets, verify
|
||||
manifest + node OTA picks it up.
|
||||
239
tests/lifecycle/os-audit.sh
Executable file
239
tests/lifecycle/os-audit.sh
Executable file
@ -0,0 +1,239 @@
|
||||
#!/usr/bin/env bash
|
||||
# tests/lifecycle/os-audit.sh — one non-destructive OS-wide health gate.
|
||||
#
|
||||
# Ties together, in a single pass with one scorecard + exit code:
|
||||
# A. Backend / RPC health — node is up, not wedged mid-OTA, core daemons answer
|
||||
# B. All-apps lifecycle audit — every catalog app: valid state, real health,
|
||||
# reachable launch URL, populated launch metadata
|
||||
# (delegates to remote-lifecycle.sh, audit-only)
|
||||
# C. FM-guards — the concrete failure modes that have bitten the
|
||||
# fleet: port-drift (FM8), secret-completeness (FM2),
|
||||
# orphaned container states (FM9), OTA wedge (FM12)
|
||||
#
|
||||
# Everything here is READ-ONLY: no install/stop/start/uninstall, no service bounce.
|
||||
# Safe to run against a live production node. It is the per-boot building block the
|
||||
# reboot-survival harness (L3) calls after each reboot.
|
||||
#
|
||||
# Env:
|
||||
# ARCHY_HOST (default 127.0.0.1)
|
||||
# ARCHY_SCHEME (default https; use http for .116 / nginx-:80-only nodes)
|
||||
# ARCHY_PASSWORD (required)
|
||||
# ARCHY_LOCAL (auto: 1 when ARCHY_HOST is loopback) — gates host-only podman checks
|
||||
#
|
||||
# Usage:
|
||||
# ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD=... tests/lifecycle/os-audit.sh
|
||||
#
|
||||
# Exit: 0 = every section green; 1 = one or more checks failed; 2 = setup/usage error.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
HERE="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
ARCHY_HOST="${ARCHY_HOST:-127.0.0.1}"
|
||||
ARCHY_SCHEME="${ARCHY_SCHEME:-https}"
|
||||
ARCHY_PASSWORD="${ARCHY_PASSWORD:-}"
|
||||
BASE_URL="${ARCHY_SCHEME}://${ARCHY_HOST}"
|
||||
|
||||
# Host-only checks (podman sweeps) make sense only when this script runs ON the node.
|
||||
if [[ -z "${ARCHY_LOCAL:-}" ]]; then
|
||||
case "$ARCHY_HOST" in
|
||||
127.0.0.1|localhost|::1) ARCHY_LOCAL=1 ;;
|
||||
*) ARCHY_LOCAL=0 ;;
|
||||
esac
|
||||
fi
|
||||
|
||||
if [[ -z "$ARCHY_PASSWORD" ]]; then
|
||||
echo "ARCHY_PASSWORD env var must be set." >&2
|
||||
exit 2
|
||||
fi
|
||||
for tool in curl jq; do
|
||||
command -v "$tool" >/dev/null 2>&1 || { echo "missing required tool: $tool" >&2; exit 2; }
|
||||
done
|
||||
|
||||
# ── scorecard state ───────────────────────────────────────────────────────────
|
||||
PASS=0; FAIL=0; WARN=0
|
||||
declare -a RESULTS=()
|
||||
record() { # record <PASS|FAIL|WARN> <label> [detail]
|
||||
local status="$1" label="$2" detail="${3:-}"
|
||||
case "$status" in
|
||||
PASS) PASS=$((PASS+1)) ;;
|
||||
FAIL) FAIL=$((FAIL+1)) ;;
|
||||
WARN) WARN=$((WARN+1)) ;;
|
||||
esac
|
||||
RESULTS+=("$(printf '%-4s %-38s %s' "$status" "$label" "$detail")")
|
||||
printf ' [%s] %s %s\n' "$status" "$label" "$detail"
|
||||
}
|
||||
|
||||
# ── minimal RPC client (session + CSRF) ────────────────────────────────────────
|
||||
SESSION=""; CSRF=""
|
||||
rpc_login() {
|
||||
local hdr; hdr=$(mktemp)
|
||||
curl -sk -D "$hdr" -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
|
||||
-d "$(jq -nc --arg p "$ARCHY_PASSWORD" '{jsonrpc:"2.0",id:1,method:"auth.login",params:{password:$p}}')" \
|
||||
-o /dev/null 2>/dev/null
|
||||
SESSION=$(grep -i '^set-cookie: session=' "$hdr" | head -1 | sed -E 's/.*session=([^;]+).*/\1/' | tr -d '\r')
|
||||
CSRF=$(grep -i '^set-cookie: csrf_token=' "$hdr" | head -1 | sed -E 's/.*csrf_token=([^;]+).*/\1/' | tr -d '\r')
|
||||
rm -f "$hdr"
|
||||
[[ -n "$SESSION" && -n "$CSRF" ]]
|
||||
}
|
||||
# rpc <method> [params-json] -> prints raw JSON response
|
||||
rpc() {
|
||||
local method="$1" params="${2:-{\}}"
|
||||
curl -sk -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
|
||||
-H "Cookie: session=${SESSION}; csrf_token=${CSRF}" -H "X-CSRF-Token: ${CSRF}" \
|
||||
-d "$(jq -nc --arg m "$method" --argjson p "$params" '{jsonrpc:"2.0",id:2,method:$m,params:$p}')" 2>/dev/null
|
||||
}
|
||||
# rpc_ok <method> [params] -> 0 if a result came back with no error
|
||||
rpc_ok() {
|
||||
local resp; resp=$(rpc "$@")
|
||||
[[ -n "$resp" ]] && [[ "$(jq -r '.error // empty' <<<"$resp" 2>/dev/null)" == "" ]] \
|
||||
&& [[ "$(jq -r 'has("result")' <<<"$resp" 2>/dev/null)" == "true" ]]
|
||||
}
|
||||
|
||||
# ══ Section A — Backend / RPC health ═══════════════════════════════════════════
|
||||
section_a() {
|
||||
echo
|
||||
echo "== A. Backend / RPC health =="
|
||||
|
||||
# unauth health probe first (doesn't need a session)
|
||||
local health; health=$(curl -sk -X POST "${BASE_URL}/rpc/v1" -H 'Content-Type: application/json' \
|
||||
-d '{"jsonrpc":"2.0","id":1,"method":"health","params":{}}' 2>/dev/null)
|
||||
if [[ "$(jq -r '.result.status // empty' <<<"$health" 2>/dev/null)" =~ ^(ok|degraded)$ ]]; then
|
||||
record PASS "node responds (health)" "status=$(jq -r '.result.status' <<<"$health")"
|
||||
else
|
||||
record FAIL "node responds (health)" "no/invalid health response — node down?"
|
||||
return
|
||||
fi
|
||||
|
||||
if ! rpc_login; then
|
||||
record FAIL "auth.login" "could not establish session (wrong password or rate-limited)"
|
||||
return
|
||||
fi
|
||||
record PASS "auth.login" "session established"
|
||||
|
||||
# FM12 — OTA must not be wedged mid-apply.
|
||||
# NB: must use has() not `//` — jq's `//` treats a legit `false` as empty and
|
||||
# would fall through to "unknown" on a perfectly healthy node.
|
||||
local us; us=$(rpc update.status)
|
||||
local inprog; inprog=$(jq -r '
|
||||
if (.result|type=="object") and (.result|has("update_in_progress")) then .result.update_in_progress
|
||||
elif (.result|type=="object") and (.result|has("in_progress")) then .result.in_progress
|
||||
else "unknown" end' <<<"$us" 2>/dev/null)
|
||||
if [[ "$inprog" == "false" ]]; then
|
||||
record PASS "OTA not wedged (update.status)" "update_in_progress=false"
|
||||
elif [[ "$inprog" == "unknown" ]]; then
|
||||
record WARN "OTA not wedged (update.status)" "could not read update_in_progress"
|
||||
else
|
||||
record FAIL "OTA not wedged (update.status)" "update_in_progress=$inprog (FM12 wedge)"
|
||||
fi
|
||||
|
||||
# Core daemons answer (only assert for ones present on this node)
|
||||
if rpc_ok bitcoin.getinfo || rpc_ok bitcoin.relay-status; then
|
||||
record PASS "bitcoin RPC reachable" ""
|
||||
else
|
||||
record WARN "bitcoin RPC reachable" "bitcoin.getinfo/relay-status did not answer (not installed?)"
|
||||
fi
|
||||
if rpc_ok lnd.getinfo; then
|
||||
record PASS "lnd RPC reachable" ""
|
||||
else
|
||||
record WARN "lnd RPC reachable" "lnd.getinfo did not answer (not installed / wallet locked?)"
|
||||
fi
|
||||
if rpc_ok system.stats || rpc_ok system.get-metrics; then
|
||||
record PASS "system metrics reachable" ""
|
||||
else
|
||||
record WARN "system metrics reachable" "system.stats/get-metrics did not answer"
|
||||
fi
|
||||
|
||||
# FM13 — disk pressure early-warning (best-effort; field names vary by version)
|
||||
local ds; ds=$(rpc system.disk-status)
|
||||
local usep; usep=$(jq -r '[.result.use_percent,.result.used_percent,.result.percent]|map(select(.!=null))|first // empty' <<<"$ds" 2>/dev/null)
|
||||
if [[ -n "$usep" ]]; then
|
||||
if (( ${usep%.*} >= 90 )); then
|
||||
record FAIL "disk pressure (system.disk-status)" "${usep}% used (FM13 risk)"
|
||||
else
|
||||
record PASS "disk pressure (system.disk-status)" "${usep}% used"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ══ Section B — All-apps lifecycle audit (delegates to remote-lifecycle.sh) ═════
|
||||
section_b() {
|
||||
echo
|
||||
echo "== B. All-apps lifecycle audit (non-destructive, all catalog apps) =="
|
||||
local out rc
|
||||
# No ARCHY_APPS + no ARCHY_FULL_LIFECYCLE => audit every catalog app (audit_app).
|
||||
out=$(ARCHY_HOST="$ARCHY_HOST" ARCHY_SCHEME="$ARCHY_SCHEME" ARCHY_PASSWORD="$ARCHY_PASSWORD" \
|
||||
ARCHY_APPS="" ARCHY_FULL_LIFECYCLE=0 \
|
||||
"$HERE/remote-lifecycle.sh" 2>&1)
|
||||
rc=$?
|
||||
# Surface the per-app lines but drop the noisy optional-probe jq parse errors.
|
||||
echo "$out" | grep -vE '^jq: (parse )?error' | sed 's/^/ /'
|
||||
if (( rc == 0 )); then
|
||||
record PASS "broad all-apps audit" "remote-lifecycle.sh exit 0"
|
||||
else
|
||||
local n; n=$(echo "$out" | grep -oE 'FAILED checks: [0-9]+' | grep -oE '[0-9]+' | tail -1)
|
||||
record FAIL "broad all-apps audit" "remote-lifecycle.sh exit $rc (${n:-?} app checks failed)"
|
||||
fi
|
||||
}
|
||||
|
||||
# ══ Section C — FM-guards ══════════════════════════════════════════════════════
|
||||
run_bats_guard() { # run_bats_guard <suite> <label> <fm>
|
||||
local suite="$1" label="$2" fm="$3" out rc
|
||||
if ! command -v bats >/dev/null 2>&1; then
|
||||
record WARN "$label" "bats not installed — $fm guard skipped"
|
||||
return
|
||||
fi
|
||||
out=$(ARCHY_HOST="$ARCHY_HOST" ARCHY_SCHEME="$ARCHY_SCHEME" ARCHY_PASSWORD="$ARCHY_PASSWORD" \
|
||||
"$HERE/run.sh" "$suite" 2>&1); rc=$?
|
||||
if (( rc == 0 )); then
|
||||
record PASS "$label" "$fm guard green"
|
||||
else
|
||||
record FAIL "$label" "$fm — $(echo "$out" | grep -E '^not ok' | head -1)"
|
||||
fi
|
||||
}
|
||||
|
||||
section_c() {
|
||||
echo
|
||||
echo "== C. FM-guards (the concrete fleet failure modes) =="
|
||||
run_bats_guard port-drift "port bindings match manifest" "FM8"
|
||||
run_bats_guard secret-completeness "all referenced secrets exist" "FM2"
|
||||
|
||||
# FM9 — orphaned container states (host-only: needs local podman)
|
||||
if [[ "$ARCHY_LOCAL" == "1" ]] && command -v podman >/dev/null 2>&1; then
|
||||
local orphans
|
||||
orphans=$(podman ps -a --format '{{.Names}} {{.Status}}' 2>/dev/null \
|
||||
| grep -iE '(^| )(stopping|removing|created)( |$)' || true)
|
||||
if [[ -z "$orphans" ]]; then
|
||||
record PASS "no orphaned container states" "no stopping/removing/created"
|
||||
else
|
||||
record FAIL "no orphaned container states" "FM9: $(echo "$orphans" | tr '\n' ';')"
|
||||
fi
|
||||
else
|
||||
record WARN "no orphaned container states" "remote node — host podman sweep skipped"
|
||||
fi
|
||||
}
|
||||
|
||||
# ── run ────────────────────────────────────────────────────────────────────────
|
||||
echo "=============================================================="
|
||||
echo " OS-wide audit — ${BASE_URL} ($(date '+%Y-%m-%d %H:%M:%S'))"
|
||||
echo " local=${ARCHY_LOCAL}"
|
||||
echo "=============================================================="
|
||||
section_a
|
||||
# Only proceed to apps/FM-guards if the node itself answered.
|
||||
if (( FAIL == 0 )) || [[ -n "$SESSION" ]]; then
|
||||
section_b
|
||||
section_c
|
||||
fi
|
||||
|
||||
echo
|
||||
echo "=============================================================="
|
||||
echo " SCORECARD: ${PASS} pass / ${FAIL} fail / ${WARN} warn"
|
||||
echo "=============================================================="
|
||||
printf '%s\n' "${RESULTS[@]}"
|
||||
echo
|
||||
if (( FAIL > 0 )); then
|
||||
echo "RESULT: FAIL ($FAIL critical checks failed)"
|
||||
exit 1
|
||||
fi
|
||||
echo "RESULT: PASS"
|
||||
exit 0
|
||||
Loading…
x
Reference in New Issue
Block a user