docs(tracker): B17 root-caused + fixed (data-volume mount ordering), verified .198

This commit is contained in:
archipelago 2026-06-16 03:38:58 -04:00
parent 9da66da776
commit 8a62ae008c

View File

@ -140,8 +140,12 @@ FIX (commit 83dbd25c): added a `bitcoinStale` flag to homeStatus.ts —
Wired `bitcoinStale` through Home.vue `systemStats` → HomeSystemCard prop; card shows "Updating…" (dimmed) when stale.
**Harness:** `neode-ui/src/stores/__tests__/homeStatus.test.ts` (6 cases) — RED before fix (5/6 fail), GREEN after (6/6). `vue-tsc --noEmit` exit 0. Full vitest suite: only pre-existing AppIconGrid cross-test teardown flake (passes 7/7 standalone; not my change). UI-confirm on .116/.198 still recommended (hard to trigger transient failure on demand — unit test is the authoritative harness here).
### B17 — archipelago.service flaps on boot before starting — TODO
On some boots, `[FAILED] Failed to start archipelago.service - Archipelago Backend` prints ~20 times over ~5 min before it finally starts properly. Likely a startup dependency/timing race (DB lock, port bind, crash-recovery, or a dependency not ready) causing systemd restart loop until a precondition is met. Check service Restart=/RestartSec, ExecStartPre gates, and what the early failures log. May tie to B16/crash-recovery.
### B17 — archipelago.service flaps on boot before starting — FIXED + VERIFIED on .198 (commit 34b1fdc1)
On some boots, `[FAILED] Failed to start archipelago.service` printed ~20× over ~5 min before starting. ROOT CAUSE (proven live on .198): on production nodes `/var/lib/archipelago` is a **separate `/dev/mapper/archipelago-data` ext4 volume** (systemd unit `var-lib-archipelago.mount`), and podman's **graphroot=`/var/lib/archipelago/containers/storage`** lives on it too. The unit ordered only `After=network-online.target` — NO mount dependency — so on cold boots the service (and its `ExecStartPre`) could start BEFORE the volume mounted, write to the bare mountpoint on rootfs, fail every podman call, exit, and be restarted every 5s (`Restart=on-failure RestartSec=5`) until the mount appeared. Smoking gun in .198's journal: `var-lib-archipelago.mount: Directory /var/lib/archipelago to mount over is not empty, mounting anyway` — the service had written there pre-mount. Dev laptop .116 has the data dir on rootfs → never flaps (explains "on some boots"). Diagnostic: every node showed `banners == "Server listening"` (process always succeeds once it runs) ⇒ failure is systemd-level, not a Rust crash.
FIX (commit 34b1fdc1): `RequiresMountsFor=/var/lib/archipelago` (adds `Requires=` + `After=` on the mount unit).
- `image-recipe/configs/archipelago.service`: ships the directive on fresh ISOs.
- `bootstrap::ensure_archipelago_mount_ordering()`: self-heals already-deployed nodes' installed `/etc/systemd/system/archipelago.service` + `daemon-reload` (boot-ordering only — effective next reboot; never restarts the running service). Idempotent; harmless on rootfs installs.
VERIFIED on .198: applied directive → `systemctl show -p After` includes `var-lib-archipelago.mount`, `systemd-analyze verify` clean → rebooted: mount@07:35:22, archipelago banner@07:35:35 (13s AFTER mount), `banners=1 listening=1 failed_to_start=0` (zero flap), directive persisted. `cargo check` EXIT 0. NOTE: self-heal CODE (auto-patch on deployed nodes) still to be exercised with the built binary on .228 (directive was applied manually on .198); residual rootfs shadow files under the mountpoint are benign.
### B18 — Apps stop right after install (or become unstartable) — TODO
Many apps install but immediately stop, requiring a manual Start — or become unstartable entirely. Likely the install→start handoff / reconciler doesn't bring them up (or starts then they exit). Related to B9 (IndeedHub stopping), B10 (Immich). Possibly linked to the cgroup-SIGKILL-on-archipelago.service-restart issue (feedback_no_systemctl_deploy_until_quadlet) — but NOTE: on .116 (Quadlet) containers survived a service restart cleanly, so the reconciler may be fine there; reproduce on the affected nodes. Check post-install start sequencing + boot_reconciler + container restart policy + cgroup placement.