docs(tracker): add explicit container-flapping/reconciler-churn workstream to Tier 2

Was implicit across the Phase-3 Quadlet flip and Workstream F; now one
consolidated pre-tag item with the lever list, an observability proposal
(per-app restart counter + flap log line), and an already-landed list so
nothing gets re-done.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-07-04 17:41:27 -04:00
parent 9020b8526c
commit 291f2d7186

View File

@ -113,6 +113,25 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
are 3 separate single-container apps with manifest dependency edges, not a
coordinated stack. Workstream A's stack-migration tail is fully closed.
- [ ] **Container thrashing/flapping + reconciler churn** (added 2026-07-04 — was
implicit across other tracks, now an explicit pre-tag concern). The root cause
of restart-storm flapping is pre-Quadlet architecture: restarting
`archipelago.service` SIGKILLs every container in its cgroup, then the
reconciler rebuilds the world over several minutes (the post-OTA health check
deliberately skips per-app container assertions because of exactly this).
Consolidated lever list, in order of impact:
- **Phase-3 Quadlet default-flip** (tracked above) — removes the SIGKILL-the-world
behavior entirely; the single biggest fix.
- **Workstream F lifecycle items** — immich/grafana uninstall hangs + ghost
containers, grafana reinstall stops, fedimint guardian sync
(`docs/PRODUCTION-MASTER-PLAN.md` workstream F).
- **Reconciler churn observability** — no metric/log today distinguishes "settling
after restart" from "flapping"; add a per-app restart counter + log line when an
app restarts >N times in M minutes so thrash is visible instead of anecdotal.
- Already landed, don't re-do: boot-reconciler circuit breaker (2026-07-01),
indeedhub crashloop fix (2026-07-01), async blocking-Command pass (`4c75bb3d`,
removes executor stalls that made the API janky under reconcile load).
- Perf polish riding along: 93 MB frontend dist shrink (hardening plan §D 🟡).
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
- [x] ~~**Consolidated deploy 2026-07-01**: merged PR #67 (reticulum daemon