docs(tracker): add explicit container-flapping/reconciler-churn workstream to Tier 2
Was implicit across the Phase-3 Quadlet flip and Workstream F; now one consolidated pre-tag item with the lever list, an observability proposal (per-app restart counter + flap log line), and an already-landed list so nothing gets re-done. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
9020b8526c
commit
291f2d7186
@ -113,6 +113,25 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating
|
||||
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
|
||||
are 3 separate single-container apps with manifest dependency edges, not a
|
||||
coordinated stack. Workstream A's stack-migration tail is fully closed.
|
||||
- [ ] **Container thrashing/flapping + reconciler churn** (added 2026-07-04 — was
|
||||
implicit across other tracks, now an explicit pre-tag concern). The root cause
|
||||
of restart-storm flapping is pre-Quadlet architecture: restarting
|
||||
`archipelago.service` SIGKILLs every container in its cgroup, then the
|
||||
reconciler rebuilds the world over several minutes (the post-OTA health check
|
||||
deliberately skips per-app container assertions because of exactly this).
|
||||
Consolidated lever list, in order of impact:
|
||||
- **Phase-3 Quadlet default-flip** (tracked above) — removes the SIGKILL-the-world
|
||||
behavior entirely; the single biggest fix.
|
||||
- **Workstream F lifecycle items** — immich/grafana uninstall hangs + ghost
|
||||
containers, grafana reinstall stops, fedimint guardian sync
|
||||
(`docs/PRODUCTION-MASTER-PLAN.md` workstream F).
|
||||
- **Reconciler churn observability** — no metric/log today distinguishes "settling
|
||||
after restart" from "flapping"; add a per-app restart counter + log line when an
|
||||
app restarts >N times in M minutes so thrash is visible instead of anecdotal.
|
||||
- Already landed, don't re-do: boot-reconciler circuit breaker (2026-07-01),
|
||||
indeedhub crashloop fix (2026-07-01), async blocking-Command pass (`4c75bb3d`,
|
||||
removes executor stalls that made the API janky under reconcile load).
|
||||
- Perf polish riding along: 93 MB frontend dist shrink (hardening plan §D 🟡).
|
||||
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
|
||||
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
|
||||
- [x] ~~**Consolidated deploy 2026-07-01**: merged PR #67 (reticulum daemon
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user