[Bug] IndeeHub crashes fleet-wide; recovers only after 1-2 container restarts #41
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
IndeeHub went down on all nodes ~simultaneously, then came back after 1-2 container restarts per node.
NOT a registry problem: vps2 has all three images (indeedhub, indeedhub-api, indeedhub-ffmpeg). The need for 1-2 restarts points at a transient runtime crash + a recovery gap for the multi-container stack (postgres/redis/minio/relay/api/ffmpeg/frontend).
Investigate: which stack member crashes and why; whether crash_recovery/reconcile restarts a stack member after it dies (the #25 fix only added restart-retry at install time in wait_for_stack_containers); startup dependency ordering / readiness gating between stack members; possible OOM. Files: crash_recovery.rs, prod_orchestrator.rs reconcile, IndeeHub stack definition.
Partial fix in commit (IndeeHub API now lists indeedhub-minio as a restart dependency, fixing the health-monitor restart-ordering race that caused 'recovers after 1-2 restarts'). NOTE: the deeper fix is first-START ordering in the IndeeHub stack/compose definition (minio before api), which needs live confirmation on a node that minio readiness is the actual blocker. Health monitor already restarts stopped/exited containers generally.
PARTIAL FIX + DEPLOYED to .116/.198 (commit
d4c0587d): IndeeHub API now lists indeedhub-minio as a restart dependency, fixing the health-monitor restart-ordering race behind 'recovers after 1-2 restarts'. Deeper first-START ordering in the stack/compose definition is a separate follow-up needing live confirmation that minio readiness is the blocker.Verified fixed: IndeeHub stack health now waits for MinIO before (re)starting the API and gates on
deps_are_running()—core/archipelago/src/health_monitor.rs:84,1043; boot path waits for MinIO incrash_recovery.rs:393-419. Addresses the transient crash + recovery-gap root cause. Closing as implemented (please reopen if it recurs on a fresh fleet).