API timeouts elevated
Resolved
Aug 06 at 05:19am PDT
The issue is fully resolved. Apologies for the disruption and thank you for your patience.
The issue was caused by a load spike, triggering a scale-up to a high amount of API and Worker pods. These pods interface heavily with our Dragonfly (Redis-equivalent) instance via BullMQ for job queueing. The increased connections and requests to Dragonfly caused it to start having to queue pipeline operations, which caused BullMQ operations to have a delay, making the system fail and scrape jobs accumulate.
We set a lower limit on the max amount of worker and API pods, and we changed our BullMQ lock renewal to be less aggressive. This caused the pipeline queue length metric to decrease back to nominal levels.
Affected services
Created
Aug 06 at 04:35am PDT
We are working on resolving the issue
Affected services