API is degraded
Resolved
Aug 13 at 04:37pm PDT
The API has recovered.
We quickly restarted the workers to get them un-stuck again. After that, we revisited the core issue. A lot of jobs were finishing at the same time, started hammering the same sorted set in Redis at the same time, and ran into the same race conditions. We decided to break up these clumps of requests by applying a timeout of random length before retrying every time the anti-race mechanism activates. This spreads out the workers nicely.
After applying the fix and observing, the issue is resolved.
Affected services
Updated
Aug 13 at 04:30pm PDT
The issue has regressed. We are continuing to work towards a solution.
Affected services
Updated
Aug 13 at 04:26pm PDT
The API has recovered.
The concurrency limit queue has been used more heavily than previously expected, causing race conditions to occur at rates higher than expected. The anti-race mechanism was triggered in a de facto infinite loop, causing workers to stop handling new scrape jobs. The workers were restarted and the code causing the event loop exhaustion was quickly patched.
Affected services
Created
Aug 13 at 04:10pm PDT
We are investigating.
Affected services