API is degraded
Resolved
Aug 14 at 02:26pm PDT
We are back. Job timeout metrics have recovered to pre-incident levels.
The issue came down to misconfigured pipeline queue limits on Dragonfly -- we configured them with high values expecting a high load on production, however, they ended up being ridiculously high. This caused Dragonfly's backpressure mechanisms to kick in way too late, only when the state of the instance is practically already unsalvageable. The configuration was tuned, and we will continue to monitor this.
Affected services
Updated
Aug 14 at 01:48pm PDT
The issue has regressed. We are working on a fix.
Affected services
Updated
Aug 14 at 01:03pm PDT
This issue was caused by an elevated number of connections overloading our Dragonfly instance in charge of operating the scrape queue. We've applied a fix to reduce the number of extraneous connections, and we are putting in further architectural work to do better on this metric. Apologies for the disruption.
Affected services
Updated
Aug 14 at 10:54am PDT
The issue is now resolved.
Affected services
Updated
Aug 14 at 10:44am PDT
The issue has been mitigated. We are now slowly allowing more workers to come online.
Affected services
Created
Aug 14 at 10:33am PDT
We are investigating
Affected services