On October 4, 16, and 21, Sentry experienced three incidents with the same root causes. While each incident had a different trigger, the failure mode for all three was as follows:
While each incident only lasted a few minutes but cleanup and customer visible impact extended beyond the initial period. Further, these three incidents are related to incidents that had occurred earlier in October.
The root cause of the most severe part of each of these incidents was a scaling flaw in our edge design. As traffic slowed or shifted, autoscaling of infrastructure was either too slow to respond, or in some cases, performed the wrong scaling action entirely. Originally, our edge scaled on CPU load of the servers involved in ingestion, however, the time required to detect impacts based on CPU was insufficient. If the slowdown resulted in iowait rather than CPU, scaling would not happen at all. This failure to scale led to a cascade of impacts affecting customers in the US region.
The problem with our edge scaling was detected in prior incidents and work has been underway to improve our reliability over the past few weeks. Changes that have already been made include:
We have already seen a significant increase in stability and the rate of recovery as a result of this work and are working on one additional change: