US region Ingestion failures across multiple event types

Incident Report for Sentry

Postmortem

Summary

On October 4, 16, and 21, Sentry experienced three incidents with the same root causes. While each incident had a different trigger, the failure mode for all three was as follows:

An ingestion region saw an increase in backlogged traffic
Traffic shifting off of the trigger region shifted to other regions
Subsequent regions also had ingestion backlogs leading to rejected traffic

While each incident only lasted a few minutes but cleanup and customer visible impact extended beyond the initial period. Further, these three incidents are related to incidents that had occurred earlier in October.

Root Cause

The root cause of the most severe part of each of these incidents was a scaling flaw in our edge design. As traffic slowed or shifted, autoscaling of infrastructure was either too slow to respond, or in some cases, performed the wrong scaling action entirely. Originally, our edge scaled on CPU load of the servers involved in ingestion, however, the time required to detect impacts based on CPU was insufficient. If the slowdown resulted in iowait rather than CPU, scaling would not happen at all. This failure to scale led to a cascade of impacts affecting customers in the US region.

Remediation and Prevention

The problem with our edge scaling was detected in prior incidents and work has been underway to improve our reliability over the past few weeks. Changes that have already been made include:

Creating more, smaller edge regions to reduce the likelihood of cascading failures. This work added three new regions: one in South America (from zero), one in North America (from one), and one in Europe (from one)
Updating our autoscaling rules to capture stronger signals than the original CPU load metric
Decreasing the rate of scaling down to avoid iowait-based backlogs triggering scale down events.

We have already seen a significant increase in stability and the rate of recovery as a result of this work and are working on one additional change:

Improving the metrics being produced by our edge to further refine our autoscaling rules and sensitivity.

Posted Nov 09, 2024 - 00:44 UTC

Resolved

The issue with event ingestion has been resolved.

Posted Oct 04, 2024 - 14:53 UTC

Monitoring

We have mitigated the issue causing ingestion failures in our US region and are monitoring the results.

Posted Oct 04, 2024 - 11:57 UTC

Investigating

We are investigating an issue causing ingestion failures across multiple event types and will update here with findings as they happen.

Posted Oct 04, 2024 - 11:28 UTC

This incident affected: US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).