Elevated errors for ingestion
Incident Report for Sentry
Postmortem

Summary

On October 4, 16, and 21, Sentry experienced three incidents with the same root causes. While each incident had a different trigger, the failure mode for all three was as follows:

  • An ingestion region saw an increase in backlogged traffic
  • Traffic shifting off of the trigger region shifted to other regions
  • Subsequent regions also had ingestion backlogs leading to rejected traffic

While each incident only lasted a few minutes but cleanup and customer visible impact extended beyond the initial period. Further, these three incidents are related to incidents that had occurred earlier in October.

Root Cause

The root cause of the most severe part of each of these incidents was a scaling flaw in our edge design. As traffic slowed or shifted, autoscaling of infrastructure was either too slow to respond, or in some cases, performed the wrong scaling action entirely. Originally, our edge scaled on CPU load of the servers involved in ingestion, however, the time required to detect impacts based on CPU was insufficient. If the slowdown resulted in iowait rather than CPU, scaling would not happen at all. This failure to scale led to a cascade of impacts affecting customers in the US region.

Remediation and Prevention

The problem with our edge scaling was detected in prior incidents and work has been underway to improve our reliability over the past few weeks. Changes that have already been made include:

  1. Creating more, smaller edge regions to reduce the likelihood of cascading failures. This work added three new regions: one in South America (from zero), one in North America (from one), and one in Europe (from one)
  2. Updating our autoscaling rules to capture stronger signals than the original CPU load metric
  3. Decreasing the rate of scaling down to avoid iowait-based backlogs triggering scale down events.

We have already seen a significant increase in stability and the rate of recovery as a result of this work and are working on one additional change:

  1. Improving the metrics being produced by our edge to further refine our autoscaling rules and sensitivity.
Posted Nov 08, 2024 - 16:45 PST

Resolved
This incident has been resolved.
Posted Oct 21, 2024 - 12:19 PDT
Update
We are continuing to monitor for any further issues.
Posted Oct 21, 2024 - 12:16 PDT
Update
A fix has been implemented and we are monitoring the results.
Posted Oct 21, 2024 - 12:13 PDT
Monitoring
We were unable to ingest data from 9:42AM PT to 10:06AM PT in US SaaS. Services have recovered and we’re monitoring the situation.
Posted Oct 21, 2024 - 11:25 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 21, 2024 - 10:16 PDT
Investigating
We are currently investigating an issue with event ingestion.
Posted Oct 21, 2024 - 10:05 PDT
This incident affected: US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring).