Delayed ingestion in US region
Incident Report for Sentry
Postmortem

Summary

We were alerted by our monitors to a Transaction processing slowdown in a portion of our pipeline. After investigation, we discovered that a code change to our debug file processing system led to unbalanced behavior. This small change led to a generalized slowdown and ultimately to an ingestion backlog of all event types (not just transactions) that required significant time to recover from.

Root Cause

The trigger for the event was a code change, however, the most significant contributor to the incident was our inability to absorb the backlog and process it in real time.

Remediation and Prevention

At the beginning of October, the team embarked on a project to dramatically increase the processing capacity of our Transactions pipeline. With a target of achieving 4x our current throughput on the same infrastructure, this project was designed to allow us sufficient capacity to simultaneously absorb future slowdowns while also processing any resulting backlog at high speed. This project was completed in early November and has already been tested up to 3x of our original maximum capacity.

Posted Nov 08, 2024 - 16:37 PST

Resolved
This incident has been resolved.
Posted Oct 01, 2024 - 20:15 PDT
Monitoring
The transaction backlog has finished processing. All ingestion services are operating normally again.
Posted Oct 01, 2024 - 19:30 PDT
Update
We are still processing transactions produced prior to 15:30 PM UTC and aim to be completed within half an hour.
Posted Oct 01, 2024 - 18:58 PDT
Update
We are still processing transactions produced prior to 15:30 PM UTC. The processing has taken longer than anticipated. We estimate another hour until they're all processed.
Posted Oct 01, 2024 - 17:24 PDT
Update
We are still processing transactions produced prior to 15:30 PM UTC. We estimate another hour and a half until they're all processed.
Posted Oct 01, 2024 - 15:53 PDT
Update
Transactions produced prior to 15:30 PM UTC are still in the backlog, we estimate another 7 hours from now until they’re all processed.
Posted Oct 01, 2024 - 12:51 PDT
Update
We’re currently ingesting all events in real-time (including errors and transaction).
All errors produced prior to this outage have been processed. All transactions produced prior to 15:30 PM UTC are still in the backlog. We will provide an ETA for when that will be completed when the burn rate stabilizes.
Posted Oct 01, 2024 - 11:24 PDT
Update
We're continuing the parallel ingest current events and backfill older events.
Posted Oct 01, 2024 - 10:04 PDT
Update
We've implemented the change to prioritize the ingestion of more recent events and continue to process our backlog to backfill older events.
Posted Oct 01, 2024 - 08:45 PDT
Update
We're continuing to process our event ingestion backlog in our US region and are working on a change to prioritize the ingestion of recent events.
Posted Oct 01, 2024 - 08:04 PDT
Update
Starting at 8:10 am UTC we encountered an issue which required us to restart some ingestion services. During this activity we encountered a defect in the underlying platform which triggered an ingestion backlog for all US event types. We've worked with our cloud provider to implement a mitigation until they can provide a long term fix. We're currently processing our event ingestion backlog and continuing to monitor.
Posted Oct 01, 2024 - 07:21 PDT
Update
We're continuing to investigate a fix for our ingestion backlog in our US region, ingestion of all event types is delayed in this region.
Posted Oct 01, 2024 - 04:52 PDT
Identified
We've identified a problem causing excessive backlogging, and are working on a fix for it now.
Posted Oct 01, 2024 - 03:47 PDT
Investigating
We're currently investigating ingestion delays in the US region.
Posted Oct 01, 2024 - 02:28 PDT
This incident affected: US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).