Delayed ingestion in US region

Incident Report for Sentry

Postmortem

Summary

We were alerted by our monitors to a Transaction processing slowdown in a portion of our pipeline. After investigation, we discovered that a code change to our debug file processing system led to unbalanced behavior. This small change led to a generalized slowdown and ultimately to an ingestion backlog of all event types (not just transactions) that required significant time to recover from.

Root Cause

The trigger for the event was a code change, however, the most significant contributor to the incident was our inability to absorb the backlog and process it in real time.

Remediation and Prevention

At the beginning of October, the team embarked on a project to dramatically increase the processing capacity of our Transactions pipeline. With a target of achieving 4x our current throughput on the same infrastructure, this project was designed to allow us sufficient capacity to simultaneously absorb future slowdowns while also processing any resulting backlog at high speed. This project was completed in early November and has already been tested up to 3x of our original maximum capacity.

Posted Nov 09, 2024 - 00:37 UTC

Resolved

This incident has been resolved.

Posted Oct 02, 2024 - 03:15 UTC

Monitoring

The transaction backlog has finished processing. All ingestion services are operating normally again.

Posted Oct 02, 2024 - 02:30 UTC

Update

We are still processing transactions produced prior to 15:30 PM UTC and aim to be completed within half an hour.

Posted Oct 02, 2024 - 01:58 UTC

Update

We are still processing transactions produced prior to 15:30 PM UTC. The processing has taken longer than anticipated. We estimate another hour until they're all processed.

Posted Oct 02, 2024 - 00:24 UTC

Update

We are still processing transactions produced prior to 15:30 PM UTC. We estimate another hour and a half until they're all processed.

Posted Oct 01, 2024 - 22:53 UTC

Update

Transactions produced prior to 15:30 PM UTC are still in the backlog, we estimate another 7 hours from now until they’re all processed.

Posted Oct 01, 2024 - 19:51 UTC

Update

We’re currently ingesting all events in real-time (including errors and transaction).
All errors produced prior to this outage have been processed. All transactions produced prior to 15:30 PM UTC are still in the backlog. We will provide an ETA for when that will be completed when the burn rate stabilizes.

Posted Oct 01, 2024 - 18:24 UTC

Update

We're continuing the parallel ingest current events and backfill older events.

Posted Oct 01, 2024 - 17:04 UTC

Update

We've implemented the change to prioritize the ingestion of more recent events and continue to process our backlog to backfill older events.

Posted Oct 01, 2024 - 15:45 UTC

Update

We're continuing to process our event ingestion backlog in our US region and are working on a change to prioritize the ingestion of recent events.

Posted Oct 01, 2024 - 15:04 UTC

Update

Starting at 8:10 am UTC we encountered an issue which required us to restart some ingestion services. During this activity we encountered a defect in the underlying platform which triggered an ingestion backlog for all US event types. We've worked with our cloud provider to implement a mitigation until they can provide a long term fix. We're currently processing our event ingestion backlog and continuing to monitor.

Posted Oct 01, 2024 - 14:21 UTC

Update

We're continuing to investigate a fix for our ingestion backlog in our US region, ingestion of all event types is delayed in this region.

Posted Oct 01, 2024 - 11:52 UTC

Identified

We've identified a problem causing excessive backlogging, and are working on a fix for it now.

Posted Oct 01, 2024 - 10:47 UTC

Investigating

We're currently investigating ingestion delays in the US region.

Posted Oct 01, 2024 - 09:28 UTC

This incident affected: US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).