Ingestion of all event types failure in US

Incident Report for Sentry

Postmortem

Summary

Ingestion was unavailable for all event types between 8:16AM PT to 9:00AM PT for Sentry US.

Root Cause

This incident was caused by a code change that uncovered a capacity limitation within Relay, specifically the disk buffer’s inability to handle the high write throughput. Leading to cascading failures across multiple services. This bottleneck highlighted the need for a more resilient disk buffering solution and a way to assess the real-time traffic that could impact write throughput more accurately.

Several factors contributed to the escalation of this incident. A code change deployed to only one pod initially made it challenging to detect the impact early on. Additionally, difficulties with rollback procedures arose due to uncertainty about the version to revert to and a lack of familiarity with rollback tools among the team members. The failure mode became more complex with multiple versions of pods in play, each reacting differently under stress. This ultimately led to a cascading failure pattern as memory limits were breached, especially in newer pod versions, resulting in degraded system performance.

Remediation and Prevention

To prevent similar incidents, the team is setting up a load-testing environment mirroring production traffic and conducting stress tests before full deployment. Benchmarking write throughput in realistic conditions could help forecast potential bottlenecks. While active monitoring and alerting mechanisms performed as expected, increased observability in Relay help further mitigate impacts and streamline detection and response.

Posted Nov 09, 2024 - 00:42 UTC

Resolved

This incident has been resolved.

Posted Oct 09, 2024 - 16:16 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 09, 2024 - 16:01 UTC

Investigating

We are investigating an overall failure of our ingestion service in US

Posted Oct 09, 2024 - 15:36 UTC

This incident affected: US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).