Ingestion was unavailable for all event types between 8:16AM PT to 9:00AM PT for Sentry US.
This incident was caused by a code change that uncovered a capacity limitation within Relay, specifically the disk buffer’s inability to handle the high write throughput. Leading to cascading failures across multiple services. This bottleneck highlighted the need for a more resilient disk buffering solution and a way to assess the real-time traffic that could impact write throughput more accurately.
Several factors contributed to the escalation of this incident. A code change deployed to only one pod initially made it challenging to detect the impact early on. Additionally, difficulties with rollback procedures arose due to uncertainty about the version to revert to and a lack of familiarity with rollback tools among the team members. The failure mode became more complex with multiple versions of pods in play, each reacting differently under stress. This ultimately led to a cascading failure pattern as memory limits were breached, especially in newer pod versions, resulting in degraded system performance.
To prevent similar incidents, the team is setting up a load-testing environment mirroring production traffic and conducting stress tests before full deployment. Benchmarking write throughput in realistic conditions could help forecast potential bottlenecks. While active monitoring and alerting mechanisms performed as expected, increased observability in Relay help further mitigate impacts and streamline detection and response.