US region transaction ingestion delay

Incident Report for Sentry

Postmortem

Summary

On October 2, 2024, transaction ingestion was delayed between 6:45AM PT and 8:54AM PT.

Root Cause

The incident occurred when the caching layer couldn’t autoscale to handle an unexpected load increase from a recent code change, leading to system strain and performance issues. The lack of targeted alerts on cache load meant that the only indication of trouble—a “HPA max” alert—went unnoticed. Additionally, some consumers relied on a shared cache pool rather than dedicated ones, broadening the impact of the issue across multiple systems.

Remediation and Prevention

The primary fix involves reconfiguring cache pools with more appropriate HPA maximums to better manage load fluctuations, alongside creating dedicated pools for specific tasks like ingest, web, and background processes to contain any future disruptions. Additional improvements include setting up alerts for sudden spikes in cache load, monitoring sustained HPA max usage, and enhancing observability through a more detailed dashboard. These changes aim to improve detection, speed up recovery, and enable teams to respond proactively, minimizing the risk of similar issues going forward. This incident is related to future incidents and continued fixes.

Posted Nov 09, 2024 - 00:39 UTC

Resolved

The issue causing transactions ingestion delays has been resolved.
Posted Oct 02, 2024 - 18:32 UTC

Monitoring

We have mitigated the issue causing transaction ingestion delays in our US region and are monitoring the results.
Posted Oct 02, 2024 - 15:35 UTC

Investigating

We are currently investigating an issue causing a 15 minute delay in transaction ingestion in our US region.
Posted Oct 02, 2024 - 14:59 UTC
This incident affected: US Ingestion (US Transaction Ingestion).