On October 2, 2024, transaction ingestion was delayed between 6:45AM PT and 8:54AM PT.
The incident occurred when the caching layer couldn’t autoscale to handle an unexpected load increase from a recent code change, leading to system strain and performance issues. The lack of targeted alerts on cache load meant that the only indication of trouble—a “HPA max” alert—went unnoticed. Additionally, some consumers relied on a shared cache pool rather than dedicated ones, broadening the impact of the issue across multiple systems.
The primary fix involves reconfiguring cache pools with more appropriate HPA maximums to better manage load fluctuations, alongside creating dedicated pools for specific tasks like ingest, web, and background processes to contain any future disruptions. Additional improvements include setting up alerts for sudden spikes in cache load, monitoring sustained HPA max usage, and enhancing observability through a more detailed dashboard. These changes aim to improve detection, speed up recovery, and enable teams to respond proactively, minimizing the risk of similar issues going forward. This incident is related to future incidents and continued fixes.