We were alerted by our monitors to a Transaction processing slowdown in a portion of our pipeline. After investigation, we discovered that a code change to our debug file processing system led to unbalanced behavior. This small change led to a generalized slowdown and ultimately to an ingestion backlog of all event types (not just transactions) that required significant time to recover from.
The trigger for the event was a code change, however, the most significant contributor to the incident was our inability to absorb the backlog and process it in real time.
At the beginning of October, the team embarked on a project to dramatically increase the processing capacity of our Transactions pipeline. With a target of achieving 4x our current throughput on the same infrastructure, this project was designed to allow us sufficient capacity to simultaneously absorb future slowdowns while also processing any resulting backlog at high speed. This project was completed in early November and has already been tested up to 3x of our original maximum capacity.