On July 21, 2022, at 9:30 UTC we completed an upgrade to our database cluster. Shortly afterward, we started experiencing high CPU load on our database servers. This caused delays in our queries and eventually led to failed API/Web requests. The engineering team was able to mitigate the problem by 10:18 UTC.
We apologize for any inconvenience this may have caused.
09:30 Database upgrade completed
09:50 Alerted on CPU usage and observed the database was overloaded with queries; the incident was created
10:07 Manually killed queries to prevent event backlog from growing which caused API and web requests to fail
10:13 Fixed database statistics and observed queries running fast again
10:18 Updated status page
10:31 Backlog of events cleared
10:43 All clear given in incident
Duration of issue: 49 minutes
Incomplete query planner statistics after our database upgrade which caused a suboptimal query plan to be used by our queries.
We are looking to improve our upgrade process by: