Delays in processing and intermittent UI/API failures

Incident Report for Sentry

Postmortem

Summary

On July 21, 2022, at 9:30 UTC we completed an upgrade to our database cluster. Shortly afterward, we started experiencing high CPU load on our database servers. This caused delays in our queries and eventually led to failed API/Web requests. The engineering team was able to mitigate the problem by 10:18 UTC.

We apologize for any inconvenience this may have caused.

Timeline (in UTC)

09:30 Database upgrade completed

09:50 Alerted on CPU usage and observed the database was overloaded with queries; the incident was created

10:07 Manually killed queries to prevent event backlog from growing which caused API and web requests to fail

10:13 Fixed database statistics and observed queries running fast again

10:18 Updated status page

10:31 Backlog of events cleared

10:43 All clear given in incident

Duration of issue: 49 minutes

Root Cause

Incomplete query planner statistics after our database upgrade which caused a suboptimal query plan to be used by our queries.

Remediation Plan

We are looking to improve our upgrade process by:

Investigating ways to preserve table attributes as code
Ensuring statistics are fully calculated after an upgrade

Posted Aug 02, 2022 - 19:51 UTC

Resolved

This incident has been resolved.

Posted Jul 21, 2022 - 10:43 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 21, 2022 - 10:18 UTC

Investigating

We are currently investigating this issue.

Posted Jul 21, 2022 - 09:53 UTC

This incident affected: Dashboard, API and US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion).