Sentry US is unavailable due to a database issue
Incident Report for Sentry
Postmortem

Summary

From 3:42PM PT to 4:04PM PT both the site and ingestion were unavailable for Sentry US.

Root Cause

The downtime stemmed from issues with database migrations, particularly when using the RunSQL command for certain database operations. Our systems lacked sufficient safety measures to catch necessary lock and statement timeouts during SQL execution. This led to prolonged query runs that exhausted connections to the database, which then became inaccessible. This resulted in the site and ingestion to become unavailable.

Remediation and Prevention

To avoid similar issues, engineering plans to eliminate RunSQL commands altogether by replacing it with more robust solutions within our migration framework. This includes an update to the database framework to the latest version, which handles some of the operations RunSQL previously managed. For faster recovery, run books have been created to guide recovery in the event of lock or timeout issues during migrations. Also, the team has enhanced alerting and observability aids in order to react sooner and minimize future downtime.

Posted Nov 08, 2024 - 16:38 PST

Resolved
This incident has been resolved.
Posted Oct 10, 2024 - 17:47 PDT
Monitoring
All services have been restored and are operating normally.
Posted Oct 10, 2024 - 16:40 PDT
Identified
A fix has been issued. The dashboard and ingestion are recovering.
Posted Oct 10, 2024 - 16:05 PDT
Update
We are continuing to investigate this issue.
Posted Oct 10, 2024 - 16:00 PDT
Investigating
We are currently investigating an issue with our Postgres database that is causing the site to be unavailable.
Posted Oct 10, 2024 - 15:58 PDT
This incident affected: Dashboard, API and US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).