Sentry US is unavailable due to a database issue

Incident Report for Sentry

Postmortem

Summary

From 3:42PM PT to 4:04PM PT both the site and ingestion were unavailable for Sentry US.

Root Cause

The downtime stemmed from issues with database migrations, particularly when using the RunSQL command for certain database operations. Our systems lacked sufficient safety measures to catch necessary lock and statement timeouts during SQL execution. This led to prolonged query runs that exhausted connections to the database, which then became inaccessible. This resulted in the site and ingestion to become unavailable.

Remediation and Prevention

To avoid similar issues, engineering plans to eliminate RunSQL commands altogether by replacing it with more robust solutions within our migration framework. This includes an update to the database framework to the latest version, which handles some of the operations RunSQL previously managed. For faster recovery, run books have been created to guide recovery in the event of lock or timeout issues during migrations. Also, the team has enhanced alerting and observability aids in order to react sooner and minimize future downtime.

Posted Nov 09, 2024 - 00:38 UTC

Resolved

This incident has been resolved.
Posted Oct 11, 2024 - 00:47 UTC

Monitoring

All services have been restored and are operating normally.
Posted Oct 10, 2024 - 23:40 UTC

Identified

A fix has been issued. The dashboard and ingestion are recovering.
Posted Oct 10, 2024 - 23:05 UTC

Update

We are continuing to investigate this issue.
Posted Oct 10, 2024 - 23:00 UTC

Investigating

We are currently investigating an issue with our Postgres database that is causing the site to be unavailable.
Posted Oct 10, 2024 - 22:58 UTC
This incident affected: Dashboard, API and US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion, US Profile Ingestion, US Replay Ingestion, US Span Ingestion, US Cron Monitoring, Custom Metrics (beta)).