From 3:42PM PT to 4:04PM PT both the site and ingestion were unavailable for Sentry US.
The downtime stemmed from issues with database migrations, particularly when using the RunSQL command for certain database operations. Our systems lacked sufficient safety measures to catch necessary lock and statement timeouts during SQL execution. This led to prolonged query runs that exhausted connections to the database, which then became inaccessible. This resulted in the site and ingestion to become unavailable.
To avoid similar issues, engineering plans to eliminate RunSQL commands altogether by replacing it with more robust solutions within our migration framework. This includes an update to the database framework to the latest version, which handles some of the operations RunSQL previously managed. For faster recovery, run books have been created to guide recovery in the event of lock or timeout issues during migrations. Also, the team has enhanced alerting and observability aids in order to react sooner and minimize future downtime.