Significant Outage

Incident Report for Sentry

Postmortem

On January 12th Sentry experienced a large outage across our services. The primary impact was seen both within our event ingestion pipeline as well as our UI. Availability of services varied based on their points of failure, and our event pipeline suffered the largest issues. Sentry experienced either a full or partial outage lasting from 14:59 UTC until 23:20 UTC. Most importantly, if your application sent events to Sentry between 14:59 UTC and 18:40 UTC it is likely we were not able to process them and these events were not captured.

The abridged version is that we experienced an issue with database availability in one of our core systems. This system is responsible for managing metadata across event aggregations (often called an Issue, or Group). At its core the issue was surrounding an invalid schema constraint, and it caused various writes to fail before it was altered. Specifically from an impact perspective, this caused events which would create a new issue to instead error out during processing. Additionally this cascaded into availability of our UI for a period of time. Once we resolved the constraint we were able to restore the UI and event processing. However, due to the abnormal length of this outage, our event backlog had become quite significant. We opted to dual process events - thus restoring service in a degraded fashion - where we processed new events alongside the backlog.

We apologize for any issues or inconvenience this may have caused and we know outages - whether planned or unplanned - are never convenient, but they are also an unfortunate inevitability. It should go without stating, but we take our uptime very seriously and are looking into ways to avoid these kinds of technical issues in the future. Additionally we will be making several investments to create better redundancies and improve our ability to restore service (especially event processing) more quickly. Lastly, we recognize that we need to take a look at the urgency of our communication in these kinds of scenarios to make sure that we are both more timely, more transparent, and more accurate in informing our customers of problems.

Posted Jan 15, 2021 - 21:46 UTC

Resolved

We have cleared our events backlog. Thanks for sticking with us.

Posted Jan 12, 2021 - 23:20 UTC

Update

Due to the size of our event backlog, we have begun to process newly received events and will be backfilling the events we received during the incident in parallel.
Additionally, we have finished processing our backlog for transactions and attachments.

Posted Jan 12, 2021 - 21:07 UTC

Update

We are now processing all events from our backlog.

Posted Jan 12, 2021 - 19:36 UTC

Update

We have identified additional issues with processing events aggregated into Issues and have paused processing them until we apply our fix. Transactions are not impacted.

Posted Jan 12, 2021 - 19:12 UTC

Monitoring

Maintenance is complete and the backlog is being processed now.

Posted Jan 12, 2021 - 18:41 UTC

Update

We are continuing to apply our fix during the emergency maintenance and will resume functionality shortly.

Posted Jan 12, 2021 - 18:25 UTC

Update

We have identified that the fixes we are applying have brought the API and Dashboard out of service. They will return shortly.

Posted Jan 12, 2021 - 17:54 UTC

Update

We are continuing to work on a fix for this issue. We will be providing an update when we start processing events.
Some users may experience the inability to load certain Issue searches and Issue details during the incident.

Posted Jan 12, 2021 - 17:47 UTC

Identified

We have identified our issue in processing events and are in the progress of remediating it.

Posted Jan 12, 2021 - 17:46 UTC

Investigating

We are currently investigating an issue around events processing, so have stopped that service temporarily.

Posted Jan 12, 2021 - 17:37 UTC

This incident affected: Dashboard and API.