On January 12th Sentry experienced a large outage across our services. The primary impact was seen both within our event ingestion pipeline as well as our UI. Availability of services varied based on their points of failure, and our event pipeline suffered the largest issues. Sentry experienced either a full or partial outage lasting from 14:59 UTC until 23:20 UTC. Most importantly, if your application sent events to Sentry between 14:59 UTC and 18:40 UTC it is likely we were not able to process them and these events were not captured.
The abridged version is that we experienced an issue with database availability in one of our core systems. This system is responsible for managing metadata across event aggregations (often called an Issue, or Group). At its core the issue was surrounding an invalid schema constraint, and it caused various writes to fail before it was altered. Specifically from an impact perspective, this caused events which would create a new issue to instead error out during processing. Additionally this cascaded into availability of our UI for a period of time. Once we resolved the constraint we were able to restore the UI and event processing. However, due to the abnormal length of this outage, our event backlog had become quite significant. We opted to dual process events - thus restoring service in a degraded fashion - where we processed new events alongside the backlog.
We apologize for any issues or inconvenience this may have caused and we know outages - whether planned or unplanned - are never convenient, but they are also an unfortunate inevitability. It should go without stating, but we take our uptime very seriously and are looking into ways to avoid these kinds of technical issues in the future. Additionally we will be making several investments to create better redundancies and improve our ability to restore service (especially event processing) more quickly. Lastly, we recognize that we need to take a look at the urgency of our communication in these kinds of scenarios to make sure that we are both more timely, more transparent, and more accurate in informing our customers of problems.