On May 23, 2022, at 10:30 UTC Sentry experienced networking issues across part of our infrastructure which resulted in some delays in ingestion, notifications, and alerts. We alerted our service provider, Google Cloud Platform, who restored connectivity. Our engineering teams encountered some issues with recovery before we were able to start ingesting new events at a normal rate and start processing the backlog of data which was fully completed by May 24, 2022, 02:05 UTC.
We apologize for any inconvenience this may have caused.
10:25 — Responded to alerts that Snuba consumers were accumulating large backlogs of data.
10:30 — Identified the issues that originated within our metrics Clickhouse cluster.
10:37 — Experienced timeouts across our metrics services and some hosts became unreachable via ssh.
10:40 — Restarting hosts seemed to resolve the issue however one of our Kafka clusters became unreachable.
11:14 — Reached out to Google Cloud support
11:15 — We continued to restart other unreachable hosts with varying success
13:03 — Meeting with Google Cloud confirmed infra issues; we proceed with their recommendation to migrate our hosts
14:27 — Google Cloud investigation completed and mitigation underway
15:39 — Web hosts restarted on different nodes to fix Sentry.io availability
15:55 — Diverted Clickhouse traffic to healthy hosts
16:10 — Backend systems started ingesting new events; backend systems started processing backlog
16:35 — Google Cloud incident given all clear with a rollback of changes
16:37 — Ingestion had stopped; started to re-attached file storage with additional space
16:46 — Ingestion restarts at normal rate
17:12 — Snuba consumer memory increased to cope with large batches
18:27 — Completed Clickhouse replication completed and queries are evenly distributed again
22:59 — Optimize backlog processing job scheduled for 24:00
02:05 — Backlog processing completed
Duration of Instability: 9 hours, 5 minutes
A networking change was rolled out to Google Cloud Networking which caused some instances in Google Compute Engine in us-central1-b to become unreachable. This negatively impacted parts of our stack that could not successfully handle partial failures.
We have short term plans to improve our data backlog processing times by:
Our longer term goal is to enable a standby Clickhouse cluster that resides in a separate zone/region. This is aligned to our overall strategy as outline from the May 6th incident to provide better failover capabilities.