Networking issues

Incident Report for Sentry

Postmortem

Summary

On May 23, 2022, at 10:30 UTC Sentry experienced networking issues across part of our infrastructure which resulted in some delays in ingestion, notifications, and alerts. We alerted our service provider, Google Cloud Platform, who restored connectivity. Our engineering teams encountered some issues with recovery before we were able to start ingesting new events at a normal rate and start processing the backlog of data which was fully completed by May 24, 2022, 02:05 UTC.

We apologize for any inconvenience this may have caused.

Timeline (in UTC)

May 23

10:25 — Responded to alerts that Snuba consumers were accumulating large backlogs of data.

10:30 — Identified the issues that originated within our metrics Clickhouse cluster.

10:37 — Experienced timeouts across our metrics services and some hosts became unreachable via ssh.

10:40 — Restarting hosts seemed to resolve the issue however one of our Kafka clusters became unreachable.

11:14 — Reached out to Google Cloud support

11:15 — We continued to restart other unreachable hosts with varying success

13:03 — Meeting with Google Cloud confirmed infra issues; we proceed with their recommendation to migrate our hosts

14:27 — Google Cloud investigation completed and mitigation underway

15:39 — Web hosts restarted on different nodes to fix Sentry.io availability

15:55 — Diverted Clickhouse traffic to healthy hosts

16:10 — Backend systems started ingesting new events; backend systems started processing backlog

16:35 — Google Cloud incident given all clear with a rollback of changes

16:37 — Ingestion had stopped; started to re-attached file storage with additional space

16:46 — Ingestion restarts at normal rate

17:12 — Snuba consumer memory increased to cope with large batches

18:27 — Completed Clickhouse replication completed and queries are evenly distributed again

22:59 — Optimize backlog processing job scheduled for 24:00

May 24

02:05 — Backlog processing completed

Duration of Instability: 9 hours, 5 minutes

Root Cause

A networking change was rolled out to Google Cloud Networking which caused some instances in Google Compute Engine in us-central1-b to become unreachable. This negatively impacted parts of our stack that could not successfully handle partial failures.

Remediation Plan

We have short term plans to improve our data backlog processing times by:

Adding connection pooling to Redis
Scaling up our Celery processing pipeline

Our longer term goal is to enable a standby Clickhouse cluster that resides in a separate zone/region. This is aligned to our overall strategy as outline from the May 6th incident to provide better failover capabilities.

Posted Jun 10, 2022 - 18:57 UTC

Resolved

The backfill is complete. All events are now available.

Posted May 24, 2022 - 02:11 UTC

Update

We continue to monitor the backfilling of events. There are no systems issues at this time. ETA for the final resolution of this incident is 02:00 UTC.

Posted May 24, 2022 - 01:31 UTC

Update

We continue to monitor the backfilling of events. There are no systems issues at this time.

Posted May 23, 2022 - 22:52 UTC

Update

We are continuing to monitor the backfill process to ensure all events from earlier today are available to customers. This backfill does not impact any new events coming through the system and all event new data is being processed within our normal SLAs.

The team anticipates no further impacts from this incident and it will be resolved as soon as the backfill is complete.

We apologize again for the disruption this morning and any delays experienced for these older events.

Posted May 23, 2022 - 20:33 UTC

Monitoring

Mitigations have been made and we're now processing and serving events. We'll be monitoring as we work through our backlog and as the upstream incident continues to be mitigated.

Posted May 23, 2022 - 16:10 UTC

Identified

We have applied the mitigation to part of our stack and continue to work to restore full service.

Posted May 23, 2022 - 14:59 UTC

Update

We have been working with our provider to resolve connectivity and availability problems with nodes in our storage layer. We have some potential workarounds that are being applied while the root cause is further investigated.

These problems have led to intermittent backlogs on web and API requests as well as delays in ingestion processing. We apologize for this disruption and are working to restore services as quickly as possible.

Posted May 23, 2022 - 13:55 UTC

Update

We are continuing to investigate the issue.

Posted May 23, 2022 - 12:54 UTC

Update

We are continuing to investigate the issue.

Posted May 23, 2022 - 12:14 UTC

Investigating

We are currently investigating issues with networking. Ingestion, notification, and alerts may be delayed.

Posted May 23, 2022 - 11:28 UTC

This incident affected: Notification Delivery (Slack, Email, PagerDuty) and US Ingestion (US Error Ingestion, US Transaction Ingestion, US Attachment Ingestion).