System Outage

Incident Report for Universe

Postmortem

Pre-incident

In preparation for a client on sale (a “countdown” style sale of a large volume of tickets), we undertook precautionary measures to deliver service reliability during the sale. Specifically, we had two engineers ready and available prior to, and for the duration of the sale, and we scaled up our instances on AWS to the same levels previously used for previous onsales for this particular client, and others.

Incident Timeline

12:00 GMT - the sale begins, we process 34,500 requests per minute on our GraphQL API, serving the calendar checkout, with good response times

12:02 GMT - we begin to receive extremely slow requests from the Stripe Webhooks API, notifying us of successful charges (note: at the time, this was unknown to engineers)

12:03 GMT - 4000 tickets successfully sold

12:03 GMT - Read Tickets Available in our primary MongoDB node reached 0, and Queued Readers climbed to over 1,000 (note: at the time, this was unknown to engineers)

12:04 GMT - first internal report of potential unavailability, with sales reporting an empty calendar

12:05 GMT - 5000 tickets successfully sold

12:08 GMT - 6000 tickets successfully sold, and we receive additional reports of unavailability

12:08 GMT - engineers decide to bring additional application-tier capacity online, and issue command to AWS to do so (this takes a couple to several minutes to complete)

12:13 GMT - first report of issues from client, via sales and additional capacity from AWS comes online

12:13 GMT - primary MongoDB node stops emitting telemetry, stops serving queries, but continues to send a valid heartbeat

12:14 GMT - Universe’s automated monitoring and reporting systems begin to emit system wide alarms, including the reporting on status.universe.com

12:16 GMT - call between engineers starts, who immediately begin prepping a fix to halt all traffic from the client’s website, in an effort to restore service by halting inflow

12:18 GMT - engineers deploy edge configuration change to start to bounce traffic from client’s website

12:20 GMT - engineers issue escalation to two other engineers

12:45 GMT - access to logs on primary MongoDB node is achieved via SSH, a segfault exception is misidentified as the culprit

12:55 GMT - a support ticket is issued to MongoDB, who sends an engineer to join our call

01:00 GMT - primary MongoDB is still unavailable, but continues to respond to health checks, preventing a leadership election from happening with the secondaries

01:00 GMT - engineers begin co-ordinating with MongoDB Inc support to forcefully failover the primary node

01:10 GMT - a kill command is issued to the mongod process on the primary node

01:20 GMT - the mongod process continues to run, ignoring the kill request - decision is made to wait, in hopes that a rollback volume will be successfully written given writeConcern on the cluster

01:30 GMT - Universe decides further outages are not tolerable, brainstorms another solution which is to firewall off the primary node

01:35 GMT - primary node is firewalled off by removal from the AWS Security Group

01:40 GMT - leadership election completes, a secondary becomes the new primary

01:50 GMT - application traffic is restored to the database, service restoration begins

02:20 GMT - Universe re-adds original primary to the AWS Security Group

Root cause

Universe believes that the issue began at 12:02 GMT, when we began to process webhooks from Stripe. In reviewing our Mongo operation log, we notice ~2300 of queries generated by this processing activity. Under load, some of these queries individually consumed more than 10 minutes of processing time, eating a Read Ticket by Mongo’s storage engine WiredTiger. Our configuration provides us with 128 Read Tickets, and so just 128 of these events processing simultaneously consumes all of them, generating large Reader Queues. During this time, CPU utilization on the primary node hit, and remained at 100%. At 12:08 GMT when we brought additional application-tier capacity online, our new app servers created additional connections at a time when the primary node had 0 Read Tickets available, 100% CPU utilization, and a Reader Queue of over 1k. Immediately after these connections, we observe the unresponsiveness of the system via a number of external pingdoms.

No actual exceptions are visible in the logs for the primary node during this time. We observe a stream of query operations, and then only internal system commands, and client connections. At the moment of unresponsiveness we also observe a drop in CPU to the single digit percent range. During this time, the primary continued to emit heartbeats, despite not processing any actual queries. Our inability later to kill the mongod process on the server later suggests to us that the process reached an only partly-broken state. Because of the continued response to heartbeats, a secondary leadership election was not called.

The root cause of this incident involves (1) expensive queries under load, (2) an unexpected primary node state responsive to heartbeats but not to actual query operations, created by some combination of 100% CPU utilization, large new client connection volumes, large Reader Queues, and empty Reader Ticket availability. We continue to work with MongoDB Inc to understand how this unexpected state is possible.

Lessons Learned and Next Steps

  1. Universe did in fact have an appropriate index for the queries generated by receiving Stripe webhooks defined in code configuration, committed, and reviewed by peers. Unfortunately, our operational process for deploying these indexes to production required manual intervention on the part of the contributor creating them. Universe commits to addressing this by adding index creation to our CI/CD pipeline to mitigate the risk of human error.
  2. Universe does not send reads to our secondary nodes because our Mongo cluster operates with a writeConcern of 1. This ultimately bottlenecks the number of read operations possible in our cluster. Universe will investigate the performance and reliability costs of changing our writeConcern. Universe will separately investigate the feasibility of using secondaries for reads, where appropriate given different writeConcern configurations.
  3. In general, allowing unbounded volumes of traffic into our checkout at once creates unacceptable operational risk. Universe commits to investigating the feasibility of throttling request volumes at edge, or a “perma-queue” strategy for our checkout, or a combination of both.
  4. Universe has access to internal tools that instrument the performance of our application. While we have active monitoring at the system-level, we do not have active monitoring of the performance of individual routes in our application in production. Universe commits to creating active monitoring to automatically detect and alert individual route level performance degradation. Doing so will allow us to capture and triage performance issues that can create cascading failure conditions in production — like missing indices.
Posted Feb 03, 2018 - 16:44 EST

Resolved

We have resolved our production issues with MongoDB, and are completing a post-mortem and a root cause analysis.
Posted Feb 01, 2018 - 09:48 EST

Update

We are currently processing ticket sales globally, service should be restored for all users.
Posted Feb 01, 2018 - 09:18 EST

Monitoring

We have partially restored service. We are continuing to monitor our MongoDB installation.
Posted Feb 01, 2018 - 09:05 EST

Update

We are experiencing a significant outage on our primary MongoDB node. We use MongoDB as our primary data store. We are currently working with our vendor (MongoDB Inc) to restore access.
Posted Feb 01, 2018 - 08:28 EST

Identified

We are currently experiencing a site-wide system outage.
Posted Feb 01, 2018 - 07:46 EST