In preparation for a client on sale (a “countdown” style sale of a large volume of tickets), we undertook precautionary measures to deliver service reliability during the sale. Specifically, we had two engineers ready and available prior to, and for the duration of the sale, and we scaled up our instances on AWS to the same levels previously used for previous onsales for this particular client, and others.
12:00 GMT - the sale begins, we process 34,500 requests per minute on our GraphQL API, serving the calendar checkout, with good response times
12:02 GMT - we begin to receive extremely slow requests from the Stripe Webhooks API, notifying us of successful charges (note: at the time, this was unknown to engineers)
12:03 GMT - 4000 tickets successfully sold
12:03 GMT - Read Tickets Available in our primary MongoDB node reached 0, and Queued Readers climbed to over 1,000 (note: at the time, this was unknown to engineers)
12:04 GMT - first internal report of potential unavailability, with sales reporting an empty calendar
12:05 GMT - 5000 tickets successfully sold
12:08 GMT - 6000 tickets successfully sold, and we receive additional reports of unavailability
12:08 GMT - engineers decide to bring additional application-tier capacity online, and issue command to AWS to do so (this takes a couple to several minutes to complete)
12:13 GMT - first report of issues from client, via sales and additional capacity from AWS comes online
12:13 GMT - primary MongoDB node stops emitting telemetry, stops serving queries, but continues to send a valid heartbeat
12:14 GMT - Universe’s automated monitoring and reporting systems begin to emit system wide alarms, including the reporting on status.universe.com
12:16 GMT - call between engineers starts, who immediately begin prepping a fix to halt all traffic from the client’s website, in an effort to restore service by halting inflow
12:18 GMT - engineers deploy edge configuration change to start to bounce traffic from client’s website
12:20 GMT - engineers issue escalation to two other engineers
12:45 GMT - access to logs on primary MongoDB node is achieved via SSH, a segfault exception is misidentified as the culprit
12:55 GMT - a support ticket is issued to MongoDB, who sends an engineer to join our call
01:00 GMT - primary MongoDB is still unavailable, but continues to respond to health checks, preventing a leadership election from happening with the secondaries
01:00 GMT - engineers begin co-ordinating with MongoDB Inc support to forcefully failover the primary node
01:10 GMT - a kill command is issued to the mongod process on the primary node
01:20 GMT - the mongod process continues to run, ignoring the kill request - decision is made to wait, in hopes that a rollback volume will be successfully written given writeConcern on the cluster
01:30 GMT - Universe decides further outages are not tolerable, brainstorms another solution which is to firewall off the primary node
01:35 GMT - primary node is firewalled off by removal from the AWS Security Group
01:40 GMT - leadership election completes, a secondary becomes the new primary
01:50 GMT - application traffic is restored to the database, service restoration begins
02:20 GMT - Universe re-adds original primary to the AWS Security Group
Universe believes that the issue began at 12:02 GMT, when we began to process webhooks from Stripe. In reviewing our Mongo operation log, we notice ~2300 of queries generated by this processing activity. Under load, some of these queries individually consumed more than 10 minutes of processing time, eating a Read Ticket by Mongo’s storage engine WiredTiger. Our configuration provides us with 128 Read Tickets, and so just 128 of these events processing simultaneously consumes all of them, generating large Reader Queues. During this time, CPU utilization on the primary node hit, and remained at 100%. At 12:08 GMT when we brought additional application-tier capacity online, our new app servers created additional connections at a time when the primary node had 0 Read Tickets available, 100% CPU utilization, and a Reader Queue of over 1k. Immediately after these connections, we observe the unresponsiveness of the system via a number of external pingdoms.
No actual exceptions are visible in the logs for the primary node during this time. We observe a stream of query operations, and then only internal system commands, and client connections. At the moment of unresponsiveness we also observe a drop in CPU to the single digit percent range. During this time, the primary continued to emit heartbeats, despite not processing any actual queries. Our inability later to kill the mongod process on the server later suggests to us that the process reached an only partly-broken state. Because of the continued response to heartbeats, a secondary leadership election was not called.
The root cause of this incident involves (1) expensive queries under load, (2) an unexpected primary node state responsive to heartbeats but not to actual query operations, created by some combination of 100% CPU utilization, large new client connection volumes, large Reader Queues, and empty Reader Ticket availability. We continue to work with MongoDB Inc to understand how this unexpected state is possible.