Platform Outage
Incident Report for Universe
Postmortem

Universe sincerely apologizes for a brief disruption in platform availability on the morning of Nov 6, starting at 8:44 EST. We additionally apologize for the follow-on performance degradation lasting until 13:50 EST. Universe prides itself on operational excellence, and we know how important service availability is to our customers, and to their fans

Timeline

8:40 EST - MongoDB, our MongoDB cloud vendor, executes a replicaset restart on our cluster to upgrade TLS. Universe was not informed of this action.

8:44 EST - shard 00 (secondary) is restarted, shard 01 (secondary) is restarted, shard 02 (primary) is restarted

8:44 EST - shard 01 is elected leader

8:45 EST - Universe automatically receives exception notifications and pingdom alerts notifying us of outage

9:00 EST - API autoscaling cluster add a new instance, on schedule

9:11 EST - new instance is fully available and online, pingdoms because this machine is accepting new connections

9:13 EST - operations team executes a restart on other API instances, no further instability is observed, the team believes issue is resolved

9:13 EST to 13:17 EST - degraded application performance, API requests are taking approximately 4x as long, product team receives customer supports

13:00 EST - operations team correctly identifies that the new primary, shard 001 had an additional 20ms of latency to the EC2 instances running the API

13:13 EST - operations team executes a second fail over, normal latency between API and database (1ms) is resumed, issue is over

Root Cause

The root cause of the initial outage was a badly handled fail-over condition after a new primary election in our Mongo cluster, by our API servers. The root cause of the follow-on latency, and in particular, its duration throughout the morning was due to a lack of automated alerts at the application performance monitoring level. Universe has a threshold on latency on our load balancer that performs this function, but the threshold was not breached - even though performance was 4x worse than normal operating conditions.

Next Steps

1. Determine why shard 001 has elevated latency and take action, so that in future leadership elections (failover conditions), we are not exposed to this additional latency 

2. Establish a ping check with automatic alerting for elevated shard latency in our Mongo cluster

3. Attempt to reproduce the bad failover conditions after the leadership election in our API

4. Maintain, and develop custom failover solutions for our Ruby-based Mongo driver dependency

5. Improve general application query performance by reducing the number of total queries required in common application activities

6. Improve our application performance alerting so that we know about outlier conditions instead of arbitrary threshold breaches

7. Request our service provider/vendor MongoDB to inform us of cluster maintenance activities after they are executed

Posted Nov 06, 2018 - 15:37 EST

Resolved
We have not observed any further instability issues. A post-mortem will follow.
Posted Nov 06, 2018 - 10:50 EST
Monitoring
We are observing platform stability, and continue to actively monitor the situation. Queue inflow is resumed.
Posted Nov 06, 2018 - 09:24 EST
Identified
We are observing a high number of 500 series HTTP requests, but continue to serve traffic intermittently. Our checkout queue remains paused.

We observed a failover of our Mongo HA cluster to a secondary backup at 13:44 UTC. We believe our web cluster is incorrectly connecting to the previous primary intermittently, and are executing a fix.
Posted Nov 06, 2018 - 09:09 EST
Investigating
We are investigating a platform outage. We have paused our checkout queue.
Posted Nov 06, 2018 - 08:54 EST
This incident affected: API and Ticket Processing Queue.