Universe sincerely apologizes for a brief disruption in platform availability on the morning of Nov 6, starting at 8:44 EST. We additionally apologize for the follow-on performance degradation lasting until 13:50 EST. Universe prides itself on operational excellence, and we know how important service availability is to our customers, and to their fans
Timeline
8:40 EST - MongoDB, our MongoDB cloud vendor, executes a replicaset restart on our cluster to upgrade TLS. Universe was not informed of this action.
8:44 EST - shard 00 (secondary) is restarted, shard 01 (secondary) is restarted, shard 02 (primary) is restarted
8:44 EST - shard 01 is elected leader
8:45 EST - Universe automatically receives exception notifications and pingdom alerts notifying us of outage
9:00 EST - API autoscaling cluster add a new instance, on schedule
9:11 EST - new instance is fully available and online, pingdoms because this machine is accepting new connections
9:13 EST - operations team executes a restart on other API instances, no further instability is observed, the team believes issue is resolved
9:13 EST to 13:17 EST - degraded application performance, API requests are taking approximately 4x as long, product team receives customer supports
13:00 EST - operations team correctly identifies that the new primary, shard 001 had an additional 20ms of latency to the EC2 instances running the API
13:13 EST - operations team executes a second fail over, normal latency between API and database (1ms) is resumed, issue is over
Root Cause
The root cause of the initial outage was a badly handled fail-over condition after a new primary election in our Mongo cluster, by our API servers. The root cause of the follow-on latency, and in particular, its duration throughout the morning was due to a lack of automated alerts at the application performance monitoring level. Universe has a threshold on latency on our load balancer that performs this function, but the threshold was not breached - even though performance was 4x worse than normal operating conditions.
Next Steps
1. Determine why shard 001 has elevated latency and take action, so that in future leadership elections (failover conditions), we are not exposed to this additional latency
2. Establish a ping check with automatic alerting for elevated shard latency in our Mongo cluster
3. Attempt to reproduce the bad failover conditions after the leadership election in our API
4. Maintain, and develop custom failover solutions for our Ruby-based Mongo driver dependency
5. Improve general application query performance by reducing the number of total queries required in common application activities
6. Improve our application performance alerting so that we know about outlier conditions instead of arbitrary threshold breaches
7. Request our service provider/vendor MongoDB to inform us of cluster maintenance activities after they are executed