API Outage

Incident Report for Universe

Postmortem

At 1:32 EST a developer issued a deploy on our Core Service API. This API runs inside of an Auto Scaling Group fronted by Elastic Load Balancer. Our deployment process for this service runs the following procedure:

Pause autoscaling activities on the Auto Scaling Group 2 For each EC2 instance, stagger the code deployment, plus restart phase
On each instance, before the code deployment phase and restart begins, remove the instance from the load balancer
On successful restart, re-add the instance to the load balancer
On successfully restarting all instances, re-enable autoscaling activities.

During the 1:32 EST deploy, the API requests we make to AWS began to fail, returning to 503 error codes. As the deploy proceeded, no newly restarted instances were available to serve traffic, because no successfully restarted instances had completed step 4. At this time, our operations team was automatically alerted of the outage and investigated.

At 1:37, one instance successfully completed step 4, and resumed serving traffic. Our automated alerting reported the outage as resolved, leading operations to believe no further issues were present. However, a single instance is not sufficient to serve our traffic volume. Latency on the load balancer quickly grew, triggering a separate latency alert at 1:50, and intermittent health check failures between 1:37 and 1:50. At this point, we paused the inflow on our checkout queue, preventing users from seeing errors during purchases. At 1:51 our operations team successfully identified the root cause as insufficient instances in our load balancer, and manually re-added the instances that had not successfully been added back via the AWS API.

At 1:55 our operations team validated that the issue had been resolved, and closed the issue at 1:56, resuming the queue.

Universe takes the operational requirements of our service very seriously. To improve the quality of our infrastructure, and to mitigate against the occurrence of issues like this, we have been working on a migration to Kubernetes as a service orchestration platform. We are currently aiming to complete migration of our Core Service to Kubernetes by the end of Q3 2018. When this is complete, our deploys of this service will follow the blue/green deployment pattern, which stages the deploy into segments, validating success along the way. This will replace the above described deployment procedure, and we expect it will deliver substantially greater reliability guarantees.

Posted Jun 21, 2018 - 14:20 EDT

Resolved

We have resolved the issue and will post a post-mortem shortly.

Posted Jun 21, 2018 - 13:59 EDT

Identified

We have identified that the majority our core API servers are not receiving traffic from our load balancer (ELB on AWS). We are manually adding these API servers back to the load balancer.

Posted Jun 21, 2018 - 13:56 EDT

Investigating

Universe is currently experiencing an API outage. We are investigating.

We have temporary paused our checkout queue to preserve purchases.

Posted Jun 21, 2018 - 13:51 EDT

This incident affected: Systems (API) and Ticket Processing.