At 1:32 EST a developer issued a deploy on our Core Service API. This API runs inside of an Auto Scaling Group fronted by Elastic Load Balancer. Our deployment process for this service runs the following procedure:
During the 1:32 EST deploy, the API requests we make to AWS began to fail, returning to 503 error codes. As the deploy proceeded, no newly restarted instances were available to serve traffic, because no successfully restarted instances had completed step 4. At this time, our operations team was automatically alerted of the outage and investigated.
At 1:37, one instance successfully completed step 4, and resumed serving traffic. Our automated alerting reported the outage as resolved, leading operations to believe no further issues were present. However, a single instance is not sufficient to serve our traffic volume. Latency on the load balancer quickly grew, triggering a separate latency alert at 1:50, and intermittent health check failures between 1:37 and 1:50. At this point, we paused the inflow on our checkout queue, preventing users from seeing errors during purchases. At 1:51 our operations team successfully identified the root cause as insufficient instances in our load balancer, and manually re-added the instances that had not successfully been added back via the AWS API.
At 1:55 our operations team validated that the issue had been resolved, and closed the issue at 1:56, resuming the queue.
Universe takes the operational requirements of our service very seriously. To improve the quality of our infrastructure, and to mitigate against the occurrence of issues like this, we have been working on a migration to Kubernetes as a service orchestration platform. We are currently aiming to complete migration of our Core Service to Kubernetes by the end of Q3 2018. When this is complete, our deploys of this service will follow the blue/green deployment pattern, which stages the deploy into segments, validating success along the way. This will replace the above described deployment procedure, and we expect it will deliver substantially greater reliability guarantees.