Latency

Incident Report for Universe

Postmortem

On Thursday, June 28 the operations team deployed a configuration change to the Kubernetes cluster which runs our billing system. This change was to the authentication mechanism. The change was deployed, and no outages were visible afterwards.

On Friday, June 29 the operations team rotated the Kubernetes pods in response to reported delays in order processing. Upon rotation, the pods became unavailable to traffic ingress from the ingress controller due to a lack of ingress controller support for the new authentication change from Thursday.

This trigged an automatic outage alert, the response to which you see detailed in the timeline previously posted. During this time, we paused the checkout queue to prevent user checkout errors for the duration of the outage. Ticket Manager users were not automatically notified of this outage similarly. Universe looks to improve this so that communication with all end users is visible throughout business interruption.

Universe regrets the errors leading to this outage. We are proud of the uptime record of this service at 99.97 % over the past 12 months.

Posted Jul 02, 2018 - 16:16 EDT

Resolved

The incident has been resolved. Universe expresses our sincere apologies for service interruption. We will post a full and complete post-mortem within 24 hours.

Posted Jun 29, 2018 - 19:46 EDT

Update

We are continuing to work on a fix for this issue.

Posted Jun 29, 2018 - 19:44 EDT

Identified

We believe that we have identified the source of the outage, and our operations team is working towards a resolution.

Specifically, we believe that our Kubernetes cluster had a configuration change deployed yesterday, which gradually led to degradation of service, and finally outage.

Posted Jun 29, 2018 - 19:41 EDT

Update

We are escalating this issue to a major outage. We have paused the queue to prevent loss of purchases. We will update shortly.

Posted Jun 29, 2018 - 18:57 EDT

Investigating

We are observing latency in our ticket processing system. All tickets are being processed, this is not an outage. Instead, we are observing much higher rates of latency than usual. Practically speaking, this means that purchasing tickets may take longer than usual. We are investigating.

Posted Jun 29, 2018 - 18:12 EDT

This incident affected: Ticket Processing.