Incident Report for Universe

On Thursday, June 28 the operations team deployed a configuration change to the Kubernetes cluster which runs our billing system. This change was to the authentication mechanism. The change was deployed, and no outages were visible afterwards.

On Friday, June 29 the operations team rotated the Kubernetes pods in response to reported delays in order processing. Upon rotation, the pods became unavailable to traffic ingress from the ingress controller due to a lack of ingress controller support for the new authentication change from Thursday.

This trigged an automatic outage alert, the response to which you see detailed in the timeline previously posted. During this time, we paused the checkout queue to prevent user checkout errors for the duration of the outage. Ticket Manager users were not automatically notified of this outage similarly. Universe looks to improve this so that communication with all end users is visible throughout business interruption.

Universe regrets the errors leading to this outage. We are proud of the uptime record of this service at 99.97 % over the past 12 months.

Posted 11 months ago. Jul 02, 2018 - 16:16 EDT

The incident has been resolved. Universe expresses our sincere apologies for service interruption. We will post a full and complete post-mortem within 24 hours.
Posted 11 months ago. Jun 29, 2018 - 19:46 EDT
We are continuing to work on a fix for this issue.
Posted 11 months ago. Jun 29, 2018 - 19:44 EDT
We believe that we have identified the source of the outage, and our operations team is working towards a resolution.

Specifically, we believe that our Kubernetes cluster had a configuration change deployed yesterday, which gradually led to degradation of service, and finally outage.
Posted 11 months ago. Jun 29, 2018 - 19:41 EDT
We are escalating this issue to a major outage. We have paused the queue to prevent loss of purchases. We will update shortly.
Posted 11 months ago. Jun 29, 2018 - 18:57 EDT
We are observing latency in our ticket processing system. All tickets are being processed, this is not an outage. Instead, we are observing much higher rates of latency than usual. Practically speaking, this means that purchasing tickets may take longer than usual. We are investigating.
Posted 11 months ago. Jun 29, 2018 - 18:12 EDT
This incident affected: Ticket Processing Queue.