On Thursday, June 28 the operations team deployed a configuration change to the Kubernetes cluster which runs our billing system. This change was to the authentication mechanism. The change was deployed, and no outages were visible afterwards.
On Friday, June 29 the operations team rotated the Kubernetes pods in response to reported delays in order processing. Upon rotation, the pods became unavailable to traffic ingress from the ingress controller due to a lack of ingress controller support for the new authentication change from Thursday.
This trigged an automatic outage alert, the response to which you see detailed in the timeline previously posted. During this time, we paused the checkout queue to prevent user checkout errors for the duration of the outage. Ticket Manager users were not automatically notified of this outage similarly. Universe looks to improve this so that communication with all end users is visible throughout business interruption.
Universe regrets the errors leading to this outage. We are proud of the uptime record of this service at 99.97 % over the past 12 months.