On Tuesday, the Universe API (and by extension our platform generally) experienced a period of severely degraded performance. This period lasted 73 minutes, starting at 9:07 UTC and ending at 10:20 UTC. Additionally, we experienced a second period of degraded performance at 14:10 UTC through 14:15 UTC.
I am reaching out to apologize.
Universe knows how important service reliability is to your business. We work hard to maintain an industry-leading level of reliability, and today we failed in that core mission. The analysis below represents the preliminary results of our investigation into this incident. We will invest in the appropriate process and infrastructure remediations to prevent this issue from recurring, and to accelerate our time-to-resolution in the future. Over the trailing 12 months, we have maintained an uptime of 99.95% on our core API, and we will seek to earn back this reputation.
Joshua Kelly - Universe CTO
9:07 UTC - A latency alarm is triggered on the core API
9:30 UTC - Incident response management begins with our operations team
9:35 UTC - The first course of action for our incident response is to allocate additional server resources in AWS (in our core API autoscaling group). At the time, it was believed that increased demand due to an onsale concurrent with this event was the origin of the issue.
9:36 UTC - Global checkout queue for Universe is paused, to observe effects of increased server resources.
9:40 UTC - Latency decreases, incident operations team believes the issue has been addressed. Queue is re-enabled at a slower inflow.
9:45 UTC - After 5 minutes, high latency is once again observed on the platform. Queue is once again paused. Further investigation begins.
10:06 UTC - By this time, the operations incident team identifies that the source of high latency is not, in fact, service demand but specifically slow query performance in the database managing user Sessions. This database manages both user Sessions and user reporting needs. Disruptions caused to Sessions access by expensive reporting queries are particularly disruptive to the platform, because Session access is required to resolve any and all API queries requiring user authentication. These API queries constitute the bulk of our “operational” activities. The operations incident team escalates the issue to the core development team in the Eastern Standard Timezone.
10:20 UTC - After consulting with the core development team, it is decided to perform a restart of the Sessions database. At this time, the issue is deemed resolved and the queue is re-enabled. Latency falls within normal ranges and the queue is re-enabled successfully. Sales continue.
14:10 UTC - A new latency alarm is triggered, and the incident operations team immediately responds. It is quickly determined to be similar in nature to the incident resolved at 10:20. A separate status is opened for this issue.
14:15 UTC - The incident operations team completes a second reboot of the Sessions database, and latency again falls within normal ranges.
15:17 UTC - The core development team releases a change to the reporting service, de-duplicating a particularly expense query believed to be the root cause of instability in the Sessions service, believed to prevent the recurrence of the issue.