Slow Performance

Incident Report for Universe

Postmortem

On Tuesday, the Universe API (and by extension our platform generally) experienced a period of severely degraded performance. This period lasted 73 minutes, starting at 9:07 UTC and ending at 10:20 UTC. Additionally, we experienced a second period of degraded performance at 14:10 UTC through 14:15 UTC.

I am reaching out to apologize.

Universe knows how important service reliability is to your business. We work hard to maintain an industry-leading level of reliability, and today we failed in that core mission. The analysis below represents the preliminary results of our investigation into this incident. We will invest in the appropriate process and infrastructure remediations to prevent this issue from recurring, and to accelerate our time-to-resolution in the future. Over the trailing 12 months, we have maintained an uptime of 99.95% on our core API, and we will seek to earn back this reputation.

Joshua Kelly - Universe CTO

Timeline

9:07 UTC - A latency alarm is triggered on the core API

9:30 UTC - Incident response management begins with our operations team

9:35 UTC - The first course of action for our incident response is to allocate additional server resources in AWS (in our core API autoscaling group). At the time, it was believed that increased demand due to an onsale concurrent with this event was the origin of the issue.

9:36 UTC - Global checkout queue for Universe is paused, to observe effects of increased server resources.

9:40 UTC - Latency decreases, incident operations team believes the issue has been addressed. Queue is re-enabled at a slower inflow.

9:45 UTC - After 5 minutes, high latency is once again observed on the platform. Queue is once again paused. Further investigation begins.

10:06 UTC - By this time, the operations incident team identifies that the source of high latency is not, in fact, service demand but specifically slow query performance in the database managing user Sessions. This database manages both user Sessions and user reporting needs. Disruptions caused to Sessions access by expensive reporting queries are particularly disruptive to the platform, because Session access is required to resolve any and all API queries requiring user authentication. These API queries constitute the bulk of our “operational” activities. The operations incident team escalates the issue to the core development team in the Eastern Standard Timezone.

10:20 UTC - After consulting with the core development team, it is decided to perform a restart of the Sessions database. At this time, the issue is deemed resolved and the queue is re-enabled. Latency falls within normal ranges and the queue is re-enabled successfully. Sales continue.

14:10 UTC - A new latency alarm is triggered, and the incident operations team immediately responds. It is quickly determined to be similar in nature to the incident resolved at 10:20. A separate status is opened for this issue.

14:15 UTC - The incident operations team completes a second reboot of the Sessions database, and latency again falls within normal ranges.

15:17 UTC - The core development team releases a change to the reporting service, de-duplicating a particularly expense query believed to be the root cause of instability in the Sessions service, believed to prevent the recurrence of the issue.

Next Steps

Universe will review its incident operations management guidelines. Specifically, we will aim to reduce time-to-first response. Twenty three minutes of our total service disruption was consumed in our initial response delay. This is compounded by geographic concentration of team members in the Eastern Standard Timezone. Universe will implement an automated paging escalation system.
Universe has been engaged in the long term re-architecture of Sessions management in the platform. It is believed that isolating the Sessions concern from the reporting service will prevent future disruptions of this nature. Universe will expedite the development efforts in this domain.

Posted Jul 16, 2019 - 18:55 EDT

Resolved

We have resolved the issue, but are continuing to closely monitor the situation.

Posted Jul 16, 2019 - 06:35 EDT

Monitoring

A patch has been applied. We have unpaused the checkout queues at a lower inflow rate, and are monitoring to report success.

Posted Jul 16, 2019 - 06:25 EDT

Update

Our teams have identified the origin of the slow performance and are working on a remediation strategy. We have paused all checkout queues. Users will wait in the queue until we resume service.

Posted Jul 16, 2019 - 06:16 EDT

Investigating

We are investigating reports of slow performance.

Posted Jul 16, 2019 - 06:00 EDT

This incident affected: Systems (API).