Just before 1 am ET one of the databases we use for tracking payments experienced an out of memory error. This delayed the final step in the automatic capture of credit card funds for the duration of the outage. Once service was restored, payment collection automatically resumed, retroactively finalizing transactions created during the outage period.
While we have extensive monitoring in place to notify us in advance of disruptive conditions, we’ve determined that the alarm for this particular metric was the incorrect type. For that reason, the impending outage was very easy to miss. The database disruption caused upstream health checks to fail, which were immediately surfaced to the entire dev team in Slack.
With the problem identified, memory was increased for the payment tracking database and service was quickly restored.
We take reliability & stability very seriously at Universe. While our already robust infrastructure allowed us to recover from this outage without dropping any transactions, we’re committed to taking that resilience to the next level. We’re taking several immediate steps to ensure this type of outage is impossible in the future: We’re migrating to an autoscaling database system which would prevent this type of error, and we’re implementing a full alarm & notification review to ensure that we’re alerted proactively and consistently when resources are under stress.