Search and Reporting Outage
Incident Report for Universe
Postmortem

We fixed a bug that was affecting searching in Discover. We had data fields indexed for full-text search which were throwing off results, and redefined the index to exclude these. We shipped this change, and kicked off a reindex job to apply the new configuration. We recently migrated this service to Kubernetes, and we use Kubernetes Jobs to run these types of background tasks.

Shortly after running it, we noticed problems with the Kubernetes cluster affecting Universe reporting and search APIs, as well as the Kubernetes cluster UI itself.

After some investigation, we discovered a problem with running these tasks as Kubernetes Jobs. Despite setting RestartPolicy=Never, this setting applies to the pod, not to the job. Jobs will always restart upon failure and may run more than once, as per: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures

This job would consume all available resources on a Node, and the Node would become Unhealthy and get killed. Kubernetes would then re-dispatch the greedy Job on another Node, which would suffer the same fate. This continued until all Nodes were in a state of repair, and when a new node came back online, would be quickly Killed. This proceeded until we removed the offending Job from the scheduler, and all Nodes autohealed fully healthy again.

Posted Jul 21, 2017 - 15:19 EDT

Resolved
This incident has been resolved.
Posted Jul 21, 2017 - 14:55 EDT
Investigating
We are investigating reports of search and reporting unavailability.
Posted Jul 21, 2017 - 14:37 EDT