We experienced a partial outage on the Hosted Metrics us-west2 cluster due to issues on our provider’s cloud platform. A few hours after the cluster recovered, a similar but less impacting event occurred.
A large number of pods were unschedulable, causing slow query performance and instance availability for a number of customers on the cluster.
2018-12-05
16:43: We start seeing alerts for hm-us-west2 and begin discussion in Slack ops channel
16:44: We identify that the issue is that four nodes in the cluster are not available. This caused four nodes worth of pods to attempt to get rescheduled elsewhere. The cluster didn't have excess capacity to handle the pods, causing many pods to get stuck in a pending state
16:46: We start seeing API server failures on the cluster
16:52: We see that the nodes are ready again, and pending pods are now getting scheduled on the nodes that came back up
16:53: Both hm-us-west2 and us-west1 clusters report 100% API failures, both in us-west1b
16:58: We see periodic api failures, and some pods are crashlooping and/or getting oom killed
17:02: Some tsdb-gw pods were crashlooping because their memory limits were not high enough to work through the backlog of metrics (this is an edge case).
17:08: All crashlooping tsdb-gw pods had their memory limits increased and came back online
17:50: All pods were ready and all alerts cleared
18:00: Incident was marked as resolved
20:34: We see that there seem to be issues with the cluster again, workloads were not showing in the portal and it showed api failures, kubectl commands not working too
20:35: We did not see any nodes go down this time, but there were over a hundred unschedulable pods, so it must have been very brief
20:45: API server began responding and we added a node to increase our pod capacity
20:46: There were issues with kafka and cassandra recovering slowly, which impacted performance slightly
22:34: Everything is recovered, and incident was marked as fully resolved
Four nodes in the kubernetes cluster became unavailable at the same time, causing all of the pods running on those nodes to be scheduled elsewhere. The cluster did not have the capacity to handle rescheduling the pods from the nodes that went down. We also saw issues with the kubernetes API server, possibly due to it being overloaded trying to reschedule the pods from the unavailable nodes.
We added a node to the cluster. The nodes that went down came back online and were able to schedule pods again. We worked with the provider to try and understand why the nodes went down and how it can be prevented in the future.