Outage in us-central1

Incident Report for Grafana Cloud

Resolved

All systems are operating normally.

Posted Nov 07, 2019 - 20:47 UTC

Update

Both the Prometheus and Graphite platforms are operating normally again.
Post-mortem to follow.

Posted Nov 07, 2019 - 19:37 UTC

Update

We are continuing to monitor for any further issues.

Posted Nov 07, 2019 - 19:10 UTC

Monitoring

A fix has been implemented and things are returning to normal. We're monitoring actively.

Posted Nov 07, 2019 - 19:06 UTC

Update

We're continuing to investigate and have reduce the number of failing requests by draining nodes and rolling back to older versions of the software.

Posted Nov 07, 2019 - 18:43 UTC

Update

We are continuing to investigate this issue.

Posted Nov 07, 2019 - 17:45 UTC

Update

We're believe our issues may be causes by network issues effecting ~3 machines; we have begun mitigation steps.

Posted Nov 07, 2019 - 17:32 UTC

Update

We're seeing issues with out Kubernetes cluster in us-central1 affect Prometheus reads and writes.

Posted Nov 07, 2019 - 17:16 UTC

Update

We are continuing to investigate this issue.

Posted Nov 07, 2019 - 17:14 UTC

Investigating

We are currently investigating this issue.

Posted Nov 07, 2019 - 17:13 UTC

This incident affected: Grafana.com, Grafana Cloud: Hosted Grafana (GCP US Central - prod-us-central-0), Grafana Cloud: Graphite (GCP US Central - prod-us-central-0: Querying, GCP US Central - prod-us-central-0: Ingestion), and Grafana Cloud: Prometheus (GCP US Central - prod-us-central-0: Querying, GCP US Central - prod-us-central-0: Ingestion).