Resolved -
From Oct 31st 17:40 UTC to Nov 3rd 14:50 UTC:
Due to some internal auth issues, the components evaluating loki-managed rules failed to push the evaluated recording and alert rules to the metrics endpoint for some tenants.
Nov 3, 15:04 UTC
Resolved -
This incident has been resolved.
Nov 3, 14:13 UTC
Monitoring -
A fix has been implemented and we can see the system are recovering everywhere. We're monitoring the situation.
Nov 3, 12:28 UTC
Investigating -
We became aware of an issue with the services in prod-us-central-7. Users experiencing this issue may encounter issues sending data to Grafana Cloud.
Resolved -
From approximately 16:30-8:15 UTC, a configuration change inadvertently removed a required headless service for hosted traces in one of our production regions. This caused elevated error rates and increased service-level objective (SLO) burn for the trace ingestion pathway. The underlying issue was a mismatch in internal configuration references following a prior migration. Re-enabling the headless service restored normal operation.
Oct 31, 15:30 UTC
Resolved -
The issue affecting Tempo (prod-us-east-0 & prod-us-west-0) and Loki (prod-us-central-5) has been fully resolved.
Metrics generation is now operating normally across all regions, and we continue to monitor for stability.
Oct 31, 15:04 UTC
Monitoring -
We’ve identified that Loki clusters in prod-us-central-5 were also affected by the same underlying issue impacting Tempo in prod-us-east-0.
Querying and ingestion remained fully operational throughout. The issue has been mitigated and services are stabilizing.
Oct 31, 15:04 UTC
Identified -
We’re investigating an issue in Tempo prod-us-east-0 and prod-us-west-0 that began around 19:00 UTC on Oct 30, causing a total outage for metrics generation.
This affects Tempo metrics generation, but Tempo query and ingestion remain fully operational.
Engineering is actively engaged and assessing the issue. We will provide updates accordingly.
Oct 31, 14:31 UTC
Resolved -
This incident has been resolved.
Oct 31, 13:49 UTC
Update -
We are continuing to work on a fix for this issue.
Oct 31, 10:06 UTC
Identified -
Parallel queriers were down from ~8:05 UTC to 9:05 UTC. Alerts and recording rules would have potentially failed to evaluate during this timeframe.
Oct 31, 10:05 UTC
Resolved -
We consider the incident as resolved now. With regards to the cause - a slow physical partition of the backend database, which is used by the control plane of a critical component caused increased latency and occasional overloading with subsequent failing of the write path. Once writes switched to a different partition, the latency dropped and error rate went down.
Oct 30, 13:36 UTC
Monitoring -
Elevated latency issue no longer happened since the last update, but we're still monitoring the situation. The write path is operational now for eu-west-3 region.
Oct 30, 12:42 UTC
Update -
We're still actively trying to mitigate the root cause. Symptoms are recurring from time to time.
Oct 30, 11:58 UTC
Identified -
The issue has been identified and a fix is being implemented.
Oct 30, 09:51 UTC
Investigating -
Since 9:15 UTC, Hosted Logs tenants in eu-west-3 region are experiencing an elevated latency in the write path. Our team is looking to identify and resolve the issue.
Oct 30, 09:30 UTC
Resolved -
This incident has been resolved.
Oct 29, 20:13 UTC
Monitoring -
We identified an issue affecting a limited number of Tempo users in this week's release. When targetInfo is enabled and traces contain resource.job and/or resource.instance attributes, some metric series may not have been processed correctly due to label duplication.
The issue does not impact all environments or tenants. A fix is being rolled out to affected environments, and no action is required from users.
Below is a list of all impacted regions.
prod-ap-south-1, prod-ap-southeast-0, prod-au-southeast-0, prod-eu-west-2, prod-eu-west-3, prod-eu-north-0, prod-gb-south-0,prod-gb-south-1,prod-us-east-1
Oct 29, 16:25 UTC
Resolved -
This incident has been resolved.
Oct 24, 17:15 UTC
Monitoring -
A fix has been implemented and we are monitoring the results.
Oct 24, 15:02 UTC
Identified -
The issue has been identified and we are working on a fix.
We will provide updates here as more information becomes available.
Oct 24, 14:39 UTC
Update -
We are continuing to investigate this issue.
Oct 24, 13:13 UTC
Investigating -
We are experiencing issues with Private Datasource queries failing in prod-us-central-3 and prod-us-central-4 regions. We are actively investigating this matter.
Oct 24, 13:10 UTC
Resolved -
This incident has been resolved.
Oct 23, 22:36 UTC
Monitoring -
A fix has been implemented and we are monitoring the results.
Oct 23, 21:30 UTC
Identified -
The issue has been identified and a fix is being implemented.
Oct 23, 21:00 UTC
Investigating -
There might be some pages and signups failing do to an issue connecting to our billing service provider. We are actively working on a fix
Oct 23, 20:59 UTC
Resolved -
This incident has been resolved.
Oct 22, 12:48 UTC
Monitoring -
A fix has been implemented and we are monitoring the results.
Oct 22, 12:06 UTC
Update -
We are continuing to investigate this issue.
Oct 22, 11:44 UTC
Investigating -
We are experiencing issues with Private Datasource API connectivity in some regions. The API hosts are not reachable. We are actively investigating this matter.
Oct 22, 11:24 UTC
Resolved -
This incident has been resolved.
Oct 21, 16:42 UTC
Monitoring -
A fix has been implemented and we are monitoring the results. Due to this fix, customers should no longer be seeing these errors.
Oct 21, 16:10 UTC
Investigating -
We are currently investigating an issue that is causing several alerts indicating various errors. We will provide more information as it becomes available.
Oct 21, 15:32 UTC