Resolved -
This incident has been resolved.
Dec 12, 14:38 UTC
Identified -
TraceQL queries with "= nil" in Explore Traces and part of Drilldown Traces are failing with 400 Bad Request errors. The issue has been identified, and a fix is currently being rolled out.
Dec 12, 14:17 UTC
Resolved -
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Dec 11, 23:17 UTC
Update -
Synthetic Monitoring has now also recovered. Customers should no longer experience alert rules failing to evaluate.
We continue to monitor for recurrence and will provide updates accordingly.
Dec 11, 23:02 UTC
Monitoring -
Engineering has released a fix and as of 22:25 UTC, customers should no longer experience ingestion issues. We will continue to monitor for recurrence and provide updates accordingly.
Dec 11, 22:33 UTC
Update -
While investigating this issue, we also became aware that Synthetic Monitoring is affected. Some customers may have alert rules failing to evaluate.
Dec 11, 22:23 UTC
Investigating -
As of 21:30 UTC, we are experiencing a partial ingestion outage in Grafana Mimir. This is affecting the write path, where some ingestion requests are failing or timing out.
Our engineering team is actively investigating and working to identify the root cause.
Dec 11, 21:43 UTC
Completed -
The scheduled maintenance has been completed.
Dec 11, 11:20 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 11, 11:00 UTC
Scheduled -
Alert instances for Synthetic Monitoring provisioned alert rules that are firing during the maintenance might resolve and fire again once in a ~1m time
Dec 11, 10:55 UTC
Resolved -
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Dec 10, 23:05 UTC
Monitoring -
We are observing a trend in improvement after implementing a fix. We will continue to monitor and update accordingly.
During our investigation, we also became aware that some alerts associated with Synthetic Monitoring checks have been failing to evaluate correctly.
Dec 10, 22:22 UTC
Identified -
Our engineering team has identified a potential root cause, and a fix is being implemented.
Dec 10, 21:38 UTC
Update -
Our engineering team has engaged with our Cloud Service Provider and are working together to continue to investigate this issue.
Dec 10, 20:48 UTC
Update -
We are continuing to investigate this issue.
Dec 10, 19:49 UTC
Update -
We have also identified that trace ingestion may also be affected. Some customers may experience elevated latency and intermittent errors when sending traces. Investigation is ongoing.
Dec 10, 19:49 UTC
Investigating -
We have detected an issue causing some customers to experience failed metric pushes as well as increased latency when sending metrics. The issue was first observed at 18:30 UTC.
Our engineering team is actively investigating the root cause and working to restore normal operation as quickly as possible. We will provide further updates as more information becomes available.
Thank you for your patience while we work to resolve this issue.
Dec 10, 19:29 UTC
Resolved -
Users experienced failed log pushes as well as increased latency when sending logs to Loki service hosted on prod-eu-west-0 cluster between 18:30 UTC to ~23:00 UTC.
Our engineering team has engaged our Cloud Service Provider and the fix was implemented that mitigated the issue.
Dec 10, 19:30 UTC
Resolved -
The incident has been resolved.
Dec 10, 09:06 UTC
Monitoring -
Read path has been restored at 08:23 UTC and queries are fully functioning again. The read path outage lasted from 08:04 to 08:23 UTC.
Dec 10, 08:30 UTC
Investigating -
At 08:04 UTC we detected read path outage (queries) on cortex-prod-13. We are currently investigating this issue.
The ingestion path (writes) is not affected.
Dec 10, 08:23 UTC
Resolved -
The issue has been resolved.
Dec 9, 13:10 UTC
Monitoring -
The query service is operational again, logs reads should be available on the cluster. Our engineers are monitoring the health status of the service to ensure full recovery.
Dec 9, 13:06 UTC
Identified -
Since today 9th, around 12:30 UTC time, we are experiencing problems on the Loki read path of cluster eu-west-2. This translates in difficulties to query logs for customers on this cluster, and can also impact alerts and other services based on these logs. Our engineers are actively working in restoring the service.
Dec 9, 12:59 UTC
Resolved -
This incident has been resolved.
Dec 5, 09:44 UTC
Investigating -
We are currently experiencing disruptions to Hosted Grafana services due to a widespread Cloudflare outage impacting connectivity across multiple regions. Our team is actively monitoring the situation and will provide updates as Cloudflare works to restore normal operation.
Dec 5, 09:25 UTC
Completed -
The scheduled maintenance has been completed.
Dec 3, 12:00 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 3, 11:00 UTC
Scheduled -
ProbeFailedExecutionsTooHigh alert rule instances that are firing during the maintenance might resolve and fire again once in a ~1m time window.
Dec 3, 08:35 UTC
Resolved -
Loki prod-ap-northeast-0-loki-prod-030 cell had writes degradation between 8:11 - 8:58 AM UTC. The engineering team mitigated the situation and the cell is stable now.
Dec 1, 08:00 UTC