Resolved -
This incident has been resolved.
Jan 23, 18:44 UTC
Monitoring -
We are noticing significant improvement, and things are stabilizing as expected. Our engineering teams will continue to monitor progress.
Jan 23, 16:55 UTC
Investigating -
We are currently investigating an issues impacting Email delivery for some Services, including Alert Notifications.
Jan 23, 15:37 UTC
Resolved -
The incident is resolved. We are in contact with customers affected by this change.
Jan 22, 22:29 UTC
Identified -
During the secrets migration in https://status.grafana.com/incidents/47d1q4sphrmj, secrets proxy URLs for some customers updated in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2. This was an unexpected breaking change affecting a subset of customers.
This will specifically affect customers who are using secrets on private probes behind a firewall.
We are investigating. If your private probes are impacted, we ask you to update firewall rules for the secrets proxy to allow outbound connections to the updated hosts:
Note that this URL change affects only a small subset of customers, the majority of customers will not need to update firewall rules. For affected customers, private probes will show the following error in probe logs, for example: Error during test execution: failed to get secret: Get "https://gsm-proxy-prod-us-east-2.grafana.net/api/v1/secrets/.../decrypt": Forbidden undefined
Jan 21, 21:16 UTC
Completed -
The scheduled maintenance has been completed.
Jan 21, 16:00 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 21, 13:00 UTC
Scheduled -
We will perform planned maintenance to synthetic monitoring secrets on Wednesday January 21st, from 13:00 to 16:00 UTC, in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2.
During maintenance synthetic monitoring checks using secrets will continue to run normally, but the secrets will be in a read-only state. Attempts to create/modify/delete secrets during maintenance will return an error until the maintenance is complete.
This maintenance is required to ensure the reliability of the secrets management system as we prepare for general availability of the feature. We will provide updates here as the maintenance progresses.
Jan 15, 19:25 UTC
Resolved -
We consider this incident as resolved since the latency hasn't been elevated since the fix was applied. The issue was caused by a latency spike in a downstream dependency, causing an increased backpressure on the Hosted Traces ingestion path, which degraded gateway performance and resulted in an elevated write latency. After clearing the affected gateway services the degraded state went away and normal operation was restored.
Jan 21, 15:21 UTC
Monitoring -
The issue was identified and a fix was applied. After applying the fix, latency went down to a regular and expected value. We're currently monitoring the component's health before resolving the incident.
Jan 21, 13:35 UTC
Investigating -
We're currently investigating an issue with elevated write latency in Hosted Traces prod-us-central-0 region. It's experiencing sustained high write latency since 7:20 AM UTC. Only a small subset of the requests are impacted.
Jan 21, 13:24 UTC
Resolved -
Impact: Between 14:30 and 14:38 UTC, some customers in prod-eu-west-2 may have experienced issues querying metrics. During this time, read requests to the metrics backend were unavailable, resulting in failed or incomplete query responses. The root cause of the issue was identified and addressed.
Resolution: The affected components were restored, and service was fully available by 14:38 UTC. We have taken additional steps to prevent this type of disruption from occurring in the future.
Next Steps: We are reviewing monitoring and safeguards around this failure mode to further improve reliability.
Jan 19, 14:30 UTC
Resolved -
This incident has been resolved.
Jan 19, 01:21 UTC
Monitoring -
The issue hasn't been seen for a reasonable amount of time and hasn't occurred when it was expected to occur. We're still closely monitoring systems behaviour and will update this incident accordingly.
Jan 18, 09:24 UTC
Investigating -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 18, 02:20 UTC
Identified -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 18, 02:14 UTC
Update -
We are continuing to monitor for any further issues.
Jan 17, 20:16 UTC
Monitoring -
The impact on this has been mitigated at this time and we are currently monitoring.
Jan 17, 20:00 UTC
Update -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 17, 18:08 UTC
Update -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 17, 16:47 UTC
Investigating -
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Jan 17, 11:28 UTC
Resolved -
This incident has been resolved.
Jan 17, 09:04 UTC
Investigating -
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Jan 17, 07:51 UTC
Both read and write 5xx's and increased latency were experienced in the two periods: 23:56:15 to 00:32:45 UTC 00:55:30 to 01:36:15 UTC
Jan 16, 04:00 UTC
Monitoring -
Customers should no longer experience issues.
We will continue to monitor and provide updates.
Jan 16, 02:27 UTC
Update -
We are continuing to investigate this issue.
Jan 16, 01:35 UTC
Investigating -
Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data.
We continue to investigate and will update when we have more to share.
Jan 16, 01:33 UTC
Monitoring -
As of 00:28 UTC, we have observed improvement with the partial write outage. Customers should no longer experience issues with metrics ingestion.
We will continue to monitor and provide updates.
Jan 16, 00:59 UTC
Investigating -
As of 23:57 UTC, our engineers became aware of an issue with prod-us-west-0 resulting in a partial write outage. Users may experience intermittent 5xx errors when writing metrics, though retries may eventually succeed, which can lead to delayed or missing data.
We continue to investigate and will update when we have more to share.
Jan 16, 00:28 UTC
Completed -
The scheduled maintenance has been completed.
Jan 15, 12:00 UTC
Verifying -
Verification is currently underway for the maintenance items.
Jan 15, 11:20 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 15, 11:00 UTC
Scheduled -
Alert instances for Synthetic Monitoring ProbeFailedExecutionsTooHigh provisioned alert rule that are firing during the maintenance might resolve and fire again once in a ~1m time
Jan 15, 09:37 UTC
Resolved -
The scope of this incident was smaller than originally anticipated.
As of 16:27 UTC our engineering team merged a fix for those affected and we are considering this as resolved.
Jan 14, 20:17 UTC
Investigating -
We're experiencing an issue with connectivity loss for Azure PrivateLink endpoints in all available Azure regions. The issue affects users trying to ingest Alloy data or use PDC over Azure PrivateLink. Our team is actively investigating the issue for the root cause.
Jan 14, 14:30 UTC
Completed -
The scheduled maintenance has been completed.
Jan 14, 09:00 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 14, 07:00 UTC
Scheduled -
During the maintenance window, we will perform planned minor version upgrades on Grafana databases. Users may experience brief service interruptions lasting up to one minute. During this period, Grafana instances may become inaccessible. Other services are unaffected.
Jan 7, 10:02 UTC
Resolved -
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Jan 12, 18:21 UTC
Monitoring -
Engineering has released a fix and as of 17:01 UTC, customers should no longer experience connectivity issues. We will continue to monitor for recurrence and provide updates accordingly.
Jan 12, 17:01 UTC
Identified -
Engineering has identified the issue and will be deploying a fix shortly. At this time, users will continue to experience disruptions for queries routed via PDC.
We will continue to provide updates as more information is shared.
Jan 12, 16:50 UTC
Investigating -
We are investigating an issue in prod-eu-west-3 where PDC agents are failing to maintain/re-establish connectivity. Some agents are struggling to reconnect, which may cause disruptions or degraded performance for customer queries routed over PDC. We’ll share updates as we learn more.
Jan 12, 15:44 UTC
Resolved -
Engineering has released a fix and we continue to observe a period of recovery. As of 15:12 UTC we are considering this resolved.
Jan 12, 15:26 UTC
Update -
There was a full degradation of write service between 9:13 UTC - 9:35 UTC. The cell is operational but there is still degradation in the write path. Our Engineering team is actively working on this.
Jan 12, 11:41 UTC
Update -
We are continuing to investigate this issue.
Jan 12, 09:09 UTC
Investigating -
We have been alerted to an issue with Tempo write degradation in prod-eu-west-3 - tempo-prod-08. The cell is operational but there is degradation in the write path. Write requests are taking longer than normal. This started at started 7:00 UTC. Our Engineering team is actively investigating this.
Jan 12, 09:03 UTC
Resolved -
Between 20:23 UTC and 20:53 UTC, Grafana Cloud Logs in prod-us-east-3 experienced a write degradation, which may have resulted in delayed or failed log ingestion for some customers.
The issue has been fully resolved, and the cell is currently operating normally. We are continuing to investigate the root cause and will provide additional details if relevant.
Jan 9, 20:30 UTC