Resolved -
This incident has been resolved.
Jan 28, 20:25 UTC
Monitoring -
A fix has been implemented, and we are monitoring the results.
Jan 28, 18:24 UTC
Investigating -
We are currently investigating an issue impacting dashboards for users in the prod-us-central-3 region. This is preventing impacted dashboards from loading as expected.
This is also impacting a very small subset of users in the prod-us-central-0 region as well.
We will provide more details regarding the scope as they become available.
Jan 28, 17:27 UTC
Resolved -
We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.
Jan 28, 00:22 UTC
Monitoring -
As of 22:55 UTC, we have observed marked improvement with the incident impacting IRM and OnCall. We are still investigating and will continue to monitor and provide updates.
Jan 27, 22:56 UTC
Investigating -
We are currently investigating an issue impacting some customers when accessing Grafana Oncall and IRM. Impacted customers may experience long load times, or even time-outs when attempting to access these components. We'll provide more information as it becomes available.
Jan 27, 20:37 UTC
Resolved -
We were experiencing increased write error rate for logs in prod-us-west-0 from 6:55 to 7:15 UTC. We have since observed continued stability and are marking this as resolved.
Jan 27, 07:49 UTC
Resolved -
Engineering has released a fix and as of 00:13 UTC, customers should no longer experience issues upgrading from Free to Pro subscriptions. At this time, we are considering this issue resolved. No further updates.
Jan 27, 00:13 UTC
Identified -
Engineering has identified the issue and is currently exploring remediation options. At this time, users will continue to experience the inability to upgrade from Free to Pro subscriptions.
We will continue to provide updates as more information is shared.
Jan 26, 21:52 UTC
Investigating -
As of 20:05 UTC, our engineering team became aware of an issue related to subscription plan upgrades. Users experiencing this issue will not be able to upgrade from a Free plan to a Pro subscription.
Engineering is actively engaged and assessing the issue. We will provide updates accordingly.
Jan 26, 20:53 UTC
Resolved -
This incident has been resolved.
Jan 23, 18:44 UTC
Monitoring -
We are noticing significant improvement, and things are stabilizing as expected. Our engineering teams will continue to monitor progress.
Jan 23, 16:55 UTC
Investigating -
We are currently investigating an issues impacting Email delivery for some Services, including Alert Notifications.
Jan 23, 15:37 UTC
Resolved -
The incident is resolved. We are in contact with customers affected by this change.
Jan 22, 22:29 UTC
Identified -
During the secrets migration in https://status.grafana.com/incidents/47d1q4sphrmj, secrets proxy URLs for some customers updated in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2. This was an unexpected breaking change affecting a subset of customers.
This will specifically affect customers who are using secrets on private probes behind a firewall.
We are investigating. If your private probes are impacted, we ask you to update firewall rules for the secrets proxy to allow outbound connections to the updated hosts:
Note that this URL change affects only a small subset of customers, the majority of customers will not need to update firewall rules. For affected customers, private probes will show the following error in probe logs, for example: Error during test execution: failed to get secret: Get "https://gsm-proxy-prod-us-east-2.grafana.net/api/v1/secrets/.../decrypt": Forbidden undefined
Jan 21, 21:16 UTC
Completed -
The scheduled maintenance has been completed.
Jan 21, 16:00 UTC
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 21, 13:00 UTC
Scheduled -
We will perform planned maintenance to synthetic monitoring secrets on Wednesday January 21st, from 13:00 to 16:00 UTC, in the following regions: prod-us-central-0, prod-us-east-0, and prod-eu-west-2.
During maintenance synthetic monitoring checks using secrets will continue to run normally, but the secrets will be in a read-only state. Attempts to create/modify/delete secrets during maintenance will return an error until the maintenance is complete.
This maintenance is required to ensure the reliability of the secrets management system as we prepare for general availability of the feature. We will provide updates here as the maintenance progresses.
Jan 15, 19:25 UTC
Resolved -
We consider this incident as resolved since the latency hasn't been elevated since the fix was applied. The issue was caused by a latency spike in a downstream dependency, causing an increased backpressure on the Hosted Traces ingestion path, which degraded gateway performance and resulted in an elevated write latency. After clearing the affected gateway services the degraded state went away and normal operation was restored.
Jan 21, 15:21 UTC
Monitoring -
The issue was identified and a fix was applied. After applying the fix, latency went down to a regular and expected value. We're currently monitoring the component's health before resolving the incident.
Jan 21, 13:35 UTC
Investigating -
We're currently investigating an issue with elevated write latency in Hosted Traces prod-us-central-0 region. It's experiencing sustained high write latency since 7:20 AM UTC. Only a small subset of the requests are impacted.
Jan 21, 13:24 UTC
Resolved -
Impact: Between 14:30 and 14:38 UTC, some customers in prod-eu-west-2 may have experienced issues querying metrics. During this time, read requests to the metrics backend were unavailable, resulting in failed or incomplete query responses. The root cause of the issue was identified and addressed.
Resolution: The affected components were restored, and service was fully available by 14:38 UTC. We have taken additional steps to prevent this type of disruption from occurring in the future.
Next Steps: We are reviewing monitoring and safeguards around this failure mode to further improve reliability.
Jan 19, 14:30 UTC
Resolved -
This incident has been resolved.
Jan 19, 01:21 UTC
Monitoring -
The issue hasn't been seen for a reasonable amount of time and hasn't occurred when it was expected to occur. We're still closely monitoring systems behaviour and will update this incident accordingly.
Jan 18, 09:24 UTC
Investigating -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 18, 02:20 UTC
Identified -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 18, 02:14 UTC
Update -
We are continuing to monitor for any further issues.
Jan 17, 20:16 UTC
Monitoring -
The impact on this has been mitigated at this time and we are currently monitoring.
Jan 17, 20:00 UTC
Update -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 17, 18:08 UTC
Update -
We are continuing to investigate this issue. It is impacting all components using the write path in cortex-prod-13 and mimir-prod-56. We do not yet have a root cause but have found that this issue seems to occur every 4 hours.
Jan 17, 16:47 UTC
Investigating -
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Jan 17, 11:28 UTC
Resolved -
This incident has been resolved.
Jan 17, 09:04 UTC
Investigating -
We are currently investigating an issue causing degraded write performance across multiple products in the AWS us-east-2 region. Our engineering team is actively working to determine the full scope and impact of the issue and restore normal service levels.
Jan 17, 07:51 UTC