Wondering Why did your CloudWatch alarm trigger without any breaching data points? We can help you with this!
As a part of our AWS Support Services, we often receive similar requests from our AWS customers.
Today, let’s see the steps followed by our Support Techs to help our customers to resolve the CloudWatch alarm trigger issue.
CloudWatch alarm trigger without any breaching data points
Amazon CloudWatch is a monitoring and observability service from AWS.CloudWatch alarms that measure time-aggregated metrics perform this measurement continuously in a rolling window.
CloudWatch alarms evaluate metrics based on data points available at a specific moment. As new values continue to flow into the CloudWatch metric, Each successive alarm evaluation might use different aggregated data points. We might be unable to see a breaching data point that triggered the alarm if that data has not flowed into the metric yet.
We can see the complete set of data points, which have now flowed into the metric by reviewing the event history later.
Detect breaching data point
We have to change the Statistic to Maximum/Minimum for detecting a breaching data point in the CloudWatch alarm metric’s graph.
Here is an example for alarm configuration:
- Standard resolution alarm
- Metric: CPUUtilization
- Threshold: 60%
- Statistic: Average
- Period: 120 seconds
- Evaluation Period: 1
- Detailed Monitoring: enabled for the monitored Amazon EC2 instance.
The following values were received by the metric when the example alarm evaluation period 11:00:00 – 11:02:00 IST starts :
Sample-1: 11:00:05 IST, numeric value: 80.96470588235294
Sample-2: 11:00:16 IST, numeric value: 16.929612366666664
Sample-3: 11:00:27 IST, numeric value: 53.57142857142857
Sample-4: 11:01:38 IST, numeric value: 94.89033212334336
The average of the above values is 61.58 and it breaches the threshold of 60%. So this will trigger a change to the ALARM state. The alarm’s event history lists the aggregated values exceeding the threshold as the reason for the state change.
When we again evaluate the alarm later, additional values have flowed in for the minute 11:00:00 – 11:02:00 IST.
For example:
Sample-1: 11:00:05 IST, numeric value: 80.96470588235294
Sample-2: 11:00:16 IST, numeric value: 16.929612366666664
Sample-3: 11:00:27 IST, numeric value: 53.57142857142857
Sample-4: 11:01:38 IST, numeric value: 94.89033212334336
Sample-5: 11:01:45 IST, numeric value: 15.18181818181819
Sample-6: 11:00:51 IST, numeric value: 10.26490
Now the new average is 45.3 and which will not breach the threshold of 60%. So the alarm changes back to the OK state. The alarm’s event history lists the aggregated values being below the threshold as the reason for the state change.
So now we may not see the breaching data point in our CloudWatch metric’s graph. The Average is listed as 45.3 in the CPUUtilization metric’s graph.
We can see the breaching data point 94.89 at 11:00:00 IST if we change the CloudWatch metric graph’s Statistic to Maximum.
Also, we need to change the CloudWatch metric graph’s Statistic to a Minimum, if we configure the alarm to trigger when data falls below the threshold.
Configure an “M out of N” alarm
We need to configure an “M out of N” alarm to prevent an alarm from changing to the ALARM state where the Evaluation Period and the Datapoints to Alarm have different values.
This makes alarms evaluate more number of aggregated data points and the state of the alarm changes only if at least a certain number of data points (M) is breaching in a given set of data points (N).
Here is an example for this alarm configuration:
- Standard resolution alarm
- Metric: CPUUtilization
- Threshold: 60%
- Statistic: Average
- Period: 120 seconds
- Evaluation Period: 2 out of 3
- Detailed Monitoring: enabled for the monitored Amazon EC2 instance
This alarm configuration is similar to the previous one and the only difference is with the evaluation period. The evaluation period checks 2 out of 3 available data points before triggering the alarm.
The following values were received by the metric when the example alarm evaluation period 11:00:00 IST starts :
Sample-1: 11:00:05 IST, numeric value: 80.96470588235294
Sample-2: 11:00:16 IST, numeric value: 16.929612366666664
Sample-3: 11:00:27 IST, numeric value: 53.57142857142857
Sample-4: 11:01:38 IST, numeric value: 94.89033212334336
Because of the increased evaluation period, the CloudWatch looks for data points that are older than 11:00:00 IST:
10:58:00 IST, Average=41.874304539920
10:59:00 IST, Average=5.230773650991253
11:00:00 IST, Average=64.93403361344538
Here the aggregated data point at 11:00:00 IST breaches the threshold. But the alarm remains in the OK state and doesn’t change to the ALARM state. This happens because only one out of three data points breach the threshold, whereas two out of three are required to trigger the alarm.
[Need help with more AWS queries? We’d be happy to assist]
Conclusion
To conclude, today we discussed the steps followed by our Support Engineers to help our customers to fix the issue ‘CloudWatch alarm trigger without any breaching data points’.
0 Comments