Wondering how to Setup automatic recovery of EC2 instance using CloudWatch? We can help you!
Here at Bobcares, we often get similar requests from our customers as a part of our Server Management Services.
Today let’s see how our Support Engineers do this set up for our customers having EC2 instances.
How to setup automatic recovery of EC2 instance using CloudWatch
When an instance fails a system status check, we can use CloudWatch alarm actions to automatically recover it.
However, the recover option works only for system check failures, not for instance status check failures. In addition, if we terminate our instance, we will not be able to recover it with CloudWatch.
Following are some causes for system status checks failure:
1. When Network connectivity is lost.
2. Loss of system power.
3. Software issues on the physical host.
4. Hardware issues on the physical host impacting the network connectivity.
Now we will see the steps that our Support Engineers follow to setup recovery with CloudWatch.
Steps to setup automatic recovery using CloudWatch
Following are the steps to recover an EC2 instance:
1. Firstly, we have to open the Amazon EC2 console.
2. Then go to Instances and select the instance we wish to do the setup on.
3. After that, go to Actions, and take Monitor and troubleshoot.
4. From there go to Manage CloudWatch alarms.
5. And click on Create an alarm.
For creating an alarm we need an AWS Identity and Access Management (IAM) permissions to stop and start the associated instance.
6. Then for Alarm notification, we can select an existing Amazon Simple Notification Service (Amazon SNS) topic.
7. After that toggle on Alarm action, and click on Recover.
8. Next, for Group samples by and Type of data to sample, we must give an appropriate statistic and metric for our use case.
9. And for Consecutive period and Period, we must specify the evaluation period for the alarm.
Furthermore, we can modify the automatically created Alarm name (this is optional).
10. Finally, we can click on Create.
Troubleshoot instance recovery failures
Following issues can cause automatic recovery of the instance to fail:
1. Firstly, it can be due to temporary, insufficient capacity of replacement hardware.
2. The instance has an attached instance store storage, that is unsupported for automatic instance recovery.
3. Furthermore, there might be an ongoing Service Health Dashboard event preventing the recovery process from successfully executing.
The automatic recovery process attempts to recover our instance for up to three separate failures per day.
If the instance system status check failure persists, we can manually stop and start the instance.