Fix long VMware Fault Tolerance failover time with proper vSAN setup, NIC settings, and VM activity checks. Our support team is always here to help you.
Understanding VMware Fault Tolerance Failover Time: Common Issues and Fixes
If you’re struggling with VMware Fault Tolerance failover time, you’re not alone. Many administrators find themselves dealing with unexpected delaysduringfailovers, especially when vSAN and Fault Tolerance are combined without a proper fault domain configuration. This blog covers all the possible reasons behind longer failover times and how to fix them.
An Overview
Default Fault Domain Configuration Can Affect Failover
It turned out that I used default settings in the vSAN wizard which would create two fault domains and place one node into each, leaving the third node without a fault domain. While this setup does work with both, vSAN and Fault Tolerance, it does cause the failover to take about half a minute.
After creating a third fault domain for the leftover node, Fault Tolerance works just as expected and virtually all problems that came with the long failover time disappear. So if anyone ever encounters a similar problem, consider reviewing your fault domain setup.
This small configuration mistake directly impacted the VMware Fault Tolerance failover time, and fixing it reduced the failover delay significantly.
Causes of Unplanned Failovers Without Host Crash
A Primary or Secondary VM can fail over even though its ESXi host has not crashed. In such cases, virtual machine execution is not interrupted, but redundancy is temporarily lost. To avoid this type of failover, be aware of the scenarios below and take appropriate measures:
Partial Hardware Failure Related to Storage
This problem can arise when access to storage is slow or down for one of the hosts. When this occurs, there are many storage errors listed in the VMkernel log.
Fix: https://bobcares.com/blog/vmware-storage-drs-configuration/ Address your storage-related problems.
Partial Hardware Failure Related to Network
If the logging NIC is not functioning or connections to other hosts through that NIC are down, this can trigger a fault tolerant virtual machine to be failed over so that redundancy can be reestablished.
Fix: Dedicate a separate NIC for both vMotion and FT logging traffic, and perform vMotion only when the VMs are less active, especially on systems affected by hyperthreading vulnerabilities.
Insufficient Bandwidth on the Logging NIC Network
This usually happens because of too many fault tolerant VMs on one host.
Fix:
- Distribute FT VM pairs across multiple hosts
- Use a 10-Gbit logging network for FT
- Verify that the network is low latency
vMotion Failures Due to High VM Activity
If the vMotion migration of a fault tolerant virtual machine fails, it might need to be failed over. This usually occurs when the VM is too active to migrate smoothly.
Fix: Perform vMotion only when the virtual machines are less active.
Excessive Activity on VMFS Volume
File system locking, VM power ons/offs, or multiple vMotions on a single VMFS volume can trigger a failover. A common symptom is multiple SCSI reservation warnings in the VMkernel log.
Fix:
- Reduce file system operations
- Avoid placing FT-enabled VMs on busy VMFS volumes
Lack of File System Space Prevents Secondary VM Startup
Check whether your / or /vmfs/datasource file systems have available space. If they’re full, you won’t be able to start a new Secondary VM.
Fix: Free up space on the required file systems.
[If needed, Our team is available 24/7 for additional assistance.]
Conclusion
By following these steps and optimizing your setup, you can significantly reduce VMware Fault Tolerance failover time. Ensuring proper network, storage, and VM activity management can help you maintain a high-availability environment without unwanted delays.
