Burning in your new server : How server hardware load testing helps us improve data center infrastructure reliability
The data was conclusive. The servers orion-47, orion-50 and orion-52 needed RAM upgrades. In the past one month their RAM usage has mostly been above 85% and was showing an increasing trend. Swap usage has grown by more than 20% and it was resulting in higher I/O wait and thereby a slight tendency for high load.
These servers were part of a load balancing cluster that served a SaaS application in a data center managed by our Dedicated Linux Systems Administrators. The occasion was our weekly review of alert trends, and corrective actions needed to prevent a performance degradation. Regular analysis of alert trends allow us to predict future resource bottle necks, and prevent service deterioration.
Decision was taken to upgrade the RAM over the week end, with 3 days allocated for reliability testing of the new hardware. Reliability testing also sometimes referred to as torture testing, stress testing or load testing, allows us to understand the limits of the system in the new configuration, and thereby help in capacity planning.
Since all servers ran Linux, we used the command “stress” to simulate high load on the server. An example usage is:
stress --cpu 300 --vm 3 --vm-bytes 31G --io 4 --timeout 4d --hdd 4 --verbose
Run over a period of 3 days, successful tests will result in the system remaining stable and responsive. Test failures are usually manifested through system freezes or segmentation faults. The reasons for failure can range from defective or incompatible hardware, to planned performance limits probing. After the testing period, the system is truly broken in, with the failure limits recorded in our asset log. This allows us to predict the time after which the server might be due for another upgrade, or retirement.
Servers need to be “broken in” and load tested before it is deployed into production environment to make sure it can take the load expected on the server. Here are a couple of pointers:
- Load testing can be conservatively done by creating 3 scenarios – Normal load conditions, Maximum load conditions and Extreme load conditions. Normal load conditions will help you understand if there are hardware defects. Maximum and extreme load conditions will help you detect hardware spec incompatibilities and limits of system capability.
- Constantly monitor the system for failures. Constant monitoring shows you errors in their contexts, and it helps you to intervene in case the system develops critical issues like over heating.
Bobcares systems administrators take care of tech support and infrastructure management for data centers and web hosts. Are you looking for ways to improve your service quality?