Stability is a popular reason why people choose cloud servers for their business. It is widely perceived that cloud systems can automatically recover from failures, and keep the data safe. While that is true to a large extent, cloud systems are as susceptible to failure as any other system.
One common failure point is network. Cloud systems comprise of a lot of sub-systems such as storage, backups, compute devices, etc. Each of these sub-systems talks to each other to perform critical functions such as creating new cloud instances, editing resource limits, etc. If there’s an issue with networks, servers wouldn’t be able to talk to each other, and cloud management functions will fail.
Network issues can happen all the time. It can range from simple authentication errors to network hardware issues. Bobcares helps data centers and cloud providers trace and fix network issues through our dedicated support services and server administration services. As part of our services, we monitor server infrastructure 24/7, and resolve service issues.
One day, we got a notification that attempts to create new cloud instances were failing in a data center we manage. We tried creating a new cloud instance, and saw that everything worked well until OnApp tried to allocate storage using the “BuildDisk Action“. The error shown was:
Remote Server: 10.0.1.21
Running: Storage API Call: POST 10.0.1.21:8080/is/Datastore/pmjl4tghe52pzq/VDisk "{\"name\":\"dfrnlizerqysvz\",\"size\":\"49152\",\"hostids\":\"3,2,4\"}"
Errno::ECONNREFUSED Connection refused - connect(2)
Fatal: Errno::ECONNREFUSED Connection refused - connect(2)
Executing Rollback...
Remote Server: 10.0.1.21
Running: Storage API Call: GET 10.0.1.21:8080/is/Id nil
Errno::ECONNREFUSED Connection refused – connect(2)
The error indicated that the OnApp Management Server was unable to communicate with a service called “Storage API” that ran in a backup server. OnApp relies on this Storage API service to allocate storage locations for new cloud instances. New cloud instance creation was failing because OnApp was unable to allocate storage.
A break in communication between two servers could happen due to many reasons. In this post we’ll go through how we resolved this current issue.
Resolving “Fatal: Errno::ECONNREFUSED Connection refused”
For OnApp to be able to communicate to Storage API,
- The OnApp management server should be able to connect to the backup servers via SSH.
- The “Storage API” service should be running in the backup server.
- Connections should be allowed to backup server’s port 8080.
To troubleshoot the issue, we went through each of the above possibilities.
Checking OnApp SSH keys
OnApp management server relies on a set of SSH keys to connect to all servers in the cloud system. If these SSH keys are lost or corrupted in any way, OnApp wouldn’t be able to connect to the servers, and management actions would fail. The SSH keys are installed under a user called “onapp” in all these servers which allows OnApp management server to remotely execute commands.
To test this, we tried connecting to these servers:
# sudo onapp ssh root@10.0.1.21
The connection went through perfectly fine, and we then knew the SSH keys were not the problem.
Checking if Storage API is running
Next we tried connecting to port 8080 of the backup server (with IP 10.0.1.21). The connection failed, which meant that either the Storage API was not running or that firewall rules were blocking the connections.
So, we logged in to the backup server and checked if Storage API was indeed running:
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 0 21694 3644/python
Storage API service, which listens on port 8080 was indeed running. So, it was likely that a firewall rule was blocking connections to port 8080.
Checking for firewall blocks
For situations exactly such as this, we keep backups of critical configuration files in each server. We restored the firewall configuration file from the daily backup and restarted the firewall service and Storage API service.
Now, connections to port 8080 started giving a response.
# nc 10.0.1.21 8080 HTTP/1.1 400 Bad Request Content-Length: 30 Content-Type: text/plain
Next, we tried creating a new cloud instance, and everything worked perfectly fine. 🙂
Network errors can happen due to a variety of reasons. Here we’ve covered some common causes in OnApp systems. Networks can fail if any of its sub-components fail. To quickly troubleshoot network downtimes, it is important to be aware of how network components interact with each other. Bobcares helps data centers and cloud service providers minimize service downtimes through proactive systems audits, 24/7 monitoring, and 24/7 emergency administration.
Bobcares helps data centers, web hosts and other online businesses deliver reliable, secure services through 24/7 technical support and server management.
0 Comments