Select Page

linux administration


How to manage a Linux server? Our Linux admins explain

These days, you can get a server from AWS or Google Cloud within 1 hour.

The trouble is, these servers will have Linux as the operating system by default, which is a tough nut to crack for people who are familiar only with Windows or iOS. (more…)

How to fix “554 Too many recipients” email error

Here at Bobcares.com, we provide Outsourced Tech Support to web hosting providers. As part of our service, we resolve email errors posted by hosting users.

A common email bounce error we see in VPS and Shared servers is:

The rejected e-mail address was 'user@domain.com'. Subject 'My mail subject', Server Error: 554, Server Response: 554 Too many recipients, Server: 'mx.sender.com', Windows Live Mail Error ID: 0x800CCC79, Protocol: SMTP, Port: 587, Secure(SSL): No

This error is seen when people attempt to send group mails with a large CC or BCC list. (more…)

Fault tolerant service logging – How remote logging was made resilient to crashes

Logs from alpha-p3 is missing!

We were responding to an issue raised by an onsite technician for a data center we managed. System logs from one server was missing in the central log server. It looked like the Rsyslog service that was used for central logging had crashed in the source server, leading to 2 hours of lost log information.

Logs are critical to day-to-day server management and missing logs were an urgent priority issue. Rsyslog service was restarted in the source server, and debugging was enabled to identify what had gone wrong. Looking at the update logs, we noted that the Rsyslog package was recently updated, which pointed to a possible bug. A quick stop at the Rsyslog github bug database confirmed that crashes were reported, and a patch was available. An update was done in all servers to fix the issue. But it still left the question, what if a future update causes a similar crash? We needed a solution to ensure the central logging is resilient to failure. (more…)

Cornering an SLA killer – How systematic resolution of an OpenVZ crash protected uptime guarantees

Cornering an SLA killer – How systematic resolution of an OpenVZ crash protected uptime guarantees

It was a peaceful night shift at a data center we managed. Just a few routine server provisioning and customer queries were keeping us occupied. Suddenly all alarm bells started ringing.

25+ managed server instances had gone offline, and the alert priority was among the highest. Each passing minute was eating into our SLA guarantee. An OpenVZ node had gone down with almost no warning at all. The monitors had shown a slight increase in load, but well within normal range.

OK, first order of business, bring the server back online. The OpenVZ kernel booted up, and all instances were back online in less than 15 minutes, but that cut our uptime to 99.96%. We just cannot afford to let it happen again, and so, we started digging.
(more…)