Server Crash Notification

I have a couple of servers (one at home and one at work) and I need to make sure that they are up all the time. However, sometimes a day or two go by with no one using the server, so I don’t always know if the server is down right away. I need some sort of immediate notification when the server is down.

I know that you can use various commercial products that will do this, but I think it’s better to do it myself, that way I learn more and I am not dependent on some other company to take care of my situation and have appropriate privacy controls.

In this posting I’ll explain how I did this as well as provide all of the scripts to do this. Obviously, I’m assuming that all of the computers are running Linux — they are servers, after all.

I’m using a hosted domain as the central computer that the other servers contact (technically it should be called a hosted website). You could also use a VPS or any other server to act as your control server instead. (I am not checking to make sure that my hosted domain is up.) You could also just do this with two servers: ServerA checks to see if ServerB is up and vice versa.


Overview

Servers sending "up" status messages to main monitoring computer.

Servers sending “up” status messages to main monitoring computer.

  1. Each server sends a simple message to the main computer (basically my hosted domain name, quarkphysics.ca) every 15 minutes
  2. If the main computer does not detect a message after a certain time, it sends a text/SMS message to my cell phone.
  3. A web interface on the hosted domain shows the status of each server and has other features for controlling things.

Only steps 1 and 2 are necessary to get this to work.

Note that at least one server is behind a firewall. You cannot access it from outside nor can you ping it … This server is for internal use only and yet I have to know right away if it goes down.

Note that none of these scripts has any authentication so anyone who knows the path and the script name can control SMS notification of your servers as well as add in fake entries. So I’ve changed the path names, script names and computer names for this document to prevent people from messing with my setup.

Web Interface

status_page

In this example, you can see that ServerA is down for more than 30 minutes (status=error). SMS is disabled (light red background behind buttons) so I will not be getting text messages about it. ServerB is up – so the buttons to enable/disable SMS don’t work. ServerC has missed one notification, so it’s been down between 15 and 30 minutes. Since the background under the buttons is light green, it means that I’ll be receiving SMS messages about it every 15 minutes.

Everything is done via filenames and timestamps stored in those files:
Contents of /somepath/ folder

None of these are the names of my actual servers, so the Delete button comes in handy here. If you delete a real server, no problem. The entry just gets recreated next time a “ping” arrives.


PHP scripts

Overview

  • each server runs a crontab entry
  • central hosted domain runs the following scripts
    • aliveStamp.php
    • notifyServerDown.php
    • serverList.php
    • disableSMS.php

 Part 1a: Server up notification

Each server runs the following command in their crontab:

NOTES:

  • All that is needed is to access ONE webpage
  • All you have to do is decide on a computer name for each server and be consistent in its use and Capitalisation
  • I’m not doing this on the 15 minute mark since the hosted domain is doing its checks on the 15 minute mark
  • Later on, test everything by commenting out this line in your crontab.

Part 1b: notification script on hosted domain

The hosted domain (or other server) has the following script in its cgi-bin directory:
aliveStamp.php

So, every time this is run, it will create/overwrite a file called AliveStatusServerA.txt and put the current time stamp into it, where “ServerA” is the server name. The contents will be something like 1467647701. (This is called the “epoch” and can be converted to human readable time in various ways.)

Part 2: Sending out a text message

  • Note: Telus and Koodo have a webpage that allows you to send an SMS message from it. All that you have to do is to fill in the fields correctly.
  • I don’t know which other cell phone companies have this. If your provider does not do this, (i) contact them, (ii) add a 3G-modem card and a SIM card to some computer of yours so that you can send text messages. Use this computer instead of the “hosted domain”, (iii) change the scripts to send emails instead.

notifyServerDown.php

Part 3: Server Page – Web Interface

Here’s the script that generates the webpage (see the screenshot above)

serverList.php

Here’s the CSS

(I need to change .btnRefresh to .btnDelete)

status.css

And here’s the disableSMS.php script which is called by serverList.php when you click on a button on the web interface.

disableSMS.php

One problem I encountered:
the location for scripts called by a browser (or curl) is in cgi-bin. However, this cgi-bin is not the same as the location for scripts called by the webhosting “scheduled jobs” ! This means that you have to know which script goes where in order to get the path right, otherwise nothing will work. Basically, all go into cgi-bin except for “notifyServerDown.php” which goes into your “scheduled jobs” location.


UPDATES:

I just found out (thanks Reddit!) that there are two really good packages that you can download that will do this whole thing for you: Nagios or Xymon (but sometimes it’s fun reinventing the wheel).

Instead of running curl on each server, wget is probably a better choice. It is more lightweight and doesn’t require as fast a network response. curl often gives false errors: