Categories
Hosting & Servers

Monitoring Site and Service Uptime

Testing a site is operating correctly, and it's required services are also available, can be challenging. There are various surface metrics you can test but often they are not reliable and are unable to give any depth of information about many important factors.

Surface Level Data

When it comes to web services you can get a good set of working or broken tests running with ease. By testing only surface data you can find out with some certainty if all services are up or if they are not.

There are loads of online systems that offer free site uptime checks. I've used Pingdom for it before but there are many others. The Jetpack WordPress plugin also has an uptime monitor feature which I have used.

Pinging Hosts

Many hosts are accessible through the internet and they will respond when you ask them to. You can ping the domain and assume a response from the host means it's ok. Checking ping response times and packet loss is a decent metric as well.

This doesn't check that what you want returned to user requests is what is being sent through. It only checks if the host is accessible.

Making HTTP Requests

When checking a website is running you can go a step farther that pinging and send an http request to the site. Every http response should contain a code number which can be used to determine success or failure.

When the http service returns code 200 it indicates success. The downside of relying on http response codes is that even success codes don't necessarily mean that a site is running. Other services might not be working correctly and the site might not be giving the correct output.

One way to enhance http testing for site and service uptime is to do additional checks when success codes are returned. Testing the response for some known output (for example look for a certain tag in the header, perhaps inclusion of a style.css file). If your known output doesn't exist in the response and a success code is returned then there is a chance a supporting service is down.

Deeper System Health Metrics

Surface level metrics can be an easy way to test for mostly all working or something is broken somewhere. It often doesn't give any insight into what is broken or how well working services are performing. 

You can get all kinds of information from the server that runs your sites and services if you are able to open a session to the host.

Shared hosts rarely give access to shell, when they do it's always severely limited to ensure security between customers.

System Monitor

Even in a limited shell you can probably get information about your own running processes. Linux shells usually have access to the `top` command. It's essentially a task manager that shows things like CPU usage, Memory usage etc.

In top you should be able to see the total CPU cores, RAM, Virtual Memory, average system load and some detailed information about the processes running. In limited shells you may only see processes running from your user accounts but on a dedicated server or VM you will probably be able to see all of the processes and which is using what system resource and how often.

Realtime system metrics like this can show what is happening right now on a host.

Checking on Important Services

There are a number of ways to check status of different services.

Upstart Scripts

Many services will provide a way to check their status. Often these are provided as scripts for your operating system to execute. I've heard them called startup scripts, upstart scripts, init scripts.

Depending on your OS commands like these could be used to check on some service statuses.

service httpd status
service mysqld status
service memcached status
/etc/init.d/mysql status
/etc/init.d/apache2 status
/etc/init.d/memcached status

Checking Log files

Most production softwares have in-built logging facilities. They can push data into different system logs or to their own logging mechanisms. Usually logs end up as easily readable text files stored somewhere on a system. Many *nix systems store a lot of the logs in /var/log/ or /home/[username]/logs/.

When it comes to running websites the most common setup is a LAMP stack. Default settings are usually to log requests, some types of queries and php errors in those systems.

Reading logs will be able to give you all kinds of useful information about services. There are also ways to configure some services to output more verbose data to the logs.

External Site, Service & Infrastructure Monitors

There are a number of dedicated server health monitoring suites available. Premium services like New Relic and DataDog are capable of tracking all kinds of deep level data using specifically built reporting agents capable of running on a system and reporting all of that deep data from your servers and processes.

Until very recently I was a customer of New Relic for personal sites. I used them especially for infrastructure monitoring and deep error reporting and I would highly recommend them if that's what your looking for. NOTE: New Relic also offer other services I did not use, check them out to see all features.

Open Source Monitoring Suites

In addition to premium services available for monitoring there is also some fairly strong contenders in the Open Source market that are very capable.

Most can run on a host and check metrics for it and many can also check remote hosts as well.

Nagios springs to mind right away. It can do various service checks, pings, resource monitoring and system tests in a very configurable way. It's highly configurable nature makes it extremely powerful.

Munin is another software I've used to keep track of things like network and disk IO as well as post queue monitoring.

Nagios and Munin I can recommend as potential monitoring solutions if you want to self-host them.