I was reading through a question about monitoring on r/devops the other day. The author wanted to add some uptime monitoring onto his service and was looking for tools that might help. However, one of the comments left made me realize that there may be some confusion as to what uptime monitoring does. Systems and Uptime monitoring are very different things.
I'm going to present a very brief overview on the two types of monitoring. While I might go over some tools, I won't go into any of the gory details.
The purpose of systems monitoring is to give your team insight into the status of all of your subsystems: servers, services, metrics, etc. It should give you access to all of the available metrics. Not just the RAM, Network I/O and other server based telemetry, but also application and infrastructure specific metrics. Having a monitor on the number of records imported can be extremely helpful.
Systems monitoring also lends itself to making some great looking dashboards. That seems a little unnecessary, but if you're staring at these all day that has a big effect.
Unlike uptime monitoring, systems monitoring is more white box. It will need access to servers and services in question, either by pulling from specific endpoints or by placing agents onto the node in question.
Installation of this tyoe of monitoring will basically involve the installation of a tool such as statsd, prometheus, nagios, or new relic. You will also need to install the agents or exporters to the nodes you want to watch. This level of monitoring can be a bit complex or expensive, but it's completely worth it.
There's a single purpose to uptime monitoring: to ensure that your clients are able to make it to your service. This type of monitoring can catch many classes of errors that systems monitoring cannot, such as DNS resolution errors, network problems, or cloud provider issues. It can also alert you to problems with your app itself, but hopefully you caught those with your systems monitors.
The main requirement for an uptime monitor is that it has to be external to your infrastructure. In other words, it has to be able to access your service as a client would. It does not good to have an uptime monitor that fails along with your infrastructure. The better products in this market will try and access your service from multiple points around the globe.
Implementation of an uptime monitor can be as easy as repeatedly running a curl against your host using a cron tab, but there are much better tools available. I've personally used and have had good success with Pingdom, Runscope, and New Relic.
wrapping it up
So, there you go. While both types of monitoring are useful, and in many cases required, they are completely different in what they measure, their tooling, and in how much access to your infrastructure they need. Systems monitoring is an internal monitor that gives you insight into how your infrastructure and app are working while Uptime monitoring let you know what the client sees.