Zenoss: Tame Report Noise (and Lose Nagios?)
We started our first Zenoss article with the full intention of completely replacing Nagios for host and service monitoring. We are happy to report Zenoss is, in fact, fully up to the task. Here's how the last hurdles, monitoring services and alerting, were implemented.
It can't be all peachy news, can it? No, not quite. Configuring service monitoring in Zenoss is semi-frustrating. Not that it's difficult to do, in fact, the only reason it's frustrating at all is because you're quite tempted to implement all kinds of fanciness that Nagios cannot. If you stick to the straight ping-test and checks of basic services, like HTTP, Zenoss almost configures itself.
If you click on the Service class, and drill down a bit to where you can search for a specific service (IPservice»Privileged), then search for something like SMTP, you can begin configuring site-wide monitoring. By default for SMTP, and if you set Monitor to True for any others, Zenoss will automatically start monitoring the service on any server that runs it. The great thing about Zenoss is that it knows which services are running on your servers, so you can just enable monitoring, and it works. That is also not so great, at least in some people's opinion. Say you only care about five SMTP servers, but decided to "turn it on" as described above. You will now get alerts about the SMTP service on all managed servers.
Something truly annoying we discovered is that if you enable service monitoring for a few servers, but leave the global default as Monitor=False, Zenoss will dutifully delete the service you just manually added. We were wondering what that "lock" functionality did. It turns out that you must "lock" services that are added manually, if you want them to stick around through the next device modeling, which happens every six hours by default.
Your basic services, like SMTP, FTP, and IMAP, are a snap to configure. You can replicate the behavior of Nagios without having to specify how each server gets monitored for these services. You probably won't want to, for reasons that will be discussed in the next section about Alerting.
There are, however, a few teaser ZenPacks available on the Zenoss Web site. A ZenPack is a zip file containing a plug-in, so to speak. One particularly attractive ZenPack, the HttpMonitor, is quite useful. HttpMonitor will allow you to monitor and graph Web site load times and page sizes over time. This ZenPack is undocumented, but there is a good community-written document available in the Zenoss wiki.
At first the HttpMonitor seemed cumbersome. You must add a whole new Device, just as you would a with a real server, but call it the name of the Web site. You'll disable SNMP monitoring, so it's not as overkill as it sounds. Only the HttpMonitor performance monitors will look at these Web site "devices." We created a new device class called /Services/HTTP, and threw tons of Web sites in there. Each site automatically inherits the /Services/HTTP monitoring properties, which essentially is: "apply the HttpMonitor template and set the failure severity to Critical." The failure severity identifies the level at which an event will be logged.
We now have monitoring, and graphs of page load times and sizes. Since all Web sites have their own device class, it's quite easy to see at-a-glance when something is misbehaving. What seemed strange at first, that is creating "devices" for a Web site, is now quite nice.
Every Zenoss user we've heard from says, "Too many alerts!" It's true. As mentioned before, if you enable monitoring of a service system-wide, you will get an alert every time any server has an issue with that service. At this point people generally start creating server groups, changing the default severity of service alerts, and many other things to try to get the frequency of alerts down.
Luckily, it is quite easy to manage the noise problem. When alerting rules are configured, you can specify that you only care about servers identified as "Production," for example. Different alerting constraints based on the device groups and classes are also available. In fact, nearly any property of a service or device and be evaluated before an alert is sent.
The basics, for example, "alert on all errors about servers in class Y only during business hours" are quite easy to implement. The more complicated constraints made us rethink our logic in implementing device groups. Mainly because there is a current bug outstanding that doesn't allow multiple "groups is not" statements, but also because Zenoss takes a little getting used to before you can design a structure that represents a complex environment well.
Current bugs withstanding, the Zenoss alerting configuration is leaps and bounds easier to configure than Nagios. We are sad to report that one very important feature does not exist in Zenoss: service dependencies. Zenoss implemented network-based dependencies, which are automatic and transparent to the administrator, but for some reason we cannot implement a service dependency. Not cross-server, and not even on a single server. If you're monitoring five services on a server that crashes, you'll get six alerts, one to say it's down (ping check), and one for each of the services that have failed. We believe Zenoss will be implementing this feature soon, but have no definitive timeline.
We are actually refreshed at the fact that the majority of time spent implementing a test Zenoss on 150 servers was in configuring fanciness that no other monitoring system provides in the Unix world. We'd expect something like Zenoss to be horribly complex and impossible to apply to our strange setup, like OpenNMS, but it wasn't.
Think back to the time investment that was put into any Nagios deployment, and specifically think about the time it took to write a script to generate host entries. If you can drop in a replacement for Nagios that also provides so many other wonderful features, there's certainly no reason not to. Be careful; it is certain that you'll be addicted to tweaking on Zenoss once its deployed, so start testing it on a Monday.