Homebrew NMS: Which Reports Should You Care About?
You can vacuum up all sorts of loose data, but you have to be able to make sense of it, too. Here's how to use your homebrew NMS to get the most value out of your information.
The most frequently used reports generated by an NMS system are those relating to service availability. Nagios, for instance, makes it quite easy to generate reports of uptime and service levels for everything that's monitored. This is basic information that every IT organization needs to know, and the existing systems manage well. It would be nice, however, to correlate this data with specific reasons for outages—more on that in a bit.
Existing reports in the Network Discovery realm, for the NMS solutions that actually do true discovery, include things like IP allocation per-subnet and VLAN saturation. Your discovery should be seeding the "network database" with information about what physical switch port a device is connected to, and with that information, you can construct further interesting reports. It may be wise to track the migration pattern of a server, especially when you had no idea it was moving!
Another aspect, or feature, of some NMS solutions is visualization of the network. The "maps" generated, at either Layer 2 or Layer 3, can be extremely helpful in planning for upgrades in your network or with reviewing changes. There isn't much to improve on if you already have a correct and manageable network mapping solution, so let's focus on the reports themselves.
In a previous NMS article, One Database To Rule Them All, we spoke at length about how all of this data could be correlated. You should be able to quickly tell what changes have been applied to a particular service or the server it runs on. Having this information available means that we can track problems much more efficiently. We also should be able to know with absolute certainty that a specific server has a specific configuration.
Sure, configuration management systems ensure that managed configuration files are correct, but they don't tell us when something else changes. Many companies use host-based intrusion detection software, like Tripwire, to identify unauthorized changes. Every morning, for example, we'd get a report of all files on a system that changed the day before. This is wonderful, but introduces another set of data that's just looked at and then forgotten.
If an IDS, or "change detection" system, when used for this purpose, inserts its data into your existing database, you can associate a long history of changes with particular hosts. Or even with particular trouble tickets. Once again, we have a single point for information, and furthermore, it's searchable based on anything you need to know. Which hosts actually received a new /etc/group file on Friday? Which of those hosts were involved in incident #3245? Those questions are easy to answer, if you have the information in a single place.
The Value of Network-Based Detection
Knowing what has changed on a network is not only limited to host-based detection. Network-based detection can reveal some interesting things as well. Open ports on a server, for example, would be extremely useful to keep track of. A newly opened TCP port on a server usually means that a new service has been deployed, but not always. A quick daily scan of your systems can reveal many things you wish you had known about your network exposure. Once the initial shock of, "oops, we're still running that old RPC service?" has died down, you can start to strategize. Ideally, you'll want a daily report of which servers have started listening on new ports. Storing a history of Open Ports gains you knowledge about what services are running, and most importantly, notification when that changes. Many security incidents would have been discovered much sooner if such a system was in place.
Gathering the data from many different tools that you probably already use will allow the generation of reports that would otherwise be impossible. Weekly reports of availability of a service are much more useful if they're accompanied with data about any odd results. Imagine a report that shows: relevant trouble tickets, files changed, changes approved and executed, unauthorized changes, vital statistics about every other aspect of the server, and anything else you can gather. In one place you can determine the problem, cause, proposed changes, actual changes, and results; this is much more better than a "there was an unknown problem" report.
Various data inputs from the different types of NMS solutions are all related. We're generally talking about a network node, so that's the common factor used to link all these reports together. Instead of generating five different reports on-demand for five pieces of data, we should be able to generate all available reports regarding a specific node. Perhaps not even a node; we may want to generate reports based on a trouble ticket, something akin to, "when this was reported, what changed about the server this application runs on?" As discussed above, "what changed" is a lot more involved then simply asking what the sysadmin did to fix it.
A daily report of a few basic items can reveal the unintended consequences from the previous day's activities. Companies are becoming increasingly proactive in detecting unauthorized configuration changes, and this is yet another thing your consolidated IT Management System database can provide.
A consolidated system for all IT operations gives you an endless supply of data, for any purpose. Over the years each of the different types of management systems have been improved, and each step along the way we see more functionality jammed into each. The NMS of today does nearly everything, but there are always a few gaps. Regardless, the data being tracked is always consistent, so it shouldn't matter what tools we use. Just mine the NMS data, turn it into your own internal format, and generate reports based on the consolidated information.