The main goal in network management and monitoring is simple: understand what is happening. Achieving that understanding is far from simple, but the right combination of tools can ease the burden. Narrowing the scope from the first part of this article, our three main aspirations are now: bandwidth monitoring, host/service monitoring, and network discovery.
Network Discovery
Netdisco is a practical network discovery and management solution. It uses SNMP (define) to gather important layer 2 data (MAC addresses) (define) and associate them with more tangible layer 3 data (IP addresses). This prized information allows network engineers to identify exactly where a computer or other network device is plugged in.
The installation of Netdisco is a bit painful if the administrator opts for anything other than the default file locations. It requires a bit of hacking, but the documentation leads you through it. After the installation, device discovery is the last step and is fairly easy to configure.
Netdisco’s Web interface presents much more than discovery information. It is clear that it was written by network engineers for network engineers. There are numerous reports and statistics that provide invaluable data. Netdisco has built-in smarts that most commercial applications lack. Its useful reports and statistics include:
- a list of ports with multiple devices attached
- identifying device ports that have a duplex mismatch
- devices that are using IP addresses not found in DNS
- a method to identify which IP addresses haven’t been used recently
Netdisco does a great job with the discovery portion. Many applications have tremendous trouble in this department, since there are numerous factors that influence the data received from a switch or router. Netdisco seems to understand these situations, and where it doesn’t, it will notify you that it couldn’t make reasonable assumptions about the topology.
Bandwidth Monitoring
Cricket is our tool of choice for bandwidth monitoring.
Cricket is an RRDTOOL-based system for monitoring various statistics over time-series based graphs. Cricket is flexible and relatively easy to use. It runs on all Unix systems, and some people have been successful running it under Windows 2000.
Cricket has two basic parts: the collection engine and the grapher. The collector runs at a scheduled interval, specified by the administrator (normally every 5 minutes), to gather data from various network devices via SNMP. The information gathered is stored in RRD format so the grapher can parse it and create the graphs we are looking for. The nature of RRD data allows graphs to be maintained over a long period of time, with no increase in storage requirements.
A normal installation of Cricket can be a bit time consuming, but tools exist to make this less burdensome. For example, if you want to monitor a large switch with 6 line cards and 48 ports on each one, you have to tell Cricket the ports you want data collected from. There are scripts available on the Cricket website that will connect to the switch and dump a listing of all ports in the correct format, making the configuration phase a great deal easier.
Once Cricket is installed and monitoring everything, the fun begins. With the correct SNMP OIDs (Object Identifiers), you can even get CPU and temperature information. After the collector has run a few times, the graphs start coming to life. You can view every port of the previously mentioned switch and get a look at how much traffic it going through each one. Cricket will display the port description string associated with a certain port on the graph page, making it easy to find a specific computer’s graph.
Cricket graphs will sometimes show very large spikes and other anomalies. Gathering this data is probably the main reason we want to monitor in the first place. Many possible trouble scenarios can be debugged by looking at traffic graphs, including spanning-tree loops and multicast routing problems.
For example, if all trunk ports suddenly see ten times the normal traffic, there’s a pretty good chance you have some layer 2 (define) looping happening. Likewise, if a certain subset of workstations on the same subnet start getting much more than the normal traffic, this could be an indication of a multicast problem; i.e. the router isn’t pruning properly, and simply sends every packet to everyone.
When these large spikes in RRD graphs occur they can throw the scale off, making it hard to see the normal data once things calm back down. There is a contributed utility called “killspike” available from the Cricket website that will smooth out these large graph spikes.
Cricket is more versatile than just graphing data too. It can be configured to send alerts based on thresholds set by the administrator. Common uses may include a general “page me if the aggregate data rate goes above 8 Mb/s on this router” or “page me if router interface 2 starts sending less than 500 Kb/s.”
Host and Service Monitoring
Nagios provides very advanced server and device monitoring solutions. It has become the de facto standard among other service monitoring applications. A rather simple installation and configuration make Nagios a desireable app for most networkers.
Nagios can monitor servers and their services. More primitive monitoring solutions only allow for a simple ping to detect whether or not a server is still up and running. All too often administrators find that a server will respond to pings, but no services are actually working. Nagios will connect to many different services to test for functionality. To test, for example, a mail server, it will connect and wait to get the SMTP greeting. Nagios will monitor most common network-based services out of the box, and plug-ins exist for almost everything else.
Plug-ins are where Nagios really shines. People have written countless feature extensions for Nagios, from SNMP servers to instant messenger hooks that allow notices to be sent via ICQ. Nagios has the built-in ability to send notifications of outages to a group of administrators, normally via a pager-to-email gateway. By utilizing the plug-ins available for download, you can configure Nagios’s notifications in endless ways. One of the more popular plug-ins is an SNMP daemon that receives traps and generates alerts based on the information received.
WhatsUP is a similar, but Windows-based solution. This software takes monitoring a bit further by attempting to add network topology graph capabilities. The host monitoring portion is fairly good.
With WhatsUP, administrators can monitor most network services, just like Nagios. The main difference is that WhatsUP is primarily a Windows GUI application, but it does have a robust Web interface for remote access. The monitoring capabilities within WhatsUP can monitor Windows services and server resources, and it also has the built-in ability to receive SNMP traps from network devices.
WhatsUP’s features are quite similar to Nagios’s, but some people prefer the easier installation and GUI interface. WhatsUP is a commercial product that comes with support and runs on Windows; an attractive offering for people who don’t have a lot of time to spend working with software to get it set up “just right.”
Many different packages offer similar features to the ones described above. The very large and complex ones, namely HP OpenView, perform marginally, despite their not-so-marginal price tag. Most everything complex NMS software can do can be accomplished with the above mentioned software, and in most circumstances it is done better. Every tool has a few weaknesses, but in general, all of the desired features discussed in this article can be realized with one or two free (in most cases) and easy to use tools.