Your users are complaining that “the Internet is, like, all slow.” Users are always complaining, but you’re seeing a lot of timeouts when you check mail, surf the Web, or try to log in for remote administration. Or even worse, latency is so bad that you keep getting killed all to heck in your favorite gory violent online multi-player game, so you know there is a problem. But there a lot of potential bottlenecks between your PC and the outside world, like your Internet gateway, proxy server, firewall, Internet service provider, and so forth, so where do you begin?
Network Admin’s Best Tool
When you contact your service provider, don’t try to attempt a diagnosis. Just show them the statistics you have collected.
One of the best and most versatile network tools you can have is a notebook PC running Linux. This lets you plug in anywhere to run tests and find out what is going on. Make it a nothing-to-lose box — don’t keep data on it so you can wipe and reinstall the operating system as necessary, because you want to be able to run tests outside of firewalls. Don’t run any services. You can put a minimal iptables firewall on it, as there is no point in being totally exposed, but keep it simple. (Use MondoRescue to make a system snapshot for fast restores.)
Eliminate The Obvious
You know the drill — is everything plugged in? Are there blinky lights? Did you pay your electric and ISP bills?
Start by taking your trusty laptop, or whatever machine you have available, and plug directly into your connection source. Bypass your firewall, router, proxy, content scanners, everything that stands between your LAN and the big bad Internet, so you can quickly find out if the problem is local, or outside of your LAN. You probably don’t want to take your entire LAN offline, so you’ll need to fix up a DMZ (define) segment to plug into.
Next, fire up mtr (My Traceroute), which if you don’t have it already is a free download. mtr combines traceroute and ping in a single handy-dandy utility. First try it on a large, well-supported site like Yahoo, Google, CNN.com, or some such, just to test it out:
$ mtr yahoo.com
A window will pop up and show you the live progress, like in Figure 1.
(Click for a larger image)
This example shows a problem highlighted in red. Packet losses under 5% are not important; this is typical of the Internet on a busy day. But 29% is definitely a problem. Should you contact the admins at Level3.net? You might, as a courtesy. In this case, it’s safe to assume there are many routes to Yahoo, so it’s probably not all that important to you. (Don’t forget to turn mtr off after you’ve run 100-200 packets!)
The interesting hop in this example is the very first one, router.ortelco.net.That is my ISP, the very first stop on the path from my computer to yahoo.com. Since the example shows 1% packet loss, that means that the ISP is not the bottleneck. (Please don’t run tests on ortelco.net; use your own ISP. It’s a small-town ISP — be kind.)
Mail Server Blues
(Click for a larger image)
Here is a real-life example of tracking down problems with my own mail server. I use a hosting service, and they are very good. But lately I’ve had trouble receiving and sending mail. It takes a long time, and often times out. Figure 2 shows the output of the following command:
$ mtr mail.bratgrrl.com
This shows that the entire path from my PC to my mail server is a congested mess. Of particular interest are the first and last hops: router.ortelco.net shows 21% packet loss, and my mail server, venus.euao.com, shows a 36% packet loss. Should I send a LART to my service providers? Not quite yet. The next step is to run the same test at different times of day over the next couple of days. To generate nicely-formatted, copy-able output, add the -r flag, and limit the number of packets sent with the -c flag. Then store the output in a file:
$ mtr -r -c100 yahoo.com >> test.txt
Slap it into a cron job, running it every hour or so, and in a day or two you have a good snapshot of what is happening on a particular link. Suppose that this shows venus.euao.com is consistently dropping bales of packets; what comes next? Create a trouble ticket using the output of mtr to show the service provider that the problem is specific to their server, and not somewhere else downstream.
What if your external testing demonstrates no particular problems? Then you know you need to investigate your LAN for the source of your network troubles. mtr works on the inside just as well as the outside.
Good Ole Traceroute
Another good tool is plain old traceroute. Remote traceroutes are especially interesting. Find these at traceroute.org. This is good for showing recalcitrant support techs that the problem really is theirs, and not something you are doing, because you can show traceroutes originating from different locations. I know, the old support-by-consensus is lame — “but ma’am, no one else is reporting a problem!” But we often must deal with it.
A side note on traceroute.org — a lot of the links are no longer valid, so it would be nice to politely inform them of any dead links you find.
Another thing traceroute is good for is to have your service provider run it from their end back to you. Sometimes the return path contains problems, so this is an important step. Especially if they are being fussy about admitting there is a problem.
Good Old Ping
Ole ping is still useful for network testing. Suppose that mtr shows that your ISP is dropping packets like they were toxic waste — ping the offending host and collect some nice statistics:
$ ping ortelco.net
PING ortelco.net (220.127.116.11): 56 data bytes
64 bytes from 18.104.22.168: icmp_seq=0 ttl=63 time=41.0 ms
64 bytes from 22.214.171.124: icmp_seq=3 ttl=63 time=1740.2 ms
64 bytes from 126.96.36.199: icmp_seq=4 ttl=63 time=2289.4 ms
64 bytes from 188.8.131.52: icmp_seq=5 ttl=63 time=1971.1 ms
--- ortelco.net ping statistics ---
100 packets transmitted, 34 packets received, 66% packet loss
round-trip min/avg/max = 41.0/1510.4/2289.4 ms
The interesting bits here are the dropped packets and the horrendous ping times. Anything under 100 milliseconds is good. At 150 milliseconds, you’ll notice slower-loading Web pages, and possibly mail and Web timeouts. So anything in the thousands is obviously unworkable.
When you contact your service provider, don’t try to attempt a diagnosis. Just show them the statistics you have collected. There are all kinds of things going on behind the scenes that ping, mtr, and traceroute cannot show you. All you want to do is show proof of a problem — it’s up to them to diagnose and fix it.
Finding Out Who To LART
Presumably you have contact information for your own service providers. What if you find a problem, such as the Level3.net example (above), and you want to report it? whois tells all. Use the -H flag to turn off the reams of annoying useless legalese:
$ whois -H level3.net
Though it may be that the WHOIS information is out-of-date or false, but it’s the first place to look.
“Network Troubleshooting Tools” by Joseph D. Sloan, is a most worthy reference book for all aspects of network troubleshooting: hardware, software, and performance analysis.