The Simplicity and Serenity of DHCP Fault Tolerance
Fault tolerance is a factor that’s considered in the provision of almost every network service. Understanding that down-time costs money, our need to provide fault tolerance for networked environments has gone beyond just making sure that users have uninterrupted access to the network. It is universally accepted that the level of fault tolerance can affect the viability and bottom-line success of an organization.
Nevertheless, while some network services get ample attention when it comes to fault tolerance, others such as DHCP, often do not. Some would claim this is because DHCP, for reasons we’ll explore shortly, serves a less important role to the critical on-going operation of the network as opposed to other services like DNS. While this may be true, try taking down a DHCP server and see what happens. The problems may not appear immediately, but ultimately the result will be the same — the network will stop operating.
In many organizations, provision of fault tolerance for DHCP is seen as a minor consideration. After all, if the DHCP server goes down, it only takes about 10 minutes to install the service on another server and redefine the address scopes. Although this is a simple solution, it goes against the grain of today’s highly controlled network environments.
For example, what server do you install the DHCP service on? The server running the corporate accounting system, or perhaps the one servicing the e-commerce website that is the lifeblood of the company? OK, so perhaps that’s a little over dramatic, but the fact remains that in today’s world of micro-managed and controlled networks, you do not install a new application, or a new server for that matter, without a healthy measure of consideration and planning.
The Inner Workings of DHCP
One of the reasons that DHCP is often not as well protected, from a fault tolerant viewpoint, is that a DHCP failure is generally not an immediate mission-critical concern. The mechanics of DHCP are such that the failure of a DHCP server may not have an impact on the network for hours or even days. This is due to the way in which DHCP leases work.
When a client system obtains an IP address via DHCP, the address is given to the system, or leased, for a given period. At various points during the lease (normally 50% and 85%), the client system will attempt to renew the lease with the DHCP server. If it cannot renew the lease, it will still use 100% of the lease term before ceasing to use the address. With an address lease duration of 3 days (which is quite common), this would mean that a system could go a total of 3 days before the inability to contact the DHCP server becomes an issue.
Problems can arise, though, when DHCP address leases are configured for a particularly short period, such as a few hours. In these cases, a DHCP server failure can create more of an issue, as a few hours may not be enough to recognize the failure and bring another DHCP server online and into service. A simple example of this might be if the failure occurred overnight, and was only realized in the morning when users were unable to log on to the system. You could say that the simple solution to this problem is not to use short DHCP leases, but that’s not always possible.
Another aspect of DHCP leases that justifies the need for fault tolerance is the way that addresses are handled after they are issued. Once an address is leased out to a system, that address cannot then be assigned to another system until it is released, harvested, or the lease expires. In an environment where there are a large number of system changes on the network, this can cause problems.
For example, in a highly mobile workforce that connects and disconnects to the network frequently, available addresses can get used up quickly. The common solution to this problem is to shorten the DHCP lease duration. As we just discussed, though, this puts you at a higher level of risk in the event of a DHCP server failure.
Implementing DHCP Fault Tolerance
Having established that there is justification for providing more than one DHCP server, the question then becomes exactly how to do it. The first thing you will need is another server somewhere on the network, but as the need for DHCP fault tolerance is most likely only a reality in larger installations, this should not pose a significant issue. No matter where the other server is on the network, it can still act as a backup for the primary DHCP server in the event of a failure.
An unusually common misconception is that the DHCP server must be on the same subnet as the clients it serves, which is not the case. Having a DHCP server on the same subnet as the clients it serves will reduce the amount of DHCP-related traffic on the rest of the network, but it does not affect how the clients receive addresses, or how the DHCP service is configured.
The key to understanding how a DHCP server can service clients from a remote subnet is in appreciating how DHCP requests from clients are transmitted through the network. Much of the DHCP client-to-server communication is achieved via broadcast, though generally speaking, routers do not forward broadcast transmissions.
It is this principle that is at the root of the myth that DHCP servers must be connected to the same subnet as the clients they serve. In the case of DHCP traffic, routers can be configured to make an exception.
Not only will a router forward DHCP broadcast traffic, it will also insert the source subnet address from which the request was received into the packet. When the DHCP server receives the request, it can then use this information to see which subnet the request originated from, and examine the configured scopes to see if it has an address for that subnet that it can supply to the client.
The ability for a DHCP server to determine the originating subnet is an important consideration in DHCP implementations, as it makes it possible to place DHCP servers on subnets other than the ones it directly serves. It also makes it possible for a single DHCP server to provide addressing services to multiple subnets.
Using DHCP servers to service remote subnets provides additional flexibility to your fault tolerant DHCP implementation, but it also means that you will have broadcast DHCP traffic traveling on the other subnets, which is not so good.
A solution to this problem is to use DHCP relay agents (also known as BOOTP relay agents), which collect DHCP traffic from the local network and then send it directly to the DHCP server. They are able to do this because they are configured with the address of the DHCP server. DHCP relay agents can be implemented on many hardware routers, or if you are using a software-based router, DHCP relay agents are available for most common network operating systems.
Configuring Redundant DHCP Servers
Having established that DHCP servers need not be on the same subnet as the clients they serve, we can now look more closely at exactly how to configure multiple DHCP servers to service a single subnet. This issue boils down simply to the use of scopes.
Multiple DHCP servers cannot serve addresses from the same scope. If you do configure two DHCP servers with the same scope, it won’t be long before duplicate IP addresses start to appear on the network. The solution is simply to distribute the range of available addresses across the DHCP servers. The generally accepted principle for doing this is referred to as the 80:20 rule.
As the name implies, the rule dictates that 80% of the available scope is defined on one of the DHCP servers, with 20% being defined in a scope on the other. For example, if you were using a scope of 192.168.1.1 to 192.168.1.100, you would configure a scope of 192.168.1.1 to 192.168.1.80 on one server (Server1), and a scope of 192.168.1.81- 192.168.1.100 on the other server (Server2).
Now, if one of the DHCP servers is down, there are addresses available in the scope on the other server to service clients from the 192.168.1 subnet. If you had another subnet, let’s say 192.168.2.1 to 192.168.2.100, you could reverse the 80:20 rule, by placing 20% of the scope on Server1, and 80% of the scope on Server2. You would then have DHCP fault tolerance for both subnets.
The obvious rationale behind the 80:20 rule is that during the period in which the failed server is down, the other server can service requests for addresses from that range. Only new address requests will be serviced, though, as the way that the DHCP leasing process works, a system that has been leased an address by one DHCP server will always attempt to renew the address with that server before giving up and contacting another DHCP server for a new address.
While configuring the 80:20 rule on the servers does not make up for a shortage of IP addresses, it will not make the situation any worse, either. If a server has no addresses left in its scope, it will simply ignore requests from clients for an address. When both servers are up and running, they can both reply to requests if they have available addresses. Likewise, when the servers run out of addresses, they run out. It is no different than how a single DHCP server would operate in this respect.
Given the simplicity of creating a fault tolerance DHCP implementation, if you do not already have it implemented, you should certainly consider doing so. Unlike so many other fault tolerant measures, there is no additional software or hardware to purchase. It’s simply a case of planning your implementation and putting it into place.