Get a Handle on VM Sprawl
Taking elements of your network virtual may reduce the number of boxes plugged into the wall, but it doesn't mean you can forget about managing their virtualized twins.
Virtualization is great; that much we can all agree on. Virtual machines (VMs) can tend to grow out of control, however, now that it’s so easy to create them. This should not be all that surprising, but apparently many small to medium businesses are also dabbling in VMs, and they are suddenly overwhelmed by the VM growth.
Each VM is another server that an administrator must manage. Security updates must be applied and global configuration changes now need to be propagated to all these new machines. While it’s easy to create 3-4 (or more) servers on one physical piece of hardware, you’ll certainly struggle if you aren’t already set up to scale.
The number of physical machines in a small company may drop dramatically; maybe 40 percent, when virtualization is implemented. Unfortunately, the number of operating system (OS) instances will generally increase by two-fold or more at the same time. The power and cooling savings are realized, as was promised by virtualization, but taking 20 servers to 12 servers, for example, will mean you may soon have 40 OS instances to manage.
The reasons for VM proliferation depend on your culture, but the most common reason is that delegating control of an entire OS is easier than managing an application for customers. IT customers, be they engineers, application developers, or smaller IT units within an organization, frequently need more access than central IT is willing to give. The easy solution: Give them a server of their own. Test environments, too, are well served by virtual machines.
To keep hardware (and power and cooling) costs down, many companies introduce policies about the implementation of new services. New applications and servers need to be run on VMs first, unless they really requires their own server. Policies such as these are good, in that they limit wastefulness, but they do tend to exacerbate VM sprawl.
Sprawl aside; it’s worth noting that higher utilization levels on your servers does not mean that they’ll use an appreciably larger amount of power. In fact, the power savings claims are really true, and can be even greater if your utilization is low and you use VirtualCenter’s power management features. VMWare can migrate VMs to fewer servers if utilization isn’t high enough, and actually power off unnecessary servers. This works best with Dell hardware, but other large vendors are supported as well. Imagine: all your VMs migrating to a few blades in a blade server during the nighttime, and then as utilization increases during the day, blades quickly boot up and take the load as needed. Granted, I don’t personally know any enterprise environments that are brave enough to try it yet, but in theory the concept is wonderful.
Something magical happens when a company grows to around 50 operating system instances. That’s too many to manage by simply logging in and running commands, so people start to write scripts. In Windows land, if it hasn’t already happened, you must implement Active Directory . For the Unix/Linux servers, configuration management becomes even more important. Writing a script that SSH ’s to each server and runs a command doesn’t scale, no matter how hard people want it to. You need a real configuration management system (such as puppet or cfengine) to ensure that servers are configured exactly how you want.
If you already operate in a large environment with good automated installations and configuration management systems, chances are scaling 100-fold won’t be a problem. Barring scaling issues with the management software itself, that is. A good network-booting deployment system is only half the battle, because every server isn’t going to be configured identically. If you’re “doing it right,” you should be able to arbitrarily reinstall any server, walk away, and know that it’ll come back up patched and running all the services it’s supposed to. Servers, or rather the OS that runs on them, should be truly disposable.
VMWare promises management of a “golden image”, probably because ITIL mentions it, but it doesn’t really help in practice. You have to create your images (somehow). There’s no mechanism to update a golden image with security patches and apply them to existing systems; you’ll generally have to reinstall the OS instances. And that’s what you should do periodically, but without some kind of configuration management system, you’ll also be manually installing and configuring the services that the VMs used to provide in order to restore service functionality.
VM growth, therefore, is no different from server growth. It may be easier and cheaper, but from the OS management viewpoint, you’re doing the same thing. Likewise, the availability of your services is also in danger. Running five VMs on a single piece of hardware means that a hardware failure takes out five servers instead of one. VMWare and Xen can both be clustered and run from shared storage, such that a hardware failure will result in the VMs immediately (instantly, even) being migrated to other servers. The problem is that VMotion requires the most expensive VMWare license, and a VirtualCenter server. Shared storage isn’t as big of an issues these days with iSCSI , but it's still another aspect that must be configured. We’ll cover this issue in-depth in a future article, focusing on Xen and RHEL Clustering Services.
The point is: dealing with VM sprawl is no different than dealing with scaling up to support more physical servers. Use whatever mechanisms are available on your given platforms, and “do it right.” A VM is, and always will be, just another server.