Time to Converge Monitoring and Management in Linux and Unix
I admit it: I'm slightly jealous of Microsoft server administrators. You see, in the Linux world, we have the power to create crazily robust and creative systems, but we're often reinventing the wheel.
Specifically, in Microsoft land, it takes very little time to set up Active Directory and a boot server. New Windows machines can be deployed and group policies get created to define what software and configurations exist on a group of servers. Then, if they buy Microsoft Systems Center Operations Manager, all the deployed servers can be automatically monitored. It's all point and click, and it's very easy.
In the Linux/Unix world, it takes a lot of planning and learning new tools to get the same functionality. You might run Puppet or Chef for configuration management, and then write special scripts in your deployment system (Cobbler or a home-grown system) to automatically add servers to your monitoring systems, such as Nagios or Zenoss. Once you've done this, however, everything else that gets added to the configuration management system is far more powerful and useful than can be done with Microsoft servers. After many weeks are spent designing, learning, and implementing the fundamental systems, sysadmins are finally able to get real work done.
The time spent is surely worth it, but what if it didn't take a top 10 percent type of sysadmin to recognize the payoff associated with a fully automated infrastructure, and with the focus and determination to make it all work? What would the Linux and Unix server world look like if that was the easy part?
I am not advocating for a feature-limited and GUI-based infrastructure-in-a-box type of solution. Instead, I'm looking at all these critical infrastructure systems and wondering why it takes so much time to get them all jibing with each other. There is no possible way that a one-size-fits-all solution could work, because the Linux world allows (requires) a lot of customization. There are, however, benefits to getting everything tightly integrated.
I am talking about Self Healing, primarily.
How many times has information from a monitoring system led to action on the system administrator's part? Almost always. Say a virtual machine server is running low on RAM: You need to deploy a new one to move VMs. We might notice that our Web application servers are getting heavily utilized: deploy a new Web server or two to spread the CPU load. These decisions are often made when looking at the monitoring system, so it makes sense to have the ability to deploy new servers from there.
The concept of "Self Healing" is scary, but it doesn't have to be automated. Ideally, we'd start by making note of the criteria that went into a particular decision. "Application servers were at 75 percent CPU all day, so I deployed a new one." That logic is simple and could even be automated. There might be another reason for high load which would require sysadmin intervention, but at the very least it would be a huge time saver to simply hit a button and deploy a new server.
To do this, the monitoring and configuration management systems need to have the same idea about "types" and "groups" of servers.
Some integration does currently exist, in that a sysadmin could cobble it all together to make deploying new servers a bit less manual.
The deployment system, Cobbler, can manage kickstart and all the net booting services, and it even talks to Puppet. Cobbler can inform Puppet to add a new node, and which classes it should be a member of, e.g. what type of server it is.
Once the server is running, Puppet can then inform Zenoss or Nagios that it has a new node which needs to be monitored. Puppet can supply information about the node, informing the monitoring system what "group" or type of server it should add. From this information, the new server is added and the appropriate monitoring begins.
But that is the extent of the information flow; it is one-way. Cobbler informs Puppet, and Puppet informs Zenoss (or Nagios). As I've already pointed out, Zenoss and Nagios have useful information that can be used to inform the systems in the other direction.
What we have now is very loose. We need tight integration, but of course these systems need to stay distinct and loosely coupled. If someone were to attempt to create a system that comprised all functionalities, the project would surely fail. It wouldn't be flexible enough, and more importantly it wouldn't be robust enough. Each of these systems are extremely complex, and the people working in those areas understand the needs of each.
I spoke with Luke Kanies, author of puppet and CEO of Reductive Labs, to ask what he thought the long-term landscape should look like. Luke agreed, adding, "In the long term, the integration should be much tighter - I'd like to see monitoring tools functionally being the integration tests for services."
Many sysadmins create loosely integrated systems themselves. In the Web-based management GUI sense, you often see link-backs and embedded frames to loosely tie these systems together. We aren't anywhere near the utopian world where every developer is trying to make sure their software speaks to the popular monitoring systems, or where sysadmins are all going this crazy with automating the infrastructure and creating sentient self-healing systems.
We are getting closer, and now that these types of tools have matured a bit and are seeing widespread adoption, it's time to start talking about the next steps.