"As soon as is practical after the event, document not only the problem, but the solution, too. " |
Sooner or later it happens to even the most well-protected systems: Something burns out, wears out, falls over or crashes. Although your systems can be protected by redundancy, backups, and fault-tolerant devices, there will always be something that is either beyond your control or simply unforeseen. As a systems manager or administrator, the key to coming through these system failures successfully depends on your approach, attitude, and–of course–technical ability. This article provides some valuable tips that can help you keep your head, while all around you are losing theirs.
Be prepared:
Before a problem even occurs, compile a list of technical support resources, such as phone numbers, Web sites, and contacts for the products you use. Don’t just save this list on the server, because that system could be down–keep a paper copy along with the system documentation. You are likely to need both in the event of a system failure. Know where to get replacement parts:
Find a reliable–and preferably local–source for obtaining new parts or products in a hurry. The middle of a system crash is no time to be reaching for the Yellow Pages. Make sure the company you find is helpful, is knowledgeable, and maintains a good stock of spare parts or even spare units for the equipment you use. Once you have found such a company, set up an account so that you can order parts quickly and easily. Make sure you have working backup systems:
It sounds obvious, but many people just assume that their backup systems are working. You should make sure that your backup systems are reliable–and, more importantly, restorable. Faulty equipment can be replaced or repaired, but the loss of data is a little harder to remedy. Knowing that you have a reliable backup can make a big difference when tackling a system failure. Stay calm:
When a failure does occur, no matter how bad it seems, don’t panic or let yourself appear overly flustered by the situation. This is the time when both your managers and your peers need to be reassured by your ability to deal with the problem. At the same time, though, try not to appear nonchalant, or as if you have all the time in the world. The very same people who want to see your calm attitude also want to sense that you are keen to cure the problem as quickly as possible. Manage user expectations:
In some situations, you can easily estimate the length of time it will take to fix a problem; in other situations, you can’t. For example, if the power supply in a server has failed, you should have a general idea of how long it will take to obtain a new power supply and replace it. Tell the users your estimations, but be very realistic. It is far better to say that it will take an hour and live with the moans and groans than to say 3 times that it you’ll need another 20 minutes. If you don’t know how long the problem will take to fix, give the users some idea of when you may know. Then, do everything you can to get an accurate picture of what has happened and make a time estimate.
Don’t just think of the obvious:
When things go wrong, it is easy to become focused on the obvious solution when an alternative may be quicker and easier. Using the scenario of a failed power supply as an example, rather than just thinking about how quickly you can obtain and fit a new power supply, think about whether you can take one out of a less important machine and swap them over. This approach may not always be possible, but widening your thought process can often pay dividends.
Keep lines of communications open:
Management and users will be far more understanding if they know a little of what the problem is. When explaining the problem to them, put it into words that they will understand–but never underestimate their capability to comprehend a problem. Consider your options:
Before you take any action, consider your various options and how likely they are to succeed. If you are lucky enough to be working within a team, canvas ideas from all team members. Then, as a team, go through each option and discuss which ones to try. Once you have selected an approach, devolve responsibility for certain tasks, if practical. Not only does doing so reduce the workload on you, but it can also serve to build team morale and productivity. Keep track of what you do:
When attempting a fix, keep notes of what steps you take and any changes you make. These notes can be very useful when you document the problem and will also make it easier to backtrack, if necessary. In addition, record keeping can help you prevent wasted time and effort by re-trying steps that you have already tried. Test, and then test again:
Once you think you have solved the problem, test the solution thoroughly. If it all possible, test from a user’s perspective, from their workstation or terminal. Just because you can now access the system from your PC, does not mean that they can. Predict further failure:
With the problem solved, you should look immediately at whether the problem is likely to reoccur. This step is particularly relevant when the problem seemed to go away on its own, rather than as a result of anything you did–if it went away on its own, it can come back on its own. If you’ve isolated the fault to a particular item, consider whether that fault may affect other systems. If you think it might, consider making the same changes on these other systems. Document the problem :
As soon as is practical after the event, document not only the problem, but the solution, too. Do it while the information is still fresh in your mind, and try to write down as much detail as you can. Although the same problem may never reoccur, knowing how you fixed it may help you with another situation at a later date. Don’t just rely on memory, because you might forget what you did–and, the next time the system goes down, you may be on a well-earned vacation.
Conclusion
System failures and outages are times when you must demonstrate not only your technical abilities, but also your diplomatic and communicative skills. By combining thorough preparation, a methodical approach, and an attention to detail during and after the event, you will be able to respond and react to such situations with maximum effectiveness. This professional approach will also serve to increase your recognition as an individual in whom people have confidence when the chips, or indeed the systems, are down. // Drew Bird has been working in the IT industry for over 12 years. After starting his career as a mainframe operator, he quickly moved into the networking arena and has since had a variety of roles including network systems analyst, networking consultant, and instructor. Drew currently works as freelance instructor and consultant.