Beyond Backups: The Next Steps for Fault Tolerance, Pt. 2
In the first part of this two part article, I discussed the difference between preventing data loss and preventing downtime. In this second part, I will look at the measures you can use to make these targets a reality.
Studies by Intel show that a full 50% of all component failures can be attributed to hard disk failure, which explains the depth to which RAID has been developed. Whereas a few years ago RAID was the domain of large organizations with big budgets, the declining price of server hardware makes a very strong argument for a RAID implementation in any server. As mentioned in part one, disk mirroring (RAID Level 1) provides a very cost effective way of protecting data, and reduces downtime in the event of hard disk failure due to the relative ease with which a failed mirror can be resurrected. Disk duplexing, which is simply disk mirroring with the drives attached to separate controllers is also a valid measure, though an adapter card is ten times less likely to fail than a drive, so as we discussed in part one, the value of this fault tolerant measure must be examined against the protection it provides.
Moving up the scale, even disk striping with parity (RAID Level 5) need not be limited to high budget implementations. The street price of a good quality RAID controller is around $500. If you add, say, 3 20GB SCSI drives at $300 each you have a complete RAID 5 solution for less than $1500.
While we are on the subject of hard drives and failure, it is worth mentioning that performance is not the only reason that servers prefer SCSI. SCSI drives typically have a much higher Mean Time Between Failure than their IDE equivalents. A longer MTBF for any device that is attached to the server translates into prevention rather than cure.
If drives are half the problem (literally), that still means that there are many more accidents waiting to happen with your server. Next up on the name and shame list are power supplies which, again according to Intel, are the cause of 28% of the problems. Although the simple answer to the power supply problem is to simply add another, the fact is that many cases and many system boards simply do not accommodate dual power supply systems. If you find yourself in exactly this situation and do not have the facilities to perform an upgrade of the case and motherboard, you can take other steps.
A more practical approach would be to over specify your power supply so that it is literally not working as hard. A supply operating at 60% of capacity is less likely to fail than one operating at 95% of capacity. Given the relatively inexpensive nature of power supplies, ordering a spare unit may also be a practical solution. Although you don't get fault tolerance, you do get the possibility of a fast switch over.


Windows Server 2008 R2 provides enhanced management control over resources across the enterprise. Downlaod this PDF to learn more.