Stack Up the Bytes with NAS, SAN and Linux
NAS (Network Attached Storage) and SAN (Storage Area Networks) are all the rage these days. And with good reason, as hard drive capacity zooms into the stratosphere, and strange laws requiring that no business record of any kind ever be discarded for any reason, including random sticky notes and e-mailed lunch orders, and users do weird thing like send simple email messages as Word document attachments, and never delete anything, not even spam. It's nuts. Future historians are going to take one look at all these yottabytes of data and give up on the spot, they're not going to even try to make sense of it all.
Meanwhile, we are faced with the problem of storing, and even worse, organizing all that data. Storage is easy—retrieving stuff is hard, just like flinging junk into the attic is easy, and finding what you want later is hard. I predict that librarians and file clerks will become our new world overlords, because only they will know how to find things.
NAS and SAN: Palindrome or Different Things?
Much brainpower and resources are being devoted to the problem of building bigger attics. The major trends in networked storage are NAS and SAN. Some folks see these as two different things: NAS is ordinary pushing-packets-over TCP/IP, like a LAN fileserver; and SAN is SCSI over fibre. Others see SANs as any high-speed storage network, including Gigabit Ethernet.
Some of the more interesting developments in the SAN world have to do with IP-storage protocols. When you run an ordinary Samba file server, for example, it uses the high-level file access protocol SMB/CIFS, which is an example of file-mode access. A faster way to move data is by transporting block-mode data. SANs do this in various ways; one is by using encapsulated SCSI, or iSCSI (define). This lets you use less-expensive Ethernet hardware and cabling, instead of hideously expensive fibre switches, host bus adapters, and so forth. The end result is ordinary block devices that can be mounted over the network like any local block device, formatted with any Linux filesystem, and written to just as if it were a local hard drive.
iSCSI is flexible, allowing you to use all kinds of storage media and interoperate with diverse platforms, and there are Free (GPL) Linux implementations. (See Resources for more information.) I'm not ready to call them ready for prime-time, but admins are successfully using them in production environments. And there is a lot of industry support behind iSCSI, so it's worth getting acquainted with it.
It's still an open question which will deliver more raw speed&mdashl fibre or Ethernet. Ethernet speeds are growing exponentially. The SANs of the future will probably be like the WANs and LANs of today— hodge-podges of different technologies and protocols.
If you don't need super-duper-warp-speed performance, but can get by with ordinary poky old Gigabit Ethernet, building some nice Linux-based NAS is easy and not too expensive. Managing ordinary data storage doesn't require the latest and greatest technology, and you can build a nice one-terabyte storage server for under $600. While one terabyte sounds like a lot, keep in mind this is four 250-Gigabyte hard drives, which is not that unusual anymore. This doesn't include niceties like hot-swappable power or drives, but it's still quite a bargain. And, there are some advantages to sticking with oldfangled technology—you'll get all the customizability of Linux, and you know everything will work right.
Get yourself a good quality full-tower server case with a good power supply, add a PCI-Express motherboard, a Gig-E NIC, a reasonably-fast CPU, lots of RAM, and stuff it with a batch of SATA hard drives. If the motherboard supports only two, which is typical, a 4-port PCI SATA controller can be had for less than $60.
PCI-Express is very different from the old PCI, and will replace both it and AGP eventually. It is backwards-compatible, so you won't have to chuck all of your old stuff. PCI-E uses a point-to-point switching connection, instead of a shared bus. Devices talk directly to each other over a dedicated circuit. A device that needs more bandwidth gets more circuits, so you'll see slots of different sizes on motherboards, like PCI-Express 2x, 4x, 8x, and 16x. PCI-E x16 can theoretically move 8 gbps.
If you're going to use more than 896 MB of RAM, get an AMD Sempron and install the 64-bit version of your favorite Linux. The Sempron is a great processor with a low price tag, and the Linux kernel makes excellent use of all memory over 896 megabytes on 64-bit architectures. Lots of memory and fast disks are more important on storage server than a super-powerful CPU.
Then use Linux RAID for redundancy and to get the most flexibility, Enterprise Volume Management System (EVMS) for freedom from the tyranny of physical hard disk and partitioning limitations, rsync over SSH for efficient, secure backups and restores, and Samba for cross-platform file sharing, and you have one nice powerhouse of a storage server.
As your storage needs grow, what do you do? In an ideal world you could add more storage servers and combine them logically as one big storage unit. As far as I know, the Linux universe does not have the ability to create and manage pooled-storage. There are a large number of commercial products for this, both software and hardware, just search for "IP storage."
I have only one good idea for long-term digital storage: don't let your data be trapped in closed, proprietary file formats. Big deal if your archives survive for fifty years, and the means to read the data no longer exist. Stick with open, free, well-documented formats. Open code will save you if the documentation vanishes.
As for physical longevity, I have no good ideas. I have old 78s from the 1940s that are still playable, and home-recorded cassette tapes over 30 years old that still sound good. What's going to happen to our digital archives? I predict most of them will disappear. Which might not be such a bad thing, as their sheer size is already unmanageable.