Backup vs RAID

Backups are a crucial part of our digital life. Every computer from giant databases to a single personal computer or mobile device needs a backup. A place where the most relevant user data can be stored for a long time and also stored in such a manner that it is recoverable in the time of need. We can draw a distinction between the data on our currently running system, let's call it live data, and the backed up data. The latter being stored away from the current system that is using the live data.

RAID, concerns itself with the live data, it is a mechanism with which a running system combines multiple disks into a single storage entity. The data is then spread across across all the disks in such a way that it can survive the failure of at least one (or more) of the physical disks. The simplest type of RAID array is RAID1, or mirroring. This is where you copy (or mirror) the same data across two or more disks such that if one of the disks fail, the data can still survive and still be actively used. There are other RAID configurations as well, and we will discuss those as we go along.

About RAID

RAID, or Redundant Array of Inexpensive Disks, is a mechanism to store data across disks. There is a wide “array” of RAID setup that you can go with, but the two basic mechanisms that they are all based on are the following:

1. Mirroring:

Mirroring implies that your data blocks are copied, mirrored, across multiple disks. If you mirror your data across three disks you can survive upto two disk's failing at any given time, the failed disks can then be replaced with new ones without much hassle. Similarly, if you copy data across n+1 disks, you can withstand upto n disks failing. The downside of this is that you only get the storage capacity equal to the smallest disk in your RAID array.

2. Parity:

A second approach is to split your data into two parts, using the two blocks of user data you can create a third 'parity' block. The three blocks are all of the same size and are spread across different devices. A minimum of three devices are necessary for this configuration to work. If any of the disk fails, you can recreate the blocks stored in that disk using the other two blocks. For example, if the second user block is lost, the first block and the parity block can be used to compute the second user block. If you are interested in how this works check out this wonderful explanation.

This method can be improved upon further to have 2 or even 3 parity blocks. But more than 3 parity blocks aren't seen in the industry that often. If you have one parity block you can survive one disk failure. Two parity block means you can withstand two disks failing and so on.

It is more efficient in terms of storage utilization, than mirroring. If you have one parity block you only need 50% more physical storage per actual user data that you are storing. This means to store 1GB of data you will need 1.5GB of storage (plus there is a small overhead for the metadata). This is way more efficient than even the most efficient mirroring scheme where you need at least 2GB of storage to mirror 1GB of data between two disks.

The downside is that random write operations are going to be slowed down, thanks to the extra bit of computation and write operation associated with the parity block. Also the reliability isn't as good as that of an n+1 mirrored disks where you can prepare for any arbitrary number of disks failing.

RAID configurations can be as complex or as simple as you like them to be, you can combine the parity and mirroring strategies and modify them to your enterprise's liking. There are dedicate RAID controllers to which you connect your physical disks, and the OS then sees a single logical disk as shown by the controller. LSI is one such vendor of RAID controllers. You can also perform RAID in the software OpenZFS is probably the best bet you have on that regard.

One last kind of RAID, that gets an honourable mention is RAID 0. Technically, it is not a RAID scheme, because there is no Redundancy involved here. The idea behind RAID 0 is to simply spread your data across multiple storage devices without any resilience against disk failures. The advantage is that you get performance improvements by doing this. If you are writing 1GB of data to a single disk, the process is slow. The disk can only do a limited number of write operations per second and your OS has to wait for it to finish that operation before new data is sent its way. If you spread the same 1GB of data across two such disks, you can write (and read) from both of them simultaneously and gain quite a bit of performance improvement.

Back Ups

The concept of backups is arguable more important than that of RAID. A backup, in the context of storage management, is a known good copy of data, from a given point in time, from which you can restore files back into your main system when needed. In terms of implementation, there are many cloud hosted solutions and many offline ones as well that can be used.

Tarsnap and Backblaze are my favorite managed backup services for both private and business use cases. You can also include Google Drive, iCloud or Dropbox in this definition of a backup solution but they are targeted more towards the consumer market than the enterprise. However, the underlying principle is still the same. When you sign in to a new iPhone or iPad all the data, your contacts, photos, media library etc, is synced from your iCloud account seamlessly and as you keep using your device the newer data gets silently backed into the Cloud and you don't have to worry about it.

Your backup solution can be as simple as copying data to an external hard disk or to use rsync (or zfs send, if you are using OpenZFS) to periodically generate a copy of all the relevant information. This could include your Documents folder, your database, your source repository or even your entire root file system splat into a flat zip or a tarball. The important criteria that a good backup solution should meet are the following:

Backups should occur often - If you backup data every month, instead of every week, you risk losing upto a month's worth of data when disaster strikes.
Your backups should go back in time - The backup storage is finite. Sometimes you have to throw away older backups. The more storage you have, the better your backups can be. Suppose you backup your data weekly, but throw away backups older than 2 weeks. If a file gets accidentally deleted, and this goes unnoticed for two weeks, you won't have a way to bring it back.
Your files should actually be restorable - If you have never tried recovering your data from the backup, you don't have a backup. You should not have to learn how to recover data, at the critical time when you suffered a data loss. Plan ahead and know how to restore the system from the last known good backup.
Your backup should be segregated from the running system - When disaster strikes, and all your files on the production server gets encrypted, deleted or corrupted, you need to make sure that the same doesn't happen to your backup. One good way of ensuring this is to make sure your backup device is not 'connected' to your production environment, i.e, unplug your USB hard disk, unmount your NFS file system when you are done backing it up. Atleast, don't give production system the privilege to overwrite or modify your backup data. Make it read-only.

Now that we know a little bit about both RAID and backup, let's highlight some differences between them.

Files and Blocks

RAID is always concerned with blocks of data, not how the filesystem presents that data to the user. Both software and hardware RAID deals with data as blocks of information, the size of blocks may vary from 128 KiB to 1 MiB.

Backups on the other hand are much more flexible. They usually are performed on the file system level, although there is no hard and fast rule for this to be the case. They are also more granular. You can restore a single file from your backup, if your solution is flexible enough. RAID arrays are not backups, they are just a way to spread data across multiple disks. If a file is deleted, all its mirrored blocks and parity blocks are freed. End of story.

Use Cases

Backups are for everyone. The approach and extent may vary from personal use case to enterprise, but everyone with a digital life needs backup. RAID is more of a business/enterprise specific feature. You see RAID arrays in servers, storage devices like NAS and SANs, cloud hypervisors, etc. Pretty much any place that stores live critical data uses some form of RAID. Even the servers that run your cloud hosted backups probably use RAID arrays. These are not mutually exclusive technologies.

This doesn't mean you can't use RAID for your personal use case, it just has more utility in the enterprises. Part of the reason behind this is that in the enterprise, disks are pounded with IO operations 24/7. In production environment, like the storage of a database or video streaming service or a cloud hypervisor, the storage device of your server will under constant gruesome load, data is constantly being read from and written to these devices and often by several applications simultaneously. In these conditions your drives are much more likely to fail. Having a RAID configuration means if a drive fails you suffer little or no downtime. Most servers can continue to operate even after a disk failure so you don't lose new information and requests coming in every second.

An average desktop computer can hardly recreate the same stressful condition, even if the disk dies, if you are using a backup solution like Backblaze, you can retrieve most of your lost data and losing a few hours worth of work is probably the worst thing that can happen. Even this is becoming a rarity thanks to cloud hosted solutions like Adobe Creative Cloud, Office 365, etc.

RAID is not a substitute for Backup

If there is a single take away you want from this article, it should be this. RAID is NOT a substitute for Backup. Always back your data up! There are many people out there who think if you have RAID, it means that the data is safe across multiple disks and so there is no need to back it up. Nothing is further from truth. RAID is meant to deal with a single specific issue - the disks failing or giving back erroneous data. Having RAID won't protect you from a million other threats like the following:

User errors and accidental deletions
Application or OS bugs causing widespread data corruption
Ransomware or other malware encrypting, deleting or corrupting your data
Failure of RAID controllers themselves

The data on your RAID array is live. If the OS, an application (or a user) goes haywire and deletes a few files here and there then the file will be deleted all across your RAID array. Having a seperate copy of your data, a backup, is the only way you can ever protect yourself against this kind of scenario.

Conclusion

If you are worried about your data, your first concern should be backup solution. Most desktop users, except maybe power users, should invest more into a reliable backup instead of fiddling with RAID1, RAID5 or RAIDZ. If you want to build your own backup server, you need to think of a decent backup policy and a reliable storage backend. This article maybe a good place to start. You can use rsync or zfs send to take period copy of your data to this backend.

If you are in the enterprise, and are considering a RAID solution to store all of your live data. Consider using OpenZFS, it offers a very flexible solution, everything from n-disk mirroring to RAIDZ1 with one parity block to RAIDZ2 and RAIDZ3 with 2 and 3 parity blocks. You need to consider a lot about your application's requirements before making a decision. There are trade-offs between your read-write performances, resilience and storage efficiency. However, I would recommend that you should only think of RAID after you have a decided upon a backup solution.