Sunday, May 17, 2015

RAID Problems

RAID is good. Right? Keep on reading.

First read this article in The Register.

tl;dr - "Rebuild times are so long that the chances of an unrecoverable read error (URE) occurring are dangerously high."

Here's the long version.

RAID 5 uses a parity set to recover from a failed drive. The problem is that spinning disks are getting larger. This increases the chance of an unrecoverable read error occurring during the rebuild of a failed drive from the parity set. Your data is gone at that point.

The math behind this is really complicated but here's the punch line:
Consumer magnetic disk error rate is ... an error every 12.5TB.
Now let's look at that for today's big drives.
Putting this into rather brutal context, consider the data sheet for the 8TB Archive Drive from Seagate. This has an error rate of 10^14 bits. That is one URE every 12.5TB. That means Seagate will not guarantee that you can fully read the entire drive twice before encountering a URE.
Gulp!

How big are the drives in your RAID? Mine are 2TB consumer class.

What's a person to do?

Buy more expensive drives.
Enterprise magnetic disk error rate is ... an error every 125TB.
That reduces the failure rate by an order of magnitude.

The elapsed time of the rebuild is still problematic. Realizing that a Drobo (1st generation) is not an enterprise class RAID system my experience is that the Drobo rebuild time is in excess of 24 hours per TB.

Or buy SSDs.
Enterprise SSD error rates are ... an error every 12.5PB.
That gets you another order of magnitude. But those are expensive.

There are alternative RAID modes that give more protection and better recovery time as well.

No comments: