RAID brings down second grid: What is it, and how to protect yourself

(Image courtesy David Goehring via Flickr.)
(Image courtesy David Goehring via Flickr.)

OSgrid, the oldest OpenSim grid, was brought down by RAID last Monday. Today, The Next Reality Grid was brought down by RAID, as well.

“Like OSgrid, our RAID has also failed completely,” said Next Reality owner Mike Hart in an announcement today. “Luckily we do have backup OARs of all the regions. But not sure about our registered avatars details in the database. Its going to be a few days before the RAID can be replaced and everything restarted again.”

Does your grid have RAIDs?

It probably does!

Stephen Colbert realizes his grid has RAIDs. (No, just kidding. Totally unrelated panic attack.)
Stephen Colbert realizes his grid has RAIDs. (No, just kidding. He’s having a totally unrelated panic attack.)

What is RAID?

Although it sounds like a fatal grid disease — or a toxic spray — RAID is actually a kind of storage. The acronym stands for “redundant array of independent disks.”

A typical RAID storage array. (Image courtesy Newegg.)
A typical RAID storage array. (Image courtesy Newegg.)

The idea is that RAID storage fails less frequently than traditional hard drives, because of the “redundant” part.

Even grids that use other types of infrastructure may be using RAID somewhere.

Melanie Thielker squareAvination, for example, a large commercial grid, bases its central infrastructure on cluster technology, grid owner Melanie Thielker told Hypergrid Business.

Avination uses a cluster to provide a group of identical machines, each able to take over for any of the others,” she said. “However, these machines still do use RAID internally because that is the easiest and most effective way to recover from hard disk failure.”

Not all RAIDs are created equal

The two most popular types of RAID are RAID 0 and RAID 1.

RAID 0 uses multiple drives to store data, in such a way that a single file might be chopped up into pieces and each piece stored on a different drive. The advantage of doing it this way is that you can save the file faster — while the first drive is saving the first half of the file, say, the other drive is saving the second half of the file. It also means you can get the file back faster, as well. But it doesn’t provide a built-in backup.

RAID 1 also uses multiple drives to store data, but it stores a full copy of file to each disk. It’s still faster to save and load data, because if one drive is busy you can use the other one. As a bonus, if one drive goes down, you still have all your data on the other drive. The downside is that you only get half as much storage capacity from the same hardware.

According to Zetamex CEO Timothy Rogers, RAID 0 offers no real advantage for OpenSim.

(Image courtesy SERT Data Recovery Services.)
(Image courtesy SERT Data Recovery Services.)

In fact, if even one drive goes down, it means the whole system goes down. Take the example of a file where half of it is stored on one drive and half stored on a second drive. If the first drive goes down, you’re left with just half a file — a very difficult situation to recover from. In this sense, having RAID 0 is worse than having no RAID at all.

Zetamex itself uses RAID 1 on all its servers, he said. If one drive is down, all the data is still there, fully copied on the other drive.

But Zetamex doesn’t just rely on the RAID mirroring for backups, he added.

“We have fail-safes for our fail-safes,” Rogers said. “We backup to one on-site server at every data center — this means we have a spare server that sits in every datacenter we have that is just used for file backups. Then we have three off-site backup locations: Amazon S3 Glacier, Google Cloud Storage, and CrashPlan.”

Zetamex backs up to all three of these providers daily, he said. “I believe that if X goes out of business or something, I have others to fall back on.”

Rogers urges other grid owners to make regular backups as well. CrashPlan Pro, for example, costs just $10 a month to back up a server.

“You just point it at your OpenSim directory and your MySQL directory and let it run in the background and never have to touch it again,” he said. “It works on Windows, Macs, and Linux.”

For those looking to do it themselves, he recommends the following three steps:

  1. Make a daily export of all OpenSim databases, including Robust
  2. Save OARs — region backup files — regularly
  3. Keep all these files in a separate, secure location

Not all RAIDs are created equal, Part 2

Keeping track of which files are saved to which disks in the RAID system is tricky. One way to do this is with software, but a faster option is to use dedicated hardware in the form of special RAID controller cards, which have slightly better performance.

Dierk Brunner
Dierk Brunner

Dreamland Metaverse CEO Dierk Brunner told Hypergrid Business that it’s likely that the problem that OSgrid and Next Reality grid have is that these RAID controller cards failed.

If that happens you need a new, compatible RAID controller as replacement,” he said. “For compatibility, the RAID controller must not only be a product of the same hardware vendor, it must be the same kind of RAID controller card, often even with a similar firmware revision number. And that is where most problems start and I guess these grids currently experience such a problem.”

In the beginning, he said, Dreamland Metaverse used software-based RAID controllers, and the impact on performance was minimal. “The advantage is that replacing failed disks is easy and there is no risk of failing RAID hardware that cannot be replaced by a compatible new card.”

 

Brunner said that his servers currently have hardware-based RAID controllers, but the data centers that host the servers guarantee that they have compatible RAID controller replacements in stock.

“Once, we had such a RAID failure and within less than a day we could replace the failed card with another new one,” he said.

Brunner said that the recent incidents underscore the need for backups, instead of relying on RAID alone.

“We do backups once a day,” he said. “And once a week we store backup data on a server at another geographic location. This way even bigger catastrophic events such as fires or earthquakes only results in a data loss of maximum seven days. Usually the lost changes are for less than 24 hours.”

Maria Korolov