I apologize to all my readers for the extended downtime. Apparently there was a massive hardware failure early Wednesday at 1and1, my hosting provider. According to an email I received from them;
“The web server that your site is hosted on has been offline due to an apc power failure of 3 server racks. While all the other affected servers came back online, this web server did not recognize its RAID subsystem.
RAID stands for “Redundant Array of Independent Disks” and is a technology that employs the simultaneous use of two or more hard disk drives to achieve greater levels of reliability and performance.
Your website is stored across the RAID system twice over different hard drives, if one of the hard drives fails your web site will continue to run. The failed hard drive is replaced and the data that was on the drive copied again from the other drives within the RAID, this is known as rebuilding the RAID, and normally happens seamlessly without any effect to the web hosting server or your website. This is a daily task performed in our data centers and is standard for large data storage systems such as used in the web hosting environment.
In this instance, we replaced the failed drive with a new drive and the RAID started to rebuild. While this was happening the rebuild process failed, corrupting all the data within the RAID set. This should not happen and we have open tickets with the RAID manufacturer to understand what went wrong in this case and to ensure that they can prevent this for the future.
Our system administrators do not rely on the RAID system as our only source of backup. We run a rolling backup of the live system to external backup servers to ensure that in a case like this we have a restore solution.
After the RAID corruption occurred, our engineers analyzed the situation and found that the only solution left to us was to recover the data from our backup systems. At this point the RAID was reinitialized ready to receive data, this process itself takes several hours to perform.
We are currently copying and restoring the data from our backup systems to the web hosting server that your site runs from. The restore process takes time and will be finished not before Saturday afternoon.”
While the email makes it sound like a rare perfect storm of unrelated cascading hardware failures, this is the third multi-day outage that I’ve experienced since September. It’s now late Saturday evening and my 1and1 shared host is still down. It has now been down the 89 hours. I’ve used 1and1 for the last five years. The first four of those years it was the perfect hosting provider, cheap, fast and very reliable. In the last twelve months the reliability has disappeared. It is time for me to say goodbye to 1and1. I’ve moved the site to Hostgator which several people have recommended highly. If you are reading this the transition was successful. Please leave a comment if you find anything that is not working.