Wap Review Downtime

1and1 Logo

I apologize to all my readers for the extended downtime. Apparently there was a massive hardware failure early Wednesday at 1and1, my hosting provider. According to an email I received from them;

“The web server that your site is hosted on has been offline due to an apc power failure of 3 server racks. While all the other affected servers came back online, this web server did not recognize its RAID subsystem.

RAID stands for “Redundant Array of Independent Disks” and is a technology that employs the simultaneous use of two or more hard disk drives to achieve greater levels of reliability and performance.

Your website is stored across the RAID system twice over different hard drives, if one of the hard drives fails your web site will continue to run. The failed hard drive is replaced and the data that was on the drive copied again from the other drives within the RAID, this is known as rebuilding the RAID, and normally happens seamlessly without any effect to the web hosting server or your website. This is a daily task performed in our data centers and is standard for large data storage systems such as used in the web hosting environment.

In this instance, we replaced the failed drive with a new drive and the RAID started to rebuild. While this was happening the rebuild process failed, corrupting all the data within the RAID set. This should not happen and we have open tickets with the RAID manufacturer to understand what went wrong in this case and to ensure that they can prevent this for the future.

Our system administrators do not rely on the RAID system as our only source of backup. We run a rolling backup of the live system to external backup servers to ensure that in a case like this we have a restore solution.

After the RAID corruption occurred, our engineers analyzed the situation and found that the only solution left to us was to recover the data from our backup systems. At this point the RAID was reinitialized ready to receive data, this process itself takes several hours to perform.

We are currently copying and restoring the data from our backup systems to the web hosting server that your site runs from. The restore process takes time and will be finished not before Saturday afternoon.”

While the email makes it sound like a rare perfect storm of unrelated cascading hardware failures, this is the third multi-day outage that I’ve experienced since September. It’s now late Saturday evening and my 1and1 shared host is still down. It has now been down the 89 hours. I’ve used 1and1 for the last five years. The first four of those years it was the perfect hosting provider, cheap, fast and very reliable. In the last twelve months the reliability has disappeared. It is time for me to say goodbye to 1and1. I’ve moved the site to Hostgator which several people have recommended highly. If you are reading this the transition was successful. Please leave a comment if you find anything that is not working.

5 thoughts on “Wap Review Downtime

  1. Sorry to hear about your troubles with 1and1. I wonder how many times they have had to use that same form letter!

    I gave up on 1and1 and moved everything to HostGator.com and am really glad I did. Since the move my sites are faster and I’ve experienced no significant downtime. I’m also impressed with HostGator’s technical support. Unlike 1and1’s clueless support message takers with HostGator I get prompt support from knowledgeable technicians 24/7.

    I hope your site is up soon.

  2. I thought you would find this email from 1&1 interesting… EXACTLY the same spiel they sent you. I wonder what REALLY happened… My webspace has now been “down” for four days and counting.
    -Mike

    Dear Michael,

    On December 10th at 2:00am EST, our technicians discovered a malfunction in the server that hosts your site. Currently, your website is still offline.

    However, we have isolated the malfunction and are working to restore your service as quickly as possible. The restoration of your website will be complete tonight around 10:00pm EST. You will not experience any loss of data.

    We know that you trust us with one of your most important resources, your website, and we are truly sorry for the negative impact that this outage has had on your business or personal web presence.

    Please be assured that we have done everything within our capacity in handling this matter. Our technicians understand the priority of this, and have followed all procedures in place to ensure limited downtime for your site, which is our primary concern.

    We are investigating any improvements we can make to our processes for the future to ensure that an incident like this does not develop to a major outage.

    ***************************************
    Technical Information:
    ***************************************

    The web server that your site is hosted on has been offline due to a hardware failure in the RAID setup.

    RAID stands for “Redundant Array of Independent Disks” and is a technology that employs the simultaneous use of two or more hard disk drives to achieve greater levels of reliability and performance.

    Your website is stored across the RAID system twice over different hard drives, if one of the hard drives fails your web site will continue to run. The failed hard drive is replaced and the data that was on the drive copied again from the other drives within the RAID, this is known as rebuilding the RAID, and normally happens seamlessly without any effect to the web hosting server or your website.

    This is a daily task performed in our data centers and is standard for large data storage systems such as used in the web hosting environment.

    In this instance, we replaced the failed drive with a new drive and the RAID started to rebuild. While this was happening the rebuild process failed, corrupting all the data within the RAID set. This should not happen and we have open tickets with the RAID manufacturer to understand what went wrong in this case and to ensure that they can prevent this for the future.

    Our system administrators do not rely on the RAID system as our only source of backup. We run a rolling backup of the live system to external backup servers to ensure that in a case like this we have a restore solution.

    After the RAID corruption occurred, our engineers analyzed the situation and found that the only solution left to us was to recover the data from our backup systems. At this point the RAID was reinitialized ready to receive data, this process itself takes several hours to perform.

    Sincerely,
    Your 1&1 Internet Team
    1&1 Internet Inc.
    http://1and1.com

  3. 1and1 Internet has an F rating with the Pennsylvania Better Business Bureau, this is their lowest possible rating.
    BBB Rating
    Based on BBB files, this business has a BBB Rating of F
    Reasons for this rating include:
    • Number of complaints filed against business
    • Failure to respond to complaints filed against business
    • Number of complaints filed against business that were unresolved
    • Overall complaint history with BBB

  4. At first I thought you had pissed off the China internet censors … and they had blocked your website. Nice to see you back in the saddle. We missed U :-)

    Kent

Comments are closed.