Ocean Mist

22 Jun 2008

Hard Drives and Spontaneous Reboots

Posted by astromme

I haven’t had much luck with hard drives in the past few weeks. After having a 500gb drive die on me while I was in Norway, I now have a computer that loves to spontaneously reboot whenever I write to 30% of an IDE drive.

I first noticed the latter problem when I was investigating a software based RAID1 (mirrored) solution. Originally the raid was successfully created and I was able to transfer 80GB (These are 120GB drives, so well over 50% of the drive) of data over. Then I wanted to simulate device failure and raid1 rebuilding.

mdadm --fail /dev/sdb1
mdadm --remove /dev/sdb1

Then, /proc/mdstat showed a degraded RAID1 array (as it should),

md0 : active raid1 sdc1[0]
117242240 blocks [2/1] [U_]

Then I tried to re-add the device,

mdadm /dev/md0 --add /dev/sdb1

Checking /proc/mdstat again showed linux happily rebuilding the raid. I come back after 20 minutes to check /proc/mdstat only to find that the computer has restarted itself in the mean time and the raid didn’t get rebuilt.

So, I try again, wondering what could be the matter, this time watching /proc/mdstat. At 30-some% suddenly I get a beep, some sort of ata error (it flashes by for less than a second) and the computer promptly restarts itself.

Great, I think, do I have a bad drive? Better to find out now rather than after storing crucial data on it (and spending the time to get it all set up). Out of curiosity, I reformat the drive as ext3 and write to the blocks (from the filesystem rather than on a device level):

dd if=/dev/zero of=/mnt/sdb1

When I return in an hour, this has completed successfully. Quite odd. I forget about the raid array and am busy with other work for a few days.

Yesterday, I returned to the test computer to investigate rdiff-backup as a backup solution. (Note: I really like the idea of rdiff-backup, and I’m inclined to write a kio-slave for it to be used for recovery easily from kde) I start the rdiff-backup of the 80GB of data and let it run overnight. This morning I find out that it rebooted itself less than 30 minutes after I started the backup, at least according to uptime. I then realize that I’m using the flaky drive/ide chipset/whatever is the matter that I had RAID problems with earlier.

My next line of work is to remove the drive (IDE), put in a 250GB (SATA) and try the same thing. I have a worrying feeling that it’s the chipset. The source drive for this backup is on SATA and the destination was on IDE. This caused reboots for both a raid setup and a rdiff-backup setup. However, using the same drive with dd and /dev/zero did NOT cause a reboot. Also, I can read the data from the SATA drive just fine. (Or can I? I should try a cp /data /dev/null).

Leave a Reply