I’ve noticed something … in recent memory, I have not suffered one single hard drive failure.
I’ve only suffered multiple hard drive failures … all my drive failures seem to happen in batches.
Last weekend the refurbished Seagate hard drive in my laptop (Rohan) started generating errors. About the same time, the main drive in Gondor started to flake out.
My laptop had been recently backed up with ghost, so getting it restored , to a spare 100gb hard drive I had, wasn’t a problem. I did struggle a bit because there was a Linux partition on the replacement drive … that Ghost didn’t know how to delete.
The drive in Gondor was a bit more problematical … although Linux was reporting problems with the drive, the Dell hard drive diagnostics reported problems with the drive, when I ran Spinrite over it, no problems were reported.
I decided to let the drive sit and see if the problems came back.
Obviously they did … this time, however, when I ran Spinrite on the drive it found a bad cluster. Luckily it was able to recover the cluster. After Spinrite was done, I copied the old drive to a new 300gb drive. Now I just have to get Dell to send me a new drive. Not sure what I’m going to do with a spare 80gb SATA drive.
Of course, all these hard drive problems got me to thinking … why the heck don’t operating systems raise serious alerts when a drive failure is detected?
On Windows XP, the drive problem was silently being logged to the “System Event Log”. I think it should have popped up a warning message telling me that something was wrong.
On Linux, the drive problems were also being logged to syslog … but if you aren’t actively monitoring the systems logs, it’s easy to miss something like that. I’m going to investigate some system monitoring software (something like Nagios) to keep an eye on problems of this nature.
first of: I am the maintainer of the rsyslog project (http://www.rsyslog.com), a drop-in replacement for syslogd. An alternate to “heavy” log monitoring solutions is to use rsyslog and make it email alerts for specific messages (e.g. anything with the hard drive in it).
I just thought I mention it.
Rainer, thanks … I’ll certainly look into it. Nagios is pretty heavy, so something lighter weight might be a good solution for the short term.
I’m also going to be looking into a raid solution of some sort.