Archive for June, 2008

SMARTy

Thursday, June 26th, 2008

HDD manufacturers invented S.M.A.R.T. some years ago.
So we should be happy, though I am not.

For one thing, there are no default error rates for attributes/thresholds, but manufacturer’s define (see also) when a drive is bad, and when it is good. Then of course they define it “to the extremities” so a drive in some cases can never go to bad SMART state even if it has constant problems. See more on this at: http://www.hdsentinel.com/smart/, from section “#1 Incorrect thresholds”.

I understand that current technology – in the microns – needs different approach than 10-15 years ago, but I fail to understand for example how a “197/C5” (Current Pending Sector Count) attribute can exist and increase without big red warnings. This means that the sector was successfully written once, but later on it was couldn’t be read (equals data loss). And this doesn’t count as an error (according to harddisk manufacturers), only an increase of an attribute (which can decrease too!). My point of view is that this is sort of the equivalent of the “old day’s” dreadful “bad sector” term. Though that time this things usually happened at write time, so you could immediately notice.

This is a picture of one of my (brand new) Samsung HD501LJ harddisks after 2 days of operation.

The second one followed it’s “path” some days later.

They were mirrored, but swap got corrupted, then ssh and console got swapped out and couldn’t make it back to the memory. So eventually I had to power off the server and since the mirror broke, I didn’t have a fully readable, “mirrorable” array or disk, so I had to do a file by file copy to new disks. Of course off peak, so it was like from 01:00 to 04:00. Was fun… [not].

I also installed a server with 8 Samsung 500 drives, eventually we had to replace all (Hitachis seem to work fine).
If you format/rewrite a harddisk with a bunch of these “read errors”, then voila: the errors go away. Then manufacturer  refuses to replace the harddrive – because of “no errors”. So we stopped selling Samsung harddisks.

I consulted my friend who recovers data from damaged disks, and he confirmed that Samsung is “experiencing problems” with the PMR technology and recommended Hitachi and Seagate drives to use. I then used then a pair consisting of a Hitachi and a Seagate drives to avoid simultaneous failure because of same technology/same time manufacturing.

“Hitachi drives use quite special own technology to park HDD heads outside of magnetic disks area to a special parking ramp. This causes HDD heads not to suffer from parking – they’re NEVER land on disk surface during parking. So, actually, Hitachi HDDs can handle a LOTS of starts/stops without any real problems.” [quoted from here] – [original hitachi article / same in html, from google cache]
Parking _on_ the platter can be seen here (picture 1 and 2).

Even if your server runs 24/7 in a server room with proper power and climate, it can happen that you stop your server and it’s harddisk[s] would never spin up again – because of the contact with the drive’s surface it can get stuck in the dirt (then might even fell off at a restart).

Additionally meanwhile most manufacturers (Hitachi/IBM, Seagate and even Samsung) use embedded servo on all platters nowadays, some models have only one servo information for all platters (“Format Disk with Servo Tracks Once, Use Servo Information with Many Heads“) which makes an occasional recovery less possible because even when a professional disassembles a faulty drive, the platters can move, then chances to recover anything from those platters without servo information is near to impossible.

So kids, avoid Samsung drives for the time being…