Monday, March 26, 2007

Netapp Disk Replacement

NetApp Disk Replacement

When a Network Appliance (e.g. hooke.math) loses a disk, we're shipped a new disk. Console access is required for the replacement. If you don't know the password, it's in the usual place. Replace the dead disk by:

find the id of the dead disk.

The sysconfig -r command on the console reveals the needed id in the section at the end titled "Broken disks".

locate the physical drive.

The drive ids are of the form controller.unit. The newer controllers are labelled (e.g. 4a and 4b. The older controller (0) isn't. The drives aren't labelled; the numbering goes right to left starting at 0.

pick a time for the replacement.

The replacement can happen very quickly (say 20 seconds), so we've not been booking downtime for it. But it's polite to pick a time when the load is low, with an advance announcement of a potential pause in file service while the drive is being replaced.

replace the drive.

The electrical transients caused by the swap shouldn't cause a problem with i/o to other disks, because it's supposed to be a "hot swap" cabinet. But just in case, i/o is temporarily halted for the swap. On the console, the disk swap command will stop disk i/o to allow the swap.

The disk container is plastic, so static isn't much of a concern, but it's always polite to electronics to keep oneself grounded. Touching the copper strip around the disk slot once the old drive is removed can do that. Do the swap as quickly as is practical, and few will even notice.

Once the swap has happened, the filer will notice, and resume normal operation. The log entries once the swap starts can look like:

Tue Oct  5 08:25:40 EDT [disk_config_admin]: *** NOTICE ***
  A disk has been swapped (removed or added) to a modular
  storage shelf.  The system will wait 15 seconds and
  then check the status of all disk drives.
Tue Oct  5 08:26:26 EDT [disk_config_admin]: *** NOTICE ***
  Disk unit status check has completed.

If the swap takes more than about 30 seconds, disk activity will resume (notice the blinking lights on the disks) even if the replacement disk hasn't been inserted. If that happens, run another disk swap command before inserting the new disk.

ship back the dead drive to Network Appliance.

We use the same box and packing they shipped the replacement drive in. There are usually instructions in the box for shipping at their expense. Wendy usually takes care of this.

Documentation for the disk command may be found via man na_disk on math.