Views:

Summary



VStor servers run on ZFS file system. If a drive becomes unhealthy it can be replaced.

Step By Step



Possible status of ZFS Devices

Run the command “zpool status” to determine the health of the ZFS pool.  When run, this will determine the aggregate state of all devices in the pool.  Below are some of the states that you may encounter.

ONLINE
All devices are in the ONLINE state, this includes the pool, VDEVs and the drives beneath them.  

OFFLINE
Only disk drives can be OFFLINE. This is an administrative state where drives can be “onlined” and actively brought back to the pool.

UNAVAIL
The device or VDEV cannot be opened.  This is roughly equivalent to a faulted disk.  If VDEV is in this state, the pool will not be accessible.

DEGRADED
A device has faulted, but the pool is still operable, and redundancy is lost.  

REMOVED
The device has been detected as removed.

FAULTED
Both the VDEVs and drives of a pool can be in a FAULTED state, in which case it is completely inaccessible.  

INUSE
This state is reserved for the case where a spare replaces a faulted drive.

Drive Replacement Procedure

Below is an example of replacing a drive in a zPool with two RAIDz2 groups below as reported from ‘zpool status’

Healthy Pool

[root@eh-centos-vsnap198 ~]# zpool status
  pool: vpool3
 state: ONLINE
  scan: resilvered 23.5K in 0h0m with 0 errors on Wed Mar  7 15:25:40 2018
config:
 
        NAME                                        STATE     READ WRITE CKSUM
        vpool3                                      ONLINE       0     0     0
          raidz2-0                                  ONLINE       0     0     0
            scsi-36000c29746d921bca14fb9c73513d776  ONLINE       0     0     0
            scsi-36000c29695a814f39e33c36dee8c100c  ONLINE       0     0     0
            scsi-36000c294c40b18a83e880530b14ffc52  ONLINE       0     0     0
            scsi-36000c29d07a0ded097910901edfa4ce4  ONLINE       0     0     0
            scsi-36000c29dc85e7adc7d10a171b00cda17  ONLINE       0     0     0
            scsi-36000c2934fbda7c1dbb2946cafae7e90  ONLINE       0     0     0
            scsi-36000c29eab355532cdd8a321af882903  ONLINE       0     0     0
          raidz2-1                                  ONLINE       0     0     0
            scsi-36000c2941b16e2fbc1e63dea59c81ff3  ONLINE       0     0     0
            scsi-36000c296fcb7ff9c1a2967b163f381a2  ONLINE       0     0     0
            scsi-36000c29911c2a5d39f044cf0e3093b4a  ONLINE       0     0     0
            scsi-36000c297c3c161c5ece0299e79f0cdc5  ONLINE       0     0     0
            scsi-36000c29cb48660757c4ea71d0c98e917  ONLINE       0     0     0
            scsi-36000c2957fe280fc3b86e364babb80dc  ONLINE       0     0     0
            scsi-36000c29e5cd27c18813f857459910749  ONLINE       0     0     0
 
errors: No known data errors


 
You can determine how the unique SCSI ID’s map to physical devices by performing a long list
‘ls -l /dev/disk/by-d/’.
 
 There are many devices, so the two that we will be dealing with are shown below, specifically /dev/sdb (simulated failure) and /dev/sdp (drive replacement).
[root@eh-centos-vsnap198 ~]# ls -l /dev/disk/by-id/ | grep sdb
lrwxrwxrwx. 1 root root  9 Mar  7 15:26 scsi-36000c29746d921bca14fb9c73513d776 -> ../../sdb
lrwxrwxrwx. 1 root root 10 Mar  7 15:26 scsi-36000c29746d921bca14fb9c73513d776-part1 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Mar  7 15:26 scsi-36000c29746d921bca14fb9c73513d776-part9 -> ../../sdb9
lrwxrwxrwx. 1 root root  9 Mar  7 15:26 wwn-0x6000c29746d921bca14fb9c73513d776 -> ../../sdb
lrwxrwxrwx. 1 root root 10 Mar  7 15:26 wwn-0x6000c29746d921bca14fb9c73513d776-part1 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Mar  7 15:26 wwn-0x6000c29746d921bca14fb9c73513d776-part9 -> ../../sdb9
[root@eh-centos-vsnap198 ~]# ls -l /dev/disk/by-id/ | grep sdp
lrwxrwxrwx. 1 root root  9 Mar  7 16:05 scsi-36000c292d5aa6e30c9fc84fc57490f25 -> ../../sdp
lrwxrwxrwx. 1 root root  9 Mar  7 16:05 wwn-0x6000c292d5aa6e30c9fc84fc57490f25 -> ../../sdp
 

Simulated Drive Corruption

Use ‘dd’ to write over the header of the drive /dev/sdb
[root@eh-centos-vsnap198 ~]# dd if=/dev/zero of=/dev/sdb bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.05309 s, 523 MB/s

Note that ZFS did not recognize the drive as a failure here at first, probably because the drive header is read on instantiation of the drive and the hardware associated with this drive is technically not reporting to ZFS failed state, i.e. this is a corruption scenario.  Rebooting the machine ensures that the drive comes back in a failed state as shown below and can be considered as part of a maintenance plan.

Pool Status After Failure

The pool after drive failure is below. Note that you should consult the link here that the command gives you in conjunction with this document.  It will walk you through a procedure based on the error condition.  In addition, if you haven’t labeled the drives by WWN, there may be optional utilities for your operating system that enable you to activate the drive LED based on this WWN.  Other than that, you can use the installed ‘lsscsi’ command to determine ports for connected drives.
 
[root@eh-centos-vsnap198 ~]# zpool status
  pool: vpool3
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: resilvered 23.5K in 0h0m with 0 errors on Wed Mar  7 15:25:40 2018
config:
 
        NAME                                        STATE     READ WRITE CKSUM
        vpool3                                      DEGRADED     0     0     0
          raidz2-0                                  DEGRADED     0     0     0
            7835885528809609590                     UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-36000c29746d921bca14fb9c73513d776-part1
            scsi-36000c29695a814f39e33c36dee8c100c  ONLINE       0     0     0
            scsi-36000c294c40b18a83e880530b14ffc52  ONLINE       0     0     0
            scsi-36000c29d07a0ded097910901edfa4ce4  ONLINE       0     0     0
            scsi-36000c29dc85e7adc7d10a171b00cda17  ONLINE       0     0     0
            scsi-36000c2934fbda7c1dbb2946cafae7e90  ONLINE       0     0     0
            scsi-36000c29eab355532cdd8a321af882903  ONLINE       0     0     0
          raidz2-1                                  ONLINE       0     0     0
            scsi-36000c2941b16e2fbc1e63dea59c81ff3  ONLINE       0     0     0
            scsi-36000c296fcb7ff9c1a2967b163f381a2  ONLINE       0     0     0
            scsi-36000c29911c2a5d39f044cf0e3093b4a  ONLINE       0     0     0
            scsi-36000c297c3c161c5ece0299e79f0cdc5  ONLINE       0     0     0
            scsi-36000c29cb48660757c4ea71d0c98e917  ONLINE       0     0     0
            scsi-36000c2957fe280fc3b86e364babb80dc  ONLINE       0     0     0
            scsi-36000c29e5cd27c18813f857459910749  ONLINE       0     0     0
 
errors: No known data errors

Drive Replacement

Then replace the disk using the same nomenclature for the device ID as reported from the long list of /dev/disk/by-id via the ‘zpool replace’ command.  Note that the processes of reintegration of the disk is called resilvering and that it might take some time.
 
[root@eh-centos-vsnap198 ~]# zpool replace vpool3 /dev/disk/by-id/scsi-36000c29746d921bca14fb9c73513d776-part1 /dev/disk/by-id/scsi-36000c292d5aa6e30c9fc84fc57490f25
[root@eh-centos-vsnap198 ~]# zpool status
  pool: vpool3
 state: ONLINE
  scan: resilvered 1.41M in 0h0m with 0 errors on Wed Mar  7 16:39:45 2018
config:
 
        NAME                                        STATE     READ WRITE CKSUM
        vpool3                                      ONLINE       0     0     0
          raidz2-0                                  ONLINE       0     0     0
            scsi-36000c292d5aa6e30c9fc84fc57490f25  ONLINE       0     0     0
            scsi-36000c29695a814f39e33c36dee8c100c  ONLINE       0     0     0
            scsi-36000c294c40b18a83e880530b14ffc52  ONLINE       0     0     0
            scsi-36000c29d07a0ded097910901edfa4ce4  ONLINE       0     0     0
            scsi-36000c29dc85e7adc7d10a171b00cda17  ONLINE       0     0     0
            scsi-36000c2934fbda7c1dbb2946cafae7e90  ONLINE       0     0     0
            scsi-36000c29eab355532cdd8a321af882903  ONLINE       0     0     0
          raidz2-1                                  ONLINE       0     0     0
            scsi-36000c2941b16e2fbc1e63dea59c81ff3  ONLINE       0     0     0
            scsi-36000c296fcb7ff9c1a2967b163f381a2  ONLINE       0     0     0
            scsi-36000c29911c2a5d39f044cf0e3093b4a  ONLINE       0     0     0
            scsi-36000c297c3c161c5ece0299e79f0cdc5  ONLINE       0     0     0
            scsi-36000c29cb48660757c4ea71d0c98e917  ONLINE       0     0     0
            scsi-36000c2957fe280fc3b86e364babb80dc  ONLINE       0     0     0
            scsi-36000c29e5cd27c18813f857459910749  ONLINE       0     0     0
 
errors: No known data errors

Note that additional commands may be required to offline and remove a failed drive from the pool via the commands ‘zpool remove <disk>’ and ‘zpool offline <disk>’ after replacement.