This post will describe the
replacement of a failed rootvg disk.
In short the procedure is the
following:
1. unmirrorvg rootvg hdisk0
(savebase -v)
2. reducevg rootvg hdisk0
3. rmdev -Rdl hdisk0
4. diag (safely remove hot swap
device/drive), physically remove the old disk
5. insert the new disk (diag – hotplug task)
6. cfgmgr -vl scsi0
7. extendvg -f rootvg hdisk0
8. mirrorvg -m rootvg hdisk0
9. bosboot -ad hdisk0
10. bootlist -m normal hdisk0
hdisk1
11. bootlist -m normal -o
In more detail the procedure is
described below starting with the errpt logs indicating a drive problem.
In the errpt output we can see that
there is some problem with a disk:
# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
EAA3D429 0216144412 U S LVDD PHYSICAL PARTITION MARKED STALE
EAA3D429 0216144412 U S LVDD PHYSICAL PARTITION MARKED STALE
F7DDA124 0216144412 U H LVDD PHYSICAL VOLUME DECLARED MISSING
52715FA5 0216144412 U H LVDD FAILED TO WRITE VOLUME GROUP STATUS AREA
E86653C3 0216144412 P H LVDD I/O ERROR DETECTED BY LVM
8647C4E2 0216144412 P H hdisk0 DISK OPERATION ERROR
41BF2110 0216144412 U H LVDD MIRROR WRITE CACHE WRITE FAILED
8647C4E2 0216144412 P H hdisk0 DISK OPERATION ERROR
EAA3D429 0216144412 U S LVDD PHYSICAL PARTITION MARKED STALE
E86653C3 0216144412 P H LVDD I/O ERROR DETECTED BY LVM
8647C4E2 0216144412 P H hdisk0 DISK OPERATION ERROR
Lets have a detailed look at the
first error:
# errpt -aj 8647C4E2
---------------------------------------------------------------------------
LABEL: DISK_ERR3
IDENTIFIER: 8647C4E2
Date/Time: Thu Feb 16 14:44:06 GMT
2012
Sequence Number: 3706
Machine Id: 00CF405E4C00
Node Id: power1
Class: H
Type: PERM
WPAR: Global
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: U787F.001.DPM28WG-P1-T10-L5-L0
Sequence Number: 3706
Machine Id: 00CF405E4C00
Node Id: power1
Class: H
Type: PERM
WPAR: Global
Resource Name: hdisk0
Resource Class: disk
Resource Type: scsd
Location: U787F.001.DPM28WG-P1-T10-L5-L0
VPD:
Manufacturer................IBM H0
Machine Type and Model......HUS103073FL3800
FRU Number..................03N5262
ROS Level and ID............52505152
Serial Number...............
EC Level....................H17923D
Part Number.................26K5573
Device Specific.(Z0)........000004129F00013E
Device Specific.(Z1)........RPQR
Device Specific.(Z2)........0068
Device Specific.(Z3)........06131
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........H17923D
Brand.......................H0
Manufacturer................IBM H0
Machine Type and Model......HUS103073FL3800
FRU Number..................03N5262
ROS Level and ID............52505152
Serial Number...............
EC Level....................H17923D
Part Number.................26K5573
Device Specific.(Z0)........000004129F00013E
Device Specific.(Z1)........RPQR
Device Specific.(Z2)........0068
Device Specific.(Z3)........06131
Device Specific.(Z4)........0001
Device Specific.(Z5)........22
Device Specific.(Z6)........H17923D
Brand.......................H0
Description
DISK OPERATION ERROR
DISK OPERATION ERROR
Probable
Causes
DASD DEVICE
STORAGE DEVICE CABLE
DASD DEVICE
STORAGE DEVICE CABLE
Failure
Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
STORAGE DEVICE CABLE
DISK DRIVE
DISK DRIVE ELECTRONICS
STORAGE DEVICE CABLE
Recommended
Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
PATH ID
0
SENSE DATA
0A05 0000 2E00 0000 0080 0000 0800 0000 0200 0800 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0117 0002
PERFORM PROBLEM DETERMINATION PROCEDURES
PATH ID
0
SENSE DATA
0A05 0000 2E00 0000 0080 0000 0800 0000 0200 0800 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0117 0002
Diagnostic
Analysis
Diagnostic Log sequence number: 928
Resource tested: hdisk0
Resource Description: 16 Bit LVD SCSI Disk Drive
Location: U787F.001.DPM28WG-P1-T10-L5-L0
SRN: 2643-129
Description: Error log analysis indicates a SCSI bus problem.
Possible FRUs:
n/a FRU: n/a
U787F.001.DPM28WG-P1-T10-L5-L0
SCSI bus problem: cables, terminators or other SCSI
devices
hdisk0 FRU: 03N5262
U787F.001.DPM28WG-P1-T10-L5-L0
16 Bit LVD SCSI Disk Drive
sisscsia0 FRU: 42R7379 U787F.001.DPM28WG-P1
PCI-X Dual Channel Ultra320 SCSI Adapter
n/a FRU: n/a
U787F.001.DPM28WG-P1-T10-L5-L0
Software
Diagnostic Log sequence number: 928
Resource tested: hdisk0
Resource Description: 16 Bit LVD SCSI Disk Drive
Location: U787F.001.DPM28WG-P1-T10-L5-L0
SRN: 2643-129
Description: Error log analysis indicates a SCSI bus problem.
Possible FRUs:
n/a FRU: n/a
U787F.001.DPM28WG-P1-T10-L5-L0
SCSI bus problem: cables, terminators or other SCSI
devices
hdisk0 FRU: 03N5262
U787F.001.DPM28WG-P1-T10-L5-L0
16 Bit LVD SCSI Disk Drive
sisscsia0 FRU: 42R7379 U787F.001.DPM28WG-P1
PCI-X Dual Channel Ultra320 SCSI Adapter
n/a FRU: n/a
U787F.001.DPM28WG-P1-T10-L5-L0
Software
Having a look at the disk using
lsvg and lspv commands indicates that the disk is in a missing state
# lsvg -p
rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk0 missing 546 41 05..01..00..00..35
hdisk1 active 546 41 05..00..00..00..36
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk0 missing 546 41 05..01..00..00..35
hdisk1 active 546 41 05..00..00..00..36
# lspv
hdisk0
PHYSICAL VOLUME: hdisk0 VOLUME GROUP: rootvg
PV IDENTIFIER: 00c0e90dce6c290a VG IDENTIFIER 00c0e90d00004c000000012ff4e24eaa
PV STATE: missing
STALE PARTITIONS: 21 ALLOCATABLE: yes
PP SIZE: 128 megabyte(s) LOGICAL VOLUMES: 24
TOTAL PPs: 546 (69888 megabytes) VG DESCRIPTORS: 1
FREE PPs: 41 (5248 megabytes) HOT SPARE: no
USED PPs: 505 (64640 megabytes) MAX REQUEST: 256 kilobytes
FREE DISTRIBUTION: 05..01..00..00..35
USED DISTRIBUTION: 105..108..109..109..74
MIRROR POOL: None
PHYSICAL VOLUME: hdisk0 VOLUME GROUP: rootvg
PV IDENTIFIER: 00c0e90dce6c290a VG IDENTIFIER 00c0e90d00004c000000012ff4e24eaa
PV STATE: missing
STALE PARTITIONS: 21 ALLOCATABLE: yes
PP SIZE: 128 megabyte(s) LOGICAL VOLUMES: 24
TOTAL PPs: 546 (69888 megabytes) VG DESCRIPTORS: 1
FREE PPs: 41 (5248 megabytes) HOT SPARE: no
USED PPs: 505 (64640 megabytes) MAX REQUEST: 256 kilobytes
FREE DISTRIBUTION: 05..01..00..00..35
USED DISTRIBUTION: 105..108..109..109..74
MIRROR POOL: None
An attempt to read from the disk
using dd failed. So it really looks like the disk died. (actually it could also
be that there is a problem with the controller/cable, but in this scenario it
is the disk..)
dd if=/dev/hdisk0 of=/tmp/disk0
bs=100 count=1024
dd: 0511-051 The read failed.
: There is an input or output error.
0+0 records in.
0+0 records out.
dd: 0511-051 The read failed.
: There is an input or output error.
0+0 records in.
0+0 records out.
In order to replace the disk we
have to do the following:
# unmirrorvg rootvg
hdisk0
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c '
as root user to clear the boot record and avoid a potential boot
off an old boot image that may reside on the disk from which this
logical volume is moved/removed.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1734 rmlvcopy: Warning, savebase failed. Please manually run 'savebase' before rebooting.
0516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c '
as root user to clear the boot record and avoid a potential boot
off an old boot image that may reside on the disk from which this
logical volume is moved/removed.
0301-108 mkboot:
Unable to read file blocks. Return code: -1
0516-1798 lchangevg: Cannot change quorum without losing quorum.
0516-732 chvg: Unable to change volume group rootvg.
0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform
bosboot of system to reinitialize boot records. Then, user must modify
bootlist to just include: hdisk1.
0516-1798 lchangevg: Cannot change quorum without losing quorum.
0516-732 chvg: Unable to change volume group rootvg.
0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform
bosboot of system to reinitialize boot records. Then, user must modify
bootlist to just include: hdisk1.
# chpv -c
hdisk0
# savebase -v
saving to '/dev/hd5'
75 CuDv objects to be saved
174 CuAt objects to be saved
25 CuDep objects to be saved
39 CuVPD objects to be saved
387 CuDvDr objects to be saved
2 CuPath objects to be saved
0 CuPathAt objects to be saved
0 CuData objects to be saved
0 CuAtDef objects to be saved
Number of bytes of data to save = 38834
Compressing data
Compressed data size is = 9840
bi_start = 0x3600
bi_size = 0x1820000
bd_size = 0x1800000
ram FS start = 0x8d6ca0
ram FS size = 0xea2902
sba_start = 0x1803600
sba_size = 0x20000
sbd_size = 0x2674
Checking boot image size:
new save base byte cnt = 0x2674
Wrote 9844 bytes
Successful completion
# savebase -v
saving to '/dev/hd5'
75 CuDv objects to be saved
174 CuAt objects to be saved
25 CuDep objects to be saved
39 CuVPD objects to be saved
387 CuDvDr objects to be saved
2 CuPath objects to be saved
0 CuPathAt objects to be saved
0 CuData objects to be saved
0 CuAtDef objects to be saved
Number of bytes of data to save = 38834
Compressing data
Compressed data size is = 9840
bi_start = 0x3600
bi_size = 0x1820000
bd_size = 0x1800000
ram FS start = 0x8d6ca0
ram FS size = 0xea2902
sba_start = 0x1803600
sba_size = 0x20000
sbd_size = 0x2674
Checking boot image size:
new save base byte cnt = 0x2674
Wrote 9844 bytes
Successful completion
Now, remove the disk from the VG
and also remove the disk from the system using rmdev, afterwards use diag to
safely remove the physical disk from the system
# reducevg rootvg hdisk0
# rmdev -Rdl hdisk0
hdisk0 deleted
#diag
# rmdev -Rdl hdisk0
hdisk0 deleted
#diag
Task selection >
Hot plug task > SCSI and SCSI RAID Hot Plug Manager > Replace remove
device.
For the safe physical insertion of
the new disk also use the diag command.
Now we need the system to detect
and identify the drive, put it into the VG,mirror the drive and install BOS on
the BLV and set the boot order.
# cfgmgr -vl scsi0
# extendvg -f rootvg hdisk0
# mirrorvg -m rootvg hdisk0
0516-1126 mirrorvg: rootvg successfully mirrored, user should perform
bosboot of system to initialize boot records. Then, user must modify
bootlist to include: hdisk0 hdisk1.
# bosboot -ad hdisk0
bosboot: Boot image
is 49180 512 byte blocks.
# bootlist -m normal hdisk0 hdisk1
[root@power2](/root) # bootlist -m normal -o
hdisk0 blv=hd5 pathid=0
hdisk1 blv=hd5 pathid=0
# bootlist -m normal hdisk0 hdisk1
[root@power2](/root) # bootlist -m normal -o
hdisk0 blv=hd5 pathid=0
hdisk1 blv=hd5 pathid=0
And finally check we are done.