ASM Support Guy: Identification of under-performing disks in Exadata

Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.

About storage cell software processes

The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.

The Management Server (MS) provides storage cell management and configuration tasks.

Disk state changes

Possibly under-performing - confined online

When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.

That disk status change would be associated with the following entry in the storage cell alerthistory:

[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"

At the same time, the following will be logged in the storage cell alert log:

CDHS: Mark cd health state change [cell disk name] with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 11912
...

Prepare for test - confined offline

The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE

Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.

In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.

Under stress test

Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.

Poor performance - drop force

If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com, ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com, ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com, ClientPID = 26502
CDHS: Done cd health state change from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL

In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.

Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.

All well - back to normal

If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.

Disk confinement triggers

Any of the following conditions can trigger a disk confinement:

Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
Errors, e.g. I/O errors (CD_PERF_IOERR).

Conclusion

As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.

I have recently published this on MOS as Doc ID 1509105.1.

11 comments:

Uwe HesseMay 24, 2013 at 4:54 PM
Very good to know - thank you for publishing this instructive material about ASM in general and on Exadata!
Tanel PoderMay 25, 2013 at 9:40 PM
Good Stuff! I don't see metrics like CD_PERF_SLOW_LAT_RW in cellsrv 11.2.3.2.1. In which version would these be introduced?
AnonymousMay 30, 2013 at 6:21 PM
Hi Bane. Thanks for this. We see an alert such as the following in the database alert log when these confinement tests run -

Errors in file /u01/app/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_pr0c_00001.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.11/DATA01_CD_05_exa01cel01 at offset 21903915414 for data length 293952
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:17 AU:4984 offset:1204348 size:264232
path:o/192.168.10.11/DATA01_CD_05_exa01cel01
incarnation:0xe369af44 asynchronous result:'I/O error'
subsys:OSS iop:0x5ed92f713200 bufp:0x7ea91d67b000 osderr:0xe9 osderr1:0x0
WARNING: failed to read mirror side 1 of virtual extent 321 logical extent 0 of file 1386 in group [1.2631526429] from disk DATA01_CD_05_EXA01CEL01 allocation unit 4984 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 321 logical extent 1 of file 1386 in group [1.2631526429] from disk DATA01_CD_03_EXA01CEL07 allocation unit 5360
Tue May 28 18:07:14 2013
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for reads
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for writes
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for reads
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for writes

Useful to know when this feature came in.
AnonymousMay 31, 2013 at 7:33 PM
Absolutely! Just thought I'd mention it. Its a shame really it gets passed up as far as the database layer. Its not important to the database users because their query is going to be serviced regardless, and it just makes them panic! Thanks again though.

May 19, 2013

Identification of under-performing disks in Exadata

11 comments: