Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.
About storage cell software processes
The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.
The Management Server (MS) provides storage cell management and configuration tasks.
Disk state changes
Possibly under-performing - confined online
When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.
That disk status change would be associated with the following entry in the storage cell alerthistory:
[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
At the same time, the following will be logged in the storage cell alert log:
CDHS: Mark cd health state change [cell disk name] with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 11912
...
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 11912
...
Prepare for test - confined offline
The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.
That action would be associated with the following entry in the cell alerthistory:
[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name], ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.
In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.
Under stress test
Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.
Poor performance - drop force
If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.
That action would be associated with the following entry in the cell alerthistory:
[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model] Size: [size] Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name] Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com, ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com, ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com, ClientPID = 26502
CDHS: Done cd health state change from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com, ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com, ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com, ClientPID = 26502
CDHS: Done cd health state change from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.
Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.
All well - back to normal
If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.
Disk confinement triggers
Any of the following conditions can trigger a disk confinement:
- Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
- Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
- High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
- Errors, e.g. I/O errors (CD_PERF_IOERR).
As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.
I have recently published this on MOS as Doc ID 1509105.1.
Very good to know - thank you for publishing this instructive material about ASM in general and on Exadata!
ReplyDeleteThanks Uwe,
DeleteEncouragement always helps.
Cheers,
Bane
Good Stuff! I don't see metrics like CD_PERF_SLOW_LAT_RW in cellsrv 11.2.3.2.1. In which version would these be introduced?
ReplyDeleteThanks Tanel!
DeleteThose CD_PERF_% are not metrics - they are the return codes from the tests. You would see them in the storage cell alert log as the reason for the test trigger/failure.
Reading my post again, I see how I implied that those might be metrics - e.g. I said "high service time threshold (CD_PERF_SLOW_ABS)...". What I meant was that when such threshold is exceeded, that code would be shown in the alert log.
Hope this clarifies the confusion.
Cheers,
Bane
Ok, cool. Thanks! Are those shown only in alert log or would they get accumulated in cell stats so that cellsrvstat would show these as counters too? (like the io_ltrlw and io_ltow metrics in cellsrvstat for example?)
DeleteI am not 100% sure (would need to see an actual case to confirm), but I don't think those would be in cellsrvstat, as those are not really stats. While those may be considered the trigger points, they are really status codes.
DeleteHi Bane,
DeleteI am interested to see is there a way to see how much is the DiskGroup taken for Each Database.
If i have 2 DB's with 1 Disk Group so assuming DB1, DB2 and DATA so Total DATA is 500 GB and DB1 took 200 Gb and DB2 took 300 GB
can you provide us a query to get that result, am looking for a SQL query. [ we can find through du, but its taking too much of time so want to rely on SQL query.
Appreciate your help
I don't have a good canned query for this. I guess, it depends what you want to count as 'space taken by database'. For example you may want to sum up:
DeleteBYTES from V$DATAFILE
FILE_SIZE_BLKS*BLOCK_SIZE from V$CONTROLFILE
BYTES from V$LOG
BYTES from V$TEMPFILE
But you also may want to include the sum of:
BYTES from V$BACKUP_FILES
BLOCKS*BLOCK_SIZE from V$ARCHIVED_LOG
SPFILE size
There is also ASM metadata used to manage those files, so you may argue that is also the space taken by the database.
I still think 'asmcmd du' is the best tool for the job here.
Cheers,
Bane
Hi Bane. Thanks for this. We see an alert such as the following in the database alert log when these confinement tests run -
ReplyDeleteErrors in file /u01/app/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_pr0c_00001.trc:
ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.11/DATA01_CD_05_exa01cel01 at offset 21903915414 for data length 293952
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:17 AU:4984 offset:1204348 size:264232
path:o/192.168.10.11/DATA01_CD_05_exa01cel01
incarnation:0xe369af44 asynchronous result:'I/O error'
subsys:OSS iop:0x5ed92f713200 bufp:0x7ea91d67b000 osderr:0xe9 osderr1:0x0
WARNING: failed to read mirror side 1 of virtual extent 321 logical extent 0 of file 1386 in group [1.2631526429] from disk DATA01_CD_05_EXA01CEL01 allocation unit 4984 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 321 logical extent 1 of file 1386 in group [1.2631526429] from disk DATA01_CD_03_EXA01CEL07 allocation unit 5360
Tue May 28 18:07:14 2013
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for reads
NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for writes
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for reads
NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for writes
Useful to know when this feature came in.
Thanks Matthew,
DeleteI guess I should have mentioned what we see in the database and ASM alert logs. It's always a question of how much detail to include...
Cheers,
Bane
Absolutely! Just thought I'd mention it. Its a shame really it gets passed up as far as the database layer. Its not important to the database users because their query is going to be serviced regardless, and it just makes them panic! Thanks again though.
ReplyDelete