The views expressed on this blog are my own and do not necessarily reflect the views of Oracle

May 19, 2013

Identification of under-performing disks in Exadata


Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.

About storage cell software processes

The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.

The Management Server (MS) provides storage cell management and configuration tasks.

Disk state changes

Possibly under-performing - confined online

When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.

That disk status change would be associated with the following entry in the storage cell alerthistory:

[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE  Manufacturer: [name]  Model Number: [model]  Size: [size]  Serial Number: [S/N]  Firmware: [F/W version]  Slot Number: m  Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"

At the same time, the following will be logged in the storage cell alert log:

CDHS: Mark cd health state change [cell disk name]  with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change  from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 11912
...

Prepare for test - confined offline

The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE  Manufacturer: [name]  Model Number: [model]  Size: [size]  Serial Number: [S/N]  Firmware: [F/W version]  Slot Number: m  Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change  from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE

Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.

In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.

Under stress test

Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.

Poor performance - drop force

If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model]  Size: [size]  Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change  after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com,  ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com,  ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com,  ClientPID = 26502
CDHS: Done cd health state change  from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL

In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.

Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.

All well - back to normal

If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.

Disk confinement triggers

Any of the following conditions can trigger a disk confinement:
  1. Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
  2. Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
  3. High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
  4. Errors, e.g. I/O errors (CD_PERF_IOERR).
Conclusion

As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.

I have recently published this on MOS as Doc ID 1509105.1.

11 comments:

  1. Very good to know - thank you for publishing this instructive material about ASM in general and on Exadata!

    ReplyDelete
    Replies
    1. Thanks Uwe,
      Encouragement always helps.
      Cheers,
      Bane

      Delete
  2. Good Stuff! I don't see metrics like CD_PERF_SLOW_LAT_RW in cellsrv 11.2.3.2.1. In which version would these be introduced?

    ReplyDelete
    Replies
    1. Thanks Tanel!

      Those CD_PERF_% are not metrics - they are the return codes from the tests. You would see them in the storage cell alert log as the reason for the test trigger/failure.

      Reading my post again, I see how I implied that those might be metrics - e.g. I said "high service time threshold (CD_PERF_SLOW_ABS)...". What I meant was that when such threshold is exceeded, that code would be shown in the alert log.

      Hope this clarifies the confusion.

      Cheers,
      Bane

      Delete
    2. Ok, cool. Thanks! Are those shown only in alert log or would they get accumulated in cell stats so that cellsrvstat would show these as counters too? (like the io_ltrlw and io_ltow metrics in cellsrvstat for example?)

      Delete
    3. I am not 100% sure (would need to see an actual case to confirm), but I don't think those would be in cellsrvstat, as those are not really stats. While those may be considered the trigger points, they are really status codes.

      Delete
    4. Hi Bane,
      I am interested to see is there a way to see how much is the DiskGroup taken for Each Database.
      If i have 2 DB's with 1 Disk Group so assuming DB1, DB2 and DATA so Total DATA is 500 GB and DB1 took 200 Gb and DB2 took 300 GB

      can you provide us a query to get that result, am looking for a SQL query. [ we can find through du, but its taking too much of time so want to rely on SQL query.

      Appreciate your help

      Delete
    5. I don't have a good canned query for this. I guess, it depends what you want to count as 'space taken by database'. For example you may want to sum up:
      BYTES from V$DATAFILE
      FILE_SIZE_BLKS*BLOCK_SIZE from V$CONTROLFILE
      BYTES from V$LOG
      BYTES from V$TEMPFILE

      But you also may want to include the sum of:
      BYTES from V$BACKUP_FILES
      BLOCKS*BLOCK_SIZE from V$ARCHIVED_LOG
      SPFILE size

      There is also ASM metadata used to manage those files, so you may argue that is also the space taken by the database.

      I still think 'asmcmd du' is the best tool for the job here.

      Cheers,
      Bane

      Delete
  3. Hi Bane. Thanks for this. We see an alert such as the following in the database alert log when these confinement tests run -

    Errors in file /u01/app/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_pr0c_00001.trc:
    ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.11/DATA01_CD_05_exa01cel01 at offset 21903915414 for data length 293952
    ORA-27626: Exadata error: 201 (Generic I/O error)
    WARNING: Read Failed. group:1 disk:17 AU:4984 offset:1204348 size:264232
    path:o/192.168.10.11/DATA01_CD_05_exa01cel01
    incarnation:0xe369af44 asynchronous result:'I/O error'
    subsys:OSS iop:0x5ed92f713200 bufp:0x7ea91d67b000 osderr:0xe9 osderr1:0x0
    WARNING: failed to read mirror side 1 of virtual extent 321 logical extent 0 of file 1386 in group [1.2631526429] from disk DATA01_CD_05_EXA01CEL01 allocation unit 4984 reason error; if possible, will try another mirror side
    NOTE: successfully read mirror side 2 of virtual extent 321 logical extent 1 of file 1386 in group [1.2631526429] from disk DATA01_CD_03_EXA01CEL07 allocation unit 5360
    Tue May 28 18:07:14 2013
    NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for reads
    NOTE: disk 17 (DATA01_CD_05_EXA01CEL01) in group 1 (DATA01) is offline for writes
    NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for reads
    NOTE: disk 17 (RECO01_CD_05_EXA01CEL01) in group 4 (RECO01) is offline for writes

    Useful to know when this feature came in.

    ReplyDelete
    Replies
    1. Thanks Matthew,

      I guess I should have mentioned what we see in the database and ASM alert logs. It's always a question of how much detail to include...

      Cheers,
      Bane

      Delete
  4. Absolutely! Just thought I'd mention it. Its a shame really it gets passed up as far as the database layer. Its not important to the database users because their query is going to be serviced regardless, and it just makes them panic! Thanks again though.

    ReplyDelete