November 25, 2015

Error count

This tiny Perl script might be used to report the error type and count for ASM disks in engineered systems, including Exadata. In those systems, the ASM uses griddisks, that are created from celldisks. The celldisks are in turn created from the physical disks.

errorCount.pl script

To quickly check for errors on any of those disks, we can use errorCount.pl Perl script. This is the complete script with comments:

#!/usr/bin/perl
# Process lines from standard input or a file(s)
while (<>) {
 # Strip whitespace
 s/\s+//g;
 # Get a disk name
 if ( /name/ ) {
  $name = $_;
 }
 # Get error type for non-zero counts
 elsif ( /err.*?Count:[1-9]/ ) {
  $errTypeCount = $_;
  # Print the disk name and the error type/count
  print "$name, $errTypeCount\n";
 }
}

Stripped to bare bones the errorCount.pl becomes:

#!/usr/bin/perl
while (<>) {
 s/\s+//g;
 if ( /name/ ) { $name = $_ }
 elsif ( /err.*?Count:[1-9]/ ) { print "$name, $_\n" }
}

Usage

Use the script with the output of the cellcli -e list physicaldisk|celldisk|griddisk detail command, on an Exadata storage cell. For example:

# cellcli -e list griddisk detail | errorCount.pl
name:DATA_CD_00_exacell03, errorCount:342
name:RECO_CD_00_exacell03, errorCount:728
name:RECO_CD_06_exacell03, errorCount:8
#

Use the script with the output of a dcli command, that is normally run on a database server. For example:

# dcli -g cell_group -l root cellcli -e list celldisk detail | errorCount.pl
exacell01:name:CD_03_exacell01, exacell01:errorCount:80
exacell01:name:CD_06_exacell01, exacell01:errorCount:64
#

The above shows the errors on cell disks 3 and 6, on storage cell 1. Have a closer look at those cell disks:

# dcli -c exacell01 -l root cellcli -e list celldisk CD_03_exacell01,CD_06_exacell01 detail
exacell01: name:                       CD_03_exacell01
exacell01: comment:
exacell01: creationTime:               2015-09-22T10:59:08+10:00
exacell01: deviceName:                 /dev/sdd
exacell01: devicePartition:            /dev/sdd
exacell01: diskType:                   HardDisk
exacell01: errorCount:                 80
exacell01: freeSpace:                  0
exacell01: id:                         bb74cae4-bb47-4d95-b7ee-e3cc5bdf780f
exacell01: interleaving:               none
exacell01: lun:                        0_3
exacell01: physicalDisk:               E1D9RY
exacell01: raidLevel:                  0
exacell01: size:                       557.859375G
exacell01: status:                     normal
exacell01:
exacell01: name:                       CD_06_exacell01
exacell01: comment:
exacell01: creationTime:               2015-09-22T10:59:08+10:00
exacell01: deviceName:                 /dev/sdg
exacell01: devicePartition:            /dev/sdg
exacell01: diskType:                   HardDisk
exacell01: errorCount:                 64
exacell01: freeSpace:                  0
exacell01: id:                         404565b2-1be7-4171-8678-9991157156da
exacell01: interleaving:               none
exacell01: lun:                        0_6
exacell01: physicalDisk:               E1EB4J
exacell01: raidLevel:                  0
exacell01: size:                       557.859375G
exacell01: status:                     normal
#

Use the script with the sundiag [physicaldisk|celldisk|griddisk]-detail.out files. For example on a celldisk detailed report:

# errorCount.pl celldisk-detail.out
name:CD_00_exacell03, errorCount:1070
name:CD_04_exacell03, errorCount:4200
name:CD_06_exacell03, errorCount:8
name:FD_02_exacell03, errorCount:5300
#

Or on a physical disk detailed report:

# errorCount.pl physicaldisk-detail.out
name:20:0, errMediaCount:1000
name:20:5, errMediaCount:2000
name:FLASH_1_0, errHardWriteCount:3000
name:FLASH_1_0, errMediaCount:4000
name:FLASH_1_0, errSeekCount:5000
name:FLASH_1_1, errOtherCount:6000
name:FLASH_4_0, errHardReadCount:7000
#

Yes, I made the numbers up, to make the output interesting.

The diamond operator (<>) in the while loop, lets us process multiple files, like this:

# errorCount.pl celldisk-detail.out physicaldisk-detail.out
...

But a quicker way to do the above would be:

# cat *detail.out | errorCount.pl
name:CD_03_dmq1cel04, errorCount:2
name:CD_07_dmq1cel04, errorCount:2
name:CD_09_dmq1cel04, errorCount:1
name:CD_11_dmq1cel04, errorCount:1
name:DATA_CD_03_dmq1cel04, errorCount:2
name:DATA_CD_07_dmq1cel04, errorCount:2
name:DATA_CD_09_dmq1cel04, errorCount:1
name:DATA_CD_11_dmq1cel04, errorCount:1
#

Check any count

The script can be easily modified to report on any disk attribute that reports a non-zero count. For example to check if there is free space on a cell disk, we can use the modified script freeSpace.pl:

#!/usr/bin/perl
while (<>) {
 s/\s+//g;
 if ( /name/ ) { $name = $_ }
 elsif ( /freeSpace:[1-9]/ ) { print "$name, $_\n" }
}

Like this:

# dcli -g cell_group -l root cellcli -e list celldisk detail | freeSpace.pl
exacell01:name:CD_00_exacell01, exacell01:freeSpace:528.6875G
#

Conclusion

In engineered systems, including Exadata, the ASM uses griddisks, that are created from celldisks. The celldisks are in turn created from the physical disks. To quickly check for errors on any of those disks, we can use the errorCount.pl Perl script, either directly on the cell or via the dcli utility, that we run on a database server.