December 25, 2013

Tell me about your ASM

When diagnosing ASM issues, it helps to know a bit about the setup - disk group names and types, the state of disks, ASM instance initialisation parameters and if any rebalance operations are in progress. In those cases I usually ask for an HTML report, that gets produced by running the SQL script against one of the ASM instances. This post is about that script with the comments about the output.

The script

First, here is the script, that may be saved as asm_report.sql:

spool /tmp/ASM_report.html
set markup html on
set echo off
set feedback off
set pages 10000
break on INST_ID on GROUP_NUMBER
prompt ASM report
select to_char(SYSDATE, 'DD-Mon-YYYY HH24:MI:SS') "Time" from dual;
prompt Version
select * from V$VERSION where BANNER like '%Database%' order by 1;
prompt Cluster wide operations
select * from GV$ASM_OPERATION order by 1;
prompt
prompt Disk groups, including the dismounted disk groups
select * from V$ASM_DISKGROUP order by 1, 2, 3;
prompt All disks, including the candidate disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, LABEL, PATH, MOUNT_STATUS, HEADER_STATUS, STATE, OS_MB, TOTAL_MB, FREE_MB, CREATE_DATE, MOUNT_DATE, SECTOR_SIZE, VOTING_FILE, FAILGROUP_TYPE
from V$ASM_DISK
where MODE_STATUS='ONLINE'
order by 1, 2;
prompt Offline disks
select GROUP_NUMBER, DISK_NUMBER, FAILGROUP, NAME, MOUNT_STATUS, HEADER_STATUS, STATE, REPAIR_TIMER
from V$ASM_DISK
where MODE_STATUS='OFFLINE'
order by 1, 2;
prompt Disk group attributes
select GROUP_NUMBER, NAME, VALUE from V$ASM_ATTRIBUTE where NAME not like 'template%' order by 1;
prompt Connected clients
select * from V$ASM_CLIENT order by 1, 2;
prompt Non-default ASM specific initialisation parameters, including the hidden ones
select KSPPINM "Parameter", KSPFTCTXVL "Value"
from X$KSPPI a, X$KSPPCV2 b
where a.INDX + 1 = KSPFTCTXPN and (KSPPINM like '%asm%' or KSPPINM like '%balance%' or KSPPINM like '%auto_manage%') and kspftctxdf = 'FALSE'
order by 1 desc;
prompt Memory, cluster and instance specific initialisation parameters
select NAME "Parameter", VALUE "Value", ISDEFAULT "Default"
from V$PARAMETER
where NAME like '%target%' or NAME like '%pool%' or NAME like 'cluster%' or NAME like 'instance%'
order by 1;
prompt Disk group imbalance
select g.NAME "Diskgroup",
100*(max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576)))-min((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))))/max((d.TOTAL_MB-d.FREE_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))/(d.TOTAL_MB + (128*g.ALLOCATION_UNIT_SIZE/1048576))) "Imbalance",
count(*) "Disk count",
g.TYPE "Type"
from V$ASM_DISK_STAT d , V$ASM_DISKGROUP_STAT g
where d.GROUP_NUMBER = g.GROUP_NUMBER and d.STATE = 'NORMAL' and d.MOUNT_STATUS = 'CACHED'
group by g.NAME, g.TYPE;
prompt End of ASM report
set markup html off
set echo on
set feedback on
exit

To produce the report, that will be saved as /tmp/ASM_report.html, run the following command as the OS user that owns the Grid Infrastructure home (usually grid or oracle), against an ASM instance (say +ASM1), like this:

$ sqlplus -S / as sysasm @asm_report.sql

To save the output in a different location or under a different name, just modify the spool command (line 1 in the script).

The report

The reports first shows the time of the report and ASM version.

It then shows if there are any ASM operations in progress. In this excerpt we see a rebalance running in ASM instance 1. It can also be seen that the resync and rebalance have completed and that the compacting is the only outstanding operation:


Next we see the information about all disk groups, including the dismounted disk groups. This is then followed by the info about disks, again with the note that this includes the candidate disks.

I have separated the info about offline disks, as this may be of interest when dealing with disk issues. That section looks like this:



Next are the disk group attributes, with the note that this will be displayed only for ASM version 11.1 and later, as we did not have the disk group attributes in earlier versions.

This is followed by the list of connected clients, usually database instances served by that ASM instance.

The section with ASM initialisation parameters includes hidden and some Exadata specific (_auto_manage) parameters. Here is a small sample:

I have also separated the memory, cluster and instance specific initialisation parameters as they are often of special interest.

The last section shows the disk group imbalance report.

Conclusion

While I use this report for a quick overview of the ASM, it can also be used as a 'backup' info about your ASM setup. You are welcome to modify the script to produce a report that suits your needs. Please let me know if you find any issues with the script or if you have suggestions for improvements.

Acknowledgments

The bulk of the script is based on My Oracle Support (MOS) Doc ID 470211.1, by Oracle Support engineer Esteban D. Bernal.

The imbalance SQL is based on the Reporting Disk Imbalances script from Oracle Press book Oracle Automatic Storage Management, Under-the-Hood & Practical Deployment Guide, by Nitin Vengurlekar, Murali Vallath and Rich Long.

September 29, 2013

ASM metadata blocks


An ASM instance manages the metadata needed to make ASM files available to Oracle databases and other ASM clients. ASM metadata is stored in disk groups and organised in metadata structures. These metadata structures consist of one or more ASM metadata blocks. For example, the ASM disk header consist of a single ASM metadata block. Other structures, like the Partnership and Status Table, consist of exactly one allocation unit (AU). Some ASM metadata, like the File Directory, can span multiple AUs and will not have the predefined size; in fact, the File Directory will grow as needed and will be managed as any other ASM file.

ASM metadata block types

The following are the ASM metadata block types:
  • KFBTYP_DISKHEAD - The ASM disk header - the very first block in every ASM disk. A copy of this block will be in the second last Partnership and Status Table (PST) block (in ASM version 11.1.0.7 and later). The copy of this block will also be in the very first block in Allocation Unit 11, for disk groups with COMPATIBLE.ASM=12.1 or higher.
  • KFBTYP_FREESPC - The Free Space Table block.
  • KFBTYP_ALLOCTBL - The Allocation Table block.
  • KFBTYP_PST_META - The Partnership and Status Table (PST) block. The PST blocks 0 and 1 will be of this type.
  • KFBTYP_PST_DTA - The PST blocks with the actual PST data.
  • KFBTYP_PST_NONE - The PST block with no PST data. Remember that Allocation Unit 1 (AU1) on every disk is reserved for the PST, but only some disks will have the PST data.
  • KFBTYP_HBEAT - The heartbeat block, in the PST.
  • KFBTYP_FILEDIR - The File Directory block.
  • KFBTYP_INDIRECT - The Indirect File Directory block, containing a pointer to another file directory block.
  • KFBTYP_LISTHEAD - The Disk Directory block. The very first block in the ASM disk directory. The field kfdhdb.f1b1locn in the ASM disk header will point the the allocation unit whose block 0 will be of this type.
  • KFBTYP_DISKDIR - The rest of the blocks in the Disk Directory will be of this type.
  • KFBTYP_ACDC - The Active Change Directory (ACD) block. The very first block of the ACD will be of this type.
  • KFBTYP_CHNGDIR - The blocks with the actual ACD data.
  • KFBTYP_COD_BGO - The Continuing Operations Directory (COD) block for background operations data.
  • KFBTYP_COD_RBO - The COD block that marks the rollback operations data.
  • KFBTYP_COD_DATA - The COD block with the actual rollback operations data.
  • KFBTYP_TMPLTDIR - The Template Directory block.
  • KFBTYP_ALIASDIR - The Alias Directory block.
  • KFBTYP_SR - The Staleness Registry block.
  • KFBTYP_STALEDIR - The Staleness Directory block.
  • KFBTYP_VOLUMEDIR -The ADVM Volume Directory block.
  • KFBTYP_ATTRDIR -The Attributes Directory block.
  • KFBTYP_USERDIR - The User Directory block.
  • KFBTYP_GROUPDIR - The User Group Directory block.
  • KFBTYP_USEDSPC - The Disk Used Space Directory block.
  • KFBTYP_ASMSPFALS -The ASM spfile alias block.
  • KFBTYP_PASWDDIR - The ASM Password Directory block.
  • KFBTYP_INVALID - Not an ASM metadata block.
Note that the KFBTYP_INVALID is not an actual block type stored in ASM metadata block. Instead, ASM will return this if it encounters a block where the type is not one of the valid ASM metadata block types. For example if the ASM disk header is corrupt, say zeroed out, ASM will report it as KFBTYP_INVALID. We will also see the same when reading such block with the kfed tool.

ASM metadata block

The default ASM metadata block size is 4096 bytes. The block size will be specified in the ASM disk header field kfdhdb.blksize. Note that the ASM metadata block size has nothing to do with the database block size.

ASM metadata block header

The first 32 bytes of an ASM metadata block contains the block header (not to be confused with the ASM disk header). The block header has the following information:
  • kfbh.endian - Platform endianness.
  • kfbh.hard - H.A.R.D. (Hardware Assisted Resilient Data) signature.
  • kfbh.type - Block type.
  • kfbh.datfmt - Block data format.
  • kfbh.block.blk - Location (block number). 
  • kfbh.block.obj - Data type held in this block.
  • kfbh.check - Block checksum.
  • kfbh.fcn.base - Block change control number (base).
  • kfbh.fcn.wrap - Block change control number (wrap).
The FCN is the ASM equivalent of database SCN.

The rest of the contents of an ASM metadata block will be specific to the block type. In other words, an ASM disk header block will have the disk header specific data - disk number, disk name, disk group name, etc. A file directory block will have the extent location data for a file, etc.

Conclusion

An ASM instance manages ASM metadata blocks. It creates them, updates them, calculates and updates the check sum on writes, reads and verifies the check sums on reads, exchanges the blocks with other instances, etc. ASM metadata structures consist of one of more ASM metadata blocks. A tool like kfed can be used to read and modify ASM metadata blocks.

August 31, 2013

Partnership and Status Table


The Partnership and Status Table (PST) contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data.

PST count

In an external redundancy disk group there will be only one copy of the PST.

In a normal redundancy disk group there will be at least two copies of the PST. If there are three or more failgroups, there will be three copies of the PST.

In a high redundancy disk group there will be at least three copies of the PST. If thre are four failgroups, there will be four PST copies, and if there are five or more failgroups there will be five copies of the PST.

Let's have a look. Note that in each example, the disk group is created with five disks.

External redundancy disk group.

SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log:

Sat Aug 31 20:44:59 2013
SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:44:59 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:45:00 2013
GMON updating group 2 at 50 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
Sat Aug 31 20:45:00 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates only one copy of the PST.

Normal redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:49:28 2013
SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:49:28 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:49:28 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:49:28 2013
GMON updating group 2 at 68 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
Sat Aug 31 20:49:28 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates three copies of the PST.

High redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:51:52 2013
SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:51:52 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:51:53 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:51:53 2013
GMON updating group 2 at 77 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
NOTE: group DG1: initial PST location: disk 0003 (PST copy 3)
NOTE: group DG1: initial PST location: disk 0004 (PST copy 4)
Sat Aug 31 20:51:53 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates five copies of the PST.

PST relocation

The PST would be relocated in the following cases
  • The disk with the PST is not available (on ASM startup)
  • The disk goes offline
  • There was an I/O error while reading/writing to/from the PST
  • Disk is dropped gracefully
In all cases the PST would be relocated to another disk in the same failgroup (if a disk is available in the same failure group) or to another failgroup (that doesn't already contain a copy of the PST).

Let's have a look.

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8';

Diskgroup created.

ASM alert log shows the PST copies are on disks 0, 1 and 2:

NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)

Let's drop disk 0:

SQL> select disk_number, name, path from v$asm_disk_stat
where group_number = (select group_number from v$asm_diskgroup_stat where name='DG1');

DISK_NUMBER NAME                           PATH
----------- ------------------------------ ----------------
          3 DG1_0003                       /dev/sdc8
          2 DG1_0002                       /dev/sdc7
          1 DG1_0001                       /dev/sdc6
          0 DG1_0000                       /dev/sdc5

SQL> alter diskgroup DG1 drop disk DG1_0000;

Diskgroup altered.

ASM alert log

Sat Aug 31 21:04:29 2013
SQL> alter diskgroup DG1 drop disk DG1_0000
...
NOTE: initiating PST update: grp 2 (DG1), dsk = 0/0xe9687ff6, mask = 0x6a, op = clear
Sat Aug 31 21:04:37 2013
GMON updating disk modes for group 2 at 96 for pid 24, osid 16502
NOTE: group DG1: updated PST location: disk 0001 (PST copy 0)
NOTE: group DG1: updated PST location: disk 0002 (PST copy 1)
NOTE: group DG1: updated PST location: disk 0003 (PST copy 2)
...

We see that the PST copy from disk 0 was moved to disk 3.

Disk Partners

A disk partnership is a symmetric relationship between two disks in a high or normal redundancy disk group. There is no disk partnership in an external disk groups. For a discussion on this topic, please see the post How many partners.

PST Availability

The PST has to be available before the rest of ASM metadata. When the disk group mount is requested, the GMON process (on the instance requesting a mount) reads all disks in the disk group to find and verify all available PST copies. Once it verifies that there are enough PSTs for a quorum, it mounts the disk group. From that point on, the PST is available in the ASM instance cache, stored in the GMON PGA and protected by an exclusive lock on the PT.n.0 enqueue.

As other ASM instances, in the same cluster, come online they cache the PST in their GMON PGA with shared PT.n.0 enqueue.

Only the GMON (the CKPT in 10gR1) that has an exclusive lock on the PT enqueue, can update the PST information on disks.

PST (GMON) tracing

The GMON trace file will log the PST info every time a disk group mount is attempted. Note that I said attempted, not mounted, as the GMON will log the information regardless of the mount being successful or no. This information may be valuable to Oracle Support in diagnosing disk group mount failures.

This would be a typical information logged in the GMON trace file on a disk group mount:

=============== PST ====================
grpNum:    2
grpTyp:    2
state:     1
callCnt:   103
bforce:    0x0
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next:    7
last:    7
pst count:       3
pst locations:   1  2  3
incarn:          4
dta size:        4
version:         0
ASM version:     168820736 = 10.1.0.0.0
contenttype:     0
--------------- LOC MAP ----------------
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 0      stable_loc: 0
--------------- DTA --------------------
1: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 3 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 2 (amp)
...

The section marked === PST === tells us the group number (grpNum), type (grpTyp) and state. The section marked --- HDR --- shows the number of PST copies (pst count) and the disk numbers that have those copies (pst locations). The secion marked --- DTA --- shows the actual state of the disks with the PST.

Conclusion

The Partnership and Status Table contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data. As the PST is a valuable ASM metadata, it is mirrored three times in a normal redundancy disk group and five times in a high redundancy disk group - provided there are enough failgroups of course.


August 24, 2013

Allocation Table


Every ASM disk contains at least one Allocation Table (AT) that describes the contents of the disk. The AT has one entry for every allocation unit (AU) on the disk. If an AU is allocated, the Allocation Table will have the extent number and the file number the AU belongs to.

Finding the Allocation Table

The location of the first block of the Allocation Table is stored in the ASM disk header (field kfdhdb.altlocn). In the following example, the look up of that field shows that the AT starts at block 2.

$ kfed read /dev/sdc1 | grep kfdhdb.altlocn
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002

Let’s have a closer look at the first block of the Allocation Table.

$ kfed read /dev/sdc1 blkn=2 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum:                         0 ; 0x000: 0x00000000
kfdatb.shrink:                      448 ; 0x004: 0x01c0
...

The kfdatb.aunum=0, means that AU0 is the first AU described by this AT block. The kfdatb.shrink=448 means that this AT block can hold the information for 448 AUs. In the next AT block we should see kfdatb.aunum=448, meaning that it will have the info for AU448 + 448 more AUs. Let’s have a look:

$ kfed read /dev/sdc1 blkn=3 | grep kfdatb.aunum
kfdatb.aunum:                       448 ; 0x000: 0x000001c0

The next AT block should show kfdatb.aunum=896:

$ kfed read /dev/sdc1 blkn=4 | grep kfdatb.aunum
kfdatb.aunum:                       896 ; 0x000: 0x00000380

And so on...

Allocation table entries

For allocated AUs, the Allocation Table entry (kfdate[i]) holds the extent number, file number and the state of the allocation unit - normally allocated (flag V=1), vs a free or unallocated AU (flag V=0).

Let’s have a look at Allocation Table block 3.

$ kfed read /dev/sdc1 blkn=3 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum:                       448 ; 0x000: 0x000001c0
...
kfdate[142].discriminator:            1 ; 0x498: 0x00000001
kfdate[142].allo.lo:                  0 ; 0x498: XNUM=0x0
kfdate[142].allo.hi:            8388867 ; 0x49c: V=1 I=0 H=0 FNUM=0x103
kfdate[143].discriminator:            1 ; 0x4a0: 0x00000001
kfdate[143].allo.lo:                  1 ; 0x4a0: XNUM=0x1
kfdate[143].allo.hi:            8388867 ; 0x4a4: V=1 I=0 H=0 FNUM=0x103
kfdate[144].discriminator:            1 ; 0x4a8: 0x00000001
kfdate[144].allo.lo:                  2 ; 0x4a8: XNUM=0x2
kfdate[144].allo.hi:            8388867 ; 0x4ac: V=1 I=0 H=0 FNUM=0x103
kfdate[145].discriminator:            1 ; 0x4b0: 0x00000001
kfdate[145].allo.lo:                  3 ; 0x4b0: XNUM=0x3
kfdate[145].allo.hi:            8388867 ; 0x4b4: V=1 I=0 H=0 FNUM=0x103
kfdate[146].discriminator:            1 ; 0x4b8: 0x00000001
kfdate[146].allo.lo:                  4 ; 0x4b8: XNUM=0x4
kfdate[146].allo.hi:            8388867 ; 0x4bc: V=1 I=0 H=0 FNUM=0x103
kfdate[147].discriminator:            1 ; 0x4c0: 0x00000001
kfdate[147].allo.lo:                  5 ; 0x4c0: XNUM=0x5
kfdate[147].allo.hi:            8388867 ; 0x4c4: V=1 I=0 H=0 FNUM=0x103
kfdate[148].discriminator:            0 ; 0x4c8: 0x00000000
kfdate[148].free.lo.next:            16 ; 0x4c8: 0x0010
kfdate[148].free.lo.prev:            16 ; 0x4ca: 0x0010
kfdate[148].free.hi:                  2 ; 0x4cc: V=0 ASZM=0x2
kfdate[149].discriminator:            0 ; 0x4d0: 0x00000000
kfdate[149].free.lo.next:             0 ; 0x4d0: 0x0000
kfdate[149].free.lo.prev:             0 ; 0x4d2: 0x0000
kfdate[149].free.hi:                  0 ; 0x4d4: V=0 ASZM=0x0
...

The excerpt shows the Allocation Table entries for file 259 (hexadecimal FNUM=0x103), which start at kfdate[142] and end at kfdate[147]. That shows the ASM file 259 has the total of 6 AUs. The AU numbers will be the index of kfdate[i] + offset (kfdatb.aunum=448). In other words, 142+448=590, 143+448=591 ... 147+448=595. Let's verify that by querying X$KFFXP:

SQL> select AU_KFFXP
from X$KFFXP
where GROUP_KFFXP=1  -- disk group 1
and NUMBER_KFFXP=259 -- file 259
;

  AU_KFFXP
----------
       590
       591
       592
       593
       594
       595

6 rows selected.

Free space

In the above kfed output, we see that kfdate[148] and kfdate[149] have the word free next to them, which marks them as free or unallocated allocation units (flagged with V=0). That kfed output is truncated, but there are many more free allocation units described by this AT block.

The stride

Each AT block can describe 448 AUs (the kfdatb.shrink value from the Allocation Table), and the whole AT can have 254 blocks (the kfdfsb.max value from the Free Space Table). This means that one Allocation Table can describe 254x448=113792 allocation units. This is called the stride, and the stride size - expressed in number of allocation units - is in the field kfdhdb.mfact, in ASM disk header:

$ kfed read /dev/sdc1 | grep kfdhdb.mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

The stride size in this example is for the AU size of 1MB, that can fit 256 metadata blocks in AU0. Block 0 is for the disk header and block 1 is for the Free Space Table, which leaves 254 blocks for the Allocation Table blocks.

With the AU size of 4MB (default in Exadata), the stride size will be 454272 allocation units or 1817088 MB. With the larger AU size, the stride will also be larger.

How many Allocation Tables

Large ASM disks may have more than one stride. Each stride will have its own physically addressed metadata, which means that it will have its own Allocation Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the AT entries for the second stride. Those should be in blocks 2-255 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
$ kfed read /dev/sdc1 aun=113792 blkn=255 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

As expected, we have another AT in AU113792. If we had another stride, there would be another AT at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the AT at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

Conclusion

Every ASM disk contains at least one Allocation Table that describes the contents of the disk. The AT has one entry for every allocation unit on the disk. If the disk has more than one stride, each stride will have its own Allocation Table.

August 23, 2013

Free Space Table


The ASM Free Space Table (FST) provides a summary of which allocation table blocks have free space. It contains an array of bit patterns indexed by allocation table block number. The table is used to speed up the allocation of new allocation units by avoiding reading blocks that are full.

The FST is technically part of the Allocation Table (AT), and is at block 1 of the AT. The Free Space Table, and the Allocation Table are so called physically addressed metadata, as they are always at the fixed location on each ASM disk.

Locating the Free Space Table

The location of the FST block is stored in the ASM disk header (field kfdhdb.fstlocn). In the following example, the lookup of that field in the disk header, shows that the FST is in block 1.

$ kfed read /dev/sdc1 | grep kfdhdb.fstlocn
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001

Let’s have a closer look at the FST:

$ kfed read /dev/sdc1 blkn=1 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
...
kfdfsb.aunum:                         0 ; 0x000: 0x00000000
kfdfsb.max:                         254 ; 0x004: 0x00fe
kfdfsb.cnt:                         254 ; 0x006: 0x00fe
kfdfsb.bound:                         0 ; 0x008: 0x0000
kfdfsb.flag:                          1 ; 0x00a: B=1
kfdfsb.ub1spare:                      0 ; 0x00b: 0x00
kfdfsb.spare[0]:                      0 ; 0x00c: 0x00000000
kfdfsb.spare[1]:                      0 ; 0x010: 0x00000000
kfdfsb.spare[2]:                      0 ; 0x014: 0x00000000
kfdfse[0].fse:                      119 ; 0x018: FREE=0x7 FRAG=0x7
kfdfse[1].fse:                       16 ; 0x019: FREE=0x0 FRAG=0x1
kfdfse[2].fse:                       16 ; 0x01a: FREE=0x0 FRAG=0x1
kfdfse[3].fse:                       16 ; 0x01b: FREE=0x0 FRAG=0x1
...
kfdfse[4037].fse:                     0 ; 0xfdd: FREE=0x0 FRAG=0x0
kfdfse[4038].fse:                     0 ; 0xfde: FREE=0x0 FRAG=0x0
kfdfse[4039].fse:                     0 ; 0xfdf: FREE=0x0 FRAG=0x0

For this FST block, the first allocation table block is in AU 0:

kfdfsb.aunum:                         0 ; 0x000: 0x00000000

Maximum number of the FST entries this block can hold is 254:

kfdfsb.max:                         254 ; 0x004: 0x00fe

How many Free Space Tables

Large ASM disks may have more than one stride. The field kfdhdb.mfact in the ASM disk header, shows the stride size - expressed in allocation units. Each stride will have its own physically addressed metadata, which means that it will have its own Free Space Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the FST for the second stride. That should be in block 1 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC

As expected, we have another FTS in AU113792. If we had another stride, there would be another FST at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the FST at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC

Conclusion

The Free Space Table is in block 1 of allocation unit 0 of every ASM disks. If the disk has more than one stride, each stride will have its own Free Space Table.


August 17, 2013

Physical metadata replication


Starting with version 12.1, ASM replicates the physically addressed metadata. This means that ASM maintains two copies of the disk header, the Free Space Table and the Allocation Table data. Note that this metadata is not mirrored, but replicated. ASM mirroring refers to copies of the same data on different disks. The copies of the physical metadata are on the same disk, hence the term replicated. This also means that the physical metadata is replicated even in an external redundancy disk group.

The Partnership and Status Table (PST) is also referred to as physically addressed metadata, but the PST is not replicated. This is because the PST is protected by mirroring - in normal and high redundancy disk groups.

Where is the replicated metadata

The physically addressed metadata is in allocation unit 0 (AU0) on every ASM disk. With this feature enabled, ASM will copy the contents of AU0 into allocation unit 11 (AU11), and from that point on, it will maintain both copies. This feature will be automatically enabled when a disk group is created with ASM compatibility of 12.1 or higher, or when ASM compatibility is advanced to 12.1 or higher, for an existing disk group.

If there is data in AU11, when the ASM compatibility is advanced to 12.1 or higher, ASM will simply move that data somewhere else, and use AU11 for the physical metadata replication.

Since version 11.1.0.7, ASM keeps a copy of the disk header in the second last block of AU1. Interestingly, in version 12.1, ASM still keeps the copy of the disk header in AU1, which means that now every ASM disk will have three copies of the disk header block.

Disk group attribute PHYS_META_REPLICATED

The status of the physical metadata replication can be checked by querying the disk group attribute PHYS_META_REPLICATED. Here is an example with the asmcmd command that shows how to check the replication status for disk group DATA:

$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

The phys_meta_replicated=true means that the physical metadata for disk group DATA has been replicated.

The kfdhdb.flags field in the ASM disk header indicates the status of the physical metadata replication as follows:
  • kfdhdb.flags = 0 - no physical data has been replicated
  • kfdhdb.flags = 1 - physical data has been replicated
  • kfdhdb.flags = 2 - physical data replication in progress
Once the flag is set to 1, it will never go back to 0.

Metadata replication in action

As stated earlier, the physical metadata will be replicated in disk groups with ASM compatibility of 12.1 or higher. Let's first have a look at a disk group with ASM compatible set to 12.1:

$ asmcmd lsattr -G DATA -l compatible.asm
Name            Value
compatible.asm  12.1.0.0.0
$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name                  Value
phys_meta_replicated  true

This shows that the physical metadata has been replicated. Now verify that all disks in the disk group have the kfdhdb.flags set to 1:

$ for disk in `asmcmd lsdsk -G DATA --suppressheader`; do kfed read $disk | egrep "dskname|flags"; done
kfdhdb.dskname:               DATA_0000 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0001 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0002 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0003 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

This shows that all disks have the replication flag set to 1, i.e. that the physical metadata has been replicated for all disks in the disk group.

Let's now have a look at a disk group with ASM compatibility 11.2, that is later advanced to 12.1:

SQL> create diskgroup DG1 external redundancy
  2  disk '/dev/sdi1'
  3  attribute 'COMPATIBLE.ASM'='11.2';

Diskgroup created.

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name  Value

Nothing - no such attribute. That is because the ASM compatibility is less than 12.1. We also expect that the kfdhdb.flags is 0 for the only disk in that disk group:

$ kfed read /dev/sdi1 | egrep "type|dskname|grpname|flags"
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.grpname:                     DG1 ; 0x048: length=3
kfdhdb.flags:                         0 ; 0x0fc: 0x00000000

Let's now advance the ASM compatibility to 12.1:

$ asmcmd setattr -G DG1 compatible.asm 12.1.0.0.0

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name                  Value
phys_meta_replicated  true

The physical metadata has been replicated, so we should now see the kfdhdb.flags set to 1:

$ kfed read /dev/sdi1 | egrep "dskname|flags"
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

The physical metadata should be replicated in AU11:

$ kfed read /dev/sdi1 aun=11 | egrep "type|dskname|flags"
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

$ kfed read /dev/sdi1 aun=11 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
$ kfed read /dev/sdi1 aun=11 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

This shows that the AU11 has the copy of the data from AU0.

Finally check for the disk header copy in AU1:

$ kfed read /dev/sdi1 aun=1 blkn=254 | grep type
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD

This shows that there is also a copy of the disk header in the second last block of AU1.

Conclusion

ASM version 12 replicates the physically addressed metadata, i.e. it keeps the copy of AU0 in AU11 - on the same disk. This allows ASM to automatically recover from damage to any data in AU0. Note that ASM will not be able to recover from loss of any other data in an external redundancy disk group. In a normal redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in a single failgroup. In a high redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in any two failgroups.


August 14, 2013

ASM version 12c is out


Oracle Database version 12c has been released, which means a brand new version of ASM is out! Notable new features are Flex ASM, proactive data validation and better handling of disk management operations. Let's have an overview with more details in separate posts.

Flex ASM

No need to run ASM instances on all nodes in the cluster. In a default installation there would be three ASM instances, irrespective of the number of nodes in the cluster. An ASM instance can serve both local and remote databases. If an ASM instance fails, the database instances do not crash; instead they fail over to another ASM instance in the cluster.

Flex ASM introduces new instance type - an I/O server or ASM proxy instance. There will be a few (default is 3) I/O server instances in Oracle flex cluster environment, serving indirect clients (typically an ACFS cluster file system). An I/O server instance can run on the same node as ASM instance or on a different node in a flex cluster. In all cases, an I/O server instance needs to talk to a flex ASM instance to get metadata information on behalf of an indirect client.

The flex ASM is an optional feature in 12c.

Physical metadata replication

In addition to replicating the disk header (available since 11.1.0.7), ASM 12c also replicates the allocation table, within each disk. This makes ASM more resilient to bad disk sectors and external corruptions. The disk group attribute PHYS_META_REPLICATED is provided to track the replication status of a disk group.

$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

The physical metadata replication status flag is in the disk header (kfdhdb.flags). This flag only ever goes from 0 to 1 (once the physical metadata has been replicated) and it never goes back to 0.

More storage

ASM 12c supports 511 disk groups, with the maximum disk size of 32 PB.

Online with power

ASM 12c has a fast mirror resync power limit to control resync parallelism and improve performance. Disk resync checkpoint functionality provides faster recovery from instance failures by enabling the resync to resume from the point at which the process was interrupted or stopped, instead of starting from the beginning. ASM 12c also provides a time estimate for the completion of a resync operation.

Use power limit for disk resync operations, similar to disk rebalance, with the range from 1 to 1024:

$ asmcmd online -G DATA -D DATA_DISK1 --power 42

Disk scrubbing - proactive data validation and repair

In ASM 12c the disk scrubbing checks for data corruptions and repairs them automatically in normal and high redundancy disk groups. This is done during disk group rebalance if a disk group attribute CONTENT.CHECK is set to TRUE. The check can also be performed manually by running ALTER DISKGROUP SCRUB command.

The scrubbing can be performed at the disk group, disk or a file level and can be monitored via V$ASM_OPERATION view.

Even read for disk groups

In previous ASM versions, the data was always read from the primary copy (in a normal or high redundancy disk groups) unless a preferred failgroup was set up. The data from the mirror would be read only if the primary copy of the data was unavailable. With the even read feature, each request to read can be sent to the least loaded of the possible source disks. The least loaded in this context is simply the disk with the least number of read requests.

Even read functionality is enabled by default on all Oracle Database and Oracle ASM instances of version 12.1 and higher in non-Exadata environments. The functionality is enabled in an Exadata environment when there is a failure. Even read functionality is applicable only to disk groups with normal or high redundancy.

Replace an offline disk

We now have a new ALTER DISKGROUP REPLACE DISK command, that is a mix of the rebalance and fast mirror resync functionality. Instead of a full rebalance, the new, replacement disk, is populated with data read from the surviving partner disks only. This effectively reduces the time to replace a failed disk.

Note that the disk being replaced must be in OFFLINE state. If the disk offline timer has expired, the disk is dropped, which initiates the rebalance. On a disk add, there will be another rebalance.

ASM password file in a disk group

ASM version 11.2 allowed ASM spfile to be placed in a disk group. In 12c we can also put ASM password file in an ASM disk group. Unlike ASM spfile, the access to the ASM password file is possible only after ASM startup and once the disk group containing the password is mounted.

The orapw utility now accepts ASM disk group as a password destination. The asmcmd has also been enhanced to allow ASM password management.

Failgroup repair timer

We now have a failgroup repair timer with the default value of 24 hours. Note that the disk repair timer still defaults to 3.6 hours.

Rebalance rebalanced

The rebalance work is now estimated based on the detailed work plan, that can be generated and viewed separately. We now have a new EXPLAIN WORK command and a new V$ASM_ESTIMATE view.

In ASM 12c we (finally) have a priority ordered rebalance - the critical files (typically control files and redo logs) are rebalanced before other database files.

In Exadata, the rebalance can be offloaded to storage cells.

Thin provisioning support

ASM 12c enables thin provisioning support for some operations (that are typically associated with the disk group rebalance). The feature is disabled by default, and can be enabled at the disk group creation time or later by setting disk group attribute THIN_PROVISIONED to TRUE.

Enhanced file access control (ACL)

Easier file ownership and permission changes, e.g. a file permission can be changed on an open file. ACL has also been implemented for Microsoft Windows OS.

Oracle Cluster Registry (OCR) backup in ASM disk group

Storing the OCR backup in an Oracle ASM disk group simplifies OCR management by permitting access to the OCR backup from any node in the cluster should an OCR recovery become necessary.

Use ocrconfig command to specify an OCR backup location in an Oracle ASM disk group:

# ocrconfig –backuploc +DATA

June 16, 2013

How many allocation units per file

This post is about the amount of space allocated to ASM based files.

The smallest amount of space ASM allocates is an allocation unit (AU). The default AU size is 1 MB, except in Exadata where the default AU size is 4 MB.

The space for ASM based files is allocated in extents, which consist of one or more AUs. In version 11.2, the first 20000 extents consist of 1 AU, next 20000 extents have 4 AUs, and extents beyond that have 16 AUs. This is known as variable size extent feature. In version 11.1, the extent growth was 1-8-64 AUs. In version 10, we don't have variable size extents, so all extents sizes are exactly 1 AU.

Bytes vs space

The definition for V$ASM_FILE view says the following for BYTES and SPACE columns:
  • BYTES - Number of bytes in the file
  • SPACE - Number of bytes allocated to the file
There is a subtle difference in the definition and very large difference in numbers. Let's have a closer look. For the examples in this post I will use database and ASM version 11.2.0.3, with ASMLIB based disks.

First get some basic info about disk group DATA where most of my datafiles are. Run the following SQL connected to the database instance.

SQL> select NAME, GROUP_NUMBER, ALLOCATION_UNIT_SIZE/1024/1024 "AU size (MB)", TYPE
from V$ASM_DISKGROUP
where NAME='DATA';

NAME             GROUP_NUMBER AU size (MB) TYPE
---------------- ------------ ------------ ------
DATA                        1            1 NORMAL

Now create one small file (under 60 extents) and one large file (over 60 extents).

SQL> create tablespace T1 datafile '+DATA' size 10 M;

Tablespace created.

SQL> create tablespace T2 datafile '+DATA' size 100 M;

Tablespace created.

Get the ASM file numbers for those two files:

SQL> select NAME, round(BYTES/1024/1024) "MB" from V$DATAFILE;

NAME                                               MB
------------------------------------------ ----------
...
+DATA/br/datafile/t1.272.818281717                 10
+DATA/br/datafile/t2.271.818281741                100

The small file is ASM file number 272 and the large file is ASM file number 271.

Get the bytes and space information (in AUs) for these two files.

SQL> select FILE_NUMBER, round(BYTES/1024/1024) "Bytes (AU)", round(SPACE/1024/1024) "Space (AUs)", REDUNDANCY
from V$ASM_FILE
where FILE_NUMBER in (271, 272) and GROUP_NUMBER=1;

FILE_NUMBER Bytes (AU) Space (AUs) REDUND
----------- ---------- ----------- ------
        272         10          22 MIRROR
        271        100         205 MIRROR

The bytes shows the actual file size. For the small file, bytes shows the file size is 10 AUs = 10 MB (the AU size is 1 MB). The space required for the small file is 22 AUs. 10 AUs for the actual datafile, 1 AU for the file header and because the file is mirrored, double that, so 22 AUs in total.

For the large file, bytes shows the file size is 100 AUs = 100 MB. So far so good. But the space required for the large file is 205 AUs, not 202 as one might expect. What are those extra 3 AUs for? Let's find out.

ASM space

The following query (run in ASM instance) will show us the extent distribution for ASM file 271.

SQL> select XNUM_KFFXP "Virtual extent", PXN_KFFXP "Physical extent", DISK_KFFXP "Disk number", AU_KFFXP "AU number"
from X$KFFXP
where GROUP_KFFXP=1 and NUMBER_KFFXP=271
order by 1,2;

Virtual extent Physical extent Disk number  AU number
-------------- --------------- ----------- ----------
             0               0           3       1155
             0               1           0       1124
             1               2           0       1125
             1               3           2       1131
             2               4           2       1132
             2               5           0       1126
...
           100             200           3       1418
           100             201           1       1412
    2147483648               0           3       1122
    2147483648               1           0       1137
    2147483648               2           2       1137

205 rows selected.

As the file is mirrored, we see that each virtual extent has two physical extents. But the interesting part of the result are the last three allocation units for virtual extent number 2147483648, that is triple mirrored. We will have a closer look at those with kfed, and for that we will need disk names.

Get the disk names.

SQL> select DISK_NUMBER, PATH
from V$ASM_DISK
where GROUP_NUMBER=1;

DISK_NUMBER PATH
----------- ---------------
          0 ORCL:ASMDISK1
          1 ORCL:ASMDISK2
          2 ORCL:ASMDISK3
          3 ORCL:ASMDISK4

Let's now check what type of data is in those allocation units.

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=1122 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK1 aun=1137 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1137 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

These additional allocation units hold ASM metadata for the large file. More specifically they hold extent map information that could not fit into the the ASM file directory block. The file directory needs extra space to keep track of files larger than 60 extents, so it needs an additional allocation unit to do so. While the file directory needs only few extra ASM metadata blocks, the smallest unit of space the ASM can allocate is an AU. And because this is metadata, this AU is triple mirrored (even in a normal redundancy disk group), hence 3 extra allocation units for the large file. In an external redundancy disk group, there would be only one extra AU per large file.

Conclusion

The amount of space ASM needs for a file, depends on two factors - the file size and the disk group redundancy.

In an external redundancy disk group, the required space will be the file size + 1 AU for the file header + 1 AU for indirect extents if the file is larger than 60 AUs.

In a normal redundancy disk group, the required space will be twice the file size + 2 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

In a high redundancy disk group, the required space will be three times the file size + 3 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

June 13, 2013

Auto disk management feature in Exadata


The automatic disk management feature is about automating ASM disk operations in an Exadata environment. The automation functionality applies to both planned actions (for example, deactivating griddisks in preparation for storage cell patching) and unplanned events (for example, disk failure).

Exadata disks

In an Exadata environment we have the following disk types:
  • Physicaldisk is a hard disk on a storage cell. Each storage cell has 12 physical disks, all with the same capacity (600 GB, 2 TB or 3 TB).
  • Flashdisk is a Sun Flash Accelerator PCIe solid state disk on a storage cell. Each storage cell has 16 flashdisks - 24 GB each in X2 (Sun Fire X4270 M2) and 100 GB each in X3 (Sun Fire X4270 M3) servers.
  • Celldisk is a logical disk created on every physicaldisk and every flashdisk on a storage cell. Celldisks created on physicaldisks are named CD_00_cellname, CD_01_cellname ... CD_11_cellname. Celldisks created on flashdisks are named FD_00_cellname, FD_01_cellname ... FD_15_cellname.
  • Griddisk is a logical disk that can be created on a celldisk. In a standard Exadata deployment we create griddisks on hard disk based celldisks only. While it is possible to create griddisks on flashdisks, this is not a standard practice; instead we use flash based celldisks for the flashcashe and flashlog.
  • ASM disk in an Exadata environment is a griddisk.
Automated disk operations

These are the disk operations that are automated in Exadata:

1. Griddisk status change to OFFLINE/ONLINE

If a griddisk becomes temporarily unavailable, it will be automatically OFFLINED by ASM. When the griddisk becomes available, it will be automatically ONLINED by ASM.

2. Griddisk DROP/ADD

If a physicaldisk fails, all griddisks on that physicaldisk will be DROPPED with FORCE option by ASM. If a physicaldisk status changes to predictive failure, all griddisks on that physical disk will be DROPPED by ASM. If a flashdisk performance degrades, the corresponding griddisks (if any) will be DROPPED with FORCE option by ASM.

When a physicaldisk is replaced, the celldisk and griddisks will be recreated by CELLSRV, and the griddisks will be automatically ADDED by ASM.

NOTE: If a griddisk in NORMAL state and in ONLINE mode status, is manually dropped with FORCE option (for example, by a DBA with 'alter diskgroup ... drop disk ... force'), it will be automatically added back by ASM. In other words, dropping a healthy disk with a force option will not achieve the desired effect.

3. Griddisk OFFLINE/ONLINE for rolling Exadata software (storage cells) upgrade

Before the rolling upgrade all griddisks will be inactivated on the storage cell by CELLSRV and OFFLINED by ASM. After the upgrade all griddisks will be activated on the storage cell and ONLINED in ASM.

4. Manual griddisk activation/inactivation

If a gridisk is manually inactivated on a storage cell, by running 'cellcli -e alter griddisk ... inactive',  it will be automatically OFFLINED by ASM. When a gridisk is activated on a storage cell, it will be automatically ONLINED by ASM.

5. Griddisk confined ONLINE/OFFLINE

If a griddisk is taken offline by CELLSRV, because the underlying disk is suspected for poor performance, all griddisks on that celldisk will be automatically OFFLINED by ASM. If the tests confirm that the celldisk is performing poorly, ASM will drop all griddisks on that celldisk. If the tests find that the disk is actually fine, ASM will online all griddisks on that celldisk.

Software components

1. Cell Server (CELLSRV)

The Cell Server (CELLSRV) runs on the storage cell and it's the main component of Exadata software. In the context of automatic disk management, its tasks are to process the Management Server notifications and handle ASM queries about the state of griddisks.

2. Management Server (MS)

The Management Server (MS) runs on the storage cell and implements a web service for cell management commands, and runs background monitoring threads. The MS monitors the storage cell for hardware changes (e.g. disk plugged in) or alerts (e.g. disk failure), and notifies the CELLSRV about those events.

3. Automatic Storage Management (ASM)

The Automatic Storage Management (ASM) instance runs on the compute (database) node and has two processes that are relevant to the automatic disk management feature:
  • Exadata Automation Manager (XDMG) initiates automation tasks involved in managing Exadata storage. It monitors all configured storage cells for state changes, such as a failed disk getting replaced, and performs the required tasks for such events. Its primary tasks are to watch for inaccessible disks and cells and when they become accessible again, to initiate the ASM ONLINE operation.
  • Exadata Automation Manager (XDWK) performs automation tasks requested by XDMG. It gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.
Working together

All three software components work together to achieve automatic disk management.

In the case of disk failure, the MS detects that the disk has failed. It then notifies the CELLSRV about it. If there are griddisks on the failed disk, the CELLSRV notifies ASM about the event. ASM then drops all griddisks from the corresponding disk groups.

In the case of a replacement disk inserted into the storage cell, the MS detects the new disk and checks the cell configuration file to see if celldisk and griddisks need to be created on it. If yes, it notifies the CELLSRV to do so. Once finished, the CELLSRV notifies ASM about new griddisks and ASM then adds them to the corresponding disk groups.

In the case of a poorly performing disk, the CELLSRV first notifies ASM to offline the disk. If possible, ASM then offlines the disk. One example when ASM would refuse to offline the disk, is when a partner disk is already offline. Offlining the disk would result in the disk group dismount, so ASM would not do that. Once the disk is offlined by ASM, it notifies the CELLSRV that the performance tests can be carried out. Once done with the tests, the CELLSRV will either tell ASM to drop that disk (if it failed the tests) or online it (if it passed the test).

The actions by MS, CELLSRV and ASM are coordinated in a similar fashion, for other disk events.

ASM initialization parameters

The following are the ASM initialization parameters relevant to the auto disk management feature:
  • _AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature set this parameter to FALSE. Range of values: TRUE [default] or FALSE.
  • _AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic operation. Range of values: 1-10. Default value is 2.
  • _AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk. Range of values: 1-10. Default value is 3.
All three parameters are static, which means they require ASM instances restart. Note that all these are hidden (underscore) parameters that should not be modified unless advised by Oracle Support.

Files

The following are the files relevant to the automatic disk management feature:

1. Cell configuration file - $OSSCONF/cell_disk_config.xml. An XML file on the storage cell that contains information about all configured objects (storage cell, disks, IORM plans, etc) except alerts and metrics. The CELLSRV reads this file during startup and writes to it when an object is updated (e.g. updates to IORM plan).

2. Grid disk file - $OSSCONF/griddisk.owners.dat. A binary file on the storage cell that contains the following information for all griddisks:
  • ASM disk name
  • ASM disk group name
  • ASM failgroup name
  • Cluster identifier (which cluster this disk belongs to)
  • Requires DROP/ADD (should the disk be dropped from or added to ASM)
3. MS log and trace files - ms-odl.log and ms-odl.trc in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

4. CELLSRV alert log - alert.log in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

5. ASM alert log - alert_+ASMn.log in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

6. XDMG and XDWK trace files - +ASMn_xdmg_nnnnn.trc and +ASMn_xdwk_nnnnn.trc in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

Conclusion

In an Exadata environment, the ASM has been enhanced to provide the automatic disk management functionality. Three software components that work together to provide this facility are the Exadata Cell Server (CELLSRV), Exadata Management Server (MS) and Automatic Storage Management (ASM).

I have also published this via MOS as Doc ID 1484274.1.

May 19, 2013

Identification of under-performing disks in Exadata


Starting with Exadata software version 11.2.3.2, an under-performing disk can be detected and removed from an active configuration. This feature applies to both hard disks and flash disks.

About storage cell software processes

The Cell Server (CELLSRV) is the main component of Exadata software, which services I/O requests and provides advanced Exadata services, such as predicate processing offload. CELLSRV is implemented as a multithreaded process and is expected to use the largest portion of processor cycles on a storage cell.

The Management Server (MS) provides storage cell management and configuration tasks.

Disk state changes

Possibly under-performing - confined online

When a poor disk performance is detected by the CELLSRV, the cell disk status changes to 'normal - confinedOnline' and the physical disk status changes to 'warning - confinedOnline'. This is expected behavior and it indicates that the disk has entered the first phase of the identification of under-performing disk. This is a transient phase, i.e. the disk does not stay in this status for a prolonged period of time.

That disk status change would be associated with the following entry in the storage cell alerthistory:

[MESSAGE ID] [date and time] info "Hard disk entered confinement status. The LUN n_m changed status to warning - confinedOnline. CellDisk changed status to normal - confinedOnline. Status: WARNING - CONFINEDONLINE  Manufacturer: [name]  Model Number: [model]  Size: [size]  Serial Number: [S/N]  Firmware: [F/W version]  Slot Number: m  Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"

At the same time, the following will be logged in the storage cell alert log:

CDHS: Mark cd health state change [cell disk name]  with newState HEALTH_BAD_ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0 inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
[date and time]
CDHS: Do cd health state change [cell disk name] from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
CDHS: Done cd health state change  from HEALTH_GOOD to newState HEALTH_BAD_ONLINE
ABSOLUTE SERVICE TIME VIOLATION DETECTED ON DISK [device name]: CD name - [cell disk name] AVERAGE SERVICETIME: 130.913043 ms. AVERAGE WAITTIME: 101.565217 ms. AVERAGE REQUESTSIZE: 625 sectors. NUMBER OF IOs COMPLETED IN LAST CYCLE ON DISK: 23 THRESHOLD VIOLATION COUNT: 6 NON_ZERO SERVICETIME COUNT: 6 SET CONFINE SUCCESS: 1
NOTE: Initiating ASM Instance operation: Query ASM Deactivation Outcome on 3 disks
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 26502
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 28966
Published 1 grid disk events Query ASM Deactivation Outcome on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 11912
...

Prepare for test - confined offline

The next action is to take all grid disks on the cell disk offline and run the performance tests on it. The CELLSRV asks ASM to take the grid disks offline and, if possible, the ASM takes the grid disks offline. In that case, the cell disk status changes to 'normal - confinedOffline' and the physical disk status changes to 'warning - confinedOffline'.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] warning "Hard disk entered confinement offline status. The LUN n_m changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. Status: WARNING - CONFINEDOFFLINE  Manufacturer: [name]  Model Number: [model]  Size: [size]  Serial Number: [S/N]  Firmware: [F/W version]  Slot Number: m  Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for confinement: threshold for service time exceeded"
The following will be logged in the storage cell alert log:
NOTE: Initiating ASM Instance operation: ASM OFFLINE disk on 3 disks
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 28966
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 31801
Published 1 grid disk events ASM OFFLINE disk on DG [disk group name] to:
ClientHostName = [database node name],  ClientPID = 26502
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE
CDHS: Done cd health state change  from HEALTH_BAD_ONLINE to newState HEALTH_BAD_OFFLINE

Note that ASM will take the grid disks offline if possible. That means that ASM will not offline any disks if that would result in the disk group dismount. For example if a partner disk is already offline, ASM will not offline this disk. In that case, the cell disk status will stay at 'normal - confinedOnline' until the disk can be safely taken offline.

In that case, the CELLSRV will repeatedly log 'Query ASM Deactivation Outcome' messages in the cell alert log. This is expected behavior and the messages will stop once ASM can take the grid disks offline.

Under stress test

Once all grid disks are offline, the MS runs the performance tests on the cell disk. If it turns out that the disk is performing well, MS will notify CELLSRV that the disk is fine. The CELLSRV will then notify ASM to put the grid disks back online.

Poor performance - drop force

If the MS finds that the disk is indeed performing poorly, the cell disk status will change to 'proactive failure' and the physical disk status will change to 'warning - poor performance'. Such disk will need to be removed from an active configuration. In that case the MS notifies the CELLSRV, which in turn notifies ASM to drop all grid disks from that cell disk.

That action would be associated with the following entry in the cell alerthistory:

[MESSAGE ID] [date and time] critical "Hard disk entered poor performance status. Status: WARNING - POOR PERFORMANCE Manufacturer: [name] Model Number: [model]  Size: [size]  Serial Number: [S/N] Firmware: [F/W version] Slot Number: m Cell Disk: [cell disk name]  Grid Disk: [grid disk 1], [grid disk 2] ... Reason for poor performance : threshold for service time exceeded"
The following will be logged in the storage cell alert log:
CDHS: Do cd health state change  after confinement [cell disk name] testFailed 1
CDHS: Do cd health state change [cell disk name] from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL
NOTE: Initiating ASM Instance operation: ASM DROP dead disk on 3 disks
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb02.clorox.com,  ClientPID = 28966
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb03.clorox.com,  ClientPID = 11912
Published 1 grid disk events ASM DROP dead disk on DG [disk group name] to:
ClientHostName = aodpdb04.clorox.com,  ClientPID = 26502
CDHS: Done cd health state change  from HEALTH_BAD_OFFLINE to newState HEALTH_FAIL

In the ASM alert log we will see the drop disk force operations for the respective grid disks, followed by the disk group rebalance operation.

Once the rebalance completes the problem disk should be replaced, by following the same process as for a disk with the status predictive failure.

All well - back to normal

If the MS tests determine that there are no performance issues with the disk, it will pass that information onto CELLSRV, which will in turn ask ASM to put the grid disks back online. The cell and physical disk status will change back to normal.

Disk confinement triggers

Any of the following conditions can trigger a disk confinement:
  1. Hung cell disk (the cause code in the storage cell alert log will be CD_PERF_HANG).
  2. Slow cell disk, e.g. high service time threshold (CD_PERF_SLOW_ABS), high relative service time threshold (CD_PERF_SLOW_RLTV), etc.
  3. High read or write latency, e.g. high latency on writes (CD_PERF_SLOW_LAT_WT), high latency on reads (CD_PERF_SLOW_LAT_RD), high latency on both reads and writes (CD_PERF_SLOW_LAT_RW), very high absolute latency on individual I/Os happening frequently (CD_PERF_SLOW_LAT_ERR), etc.
  4. Errors, e.g. I/O errors (CD_PERF_IOERR).
Conclusion

As a single underperforming disk can impact overall system performance, a new feature has been introduced in Exadata to identify and remove such disks from an active configuration. This is fully automated process that includes an automatic service request (ASR) for disk replacement.

I have recently published this on MOS as Doc ID 1509105.1.