August 31, 2013

Partnership and Status Table


The Partnership and Status Table (PST) contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data.

PST count

In an external redundancy disk group there will be only one copy of the PST.

In a normal redundancy disk group there will be at least two copies of the PST. If there are three or more failgroups, there will be three copies of the PST.

In a high redundancy disk group there will be at least three copies of the PST. If thre are four failgroups, there will be four PST copies, and if there are five or more failgroups there will be five copies of the PST.

Let's have a look. Note that in each example, the disk group is created with five disks.

External redundancy disk group.

SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log:

Sat Aug 31 20:44:59 2013
SQL> CREATE DISKGROUP DG1 EXTERNAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:44:59 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:45:00 2013
GMON updating group 2 at 50 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
Sat Aug 31 20:45:00 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates only one copy of the PST.

Normal redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:49:28 2013
SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:49:28 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:49:28 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:49:28 2013
GMON updating group 2 at 68 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
Sat Aug 31 20:49:28 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates three copies of the PST.

High redundancy disk group

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9';

Diskgroup created.

ASM alert log

Sat Aug 31 20:51:52 2013
SQL> CREATE DISKGROUP DG1 HIGH REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8', '/dev/sdc9'
Sat Aug 31 20:51:52 2013
NOTE: Assigning number (2,0) to disk (/dev/sdc5)
NOTE: Assigning number (2,1) to disk (/dev/sdc6)
NOTE: Assigning number (2,2) to disk (/dev/sdc7)
NOTE: Assigning number (2,3) to disk (/dev/sdc8)
NOTE: Assigning number (2,4) to disk (/dev/sdc9)
...
Sat Aug 31 20:51:53 2013
NOTE: group 2 PST updated.
NOTE: initiating PST update: grp = 2
Sat Aug 31 20:51:53 2013
GMON updating group 2 at 77 for pid 22, osid 9873
NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)
NOTE: group DG1: initial PST location: disk 0003 (PST copy 3)
NOTE: group DG1: initial PST location: disk 0004 (PST copy 4)
Sat Aug 31 20:51:53 2013
NOTE: PST update grp = 2 completed successfully
...

We see that ASM creates five copies of the PST.

PST relocation

The PST would be relocated in the following cases
  • The disk with the PST is not available (on ASM startup)
  • The disk goes offline
  • There was an I/O error while reading/writing to/from the PST
  • Disk is dropped gracefully
In all cases the PST would be relocated to another disk in the same failgroup (if a disk is available in the same failure group) or to another failgroup (that doesn't already contain a copy of the PST).

Let's have a look.

SQL> drop diskgroup DG1;

Diskgroup dropped.

SQL> CREATE DISKGROUP DG1 NORMAL REDUNDANCY
DISK '/dev/sdc5', '/dev/sdc6', '/dev/sdc7', '/dev/sdc8';

Diskgroup created.

ASM alert log shows the PST copies are on disks 0, 1 and 2:

NOTE: group DG1: initial PST location: disk 0000 (PST copy 0)
NOTE: group DG1: initial PST location: disk 0001 (PST copy 1)
NOTE: group DG1: initial PST location: disk 0002 (PST copy 2)

Let's drop disk 0:

SQL> select disk_number, name, path from v$asm_disk_stat
where group_number = (select group_number from v$asm_diskgroup_stat where name='DG1');

DISK_NUMBER NAME                           PATH
----------- ------------------------------ ----------------
          3 DG1_0003                       /dev/sdc8
          2 DG1_0002                       /dev/sdc7
          1 DG1_0001                       /dev/sdc6
          0 DG1_0000                       /dev/sdc5

SQL> alter diskgroup DG1 drop disk DG1_0000;

Diskgroup altered.

ASM alert log

Sat Aug 31 21:04:29 2013
SQL> alter diskgroup DG1 drop disk DG1_0000
...
NOTE: initiating PST update: grp 2 (DG1), dsk = 0/0xe9687ff6, mask = 0x6a, op = clear
Sat Aug 31 21:04:37 2013
GMON updating disk modes for group 2 at 96 for pid 24, osid 16502
NOTE: group DG1: updated PST location: disk 0001 (PST copy 0)
NOTE: group DG1: updated PST location: disk 0002 (PST copy 1)
NOTE: group DG1: updated PST location: disk 0003 (PST copy 2)
...

We see that the PST copy from disk 0 was moved to disk 3.

Disk Partners

A disk partnership is a symmetric relationship between two disks in a high or normal redundancy disk group. There is no disk partnership in an external disk groups. For a discussion on this topic, please see the post How many partners.

PST Availability

The PST has to be available before the rest of ASM metadata. When the disk group mount is requested, the GMON process (on the instance requesting a mount) reads all disks in the disk group to find and verify all available PST copies. Once it verifies that there are enough PSTs for a quorum, it mounts the disk group. From that point on, the PST is available in the ASM instance cache, stored in the GMON PGA and protected by an exclusive lock on the PT.n.0 enqueue.

As other ASM instances, in the same cluster, come online they cache the PST in their GMON PGA with shared PT.n.0 enqueue.

Only the GMON (the CKPT in 10gR1) that has an exclusive lock on the PT enqueue, can update the PST information on disks.

PST (GMON) tracing

The GMON trace file will log the PST info every time a disk group mount is attempted. Note that I said attempted, not mounted, as the GMON will log the information regardless of the mount being successful or no. This information may be valuable to Oracle Support in diagnosing disk group mount failures.

This would be a typical information logged in the GMON trace file on a disk group mount:

=============== PST ====================
grpNum:    2
grpTyp:    2
state:     1
callCnt:   103
bforce:    0x0
(lockvalue) valid=1 ver=0.0 ndisks=3 flags=0x3 from inst=0 (I am 1) last=0
--------------- HDR --------------------
next:    7
last:    7
pst count:       3
pst locations:   1  2  3
incarn:          4
dta size:        4
version:         0
ASM version:     168820736 = 10.1.0.0.0
contenttype:     0
--------------- LOC MAP ----------------
0: dirty 0       cur_loc: 0      stable_loc: 0
1: dirty 0       cur_loc: 0      stable_loc: 0
--------------- DTA --------------------
1: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 2 (amp) 3 (amp)
2: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 3 (amp)
3: sts v v(rw) p(rw) a(x) d(x) fg# = 0 addTs = 0 parts: 1 (amp) 2 (amp)
...

The section marked === PST === tells us the group number (grpNum), type (grpTyp) and state. The section marked --- HDR --- shows the number of PST copies (pst count) and the disk numbers that have those copies (pst locations). The secion marked --- DTA --- shows the actual state of the disks with the PST.

Conclusion

The Partnership and Status Table contains the information about all ASM disks in a disk group – disk number, disk status, partner disk number, heartbeat info and the failgroup info (11g and later).

Allocation unit number 1 on every ASM disk is reserved for the PST, but only some disks will have the PST data. As the PST is a valuable ASM metadata, it is mirrored three times in a normal redundancy disk group and five times in a high redundancy disk group - provided there are enough failgroups of course.


August 24, 2013

Allocation Table


Every ASM disk contains at least one Allocation Table (AT) that describes the contents of the disk. The AT has one entry for every allocation unit (AU) on the disk. If an AU is allocated, the Allocation Table will have the extent number and the file number the AU belongs to.

Finding the Allocation Table

The location of the first block of the Allocation Table is stored in the ASM disk header (field kfdhdb.altlocn). In the following example, the look up of that field shows that the AT starts at block 2.

$ kfed read /dev/sdc1 | grep kfdhdb.altlocn
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002

Let’s have a closer look at the first block of the Allocation Table.

$ kfed read /dev/sdc1 blkn=2 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum:                         0 ; 0x000: 0x00000000
kfdatb.shrink:                      448 ; 0x004: 0x01c0
...

The kfdatb.aunum=0, means that AU0 is the first AU described by this AT block. The kfdatb.shrink=448 means that this AT block can hold the information for 448 AUs. In the next AT block we should see kfdatb.aunum=448, meaning that it will have the info for AU448 + 448 more AUs. Let’s have a look:

$ kfed read /dev/sdc1 blkn=3 | grep kfdatb.aunum
kfdatb.aunum:                       448 ; 0x000: 0x000001c0

The next AT block should show kfdatb.aunum=896:

$ kfed read /dev/sdc1 blkn=4 | grep kfdatb.aunum
kfdatb.aunum:                       896 ; 0x000: 0x00000380

And so on...

Allocation table entries

For allocated AUs, the Allocation Table entry (kfdate[i]) holds the extent number, file number and the state of the allocation unit - normally allocated (flag V=1), vs a free or unallocated AU (flag V=0).

Let’s have a look at Allocation Table block 3.

$ kfed read /dev/sdc1 blkn=3 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
kfdatb.aunum:                       448 ; 0x000: 0x000001c0
...
kfdate[142].discriminator:            1 ; 0x498: 0x00000001
kfdate[142].allo.lo:                  0 ; 0x498: XNUM=0x0
kfdate[142].allo.hi:            8388867 ; 0x49c: V=1 I=0 H=0 FNUM=0x103
kfdate[143].discriminator:            1 ; 0x4a0: 0x00000001
kfdate[143].allo.lo:                  1 ; 0x4a0: XNUM=0x1
kfdate[143].allo.hi:            8388867 ; 0x4a4: V=1 I=0 H=0 FNUM=0x103
kfdate[144].discriminator:            1 ; 0x4a8: 0x00000001
kfdate[144].allo.lo:                  2 ; 0x4a8: XNUM=0x2
kfdate[144].allo.hi:            8388867 ; 0x4ac: V=1 I=0 H=0 FNUM=0x103
kfdate[145].discriminator:            1 ; 0x4b0: 0x00000001
kfdate[145].allo.lo:                  3 ; 0x4b0: XNUM=0x3
kfdate[145].allo.hi:            8388867 ; 0x4b4: V=1 I=0 H=0 FNUM=0x103
kfdate[146].discriminator:            1 ; 0x4b8: 0x00000001
kfdate[146].allo.lo:                  4 ; 0x4b8: XNUM=0x4
kfdate[146].allo.hi:            8388867 ; 0x4bc: V=1 I=0 H=0 FNUM=0x103
kfdate[147].discriminator:            1 ; 0x4c0: 0x00000001
kfdate[147].allo.lo:                  5 ; 0x4c0: XNUM=0x5
kfdate[147].allo.hi:            8388867 ; 0x4c4: V=1 I=0 H=0 FNUM=0x103
kfdate[148].discriminator:            0 ; 0x4c8: 0x00000000
kfdate[148].free.lo.next:            16 ; 0x4c8: 0x0010
kfdate[148].free.lo.prev:            16 ; 0x4ca: 0x0010
kfdate[148].free.hi:                  2 ; 0x4cc: V=0 ASZM=0x2
kfdate[149].discriminator:            0 ; 0x4d0: 0x00000000
kfdate[149].free.lo.next:             0 ; 0x4d0: 0x0000
kfdate[149].free.lo.prev:             0 ; 0x4d2: 0x0000
kfdate[149].free.hi:                  0 ; 0x4d4: V=0 ASZM=0x0
...

The excerpt shows the Allocation Table entries for file 259 (hexadecimal FNUM=0x103), which start at kfdate[142] and end at kfdate[147]. That shows the ASM file 259 has the total of 6 AUs. The AU numbers will be the index of kfdate[i] + offset (kfdatb.aunum=448). In other words, 142+448=590, 143+448=591 ... 147+448=595. Let's verify that by querying X$KFFXP:

SQL> select AU_KFFXP
from X$KFFXP
where GROUP_KFFXP=1  -- disk group 1
and NUMBER_KFFXP=259 -- file 259
;

  AU_KFFXP
----------
       590
       591
       592
       593
       594
       595

6 rows selected.

Free space

In the above kfed output, we see that kfdate[148] and kfdate[149] have the word free next to them, which marks them as free or unallocated allocation units (flagged with V=0). That kfed output is truncated, but there are many more free allocation units described by this AT block.

The stride

Each AT block can describe 448 AUs (the kfdatb.shrink value from the Allocation Table), and the whole AT can have 254 blocks (the kfdfsb.max value from the Free Space Table). This means that one Allocation Table can describe 254x448=113792 allocation units. This is called the stride, and the stride size - expressed in number of allocation units - is in the field kfdhdb.mfact, in ASM disk header:

$ kfed read /dev/sdc1 | grep kfdhdb.mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

The stride size in this example is for the AU size of 1MB, that can fit 256 metadata blocks in AU0. Block 0 is for the disk header and block 1 is for the Free Space Table, which leaves 254 blocks for the Allocation Table blocks.

With the AU size of 4MB (default in Exadata), the stride size will be 454272 allocation units or 1817088 MB. With the larger AU size, the stride will also be larger.

How many Allocation Tables

Large ASM disks may have more than one stride. Each stride will have its own physically addressed metadata, which means that it will have its own Allocation Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the AT entries for the second stride. Those should be in blocks 2-255 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
...
$ kfed read /dev/sdc1 aun=113792 blkn=255 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

As expected, we have another AT in AU113792. If we had another stride, there would be another AT at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the AT at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

Conclusion

Every ASM disk contains at least one Allocation Table that describes the contents of the disk. The AT has one entry for every allocation unit on the disk. If the disk has more than one stride, each stride will have its own Allocation Table.

August 23, 2013

Free Space Table


The ASM Free Space Table (FST) provides a summary of which allocation table blocks have free space. It contains an array of bit patterns indexed by allocation table block number. The table is used to speed up the allocation of new allocation units by avoiding reading blocks that are full.

The FST is technically part of the Allocation Table (AT), and is at block 1 of the AT. The Free Space Table, and the Allocation Table are so called physically addressed metadata, as they are always at the fixed location on each ASM disk.

Locating the Free Space Table

The location of the FST block is stored in the ASM disk header (field kfdhdb.fstlocn). In the following example, the lookup of that field in the disk header, shows that the FST is in block 1.

$ kfed read /dev/sdc1 | grep kfdhdb.fstlocn
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001

Let’s have a closer look at the FST:

$ kfed read /dev/sdc1 blkn=1 | more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
...
kfdfsb.aunum:                         0 ; 0x000: 0x00000000
kfdfsb.max:                         254 ; 0x004: 0x00fe
kfdfsb.cnt:                         254 ; 0x006: 0x00fe
kfdfsb.bound:                         0 ; 0x008: 0x0000
kfdfsb.flag:                          1 ; 0x00a: B=1
kfdfsb.ub1spare:                      0 ; 0x00b: 0x00
kfdfsb.spare[0]:                      0 ; 0x00c: 0x00000000
kfdfsb.spare[1]:                      0 ; 0x010: 0x00000000
kfdfsb.spare[2]:                      0 ; 0x014: 0x00000000
kfdfse[0].fse:                      119 ; 0x018: FREE=0x7 FRAG=0x7
kfdfse[1].fse:                       16 ; 0x019: FREE=0x0 FRAG=0x1
kfdfse[2].fse:                       16 ; 0x01a: FREE=0x0 FRAG=0x1
kfdfse[3].fse:                       16 ; 0x01b: FREE=0x0 FRAG=0x1
...
kfdfse[4037].fse:                     0 ; 0xfdd: FREE=0x0 FRAG=0x0
kfdfse[4038].fse:                     0 ; 0xfde: FREE=0x0 FRAG=0x0
kfdfse[4039].fse:                     0 ; 0xfdf: FREE=0x0 FRAG=0x0

For this FST block, the first allocation table block is in AU 0:

kfdfsb.aunum:                         0 ; 0x000: 0x00000000

Maximum number of the FST entries this block can hold is 254:

kfdfsb.max:                         254 ; 0x004: 0x00fe

How many Free Space Tables

Large ASM disks may have more than one stride. The field kfdhdb.mfact in the ASM disk header, shows the stride size - expressed in allocation units. Each stride will have its own physically addressed metadata, which means that it will have its own Free Space Table.

The second stride will have its physically addressed metadata in the first AU of the stride. Let's have a look.

$ kfed read /dev/sdc1 | grep mfact
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

This shows the stride size is 113792 AUs. Let's check the FST for the second stride. That should be in block 1 in AU113792.

$ kfed read /dev/sdc1 aun=113792 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC

As expected, we have another FTS in AU113792. If we had another stride, there would be another FST at the beginning of that stride. As it happens, I have a large disk, with few strides, so we see the FST at the beginning at the third stride as well:

$ kfed read /dev/sdc1 aun=227584 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC

Conclusion

The Free Space Table is in block 1 of allocation unit 0 of every ASM disks. If the disk has more than one stride, each stride will have its own Free Space Table.


August 17, 2013

Physical metadata replication


Starting with version 12.1, ASM replicates the physically addressed metadata. This means that ASM maintains two copies of the disk header, the Free Space Table and the Allocation Table data. Note that this metadata is not mirrored, but replicated. ASM mirroring refers to copies of the same data on different disks. The copies of the physical metadata are on the same disk, hence the term replicated. This also means that the physical metadata is replicated even in an external redundancy disk group.

The Partnership and Status Table (PST) is also referred to as physically addressed metadata, but the PST is not replicated. This is because the PST is protected by mirroring - in normal and high redundancy disk groups.

Where is the replicated metadata

The physically addressed metadata is in allocation unit 0 (AU0) on every ASM disk. With this feature enabled, ASM will copy the contents of AU0 into allocation unit 11 (AU11), and from that point on, it will maintain both copies. This feature will be automatically enabled when a disk group is created with ASM compatibility of 12.1 or higher, or when ASM compatibility is advanced to 12.1 or higher, for an existing disk group.

If there is data in AU11, when the ASM compatibility is advanced to 12.1 or higher, ASM will simply move that data somewhere else, and use AU11 for the physical metadata replication.

Since version 11.1.0.7, ASM keeps a copy of the disk header in the second last block of AU1. Interestingly, in version 12.1, ASM still keeps the copy of the disk header in AU1, which means that now every ASM disk will have three copies of the disk header block.

Disk group attribute PHYS_META_REPLICATED

The status of the physical metadata replication can be checked by querying the disk group attribute PHYS_META_REPLICATED. Here is an example with the asmcmd command that shows how to check the replication status for disk group DATA:

$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

The phys_meta_replicated=true means that the physical metadata for disk group DATA has been replicated.

The kfdhdb.flags field in the ASM disk header indicates the status of the physical metadata replication as follows:
  • kfdhdb.flags = 0 - no physical data has been replicated
  • kfdhdb.flags = 1 - physical data has been replicated
  • kfdhdb.flags = 2 - physical data replication in progress
Once the flag is set to 1, it will never go back to 0.

Metadata replication in action

As stated earlier, the physical metadata will be replicated in disk groups with ASM compatibility of 12.1 or higher. Let's first have a look at a disk group with ASM compatible set to 12.1:

$ asmcmd lsattr -G DATA -l compatible.asm
Name            Value
compatible.asm  12.1.0.0.0
$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name                  Value
phys_meta_replicated  true

This shows that the physical metadata has been replicated. Now verify that all disks in the disk group have the kfdhdb.flags set to 1:

$ for disk in `asmcmd lsdsk -G DATA --suppressheader`; do kfed read $disk | egrep "dskname|flags"; done
kfdhdb.dskname:               DATA_0000 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0001 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0002 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001
kfdhdb.dskname:               DATA_0003 ; 0x028: length=9
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

This shows that all disks have the replication flag set to 1, i.e. that the physical metadata has been replicated for all disks in the disk group.

Let's now have a look at a disk group with ASM compatibility 11.2, that is later advanced to 12.1:

SQL> create diskgroup DG1 external redundancy
  2  disk '/dev/sdi1'
  3  attribute 'COMPATIBLE.ASM'='11.2';

Diskgroup created.

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name  Value

Nothing - no such attribute. That is because the ASM compatibility is less than 12.1. We also expect that the kfdhdb.flags is 0 for the only disk in that disk group:

$ kfed read /dev/sdi1 | egrep "type|dskname|grpname|flags"
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.grpname:                     DG1 ; 0x048: length=3
kfdhdb.flags:                         0 ; 0x0fc: 0x00000000

Let's now advance the ASM compatibility to 12.1:

$ asmcmd setattr -G DG1 compatible.asm 12.1.0.0.0

Check the replication status:

$ asmcmd lsattr -G DG1 -l phys_meta_replicated
Name                  Value
phys_meta_replicated  true

The physical metadata has been replicated, so we should now see the kfdhdb.flags set to 1:

$ kfed read /dev/sdi1 | egrep "dskname|flags"
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

The physical metadata should be replicated in AU11:

$ kfed read /dev/sdi1 aun=11 | egrep "type|dskname|flags"
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:                DG1_0000 ; 0x028: length=8
kfdhdb.flags:                         1 ; 0x0fc: 0x00000001

$ kfed read /dev/sdi1 aun=11 blkn=1 | grep type
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
$ kfed read /dev/sdi1 aun=11 blkn=2 | grep type
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL

This shows that the AU11 has the copy of the data from AU0.

Finally check for the disk header copy in AU1:

$ kfed read /dev/sdi1 aun=1 blkn=254 | grep type
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD

This shows that there is also a copy of the disk header in the second last block of AU1.

Conclusion

ASM version 12 replicates the physically addressed metadata, i.e. it keeps the copy of AU0 in AU11 - on the same disk. This allows ASM to automatically recover from damage to any data in AU0. Note that ASM will not be able to recover from loss of any other data in an external redundancy disk group. In a normal redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in a single failgroup. In a high redundancy disk group, ASM will be able to recover from a loss of any data in one or more disks in any two failgroups.


August 14, 2013

ASM version 12c is out


Oracle Database version 12c has been released, which means a brand new version of ASM is out! Notable new features are Flex ASM, proactive data validation and better handling of disk management operations. Let's have an overview with more details in separate posts.

Flex ASM

No need to run ASM instances on all nodes in the cluster. In a default installation there would be three ASM instances, irrespective of the number of nodes in the cluster. An ASM instance can serve both local and remote databases. If an ASM instance fails, the database instances do not crash; instead they fail over to another ASM instance in the cluster.

Flex ASM introduces new instance type - an I/O server or ASM proxy instance. There will be a few (default is 3) I/O server instances in Oracle flex cluster environment, serving indirect clients (typically an ACFS cluster file system). An I/O server instance can run on the same node as ASM instance or on a different node in a flex cluster. In all cases, an I/O server instance needs to talk to a flex ASM instance to get metadata information on behalf of an indirect client.

The flex ASM is an optional feature in 12c.

Physical metadata replication

In addition to replicating the disk header (available since 11.1.0.7), ASM 12c also replicates the allocation table, within each disk. This makes ASM more resilient to bad disk sectors and external corruptions. The disk group attribute PHYS_META_REPLICATED is provided to track the replication status of a disk group.

$ asmcmd lsattr -G DATA -l phys_meta_replicated
Name Value
phys_meta_replicated true

The physical metadata replication status flag is in the disk header (kfdhdb.flags). This flag only ever goes from 0 to 1 (once the physical metadata has been replicated) and it never goes back to 0.

More storage

ASM 12c supports 511 disk groups, with the maximum disk size of 32 PB.

Online with power

ASM 12c has a fast mirror resync power limit to control resync parallelism and improve performance. Disk resync checkpoint functionality provides faster recovery from instance failures by enabling the resync to resume from the point at which the process was interrupted or stopped, instead of starting from the beginning. ASM 12c also provides a time estimate for the completion of a resync operation.

Use power limit for disk resync operations, similar to disk rebalance, with the range from 1 to 1024:

$ asmcmd online -G DATA -D DATA_DISK1 --power 42

Disk scrubbing - proactive data validation and repair

In ASM 12c the disk scrubbing checks for data corruptions and repairs them automatically in normal and high redundancy disk groups. This is done during disk group rebalance if a disk group attribute CONTENT.CHECK is set to TRUE. The check can also be performed manually by running ALTER DISKGROUP SCRUB command.

The scrubbing can be performed at the disk group, disk or a file level and can be monitored via V$ASM_OPERATION view.

Even read for disk groups

In previous ASM versions, the data was always read from the primary copy (in a normal or high redundancy disk groups) unless a preferred failgroup was set up. The data from the mirror would be read only if the primary copy of the data was unavailable. With the even read feature, each request to read can be sent to the least loaded of the possible source disks. The least loaded in this context is simply the disk with the least number of read requests.

Even read functionality is enabled by default on all Oracle Database and Oracle ASM instances of version 12.1 and higher in non-Exadata environments. The functionality is enabled in an Exadata environment when there is a failure. Even read functionality is applicable only to disk groups with normal or high redundancy.

Replace an offline disk

We now have a new ALTER DISKGROUP REPLACE DISK command, that is a mix of the rebalance and fast mirror resync functionality. Instead of a full rebalance, the new, replacement disk, is populated with data read from the surviving partner disks only. This effectively reduces the time to replace a failed disk.

Note that the disk being replaced must be in OFFLINE state. If the disk offline timer has expired, the disk is dropped, which initiates the rebalance. On a disk add, there will be another rebalance.

ASM password file in a disk group

ASM version 11.2 allowed ASM spfile to be placed in a disk group. In 12c we can also put ASM password file in an ASM disk group. Unlike ASM spfile, the access to the ASM password file is possible only after ASM startup and once the disk group containing the password is mounted.

The orapw utility now accepts ASM disk group as a password destination. The asmcmd has also been enhanced to allow ASM password management.

Failgroup repair timer

We now have a failgroup repair timer with the default value of 24 hours. Note that the disk repair timer still defaults to 3.6 hours.

Rebalance rebalanced

The rebalance work is now estimated based on the detailed work plan, that can be generated and viewed separately. We now have a new EXPLAIN WORK command and a new V$ASM_ESTIMATE view.

In ASM 12c we (finally) have a priority ordered rebalance - the critical files (typically control files and redo logs) are rebalanced before other database files.

In Exadata, the rebalance can be offloaded to storage cells.

Thin provisioning support

ASM 12c enables thin provisioning support for some operations (that are typically associated with the disk group rebalance). The feature is disabled by default, and can be enabled at the disk group creation time or later by setting disk group attribute THIN_PROVISIONED to TRUE.

Enhanced file access control (ACL)

Easier file ownership and permission changes, e.g. a file permission can be changed on an open file. ACL has also been implemented for Microsoft Windows OS.

Oracle Cluster Registry (OCR) backup in ASM disk group

Storing the OCR backup in an Oracle ASM disk group simplifies OCR management by permitting access to the OCR backup from any node in the cluster should an OCR recovery become necessary.

Use ocrconfig command to specify an OCR backup location in an Oracle ASM disk group:

# ocrconfig –backuploc +DATA