June 16, 2013

How many allocation units per file

This post is about the amount of space allocated to ASM based files.

The smallest amount of space ASM allocates is an allocation unit (AU). The default AU size is 1 MB, except in Exadata where the default AU size is 4 MB.

The space for ASM based files is allocated in extents, which consist of one or more AUs. In version 11.2, the first 20000 extents consist of 1 AU, next 20000 extents have 4 AUs, and extents beyond that have 16 AUs. This is known as variable size extent feature. In version 11.1, the extent growth was 1-8-64 AUs. In version 10, we don't have variable size extents, so all extents sizes are exactly 1 AU.

Bytes vs space

The definition for V$ASM_FILE view says the following for BYTES and SPACE columns:
  • BYTES - Number of bytes in the file
  • SPACE - Number of bytes allocated to the file
There is a subtle difference in the definition and very large difference in numbers. Let's have a closer look. For the examples in this post I will use database and ASM version 11.2.0.3, with ASMLIB based disks.

First get some basic info about disk group DATA where most of my datafiles are. Run the following SQL connected to the database instance.

SQL> select NAME, GROUP_NUMBER, ALLOCATION_UNIT_SIZE/1024/1024 "AU size (MB)", TYPE
from V$ASM_DISKGROUP
where NAME='DATA';

NAME             GROUP_NUMBER AU size (MB) TYPE
---------------- ------------ ------------ ------
DATA                        1            1 NORMAL

Now create one small file (under 60 extents) and one large file (over 60 extents).

SQL> create tablespace T1 datafile '+DATA' size 10 M;

Tablespace created.

SQL> create tablespace T2 datafile '+DATA' size 100 M;

Tablespace created.

Get the ASM file numbers for those two files:

SQL> select NAME, round(BYTES/1024/1024) "MB" from V$DATAFILE;

NAME                                               MB
------------------------------------------ ----------
...
+DATA/br/datafile/t1.272.818281717                 10
+DATA/br/datafile/t2.271.818281741                100

The small file is ASM file number 272 and the large file is ASM file number 271.

Get the bytes and space information (in AUs) for these two files.

SQL> select FILE_NUMBER, round(BYTES/1024/1024) "Bytes (AU)", round(SPACE/1024/1024) "Space (AUs)", REDUNDANCY
from V$ASM_FILE
where FILE_NUMBER in (271, 272) and GROUP_NUMBER=1;

FILE_NUMBER Bytes (AU) Space (AUs) REDUND
----------- ---------- ----------- ------
        272         10          22 MIRROR
        271        100         205 MIRROR

The bytes shows the actual file size. For the small file, bytes shows the file size is 10 AUs = 10 MB (the AU size is 1 MB). The space required for the small file is 22 AUs. 10 AUs for the actual datafile, 1 AU for the file header and because the file is mirrored, double that, so 22 AUs in total.

For the large file, bytes shows the file size is 100 AUs = 100 MB. So far so good. But the space required for the large file is 205 AUs, not 202 as one might expect. What are those extra 3 AUs for? Let's find out.

ASM space

The following query (run in ASM instance) will show us the extent distribution for ASM file 271.

SQL> select XNUM_KFFXP "Virtual extent", PXN_KFFXP "Physical extent", DISK_KFFXP "Disk number", AU_KFFXP "AU number"
from X$KFFXP
where GROUP_KFFXP=1 and NUMBER_KFFXP=271
order by 1,2;

Virtual extent Physical extent Disk number  AU number
-------------- --------------- ----------- ----------
             0               0           3       1155
             0               1           0       1124
             1               2           0       1125
             1               3           2       1131
             2               4           2       1132
             2               5           0       1126
...
           100             200           3       1418
           100             201           1       1412
    2147483648               0           3       1122
    2147483648               1           0       1137
    2147483648               2           2       1137

205 rows selected.

As the file is mirrored, we see that each virtual extent has two physical extents. But the interesting part of the result are the last three allocation units for virtual extent number 2147483648, that is triple mirrored. We will have a closer look at those with kfed, and for that we will need disk names.

Get the disk names.

SQL> select DISK_NUMBER, PATH
from V$ASM_DISK
where GROUP_NUMBER=1;

DISK_NUMBER PATH
----------- ---------------
          0 ORCL:ASMDISK1
          1 ORCL:ASMDISK2
          2 ORCL:ASMDISK3
          3 ORCL:ASMDISK4

Let's now check what type of data is in those allocation units.

$ kfed read /dev/oracleasm/disks/ASMDISK4 aun=1122 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK1 aun=1137 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

$ kfed read /dev/oracleasm/disks/ASMDISK3 aun=1137 | grep type
kfbh.type:                           12 ; 0x002: KFBTYP_INDIRECT

These additional allocation units hold ASM metadata for the large file. More specifically they hold extent map information that could not fit into the the ASM file directory block. The file directory needs extra space to keep track of files larger than 60 extents, so it needs an additional allocation unit to do so. While the file directory needs only few extra ASM metadata blocks, the smallest unit of space the ASM can allocate is an AU. And because this is metadata, this AU is triple mirrored (even in a normal redundancy disk group), hence 3 extra allocation units for the large file. In an external redundancy disk group, there would be only one extra AU per large file.

Conclusion

The amount of space ASM needs for a file, depends on two factors - the file size and the disk group redundancy.

In an external redundancy disk group, the required space will be the file size + 1 AU for the file header + 1 AU for indirect extents if the file is larger than 60 AUs.

In a normal redundancy disk group, the required space will be twice the file size + 2 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

In a high redundancy disk group, the required space will be three times the file size + 3 AUs for the file header + 3 AUs for indirect extents if the file is larger than 60 AUs.

June 13, 2013

Auto disk management feature in Exadata


The automatic disk management feature is about automating ASM disk operations in an Exadata environment. The automation functionality applies to both planned actions (for example, deactivating griddisks in preparation for storage cell patching) and unplanned events (for example, disk failure).

Exadata disks

In an Exadata environment we have the following disk types:
  • Physicaldisk is a hard disk on a storage cell. Each storage cell has 12 physical disks, all with the same capacity (600 GB, 2 TB or 3 TB).
  • Flashdisk is a Sun Flash Accelerator PCIe solid state disk on a storage cell. Each storage cell has 16 flashdisks - 24 GB each in X2 (Sun Fire X4270 M2) and 100 GB each in X3 (Sun Fire X4270 M3) servers.
  • Celldisk is a logical disk created on every physicaldisk and every flashdisk on a storage cell. Celldisks created on physicaldisks are named CD_00_cellname, CD_01_cellname ... CD_11_cellname. Celldisks created on flashdisks are named FD_00_cellname, FD_01_cellname ... FD_15_cellname.
  • Griddisk is a logical disk that can be created on a celldisk. In a standard Exadata deployment we create griddisks on hard disk based celldisks only. While it is possible to create griddisks on flashdisks, this is not a standard practice; instead we use flash based celldisks for the flashcashe and flashlog.
  • ASM disk in an Exadata environment is a griddisk.
Automated disk operations

These are the disk operations that are automated in Exadata:

1. Griddisk status change to OFFLINE/ONLINE

If a griddisk becomes temporarily unavailable, it will be automatically OFFLINED by ASM. When the griddisk becomes available, it will be automatically ONLINED by ASM.

2. Griddisk DROP/ADD

If a physicaldisk fails, all griddisks on that physicaldisk will be DROPPED with FORCE option by ASM. If a physicaldisk status changes to predictive failure, all griddisks on that physical disk will be DROPPED by ASM. If a flashdisk performance degrades, the corresponding griddisks (if any) will be DROPPED with FORCE option by ASM.

When a physicaldisk is replaced, the celldisk and griddisks will be recreated by CELLSRV, and the griddisks will be automatically ADDED by ASM.

NOTE: If a griddisk in NORMAL state and in ONLINE mode status, is manually dropped with FORCE option (for example, by a DBA with 'alter diskgroup ... drop disk ... force'), it will be automatically added back by ASM. In other words, dropping a healthy disk with a force option will not achieve the desired effect.

3. Griddisk OFFLINE/ONLINE for rolling Exadata software (storage cells) upgrade

Before the rolling upgrade all griddisks will be inactivated on the storage cell by CELLSRV and OFFLINED by ASM. After the upgrade all griddisks will be activated on the storage cell and ONLINED in ASM.

4. Manual griddisk activation/inactivation

If a gridisk is manually inactivated on a storage cell, by running 'cellcli -e alter griddisk ... inactive',  it will be automatically OFFLINED by ASM. When a gridisk is activated on a storage cell, it will be automatically ONLINED by ASM.

5. Griddisk confined ONLINE/OFFLINE

If a griddisk is taken offline by CELLSRV, because the underlying disk is suspected for poor performance, all griddisks on that celldisk will be automatically OFFLINED by ASM. If the tests confirm that the celldisk is performing poorly, ASM will drop all griddisks on that celldisk. If the tests find that the disk is actually fine, ASM will online all griddisks on that celldisk.

Software components

1. Cell Server (CELLSRV)

The Cell Server (CELLSRV) runs on the storage cell and it's the main component of Exadata software. In the context of automatic disk management, its tasks are to process the Management Server notifications and handle ASM queries about the state of griddisks.

2. Management Server (MS)

The Management Server (MS) runs on the storage cell and implements a web service for cell management commands, and runs background monitoring threads. The MS monitors the storage cell for hardware changes (e.g. disk plugged in) or alerts (e.g. disk failure), and notifies the CELLSRV about those events.

3. Automatic Storage Management (ASM)

The Automatic Storage Management (ASM) instance runs on the compute (database) node and has two processes that are relevant to the automatic disk management feature:
  • Exadata Automation Manager (XDMG) initiates automation tasks involved in managing Exadata storage. It monitors all configured storage cells for state changes, such as a failed disk getting replaced, and performs the required tasks for such events. Its primary tasks are to watch for inaccessible disks and cells and when they become accessible again, to initiate the ASM ONLINE operation.
  • Exadata Automation Manager (XDWK) performs automation tasks requested by XDMG. It gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.
Working together

All three software components work together to achieve automatic disk management.

In the case of disk failure, the MS detects that the disk has failed. It then notifies the CELLSRV about it. If there are griddisks on the failed disk, the CELLSRV notifies ASM about the event. ASM then drops all griddisks from the corresponding disk groups.

In the case of a replacement disk inserted into the storage cell, the MS detects the new disk and checks the cell configuration file to see if celldisk and griddisks need to be created on it. If yes, it notifies the CELLSRV to do so. Once finished, the CELLSRV notifies ASM about new griddisks and ASM then adds them to the corresponding disk groups.

In the case of a poorly performing disk, the CELLSRV first notifies ASM to offline the disk. If possible, ASM then offlines the disk. One example when ASM would refuse to offline the disk, is when a partner disk is already offline. Offlining the disk would result in the disk group dismount, so ASM would not do that. Once the disk is offlined by ASM, it notifies the CELLSRV that the performance tests can be carried out. Once done with the tests, the CELLSRV will either tell ASM to drop that disk (if it failed the tests) or online it (if it passed the test).

The actions by MS, CELLSRV and ASM are coordinated in a similar fashion, for other disk events.

ASM initialization parameters

The following are the ASM initialization parameters relevant to the auto disk management feature:
  • _AUTO_MANAGE_EXADATA_DISKS controls the auto disk management feature. To disable the feature set this parameter to FALSE. Range of values: TRUE [default] or FALSE.
  • _AUTO_MANAGE_NUM_TRIES controls the maximum number of attempts to perform an automatic operation. Range of values: 1-10. Default value is 2.
  • _AUTO_MANAGE_MAX_ONLINE_TRIES controls maximum number of attempts to ONLINE a disk. Range of values: 1-10. Default value is 3.
All three parameters are static, which means they require ASM instances restart. Note that all these are hidden (underscore) parameters that should not be modified unless advised by Oracle Support.

Files

The following are the files relevant to the automatic disk management feature:

1. Cell configuration file - $OSSCONF/cell_disk_config.xml. An XML file on the storage cell that contains information about all configured objects (storage cell, disks, IORM plans, etc) except alerts and metrics. The CELLSRV reads this file during startup and writes to it when an object is updated (e.g. updates to IORM plan).

2. Grid disk file - $OSSCONF/griddisk.owners.dat. A binary file on the storage cell that contains the following information for all griddisks:
  • ASM disk name
  • ASM disk group name
  • ASM failgroup name
  • Cluster identifier (which cluster this disk belongs to)
  • Requires DROP/ADD (should the disk be dropped from or added to ASM)
3. MS log and trace files - ms-odl.log and ms-odl.trc in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

4. CELLSRV alert log - alert.log in $ADR_BASE/diag/asm/cell/`hostname -s`/trace directory on the storage cell.

5. ASM alert log - alert_+ASMn.log in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

6. XDMG and XDWK trace files - +ASMn_xdmg_nnnnn.trc and +ASMn_xdwk_nnnnn.trc in $ORACLE_BASE/diag/asm/+asm/+ASMn/trace directory on the compute node.

Conclusion

In an Exadata environment, the ASM has been enhanced to provide the automatic disk management functionality. Three software components that work together to provide this facility are the Exadata Cell Server (CELLSRV), Exadata Management Server (MS) and Automatic Storage Management (ASM).

I have also published this via MOS as Doc ID 1484274.1.