April 1, 2012

ASM in Exadata


ASM is a critical component of the Exadata software stack. It is also a bit different - compared to non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It still takes care of disk errors, but also handles predictive disk failures. It doesn't like external redundancy, but it makes the disk group smart scan capable. Let's have a closer look.

Grid disks

In Exadata the ASM disks live on storage cells and are presented to compute nodes (where ASM instances run) via Oracle proprietary iDB protocol. Each storage cell has 12 hard disks and 16 flash disks. During Exadata deployment grid disks are created on those 12 hard disks. Flash disks are used for the flash and redo log cache, so grid disks are normally not created on flash disks.

Grid disks are not exposed to the Operating System, so only database instances, ASM and related utilities, that speak iDB, can see them. The kfod, ASM discovery tool, is one such utility. Here is an example of kfod discovering grid disks in one Exadata environment:

$ kfod disks=all
-----------------------------------------------------------------
 Disk          Size Path                           User     Group
=================================================================

   1:     433152 Mb o/192.168.10.9/DATA_CD_00_exacell01  
   2:     433152 Mb o/192.168.10.9/DATA_CD_01_exacell01  
   3:     433152 Mb o/192.168.10.9/DATA_CD_02_exacell01  
  ...
  13:      29824 Mb o/192.168.10.9/DBFS_DG_CD_02_exacell01
  14:      29824 Mb o/192.168.10.9/DBFS_DG_CD_03_exacell01
  15:      29824 Mb o/192.168.10.9/DBFS_DG_CD_04_exacell01
  ...
  23:     108224 Mb o/192.168.10.9/RECO_CD_00_exacell01  
  24:     108224 Mb o/192.168.10.9/RECO_CD_01_exacell01  
  25:     108224 Mb o/192.168.10.9/RECO_CD_02_exacell01  
  ...
 474:     108224 Mb o/192.168.10.22/RECO_CD_09_exacell14  
 475:     108224 Mb o/192.168.10.22/RECO_CD_10_exacell14  
 476:     108224 Mb o/192.168.10.22/RECO_CD_11_exacell14  

-----------------------------------------------------------------
ORACLE_SID ORACLE_HOME
=================================================================
  +ASM1 /u01/app/11.2.0.3/grid
  +ASM2 /u01/app/11.2.0.3/grid
  +ASM3 /u01/app/11.2.0.3/grid
  ...
  +ASM8 /u01/app/11.2.0.3/grid
$

Note that grid disks are prefixed with either DATA, RECO or DBFS_DG. Those are ASM disk group names in this environment. Each grid disk name ends with the storage cell name. It is also important to note that disks with the same prefix have the same size. The above example is from a full rack - hence 14 storage cells and 8 ASM instances.

ASM_DISKSTRING

In Exadata ASM_DISKSTRING='o/*/*'. That is suggesting to ASM that it is running on an Exadata compute node and to expect grid disks.

$ sqlplus / as sysasm
SQL> show parameter asm_diskstring
NAME           TYPE   VALUE
-------------- ------ -----
asm_diskstring string o/*/*

Automatic failgroups

There are no external redundancy disk groups in Exadata - you have a choice of either normal or high redundancy. When creating disk groups, ASM automatically puts all grid disks from the same storage cell into the same failgroup. The failgroup is then named after the storage cell.

This would be an example of creating a diskgroup in Exadata environment (note how that grid disk prefix comes in handy):

SQL> create diskgroup RECO
disk 'o/*/RECO*'
attribute
'COMPATIBLE.ASM'='11.2.0.0.0',
'COMPATIBLE.RDBMS'='11.2.0.0.0',
'CELL.SMART_SCAN_CAPABLE'='TRUE';

Once the disk group is created we can check the disk and failgroup names:

SQL> select name, failgroup, path from v$asm_disk_stat where name like 'RECO%';

NAME                 FAILGROUP PATH
-------------------- --------- -----------------------------------
RECO_CD_08_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_08_exacell01
RECO_CD_07_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_07_exacell01
RECO_CD_01_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_01_exacell01
...
RECO_CD_00_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_00_exacell02
RECO_CD_05_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_05_exacell02
RECO_CD_04_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_04_exacell02
...

SQL>

Note that we did not specify the failgroup names in the CREATE DISKGROUP statement. ASM has automatically put grid disks from the same storage cell in the same failgroup.

cellip.ora


The cellip.ora is the configuration file, on every database server, that tells ASM instances which cells are available to the cluster.

Here is a content of a typical cellip.ora file for a quarter rack system:

$ cat /etc/oracle/cell/network-config/cellip.ora
cell="192.168.10.3"
cell="192.168.10.4"
cell="192.168.10.5"

Now that we see what is in the cellip.ora, the grid disk path, in the examples above, should make more sense.

Disk group attributes

The following attributes and their values are recommended in Exadata environments:
  • COMPATIBLE.ASM - Should be set to the ASM software version in use.
  • COMPATIBLE.RDBMS - Should be set to the database software version in use.
  • CELL.SMART_SCAN_CAPABLE - Has be set to TRUE. This attribute/value is actually mandatory in Exadata.
  • AU_SIZE - Should be set to 4M. This is the default value in recent ASM versions for Exadata environments.
Initialization parameters

The following recommendations are for ASM version 11.2.0.3.


Parameter  Value 
CLUSTER_INTERCONNECTS Bondib0 IP address for X2-2. Colon delimited Bondib* IP addresses for X2-8.
ASM_POWER_LIMIT 1 for a quarter rack, 2 for all other racks.
SGA_TARGET 1250 MB
PGA_AGGREGATE_TARGET 400 MB
MEMORY_TARGET 0
MEMORY_MAX_TARGET 0
PROCESSES For less than 10 instances per node: 50*(#db instances per node + 1). For 10 0r more more instances per node: [50*MIN(#db instances per node + 1, 11)] + [10*MAX(#db instance per node - 10, 0)]
USE_LARGE_PAGES ONLY

Voting disks and disk group redundancy

Default location for voting disks in Exadata is ASM disk group DBFS_DG. That disk group can be either normal or high redundancy, except in a quarter rack where it has to be a normal redundancy.

This is because of the voting disks requirement for the minimal number of failgroups in a given ASM disk group. If we put voting disks in a normal redundancy disk group, that disk group has to have at least 3 failgroups. If we put voting disks in a high redundancy disk group, that disk group has to have at least 5 failgroups.

In a quarter rack, where we have only 3 storage cells, all disk groups can have at most 3 failgroups. While we can create a high redundancy disk group with 3 storage cells, voting disks cannot go into that disk group as it does not have 5 failgroups.

XDMG and XDWK background processes

These two process run in ASM instances on compute nodes. XDMG monitors all configured Exadata cells for storage state changes and performs the required tasks for such events. Its primary role is to watch for inaccessible disks and to initiate the disk online operations, when they become accessible again. Those operations are then handled by XDWK.

XDWK gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk performance degrades it can put it into proactive failure mode. It also monitors for predictive failures based on the disk's SMART (Self-monitoring, Analysis and Reporting Technology) data. In both cases, the Exadata Server notifies XDMG to take those disks offline.

When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk groups, in case they were already dropped.

The diskmon

The master diskmon process (diskmon.bin) can be seen running in all Grid Infrastructure installs, but it's only in Exadata that it's actually doing any work. On every compute node there will be one master diskmon process and one DSKM, slave diskmon process, per every Oracle instance (including ASM). Here is an example from one compute node:

# ps -ef | egrep "diskmon|dskm" | grep -v grep
oracle    3205     1  0 Mar16 ?        00:01:18 ora_dskm_ONE2
oracle   10755     1  0 Mar16 ?        00:32:19 /u01/app/11.2.0.3/grid/bin/diskmon.bin -d -f
oracle   17292     1  0 Mar16 ?        00:01:17 asm_dskm_+ASM2
oracle   24388     1  0 Mar28 ?        00:00:21 ora_dskm_TWO2
oracle   27962     1  0 Mar27 ?        00:00:24 ora_dskm_THREE2
#

In Exadata, the diskmon is responsible for
  • Handling of storage cell failures and I/O fencing
  • Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
  • Broadcasting intra database IORM (I/O Resource Manager) plans from databases to storage cells
  • Monitoring or the control messages from database and ASM instances to storage cells
  • Communicating with other diskmons in the cluster

ACFS

The ACFS (ASM Cluster File System) is supported in Exadata environments staring with ASM version 12.1.0.2. Alternatives to the ACFS are the DBFS (Database based File System) and the NFS (Network File System). Many Exadata customers have an Oracle ZFS Appliance that can provide a high performance, InfiniBand connected, NFS storage.

Conclusion

There are quite a few extra features and differences in ASM compared to non-Exadata environments. Most of them are about storage cells and grid disks, and some are about tuning ASM for the extreme Exadata performance.