The views expressed on this blog are my own and do not necessarily reflect the views of Oracle

April 1, 2012

ASM in Exadata


ASM is a critical component of the Exadata software stack. It is also a bit different - compared to non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It still takes care of disk errors, but also handles predictive disk failures. It doesn't like external redundancy, but it makes the disk group smart scan capable. Let's have a closer look.

Grid disks

In Exadata the ASM disks live on storage cells and are presented to compute nodes (where ASM instances run) via Oracle proprietary iDB protocol. Each storage cell has 12 hard disks and 16 flash disks. During Exadata deployment grid disks are created on those 12 hard disks. Flash disks are used for the flash and redo log cache, so grid disks are normally not created on flash disks.

Grid disks are not exposed to the Operating System, so only database instances, ASM and related utilities, that speak iDB, can see them. The kfod, ASM discovery tool, is one such utility. Here is an example of kfod discovering grid disks in one Exadata environment:

$ kfod disks=all
-----------------------------------------------------------------
 Disk          Size Path                           User     Group
=================================================================

   1:     433152 Mb o/192.168.10.9/DATA_CD_00_exacell01  
   2:     433152 Mb o/192.168.10.9/DATA_CD_01_exacell01  
   3:     433152 Mb o/192.168.10.9/DATA_CD_02_exacell01  
  ...
  13:      29824 Mb o/192.168.10.9/DBFS_DG_CD_02_exacell01
  14:      29824 Mb o/192.168.10.9/DBFS_DG_CD_03_exacell01
  15:      29824 Mb o/192.168.10.9/DBFS_DG_CD_04_exacell01
  ...
  23:     108224 Mb o/192.168.10.9/RECO_CD_00_exacell01  
  24:     108224 Mb o/192.168.10.9/RECO_CD_01_exacell01  
  25:     108224 Mb o/192.168.10.9/RECO_CD_02_exacell01  
  ...
 474:     108224 Mb o/192.168.10.22/RECO_CD_09_exacell14  
 475:     108224 Mb o/192.168.10.22/RECO_CD_10_exacell14  
 476:     108224 Mb o/192.168.10.22/RECO_CD_11_exacell14  

-----------------------------------------------------------------
ORACLE_SID ORACLE_HOME
=================================================================
  +ASM1 /u01/app/11.2.0.3/grid
  +ASM2 /u01/app/11.2.0.3/grid
  +ASM3 /u01/app/11.2.0.3/grid
  ...
  +ASM8 /u01/app/11.2.0.3/grid
$

Note that grid disks are prefixed with either DATA, RECO or DBFS_DG. Those are ASM disk group names in this environment. Each grid disk name ends with the storage cell name. It is also important to note that disks with the same prefix have the same size. The above example is from a full rack - hence 14 storage cells and 8 ASM instances.

ASM_DISKSTRING

In Exadata ASM_DISKSTRING='o/*/*'. That is suggesting to ASM that it is running on an Exadata compute node and to expect grid disks.

$ sqlplus / as sysasm
SQL> show parameter asm_diskstring
NAME           TYPE   VALUE
-------------- ------ -----
asm_diskstring string o/*/*

Automatic failgroups

There are no external redundancy disk groups in Exadata - you have a choice of either normal or high redundancy. When creating disk groups, ASM automatically puts all grid disks from the same storage cell into the same failgroup. The failgroup is then named after the storage cell.

This would be an example of creating a diskgroup in Exadata environment (note how that grid disk prefix comes in handy):

SQL> create diskgroup RECO
disk 'o/*/RECO*'
attribute
'COMPATIBLE.ASM'='11.2.0.0.0',
'COMPATIBLE.RDBMS'='11.2.0.0.0',
'CELL.SMART_SCAN_CAPABLE'='TRUE';

Once the disk group is created we can check the disk and failgroup names:

SQL> select name, failgroup, path from v$asm_disk_stat where name like 'RECO%';

NAME                 FAILGROUP PATH
-------------------- --------- -----------------------------------
RECO_CD_08_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_08_exacell01
RECO_CD_07_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_07_exacell01
RECO_CD_01_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_01_exacell01
...
RECO_CD_00_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_00_exacell02
RECO_CD_05_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_05_exacell02
RECO_CD_04_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_04_exacell02
...

SQL>

Note that we did not specify the failgroup names in the CREATE DISKGROUP statement. ASM has automatically put grid disks from the same storage cell in the same failgroup.

cellip.ora


The cellip.ora is the configuration file, on every database server, that tells ASM instances which cells are available to the cluster.

Here is a content of a typical cellip.ora file for a quarter rack system:

$ cat /etc/oracle/cell/network-config/cellip.ora
cell="192.168.10.3"
cell="192.168.10.4"
cell="192.168.10.5"

Now that we see what is in the cellip.ora, the grid disk path, in the examples above, should make more sense.

Disk group attributes

The following attributes and their values are recommended in Exadata environments:
  • COMPATIBLE.ASM - Should be set to the ASM software version in use.
  • COMPATIBLE.RDBMS - Should be set to the database software version in use.
  • CELL.SMART_SCAN_CAPABLE - Has be set to TRUE. This attribute/value is actually mandatory in Exadata.
  • AU_SIZE - Should be set to 4M. This is the default value in recent ASM versions for Exadata environments.
Initialization parameters

The following recommendations are for ASM version 11.2.0.3.


Parameter  Value 
CLUSTER_INTERCONNECTS Bondib0 IP address for X2-2. Colon delimited Bondib* IP addresses for X2-8.
ASM_POWER_LIMIT 1 for a quarter rack, 2 for all other racks.
SGA_TARGET 1250 MB
PGA_AGGREGATE_TARGET 400 MB
MEMORY_TARGET 0
MEMORY_MAX_TARGET 0
PROCESSES For less than 10 instances per node: 50*(#db instances per node + 1). For 10 0r more more instances per node: [50*MIN(#db instances per node + 1, 11)] + [10*MAX(#db instance per node - 10, 0)]
USE_LARGE_PAGES ONLY

Voting disks and disk group redundancy

Default location for voting disks in Exadata is ASM disk group DBFS_DG. That disk group can be either normal or high redundancy, except in a quarter rack where it has to be a normal redundancy.

This is because of the voting disks requirement for the minimal number of failgroups in a given ASM disk group. If we put voting disks in a normal redundancy disk group, that disk group has to have at least 3 failgroups. If we put voting disks in a high redundancy disk group, that disk group has to have at least 5 failgroups.

In a quarter rack, where we have only 3 storage cells, all disk groups can have at most 3 failgroups. While we can create a high redundancy disk group with 3 storage cells, voting disks cannot go into that disk group as it does not have 5 failgroups.

XDMG and XDWK background processes

These two process run in ASM instances on compute nodes. XDMG monitors all configured Exadata cells for storage state changes and performs the required tasks for such events. Its primary role is to watch for inaccessible disks and to initiate the disk online operations, when they become accessible again. Those operations are then handled by XDWK.

XDWK gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk performance degrades it can put it into proactive failure mode. It also monitors for predictive failures based on the disk's SMART (Self-monitoring, Analysis and Reporting Technology) data. In both cases, the Exadata Server notifies XDMG to take those disks offline.

When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk groups, in case they were already dropped.

The diskmon

The master diskmon process (diskmon.bin) can be seen running in all Grid Infrastructure installs, but it's only in Exadata that it's actually doing any work. On every compute node there will be one master diskmon process and one DSKM, slave diskmon process, per every Oracle instance (including ASM). Here is an example from one compute node:

# ps -ef | egrep "diskmon|dskm" | grep -v grep
oracle    3205     1  0 Mar16 ?        00:01:18 ora_dskm_ONE2
oracle   10755     1  0 Mar16 ?        00:32:19 /u01/app/11.2.0.3/grid/bin/diskmon.bin -d -f
oracle   17292     1  0 Mar16 ?        00:01:17 asm_dskm_+ASM2
oracle   24388     1  0 Mar28 ?        00:00:21 ora_dskm_TWO2
oracle   27962     1  0 Mar27 ?        00:00:24 ora_dskm_THREE2
#

In Exadata, the diskmon is responsible for
  • Handling of storage cell failures and I/O fencing
  • Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
  • Broadcasting intra database IORM (I/O Resource Manager) plans from databases to storage cells
  • Monitoring or the control messages from database and ASM instances to storage cells
  • Communicating with other diskmons in the cluster

ACFS

The ACFS (ASM Cluster File System) is supported in Exadata environments staring with ASM version 12.1.0.2. Alternatives to the ACFS are the DBFS (Database based File System) and the NFS (Network File System). Many Exadata customers have an Oracle ZFS Appliance that can provide a high performance, InfiniBand connected, NFS storage.

Conclusion

There are quite a few extra features and differences in ASM compared to non-Exadata environments. Most of them are about storage cells and grid disks, and some are about tuning ASM for the extreme Exadata performance.

16 comments:

  1. Hi Bane- Thanks for excellent post, may I ask why does ACFS not supported with Exadata ?

    Thanks

    ReplyDelete
    Replies
    1. Thanks Jagjeet,
      I think the official reason is that Exadata disks are not Exposed directly to database nodes (kernel). But I guess the real reasons are to do with technical challenges around implementing ACFS in Exadata environment and ensuring good ACFS performance...
      Cheers,
      Bane

      Delete
  2. Hi Bane,

    Could you please elaborate the difference between terms "proactive failure","predictive failure","poor performance","failure" with respect to both flash disk and grid disk.

    For the image version :11.2.3.2.0
    Thanks,
    Uday

    ReplyDelete
    Replies
    1. Hi Uday,

      Exadata monitors disk/flash performance at all times and if a disk/flash is under-performing it will take it out of the system. That would be proactive failure as the disk/flash is taken out before it has actually failed.

      Exadata also monitors for the number of media and other disk/flash failures (e.g. an I/O write failure due to physical media damage). If there are too many of those, Exadata is 'predicting' that it will soon fail and it takes it out of the system.

      I believe the poor performance case is the same as the proactive failure - the disk/flash would be removed from the system if it is under-performing. For a bit more on this topic have a look at MOS Doc ID 1484274.1 that I have published recently.

      Cheers,
      Bane

      Delete
  3. Hi Bane,

    can u please provide me with the conditions under which an exadata cell fails..

    ReplyDelete
    Replies
    1. A storage cell is an x86 server, running Linux and Exadata software stack. A motherboard can fail, a CPU can fail, Linux can crash, Exadata software can become unresponsive, etc. All these would be critical failures.
      It would be good if you could clarify on what exactly are you after or what are you worried about.
      Cheers,
      Bane

      Delete
    2. physical disk failure.. why this happens... showing the status as critical.. reasons

      Delete
  4. a physical disk failed showing status as critical... what are the conditions...

    ReplyDelete
    Replies
    1. OK, so you meant a single cell disk failure, not the cell failure.

      That simply means the disk has failed and it needs to be replaced.

      Cheers,
      Bane

      Delete
  5. Hi Bane,
    I am very new exadata.. As i know ACFS is not supporting exadata. For the same reason we have DBFS file system which is running over ASM..I am trying to get information regarding how we configure disk (Any LVM LUN and all)for ASM. As we know in exadata we have cell server for storage and the same every cell we have 12 hard disk and 14 flash.. How we are managing and configuring the same with ASM.

    May i ask some silly question but really believe me i am trying me best to clear my drought on same.

    And also request you to pls provide your some notes for the same topics.

    Regards'
    Raj Gupta

    ReplyDelete
    Replies
    1. Thanks for you kind words Raj. You made a good point. I see that I haven't really talked about setting up ASM in Exadata. This is done at the deployment time by Oracle and if you are not there at the time, you miss the whole thing. I will write a post on that...

      To answer your questions - in Exadata we create grid disks on storage cells and then ASM uses those grid disks to create disk groups. This is done at the deployment time and rarely changes, so DBAs and storage admins don't get to learn about it. A common reason for a change is a need to resize a disk group. You would then need to drop the grid disks, recreate them, etc. All this is done on the storage cells, using the cellcli command.

      I recommend you review the following documents:
      1. Oracle® Exadata Storage Server Software User's Guide [http://docs.oracle.com/html/E13861_14/toc.htm]
      2. Resize diskgroup without downtime in Exadata (Doc ID 1272569.1) [no link as you need to access it via My Oracle Support (MOS)]
      3. How to resize ASM disk/Grid disk in Exadata Environment (Doc ID 1245494.1) [same with this one - access via MOS only]

      Let me know if you have more questions or if you run into any issues.

      Cheers,
      Bane

      Delete
    2. Hi Bane',

      Again' I am very much thankful to you for your reply on above.. As per your comment I will check the same notes and will ask further question on its.. Now I have very positive feeling to learn exadata because now I have someone where I can share my drought and clear it.

      Thanks again.

      Regards'
      Raj gupta

      Delete
  6. And also really Thank you so much for this wonderful blogs

    ReplyDelete
  7. Very nice blog, I will keep on visiting this.

    ReplyDelete
  8. I wish to show thanks to you just for bailing me out of this particular trouble. As a result of checking through the net and meeting techniques that were not productive, Same as your blog I found another one Oracle Fusion Financials .Actually I was looking for the same information on internet for Oracle Fusion Financials and came across your blog. I am impressed by the information that you have on this blog. Thanks once more for all the details.

    ReplyDelete