October 27, 2012

Where is my data


Sometimes we want to know where exactly is a particular database block - on which ASM disk, in which allocation unit on that disk and in which block of that allocation unit. In this post I will show how to work that out.

Database instance

In the first part of this exercise I am logged into the database instance. Let's create a tablespace first.

SQL> create tablespace T1 datafile '+DATA';

Tablespace created.

SQL> select f.FILE#, f.NAME "File", t.NAME "Tablespace"
from V$DATAFILE f, V$TABLESPACE t
where t.NAME='T1' and f.TS# = t.TS#;

FILE# File                               Tablespace
----- ---------------------------------- ----------
   6  +DATA/br/datafile/t1.272.797809075 T1

SQL>

Note the ASM file number is 272.

Let's now create a table and insert some data into it

SQL> create table TAB1 (n number, name varchar2(16)) tablespace T1;

Table created.

SQL> insert into TAB1 values (1, 'CAT');

1 row created.

SQL> commit;

Commit complete.

SQL>

Get the block number.

SQL> select ROWID, NAME from TAB1;

ROWID              NAME
------------------ ----
AAASxxAAGAAAACHAAA CAT

SQL> select DBMS_ROWID.ROWID_BLOCK_NUMBER('AAASxxAAGAAAACHAAA') "Block number" from DUAL;

Block number
------------
         135

SQL>

Get the block size for the datafile.

SQL> select BLOCK_SIZE from V$DATAFILE where FILE#=6;

BLOCK_SIZE
----------
      8192

SQL>

From the above I see that the data is in block 135 and that the block size is 8KB.

ASM instance

I now connect to the ASM instance and first check the extent distributions for ASM datafile 272.

SQL> select GROUP_NUMBER from V$ASM_DISKGROUP where NAME='DATA';

GROUP_NUMBER
------------
           1

SQL> select PXN_KFFXP, -- physical extent number
XNUM_KFFXP,            -- virtual extent number
DISK_KFFXP,            -- disk number
AU_KFFXP               -- allocation unit number
from X$KFFXP
where NUMBER_KFFXP=272 -- ASM file 272
AND GROUP_KFFXP=1      -- group number 1
order by 1;

PXN_KFFXP  XNUM_KFFXP DISK_KFFXP   AU_KFFXP
---------- ---------- ---------- ----------
         0          0          0       1175
         1          0          3       1170
         2          1          3       1175
         3          1          2       1179
         4          2          1       1175
...

SQL>

As expected, the file extents are spread over all disks and each (physical) extent is mirrored, as this file is normal redundancy. Note that I said the file is normal redundancy. By default the file inherits the disk group redundancy. The controlfile is an exception, as it gets created as high redundancy, even in the normal redundancy disk group - if the disk group has at least three failgroups.

I also need to know the ASM allocation unit size for this disk group.

SQL> select VALUE from V$ASM_ATTRIBUTE where NAME='au_size' and GROUP_NUMBER=1;

VALUE
-------
1048576

SQL>

The allocation unit size is 1MB. Note that each disk group can have a different allocation unit size.

Where is my block

I know my data is in block 135 of ASM file 272. With the block size of 8K each allocation unit can hold 128 blocks (1MB/8KB=128). That means the block 135 is 7th (135-128=7) in the second virtual extent. The second virtual extent consists of allocation unit 1175 on disk 3 and allocation unit 1179 on disk 2, as per the select from X$KFFXP.

Let's get the names of disks 2 and 3.

SQL> select DISK_NUMBER, NAME
from V$ASM_DISK
where DISK_NUMBER in (2,3);

DISK_NUMBER NAME
----------- ------------------------------
         2 ASMDISK3
         3 ASMDISK4

SQL>

I am using ASMLIB, so at the OS level, those disks are /dev/oracleasm/disks/ASMDISK3 and /dev/oracleasm/disks/ASMDISK4.

Show me the money

Let's recap. My data is 7 blocks into the allocation unit 1175. That allocation unit is 1175 MB into the disk /dev/oracleasm/disks/ASMDISK4.

Let's first extract that allocation unit.

$ dd if=/dev/oracleasm/disks/ASMDISK4 bs=1024k count=1 skip=1175 of=AU1175.dd
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.057577 seconds, 18.2 MB/s
$ ls -l AU1175.dd
-rw-r--r-- 1 grid oinstall 1048576 Oct 27 22:45 AU1175.dd
$

Note the arguments to the dd command:
  • bs=1024k - allocation unit size
  • skip=1175 - allocation unit I am interested in
  • count=1 - I only need one allocation unit

Let's now extract block 7 out of that allocation unit.

$ dd if=AU1175.dd bs=8k count=1 skip=7 of=block135.dd
$

Note the arguments to the dd command now - bs=8k (data block size) and skip=7 (block I am interested in).

Let's now look at that block.

$ od -c block135.dd
...
0017760 \0 \0 , 001 002 002 301 002 003 C A T 001 006 332 217
0020000
$

At the bottom of that block I see my data (CAT). Remember that Oracle blocks are populated from the bottom up.

Note that I would see the same if I looked at the allocation unit 1179 on disk /dev/oracleasm/disks/ASMDISK3.

Conclusion

To locate an Oracle data block in ASM, I had to know in which datafile that block was stored. I then queried X$KFFXP in ASM to see the extent distribution for that datafile. I also had to know both the datafile block size and ASM allocation unit size, to work out in which allocation unit my block was.

None of this is ASM or RDBMS version specific (except the query from V$ASM_ATTRIBUTE, as there is no such view in 10g). The ASM disk group redundancy is also irrelevant. Of course, with normal and high redundancy we will have multiple copies of data, but the method to find the data location is exactly the same for all types of disk group redundancy.

July 22, 2012

When will my rebalance complete


This is the top ASM questions people ask. But if you expect me to respond with a number of minutes, you will be disappointed.

After all, ASM has given you an estimate, and you still want to know when exactly is that rebalance going to finish. Instead, I will show you how to check if the rebalance is actually progressing, what phase it is in, and if there is a reason for concern.

Understanding the rebalance

As explained in the rebalancing act, the rebalance operation has three phases - planning, extents relocation and compacting. As far as the overall time to complete is concerned, the planing phase time is insignificant so there is no need to worry about it. The extent relocation phase will take most of the time, so the main focus will be on that.I will also show what is going on during the compacting phase.

It is important to know why the rebalance is running. If you are adding a new disk, say to increase the available disk group space, it doesn't really matter how long it will take for the rebalance to complete. OK maybe it does, if your database is hung because you ran out of space in your archive log destination. Similarly if you are resizing or dropping disk(s), to adjust the disk group space, you are generally not concerned with the time it takes for the rebalance to complete.

But if a disk has failed and ASM has initiated rebalance, there may be legitimate reason for concern. If your disk group is normal redundancy AND if another disk fails AND it's the partner of that disk that has already failed, your disk group will be dismounted, all your databases that use that disk group will crash and you may lose data. In such cases I understand that you want to know when that rebalance will complete. Actually, you want to see the relocation phase completed, as once it does, all your data is fully redundant again.

Extents relocation

To have a closer look at the extents relocation phase, I drop one of the disks with the default rebalance power:

SQL> show parameter power

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------
asm_power_limit                      integer     1

SQL> set time on
16:40:57 SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06;

Diskgroup altered.

Initial estimated time to complete is 26 minutes:

16:41:21 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1        516      53736       2012          26
         4 REBAL WAIT          1

About 10 minutes into the rebalance, the estimate is 24 minutes:

16:50:25 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT          1
         2 REBAL RUN           1      19235      72210       2124          24
         4 REBAL WAIT          1

While that EST_MINUTES doesn't give me much confidence, I see that the SOFAR (number of allocation units moved so far) is going up, which is a good sign.

ASM alert log shows the time of the drop disk, the OS process ID of the ARB0 doing all the work, and most importantly - that there are no errors:

Wed Jul 11 16:41:15 2012
SQL> alter diskgroup DATA1 drop disk DATA1_CD_06_CELL06
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
...
NOTE: starting rebalance of group 1/0x6ecaf3e6 (DATA1) at power 1
Starting background process ARB0
Wed Jul 11 16:41:24 2012
ARB0 started with pid=41, OS id=58591
NOTE: assigning ARB0 to group 1/0x6ecaf3e6 (DATA1) with 1 parallel I/O
NOTE: F1X0 copy 3 relocating from 0:2 to 55:35379 for diskgroup 1 (DATA1)
...

ARB0 trace file should show which file extents are being relocated. It does, and that is how I know that ARB0 is doing what it is supposed to do:

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_58591.trc
...
ARB0 relocating file +DATA1.282.788356359 (120 entries)
*** 2012-07-11 16:48:44.808
ARB0 relocating file +DATA1.283.788356383 (120 entries)
...
*** 2012-07-11 17:13:11.761
ARB0 relocating file +DATA1.316.788357201 (120 entries)
*** 2012-07-11 17:13:16.326
ARB0 relocating file +DATA1.316.788357201 (120 entries)
...

Note that there may be lot of arb0 trace files in the trace directory, so that's why we need to know the OS process ID of the ARB0 actually doing the rebalance. That information is in the alert log of the ASM instance performing the rebalance.

I can also look at the pstack of the ARB0 process to see what is going on. It does show me that ASM is relocating extents (key functions on the stack being kfgbRebalExecute - kfdaExecute - kffRelocate):

# pstack 58591
#0  0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9  0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003d4beaa in kfgbRebalExecute ()
#19 0x0000000003d39627 in kfgbDriver ()
#20 0x00000000020e8d23 in ksbabs ()
#21 0x0000000003d4faae in kfgbRun ()
#22 0x00000000020ed95d in ksbrdp ()
#23 0x0000000002322343 in opirip ()
#24 0x0000000001618571 in opidrv ()
#25 0x0000000001c13be7 in sou2o ()
#26 0x000000000083ceba in opimai_real ()
#27 0x0000000001c19b58 in ssthrdmain ()
#28 0x000000000083cda1 in main ()

After about 35 minutes the EST_MINUTES dropps to 0:

17:16:54 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN           1      74581      75825       2129           0
         3 REBAL WAIT          1
         4 REBAL WAIT          1

And soon after that, the ASM alert log shows:
  • Disk emptied
  • Disk header erased
  • PST update completed successfully
  • Disk closed
  • Rebalance completed
Wed Jul 11 17:17:32 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:17:41 2012
GMON updating for reconfiguration, group 1 at 20 for pid 38, osid 93832
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk DATA1_CD_06_CELL06 emptied
NOTE: erasing header on grp 1 disk DATA1_CD_06_CELL06
NOTE: process _x000_+asm2 (93832) initiating offline of disk 0.3916039210 (DATA1_CD_06_CELL06) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 21 for pid 38, osid 93832
NOTE: PST update grp = 1 completed successfully
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96a042a, mask = 0x7e, op = clear
GMON updating disk modes for group 1 at 22 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: DATA1_CD_06_CELL06
NOTE: PST update grp = 1 completed successfully
GMON updating for reconfiguration, group 1 at 23 for pid 38, osid 93832
NOTE: cache closing disk 0 of grp 1: (not open) DATA1_CD_06_CELL06
NOTE: group 1 PST updated.
Wed Jul 11 17:17:41 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 24 for pid 19, osid 38421
GMON querying group 1 at 25 for pid 19, osid 38421
NOTE: Disk  in mode 0x8 marked for de-assignment
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

So the estimated time was 26 minutes and the rebalance actually took about 36 minutes (in this particular case the compacting took less than a minute so I have ignored it). That is why it is more important to understand what is going on, then to know when will the rebalance complete.

Note that the estimated time may also be increasing. If the system is under heavy load, the rebalance will take more time - especially with the rebalance power 1. For a large disk group (many TB) and large number of files, the rebalance can take hours and possibly days.

If you want to get an idea how long will a drop disk take in your environment, you need to test it. Just drop one of the disks, while your system is under normal/typical load. Your data is fully redundant during such disk drop, so you are not exposed to a disk group dismount in case its partner disk fails during the rebalance.

Compacting

In another example, to look at the compacting phase, I add the same disk back, with rebalance power 10:

17:26:48 SQL> alter diskgroup DATA1 add disk '/o/*/DATA1_CD_06_celll06' rebalance power 10;

Diskgroup altered.

Initial estimated time to complete is 6 minutes:

17:27:22 SQL> select INST_ID, OPERATION, STATE, POWER, SOFAR, EST_WORK, EST_RATE, EST_MINUTES from GV$ASM_OPERATION where GROUP_NUMBER=1;

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         2 REBAL RUN          10        489      53851       7920           6
         3 REBAL WAIT         10
         4 REBAL WAIT         10

After about 10 minutes, the EST_MINUTES drops to 0:

17:39:05 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         2 REBAL RUN          10      92407      97874       8716           0
         4 REBAL WAIT         10

And I see the following in the ASM alert log

Wed Jul 11 17:39:49 2012
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
Wed Jul 11 17:39:58 2012
GMON updating for reconfiguration, group 1 at 31 for pid 43, osid 115117
NOTE: group 1 PST updated.
Wed Jul 11 17:39:58 2012
NOTE: membership refresh pending for group 1/0x6ecaf3e6 (DATA1)
GMON querying group 1 at 32 for pid 19, osid 38421
SUCCESS: refreshed membership for 1/0x6ecaf3e6 (DATA1)
NOTE: Attempting voting file refresh on diskgroup DATA1

That means ASM has completed the second phase of the rebalance and is compacting now. If that is true, the pstack should show kfdCompact() function. Indeed it does:

# pstack 103326
#0  0x0000003957ccb6ef in poll () from /lib64/libc.so.6
...
#9  0x0000000003d711e0 in kfk_reap_oss_async_io ()
#10 0x0000000003d70c17 in kfk_reap_ios_from_subsys ()
#11 0x0000000000aea50e in kfk_reap_ios ()
#12 0x0000000003d702ae in kfk_io1 ()
#13 0x0000000003d6fe54 in kfkRequest ()
#14 0x0000000003d76540 in kfk_transitIO ()
#15 0x0000000003cd482b in kffRelocateWait ()
#16 0x0000000003cfa190 in kffRelocate ()
#17 0x0000000003c7ba16 in kfdaExecute ()
#18 0x0000000003c4b737 in kfdCompact ()
#19 0x0000000003c4c6d0 in kfdExecute ()
#20 0x0000000003d4bf0e in kfgbRebalExecute ()
#21 0x0000000003d39627 in kfgbDriver ()
#22 0x00000000020e8d23 in ksbabs ()
#23 0x0000000003d4faae in kfgbRun ()
#24 0x00000000020ed95d in ksbrdp ()
#25 0x0000000002322343 in opirip ()
#26 0x0000000001618571 in opidrv ()
#27 0x0000000001c13be7 in sou2o ()
#28 0x000000000083ceba in opimai_real ()
#29 0x0000000001c19b58 in ssthrdmain ()
#30 0x000000000083cda1 in main ()

The tail on ARB0 trace file now shows relocating just 1 entry at the time (another sign of compacting):

$ tail -f /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_arb0_103326.trc
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
ARB0 relocating file +DATA1.321.788357323 (1 entries)
...

The V$ASM_OPERATION keeps showing EST_MINUTES=0 (compacting):

17:42:39 SQL> /

   INST_ID OPERA STAT      POWER      SOFAR   EST_WORK   EST_RATE EST_MINUTES
---------- ----- ---- ---------- ---------- ---------- ---------- -----------
         3 REBAL WAIT         10
         4 REBAL WAIT         10
         2 REBAL RUN          10      98271      98305       7919           0

The X$KFGMG shows REBALST_KFGMG=2 (compacting):

17:42:50 SQL> select NUMBER_KFGMG, OP_KFGMG, ACTUAL_KFGMG, REBALST_KFGMG from X$KFGMG;

NUMBER_KFGMG   OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
           1          1           10             2

Once the compacting phase completes, the alert log shows "stopping process ARB0" and "rebalance completed":

Wed Jul 11 17:43:48 2012
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0x6ecaf3e6 (DATA1)

In this case, the extents relocation took about 12 minutes and the compacting took about 4 minutes.

The compacting phase can actually take significant amount of time. In one case I have seen the extents relocation run for 60 minutes and the compacting after that took another 30 minutes. But it doesn't really matter how long it takes for the compacting to complete, because as soon as the second phase of the rebalance (extent relocation) completes, all data is fully redundant and we are not exposed to disk group dismount due to partner disk failure.

Changing the power

Rebalance power can be changed dynamically, i.e. during the rebalance, so if your rebalance with the default power is 'too slow', you can increase it. How much? Well, do you understand your I/O load, your I/O throughput and most importantly your limits? If not, increase the power to 5 (just run 'ALTER DISKGROUP ... REBALANCE POWER 5;') and see if it makes a difference. Give it 10-15 minutes, before you jump to conclusions. Should you go higher? Again, as long as you are not adversely impacting your database I/O performance, you can keep increasing the power. But I haven't seen much improvement beyond power 30.

The testing is the key here. You really need to test this under your regular load and in your production environment. There is no point testing with no databases running and on a system that  runs off different storage system.

June 22, 2012

Exadata motherboard replacement


This photo story is about a motherboard replacement in an Exadata X2-2 database server (Sun Fire X4170 M2).

While setting up a half rack Exadata system, we discovered a problem with the motherboard in database server 4, so we got ourselves a new one. Even for an internal system, I had to log an SR, get our hardware team to review the diagnostics and have them confirm that a new motherboard is required. I had a full customer experience with me being both the customer and Support Engineer! Anyway, a replacement motherboard arrived the next day.














We stopped the clusterware on that database server and powered it down. It was time to take the new motherboard to that noisy computer room.

We disconnected both power cords from the server, but we left all other cables plugged in. We then (slowly, very slowly) pulled the server out of the rack (it's on rails, so it just slides out). In a field engineer lingo, we 'extended the server to the maintenance position'. All cables were still plugged in, but the cable management arm took good care of them. From the photo bellow, it can be seen that this was a half rack system, with all other database servers and storage cells up and running.














Once the server was fully extended, we removed the top cover and disconnected the cables (Ethernet, InfiniBand and KVM). We then removed the PCIe cards (RAID controller, dual port 10Gb Ethernet and dual port InfiniBand HCA). On the photo bellow, the PCIe cards and memory modules were already removed so the motherboard is fully exposed.














Motherboard was then taken out of the chassis. On the photo below we see the dual power supply (top left), two CPU heat sinks (removed from the CPUs and resting on top of the power supply), row of fans (middle) and disk drives (right). Well, we cannot see the disk drives as they are covered, but they are there under those tools.














It was then time to transfer the CPUs and memory modules from the old motherboard to the new one.














The memory modules are easy to put in as they just click in. The CPUs are tricky with all their pins, so they need to be put in very carefully.














After that the new motherboard was ready to go into the server. On the photo bellow we can see two CPUs, without the heat sinks, and two rows of memory modules. This server had 'only' 96MB of RAM, with plastic fillers in 6 slots. There is room for the total of 144GB of RAM on that motherboard.














The motherboard is shipped with the thermal paste, so now was the time to apply that paste on top of the CPUs, and put the heat sinks on top of them.














The photo bellow shows that all parts were back on the motherboard. Note that three PCIe cards are not plugged into the motherboard directly. They are plugged into the little raiser boards and those are plugged into the motherboard.














Plug back in all the cables (Ethernet, InfiniBand and KVM), put the cover on and slide the server back into the rack. Connect the power cables and turn the server on.

Our server came up fine and the only thing we had to do was set up ILOM. That is done by running /opt/oracle.cellos/ipconf.pl and specifying the ILOM name, IP address and other network details. We also had to reset the ILOM password.

Finally, we started up the clusterware and all services came up fine. All this was done with only a single database server out of action (for about an hour and a half) with the clusterware and database(s) running on the remaining three nodes in the cluster.

April 1, 2012

ASM in Exadata


ASM is a critical component of the Exadata software stack. It is also a bit different - compared to non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It still takes care of disk errors, but also handles predictive disk failures. It doesn't like external redundancy, but it makes the disk group smart scan capable. Let's have a closer look.

Grid disks

In Exadata the ASM disks live on storage cells and are presented to compute nodes (where ASM instances run) via Oracle proprietary iDB protocol. Each storage cell has 12 hard disks and 16 flash disks. During Exadata deployment grid disks are created on those 12 hard disks. Flash disks are used for the flash and redo log cache, so grid disks are normally not created on flash disks.

Grid disks are not exposed to the Operating System, so only database instances, ASM and related utilities, that speak iDB, can see them. The kfod, ASM discovery tool, is one such utility. Here is an example of kfod discovering grid disks in one Exadata environment:

$ kfod disks=all
-----------------------------------------------------------------
 Disk          Size Path                           User     Group
=================================================================

   1:     433152 Mb o/192.168.10.9/DATA_CD_00_exacell01  
   2:     433152 Mb o/192.168.10.9/DATA_CD_01_exacell01  
   3:     433152 Mb o/192.168.10.9/DATA_CD_02_exacell01  
  ...
  13:      29824 Mb o/192.168.10.9/DBFS_DG_CD_02_exacell01
  14:      29824 Mb o/192.168.10.9/DBFS_DG_CD_03_exacell01
  15:      29824 Mb o/192.168.10.9/DBFS_DG_CD_04_exacell01
  ...
  23:     108224 Mb o/192.168.10.9/RECO_CD_00_exacell01  
  24:     108224 Mb o/192.168.10.9/RECO_CD_01_exacell01  
  25:     108224 Mb o/192.168.10.9/RECO_CD_02_exacell01  
  ...
 474:     108224 Mb o/192.168.10.22/RECO_CD_09_exacell14  
 475:     108224 Mb o/192.168.10.22/RECO_CD_10_exacell14  
 476:     108224 Mb o/192.168.10.22/RECO_CD_11_exacell14  

-----------------------------------------------------------------
ORACLE_SID ORACLE_HOME
=================================================================
  +ASM1 /u01/app/11.2.0.3/grid
  +ASM2 /u01/app/11.2.0.3/grid
  +ASM3 /u01/app/11.2.0.3/grid
  ...
  +ASM8 /u01/app/11.2.0.3/grid
$

Note that grid disks are prefixed with either DATA, RECO or DBFS_DG. Those are ASM disk group names in this environment. Each grid disk name ends with the storage cell name. It is also important to note that disks with the same prefix have the same size. The above example is from a full rack - hence 14 storage cells and 8 ASM instances.

ASM_DISKSTRING

In Exadata ASM_DISKSTRING='o/*/*'. That is suggesting to ASM that it is running on an Exadata compute node and to expect grid disks.

$ sqlplus / as sysasm
SQL> show parameter asm_diskstring
NAME           TYPE   VALUE
-------------- ------ -----
asm_diskstring string o/*/*

Automatic failgroups

There are no external redundancy disk groups in Exadata - you have a choice of either normal or high redundancy. When creating disk groups, ASM automatically puts all grid disks from the same storage cell into the same failgroup. The failgroup is then named after the storage cell.

This would be an example of creating a diskgroup in Exadata environment (note how that grid disk prefix comes in handy):

SQL> create diskgroup RECO
disk 'o/*/RECO*'
attribute
'COMPATIBLE.ASM'='11.2.0.0.0',
'COMPATIBLE.RDBMS'='11.2.0.0.0',
'CELL.SMART_SCAN_CAPABLE'='TRUE';

Once the disk group is created we can check the disk and failgroup names:

SQL> select name, failgroup, path from v$asm_disk_stat where name like 'RECO%';

NAME                 FAILGROUP PATH
-------------------- --------- -----------------------------------
RECO_CD_08_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_08_exacell01
RECO_CD_07_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_07_exacell01
RECO_CD_01_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_01_exacell01
...
RECO_CD_00_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_00_exacell02
RECO_CD_05_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_05_exacell02
RECO_CD_04_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_04_exacell02
...

SQL>

Note that we did not specify the failgroup names in the CREATE DISKGROUP statement. ASM has automatically put grid disks from the same storage cell in the same failgroup.

cellip.ora


The cellip.ora is the configuration file, on every database server, that tells ASM instances which cells are available to the cluster.

Here is a content of a typical cellip.ora file for a quarter rack system:

$ cat /etc/oracle/cell/network-config/cellip.ora
cell="192.168.10.3"
cell="192.168.10.4"
cell="192.168.10.5"

Now that we see what is in the cellip.ora, the grid disk path, in the examples above, should make more sense.

Disk group attributes

The following attributes and their values are recommended in Exadata environments:
  • COMPATIBLE.ASM - Should be set to the ASM software version in use.
  • COMPATIBLE.RDBMS - Should be set to the database software version in use.
  • CELL.SMART_SCAN_CAPABLE - Has be set to TRUE. This attribute/value is actually mandatory in Exadata.
  • AU_SIZE - Should be set to 4M. This is the default value in recent ASM versions for Exadata environments.
Initialization parameters

The following recommendations are for ASM version 11.2.0.3.


Parameter  Value 
CLUSTER_INTERCONNECTS Bondib0 IP address for X2-2. Colon delimited Bondib* IP addresses for X2-8.
ASM_POWER_LIMIT 1 for a quarter rack, 2 for all other racks.
SGA_TARGET 1250 MB
PGA_AGGREGATE_TARGET 400 MB
MEMORY_TARGET 0
MEMORY_MAX_TARGET 0
PROCESSES For less than 10 instances per node: 50*(#db instances per node + 1). For 10 0r more more instances per node: [50*MIN(#db instances per node + 1, 11)] + [10*MAX(#db instance per node - 10, 0)]
USE_LARGE_PAGES ONLY

Voting disks and disk group redundancy

Default location for voting disks in Exadata is ASM disk group DBFS_DG. That disk group can be either normal or high redundancy, except in a quarter rack where it has to be a normal redundancy.

This is because of the voting disks requirement for the minimal number of failgroups in a given ASM disk group. If we put voting disks in a normal redundancy disk group, that disk group has to have at least 3 failgroups. If we put voting disks in a high redundancy disk group, that disk group has to have at least 5 failgroups.

In a quarter rack, where we have only 3 storage cells, all disk groups can have at most 3 failgroups. While we can create a high redundancy disk group with 3 storage cells, voting disks cannot go into that disk group as it does not have 5 failgroups.

XDMG and XDWK background processes

These two process run in ASM instances on compute nodes. XDMG monitors all configured Exadata cells for storage state changes and performs the required tasks for such events. Its primary role is to watch for inaccessible disks and to initiate the disk online operations, when they become accessible again. Those operations are then handled by XDWK.

XDWK gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down.

Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk performance degrades it can put it into proactive failure mode. It also monitors for predictive failures based on the disk's SMART (Self-monitoring, Analysis and Reporting Technology) data. In both cases, the Exadata Server notifies XDMG to take those disks offline.

When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk groups, in case they were already dropped.

The diskmon

The master diskmon process (diskmon.bin) can be seen running in all Grid Infrastructure installs, but it's only in Exadata that it's actually doing any work. On every compute node there will be one master diskmon process and one DSKM, slave diskmon process, per every Oracle instance (including ASM). Here is an example from one compute node:

# ps -ef | egrep "diskmon|dskm" | grep -v grep
oracle    3205     1  0 Mar16 ?        00:01:18 ora_dskm_ONE2
oracle   10755     1  0 Mar16 ?        00:32:19 /u01/app/11.2.0.3/grid/bin/diskmon.bin -d -f
oracle   17292     1  0 Mar16 ?        00:01:17 asm_dskm_+ASM2
oracle   24388     1  0 Mar28 ?        00:00:21 ora_dskm_TWO2
oracle   27962     1  0 Mar27 ?        00:00:24 ora_dskm_THREE2
#

In Exadata, the diskmon is responsible for
  • Handling of storage cell failures and I/O fencing
  • Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
  • Broadcasting intra database IORM (I/O Resource Manager) plans from databases to storage cells
  • Monitoring or the control messages from database and ASM instances to storage cells
  • Communicating with other diskmons in the cluster

ACFS

The ACFS (ASM Cluster File System) is supported in Exadata environments staring with ASM version 12.1.0.2. Alternatives to the ACFS are the DBFS (Database based File System) and the NFS (Network File System). Many Exadata customers have an Oracle ZFS Appliance that can provide a high performance, InfiniBand connected, NFS storage.

Conclusion

There are quite a few extra features and differences in ASM compared to non-Exadata environments. Most of them are about storage cells and grid disks, and some are about tuning ASM for the extreme Exadata performance.