The views expressed on this blog are my own and do not necessarily reflect the views of Oracle

October 5, 2011

Offline or drop?


When an ASM disk becomes unavailable, ASM drops it from the disk group, right? Well, that depends on the ASM version and on the disk group redundancy. An external redundancy disk group would be simply dismounted, so we will focus on the normal and high redundancy disk groups.

The disk would simply be dropped in ASM version 10g. Starting from 11gR1, a disk that becomes unavailable is first offlined, the disk repair timer kicks in, and if the time exceeds the value specified by DISK_REPAIR_TIME (disk group) attribute, the disk is dropped from the disk group. If the disk becomes available before the timer expires and its state is changed to online, the disk is not dropped from the disk group. But how would ASM learn that the disk is available and who would put the disk online?

Unavailable

A disk is considered unavailable if it cannot be read from or written to, by ASM or an ASM client. A database is a typical but not the only ASM client.

A disk can become unavailable for many reasons - faulty SCSI cable (for local disks), SAN network/switch issue (for SAN based storage), NFS server problem (for NFS based disks), site failure (in a stretched cluster setup), disk failure, etc. Whatever the reason, the ASM and/or the ASM client would report an I/O error and the ASM would take an action.

Drop

In version 10g, the ASM will immediately drop the disk that becomes unavailable. That would trigger a rebalance operation that will attempt to restore the data redundancy. Once the rebalance finishes, the data redundancy is fully restored and the disk is expelled from the disk group. Once the problem is resolved the disk can be added back into the disk group with an alter diskgroup command like this:

alter diskgroup DATA add disk 'ORCL:DISK077';

That will again trigger a rebalance and once it finishes, the disk will again be part of the disk group.

But what if multiple disks fail at the same time? What if one disk fails, the rebalance starts and then another disk fails? The outcome depends - on disk group redundancy, whether the disks are from the same or different failgroup and whether the disks are partners or not.

In a normal redundancy disk group, ASM can tolerate one or more (including all) disks becoming unavailable if they are all from the same failgroup. If disks from different failgroups become unavailable, ASM will tolerate it as long as the disks are not partners. By tolerate I mean the disk group will stay online and there will be no interruption for ASM clients.

In a high redundancy disk group, ASM can tolerate one or more (including all) disks becoming unavailable if they are all from two failgroups only. As for the disks from more than two failgroups, the same partnership rule applies. Basically ASM will tolerate unavailability of any number of disks as long as they are not partners.

Offline

So when a disk is dropped the whole disk group needs to be rebalanced and that takes time. During that time other disks can fail, increasing the risk of data loss. For that reason a fast disk resync(hronization)  was introduced in 11gR1. Instead of dropping the disk when it becomes unavailable, ASM simply takes the disk offline. The idea here is that the ASM administrator will be notified and the disk issue resolved - during the time it takes for the disk repair timer to expire.

The default value for the disk repair timer is 3.6 hours. That can be adjusted, to say 12 hours, with an alter diskgroup command, like this:

alter diskgroup DATA set attribute 'DISK_REPAIR_TIME' = '12h';

During the time the disk is offline, ASM keeps track of the changes that would have gone to the offlined disk. If the disks is made available before the timer expires and the disk is put back online, ASM applies the outstanding changes and no rebalance is required. That is called the fast disk resync.

If the issue with the disk becoming unavailable is not resolved and the disk repair timer expires, the disk is dropped from the disk group.

Online

So a system engineer or an ASM administrator fixes the issue that caused the disk to became unavailable. Say they replace the faulty cable. But, who makes the disk online and how? Can that be automated?

Again it depends. If you are on Exadata or Oracle Database Appliance, the disk is put back online automatically. In all other cases an ASM administrator has to put the disk online with an alter diskgroup command, like this:

alter diskgroup DATA online disk 'ORCL:DISK077';

or

alter diskgroup DATA online all;

Conclusion


It is always a good idea to understand what happens in a fault scenario, what your version of ASM can and cannot do and what level of protection your disk group redundancy gives you.