ASM Support Guy: December 2015

According to Wikipedia, data scrubbing is “an error correction technique that uses a background task to periodically inspect main memory or storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.”

Oracle documentation says that ASM disk scrubbing “improves availability and reliability by searching for data that may be less likely to be read. Disk scrubbing checks logical data corruptions and repairs them automatically in normal and high redundancy disks groups. The scrubbing process repairs logical corruptions using the mirror disks. Disk scrubbing can be combined with disk group rebalance to reduce I/O resources.” This feature is available in ASM version 12c.

Setup

First let's look at my setup. I have a single normal redundancy disk group DATA, with ASM filter driver (AFD) disks.

[grid@dbserver]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Tue Dec 8 14:08:22 2015

...

SQL> select NAME, TYPE, STATE from v$asm_diskgroup_stat;

NAME TYPE STATE

------------ ------ -----------

DATA NORMAL MOUNTED

SQL> select DISK_NUMBER, PATH from v$asm_disk_stat;

DISK_NUMBER PATH

----------- ----------------

0 AFD:ASMDISK01

2 AFD:ASMDISK03

4 AFD:ASMDISK05

1 AFD:ASMDISK02

6 AFD:ASMDISK07

5 AFD:ASMDISK06

3 AFD:ASMDISK04

7 rows selected.

SQL>

And I have some datafiles in that disk group:

[grid@dbserver]$ asmcmd ls +DATA/BR/DATAFILE

SYSTEM.261.887497785

USERS.262.8874978313

UNDOTBS1.269.887497831

SYSAUX.270.887497739

T1.305.897911209

T2.306.897911479

T3.307.897911659

Scrubbing a file

We can scrub a disk group, an individual disk or an individual file. In this post I will demonstrate the feature by scrubbing a single file.

To scrub a file, we connect to an ASM instance and run the ALTER DISKGROUP SCRUB FILE command:

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait;

...

The REPAIR keyword is optional, in which case a corruption would be reported, but not repaired.

In the ASM alert log we see the time stamped command, and once the scrubbing finishes, we see the outcome, which is something like this:

Tue Dec 08 11:55:56 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait

Tue Dec 08 11:55:56 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 11:56:43 2015

NOTE: Scrubbing finished

Tue Dec 08 11:56:43 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait

There were no corruptions in my file, so none were reported.

While the scrubbing was running, I saw two ASM processes doing the actual work:

[grid@dbserver]$ ps -ef | grep asm_sc

grid 17902 1 0 11:27 ? 00:00:00 asm_scrb_+ASM

grid 24365 1 0 11:49 ? 00:00:01 asm_scc0_+ASM

...

SCRB - ASM disk scrubbing master process that coordinates disk scrubbing operations.

SCCn - ASM disk scrubbing slave check process that performs the checking operations. The possible processes are SCC0-SCC9.

We would see additional processes during the repair operation:

SCRn - ASM disk scrubbing slave repair process that performs repair operation. The possible processes are SCR0-SCR9.

SCVn - ASM disk scrubbing slave verify process that performs the verifying operations. The possible processes are SCV0-SCV9.

Corrupted block found

To make this interesting, let's corrupt one block - say block 200 - in that datafile. First I used find_block.pl script, to locate both copies of the block:

[grid@dbserver ]$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/BR/DATAFILE/t3.307.897911659 200

dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200.dd

dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200.dd

Then I used those dd commands to extract the blocks.

[root@dbserver]# dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200_primary.dd

[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_mirror.dd

Using diff command I verified both copies are exactly the same.

[root@dbserver]# diff block_200_primary.dd block_200_mirror.dd

[root@dbserver]#

Let's now corrupt the mirror block, which is not trivial with the ASM filter driver.

First create a text file, to use as a corruption. Sure, I could have just used /dev/zero for that.

[root@dbserver]# od -c block_200_mirror.dd > block_200_corrupt.txt

Shut down the database and ASM, then unload the ASM filter driver, otherwise I cannot dd into that disk.

[root@dbserver]# modprobe -r oracleafd

[root@dbserver]# lsmod | grep oracle

oracleacfs 3308260 0

oracleadvm 508030 0

oracleoks 506741 2 oracleacfs,oracleadvm

Now corrupt the mirror copy of block 200.

[root@dbserver]# dd if=block_200_corrupt.txt of=/dev/sdd1 bs=8192 count=1 seek=1462088

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.00160027 s, 5.1 MB/s

Confirm the mirror block is corrupt, by reading the content again.

[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_corr.dd

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.00152298 s, 5.4 MB/s

[root@dbserver]# diff block_200_primary.dd block_200_corr.dd

Binary files block_200_primary.dd and block_200_corr.dd differ

[root@dbserver]#

As we can see, the content of the mirror block is now different.

Scrub a file

Load the ASM filter driver back.

[root@dbserver]# insmod /lib/modules/.../oracleafd.ko

And restart the ASM instance.

We can now scrub the datafile.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait;

...

The ASM alert log shows that a corrupted block was found.

Tue Dec 08 13:25:48 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait

Starting background process SCRB

Tue Dec 08 13:25:48 2015

SCRB started with pid=25, OS id=4585

Tue Dec 08 13:25:48 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 13:25:49 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:25:50 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:26:39 2015

NOTE: Scrubbing finished

Tue Dec 08 13:26:39 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait

In the background dump destination we see newly generated trace files.

[grid@dbserver]$ ls -lrt | tail

-rw-r-----. 1 grid oinstall 36887 Dec 8 13:25 +ASM_scc0_4587.trc

-rw-r-----. 1 grid oinstall 36967 Dec 8 13:25 +ASM_scv0_4592.trc

-rw-r-----. 1 grid oinstall 5088 Dec 8 13:25 +ASM_gmon_16705.trc

-rw-r-----. 1 grid oinstall 36965 Dec 8 13:25 +ASM_scv1_4599.trc

-rw-r-----. 1 grid oinstall 551218 Dec 8 13:26 alert_+ASM.log

And if we have a closer look, in, say the SCC0 trace file, we see the scrubbing report.

[grid@dbserver]$ more ./+ASM_scc0_4587.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_scc0_4587.trc

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

...

HARD VIOLATION(S) DETECTED!

CORRUPTED BLOCK INFORMATION:

memory address: 0x7f13531fb000

block size: 8192

block number: 200 [0xc8]

block offset in this write request of 1 block(s): 0

file type: datafile (type value: 2)

...

HARD check failed at block 200

One way to get a quick scrubbing report would be to grep the trace files for the relevant key words.

[grid@dbserver]$ egrep -i "corrupted|block number|failed" *sc*trc

+ASM_scc0_4587.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scc0_4587.trc: block number: 200 [0xc8]

+ASM_scc0_4587.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scc0_4587.trc:FAILED CHECK(S):

+ASM_scc0_4587.trc: MAGIC NUMBER CHECK FAILED

+ASM_scc0_4587.trc: BLOCK SIZE CHECK FAILED

+ASM_scc0_4587.trc: BLOCK NUMBER CHECK FAILED

+ASM_scc0_4587.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scc0_4587.trc:HARD check failed at block 200

+ASM_scv0_4592.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scv0_4592.trc: block number: 200 [0xc8]

+ASM_scv0_4592.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scv0_4592.trc:FAILED CHECK(S):

+ASM_scv0_4592.trc: MAGIC NUMBER CHECK FAILED

+ASM_scv0_4592.trc: BLOCK SIZE CHECK FAILED

+ASM_scv0_4592.trc: BLOCK NUMBER CHECK FAILED

+ASM_scv0_4592.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scv0_4592.trc:HARD check failed at block 200

+ASM_scv0_4592.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

+ASM_scv1_4599.trc:CORRUPTED BLOCK INFORMATION:

+ASM_scv1_4599.trc: block number: 200 [0xc8]

+ASM_scv1_4599.trc: Magic number Block size Block number Checksum Head-tail match

+ASM_scv1_4599.trc:FAILED CHECK(S):

+ASM_scv1_4599.trc: MAGIC NUMBER CHECK FAILED

+ASM_scv1_4599.trc: BLOCK SIZE CHECK FAILED

+ASM_scv1_4599.trc: BLOCK NUMBER CHECK FAILED

+ASM_scv1_4599.trc: HEAD-TAIL MATCH CHECK FAILED

+ASM_scv1_4599.trc:HARD check failed at block 200

+ASM_scv1_4599.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

[grid@dbserver]$

Fix the corruption

But the corruption wasn't repaired, since I didn't ask for that. Let's fix it - sorry, scrub it.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait;

...

This time the ASM alert log shows the block was scrubbed.

Tue Dec 08 13:35:02 2015

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait

Tue Dec 08 13:35:02 2015

NOTE: Waiting for scrubbing to finish

Tue Dec 08 13:35:03 2015

NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

Tue Dec 08 13:35:04 2015

NOTE: Scrubbing block 200 in file 307.897911659 in slave

NOTE: Successfully scrubbed block 200 in file 307.897911659

Tue Dec 08 13:35:53 2015

NOTE: Scrubbing finished

Tue Dec 08 13:35:53 2015

SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait

Let's check, by reading the block back and comparing it to the primary copy.

[root@dbserver ~]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_after_scrub_repair.dd

1+0 records in

1+0 records out

8192 bytes (8.2 kB) copied, 0.000856456 s, 9.6 MB/s

[root@dbserver]# diff block_200_primary.dd block_200_after_scrub_repair.dd

[root@dbserver]#

This shows that the block was successfully repaired.

Conclusion

ASM data scrubbing can detect and repair Oracle data blocks that are affected by media or logical corruptions. It can also correct Oracle data blocks written to incorrect location, or overwritten by an external, non-Oracle process (like in my example).

December 8, 2015

ASM data scrubbing