December 8, 2015

ASM data scrubbing

According to Wikipedia, data scrubbing is “an error correction technique that uses a background task to periodically inspect main memory or storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.”

Oracle documentation says that ASM disk scrubbing “improves availability and reliability by searching for data that may be less likely to be read. Disk scrubbing checks logical data corruptions and repairs them automatically in normal and high redundancy disks groups. The scrubbing process repairs logical corruptions using the mirror disks. Disk scrubbing can be combined with disk group rebalance to reduce I/O resources.” This feature is available in ASM version 12c.

Setup

First let's look at my setup. I have a single normal redundancy disk group DATA, with ASM filter driver (AFD) disks.

[grid@dbserver]$ sqlplus / as sysasm
SQL*Plus: Release 12.1.0.2.0 Production on Tue Dec 8 14:08:22 2015
...

SQL> select NAME, TYPE, STATE from v$asm_diskgroup_stat;

NAME         TYPE   STATE
------------ ------ -----------
DATA         NORMAL MOUNTED

SQL> select DISK_NUMBER, PATH from v$asm_disk_stat;

DISK_NUMBER PATH
----------- ----------------
        0 AFD:ASMDISK01
        2 AFD:ASMDISK03
        4 AFD:ASMDISK05
        1 AFD:ASMDISK02
        6 AFD:ASMDISK07
        5 AFD:ASMDISK06
        3 AFD:ASMDISK04

7 rows selected.

SQL>

And I have some datafiles in that disk group:

[grid@dbserver]$ asmcmd ls +DATA/BR/DATAFILE
SYSTEM.261.887497785
USERS.262.8874978313
UNDOTBS1.269.887497831
SYSAUX.270.887497739
T1.305.897911209
T2.306.897911479
T3.307.897911659

Scrubbing a file

We can scrub a disk group, an individual disk or an individual file. In this post I will demonstrate the feature by scrubbing a single file.

To scrub a file, we connect to an ASM instance and run the ALTER DISKGROUP SCRUB FILE command:

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait;
...

The REPAIR keyword is optional, in which case a corruption would be reported, but not repaired.

In the ASM alert log we see the time stamped command, and once the scrubbing finishes, we see the outcome, which is something like this:

Tue Dec 08 11:55:56 2015
SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait
Tue Dec 08 11:55:56 2015
NOTE: Waiting for scrubbing to finish
Tue Dec 08 11:56:43 2015
NOTE: Scrubbing finished
Tue Dec 08 11:56:43 2015
SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair power high wait

There were no corruptions in my file, so none were reported.

While the scrubbing was running, I saw two ASM processes doing the actual work:

[grid@dbserver]$ ps -ef | grep asm_sc
grid     17902     1  0 11:27 ?        00:00:00 asm_scrb_+ASM
grid     24365     1  0 11:49 ?        00:00:01 asm_scc0_+ASM
...

SCRB - ASM disk scrubbing master process that coordinates disk scrubbing operations.

SCCn - ASM disk scrubbing slave check process that performs the checking operations. The possible processes are SCC0-SCC9.
We would see additional processes during the repair operation:

SCRn - ASM disk scrubbing slave repair process that performs repair operation. The possible processes are SCR0-SCR9.
  
SCVn - ASM disk scrubbing slave verify process that performs the verifying operations. The possible processes are SCV0-SCV9.

Corrupted block found

To make this interesting, let's corrupt one block - say block 200 - in that datafile. First I used find_block.pl script, to locate both copies of the block:

[grid@dbserver ]$ $ORACLE_HOME/perl/bin/perl find_block.pl +DATA/BR/DATAFILE/t3.307.897911659 200
dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200.dd
dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200.dd

Then I used those dd commands to extract the blocks.

[root@dbserver]# dd if=/dev/sdo1 bs=8192 count=1 skip=1460552 of=block_200_primary.dd
[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_mirror.dd
Using diff command I verified both copies are exactly the same.

[root@dbserver]# diff block_200_primary.dd block_200_mirror.dd
[root@dbserver]#

Let's now corrupt the mirror block, which is not trivial with the ASM filter driver.

First create a text file, to use as a corruption. Sure, I could have just used /dev/zero for that.

[root@dbserver]# od -c block_200_mirror.dd > block_200_corrupt.txt

Shut down the database and ASM, then unload the ASM filter driver, otherwise I cannot dd into that disk.

[root@dbserver]# modprobe -r oracleafd
[root@dbserver]# lsmod | grep oracle
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm

Now corrupt the mirror copy of block 200.

[root@dbserver]# dd if=block_200_corrupt.txt of=/dev/sdd1 bs=8192 count=1 seek=1462088
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.00160027 s, 5.1 MB/s

Confirm the mirror block is corrupt, by reading the content again.

[root@dbserver]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_corr.dd
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.00152298 s, 5.4 MB/s

[root@dbserver]# diff block_200_primary.dd  block_200_corr.dd
Binary files block_200_primary.dd and block_200_corr.dd differ
[root@dbserver]#

As we can see, the content of the mirror block is now different.

Scrub a file

Load the ASM filter driver back.

[root@dbserver]# insmod /lib/modules/.../oracleafd.ko

And restart the ASM instance.

We can now scrub the datafile.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait;
...

The ASM alert log shows that a corrupted block was found.

Tue Dec 08 13:25:48 2015
SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait
Starting background process SCRB
Tue Dec 08 13:25:48 2015
SCRB started with pid=25, OS id=4585
Tue Dec 08 13:25:48 2015
NOTE: Waiting for scrubbing to finish
Tue Dec 08 13:25:49 2015
NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659
Tue Dec 08 13:25:50 2015
NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659
Tue Dec 08 13:26:39 2015
NOTE: Scrubbing finished
Tue Dec 08 13:26:39 2015
SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' wait

In the background dump destination we see newly generated trace files.

[grid@dbserver]$ ls -lrt | tail
-rw-r-----. 1 grid oinstall  36887 Dec  8 13:25 +ASM_scc0_4587.trc
-rw-r-----. 1 grid oinstall  36967 Dec  8 13:25 +ASM_scv0_4592.trc
-rw-r-----. 1 grid oinstall   5088 Dec  8 13:25 +ASM_gmon_16705.trc
-rw-r-----. 1 grid oinstall  36965 Dec  8 13:25 +ASM_scv1_4599.trc
-rw-r-----. 1 grid oinstall 551218 Dec  8 13:26 alert_+ASM.log

And if we have a closer look, in, say the SCC0 trace file, we see the scrubbing report.

[grid@dbserver]$ more ./+ASM_scc0_4587.trc
Trace file /u01/app/grid/diag/asm/+asm/+ASM/trace/+ASM_scc0_4587.trc
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
...
HARD VIOLATION(S) DETECTED!
CORRUPTED BLOCK INFORMATION:
memory address: 0x7f13531fb000
block size: 8192
block number: 200 [0xc8]
block offset in this write request of 1 block(s): 0
file type: datafile (type value: 2)
...
HARD check failed at block 200
One way to get a quick scrubbing report would be to grep the trace files for the relevant key words.

[grid@dbserver]$ egrep -i "corrupted|block number|failed" *sc*trc
+ASM_scc0_4587.trc:CORRUPTED BLOCK INFORMATION:
+ASM_scc0_4587.trc:  block number: 200 [0xc8]
+ASM_scc0_4587.trc:  Magic number  Block size  Block number  Checksum  Head-tail match
+ASM_scc0_4587.trc:FAILED CHECK(S):
+ASM_scc0_4587.trc:  MAGIC NUMBER CHECK FAILED
+ASM_scc0_4587.trc:  BLOCK SIZE CHECK FAILED
+ASM_scc0_4587.trc:  BLOCK NUMBER CHECK FAILED
+ASM_scc0_4587.trc:  HEAD-TAIL MATCH CHECK FAILED
+ASM_scc0_4587.trc:HARD check failed at block 200

+ASM_scv0_4592.trc:CORRUPTED BLOCK INFORMATION:
+ASM_scv0_4592.trc:  block number: 200 [0xc8]
+ASM_scv0_4592.trc:  Magic number  Block size  Block number  Checksum  Head-tail match
+ASM_scv0_4592.trc:FAILED CHECK(S):
+ASM_scv0_4592.trc:  MAGIC NUMBER CHECK FAILED
+ASM_scv0_4592.trc:  BLOCK SIZE CHECK FAILED
+ASM_scv0_4592.trc:  BLOCK NUMBER CHECK FAILED
+ASM_scv0_4592.trc:  HEAD-TAIL MATCH CHECK FAILED
+ASM_scv0_4592.trc:HARD check failed at block 200
+ASM_scv0_4592.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659

+ASM_scv1_4599.trc:CORRUPTED BLOCK INFORMATION:
+ASM_scv1_4599.trc:  block number: 200 [0xc8]
+ASM_scv1_4599.trc:  Magic number  Block size  Block number  Checksum  Head-tail match
+ASM_scv1_4599.trc:FAILED CHECK(S):
+ASM_scv1_4599.trc:  MAGIC NUMBER CHECK FAILED
+ASM_scv1_4599.trc:  BLOCK SIZE CHECK FAILED
+ASM_scv1_4599.trc:  BLOCK NUMBER CHECK FAILED
+ASM_scv1_4599.trc:  HEAD-TAIL MATCH CHECK FAILED
+ASM_scv1_4599.trc:HARD check failed at block 200
+ASM_scv1_4599.trc:NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659
[grid@dbserver]$

Fix the corruption

But the corruption wasn't repaired, since I didn't ask for that. Let's fix it - sorry, scrub it.

SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait;
...

This time the ASM alert log shows the block was scrubbed.

Tue Dec 08 13:35:02 2015
SQL> alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait
Tue Dec 08 13:35:02 2015
NOTE: Waiting for scrubbing to finish
Tue Dec 08 13:35:03 2015
NOTE: Corrupted block 200 found in file +DATA/BR/DATAFILE/t3.307.897911659
Tue Dec 08 13:35:04 2015
NOTE: Scrubbing block 200 in file 307.897911659 in slave
NOTE: Successfully scrubbed block 200 in file 307.897911659
Tue Dec 08 13:35:53 2015
NOTE: Scrubbing finished
Tue Dec 08 13:35:53 2015
SUCCESS: alter diskgroup DATA scrub file '+DATA/BR/DATAFILE/t3.307.897911659' repair wait

Let's check, by reading the block back and comparing it to the primary copy.

[root@dbserver ~]# dd if=/dev/sdd1 bs=8192 count=1 skip=1462088 of=block_200_after_scrub_repair.dd
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000856456 s, 9.6 MB/s

[root@dbserver]# diff block_200_primary.dd block_200_after_scrub_repair.dd
[root@dbserver]#

This shows that the block was successfully repaired.

Conclusion

ASM data scrubbing can detect and repair Oracle data blocks that are affected by media or logical corruptions. It can also correct Oracle data blocks written to incorrect location, or overwritten by an external, non-Oracle process (like in my example).