Download presentation
Presentation is loading. Please wait.
Published byDarrell Green Modified over 8 years ago
1
1EMC CONFIDENTIAL—INTERNAL USE ONLY Recovery Check FAQs Ibrahim Shamel
2
2© Copyright 2014 EMC Corporation. All rights reserved. What is SMLink Failure What is SMLink_Check Tool What is Recovery Check FAQs Q&A Agenda
3
3© Copyright 2014 EMC Corporation. All rights reserved. In R31 it was possible for Slice Manager Link Corruption to be present in Pool LUNs. This corruption is found on the metadata of Pool LUNs and not on the actual data itself. R32 has stricter validation of Pool LUNs built in, and as a result after an upgrade to R32, Pool LUNs which are affected by the SMLink issue will go offline What is SMLink Failure SML stands for “Slice Manager Link”
4
4© Copyright 2014 EMC Corporation. All rights reserved. When planning a non-disruptive upgrade (NDU) from VNX OE 05.31 to VNX OE 05.32 and the array has run a version of 05.31 prior to 05.31.502 (Franklin) with FAST VP (auto-tiering) enabled at some point in its time of operation, it is important to run Recoverycheck on all pools. There is no need to run Recoverycheck on arrays that have never run a version earlier than 05.31.502 or versions earlier than 05.31.502 that never ran FAST VP (Auto-tiering). Due to the addition of a file system check that checks for link corruption in VNX OE 05.32 there is a chance that VP LUNs can be taken offline causing a data unavailable (DU) condition after a non- disruptive upgrade (NDU) to VNX 05.32 What is SMLink Failure From KB 16126
5
5© Copyright 2014 EMC Corporation. All rights reserved. SMLink_check checks to see if the array was ever running a Release of Release 31 with Autotiering (FastVP) and if a pool was ever created while running an affected release of code. SMLink_Check does not find corruption it only attempts to categorize whether or not it can be affected by the SMLink footprint, it does no data validation What is SMLink_Check Tool See emc308955 to download SMLink_check
6
6© Copyright 2014 EMC Corporation. All rights reserved. If the SMLink_Check advises that certain pools need to be reviewed then a Recovery Check activity is needed before we can proceed with an upgrade to R32. What SMLink_Check does: – Verifies if the array was ever running Elias code – Verifies if the array has Autotiering installed – Verifies if the pool(s) was created while the array was running Elias If a Pool matches the above rules then the pool is marked vulnerable to the issue and requires Recoverycheck What is SMLink_Check Tool
7
7© Copyright 2014 EMC Corporation. All rights reserved. Welcome to the VNX SMLINK_check program SMLINK_check version 1.0.6 Program start : 12/02/2012 20:08:37 SP information: FCN00113900048 05.31.000.5.709 This array was running code : 05.31.000.5.502 on : 2011/09/30 07:05:22 This array was running code : 05.31.000.5.509 on : 2011/12/14 13:38:27 This array was running code : 05.31.000.5.704 on : 2012/02/10 14:58:14 This array was running code : 05.31.000.5.709 on : 2012/03/21 14:54:12 This array is NOT potentially affected. Exiting... SMLink check tool completed and found no issues. It is OK to proceed with R31 to R32 NDU. SMLink_Check Output - Pass
8
8© Copyright 2014 EMC Corporation. All rights reserved. Welcome to the VNX SMLink_check program SMLink_check version 1.0.6 Program start : 12/03/2012 12:33:18 SP information: APM0011300304005.31.000.5.709 This array was running code : 05.31.000.5.008 on : 2011/08/02 00:09:42 This array was running code : 05.31.000.5.012 on : 2011/08/25 18:51:45 This array was running code : 05.31.000.5.704 on : 2012/01/24 01:56:03 This array was running code : 05.31.000.5.709 on : 2012/04/20 22:39:10 This array is running FAST Virtual Provisioning This array has one or more pools configured Post Elias Code installed on: 2012/01/24 01:56:03 Pool ID 0x0 with name GP_FAST_R5_Pool_3 was created under ELIAS code Pool ID 0x1 with name GP_FAST_R5_Pool_4 was created under ELIAS code Pool ID 0x2 with name Hi_Perf_R5_Pool_5 was created under ELIAS code Pool ID 0x3 with name Hi_Perf_R10_Pool_6 was created under ELIAS code The following Pool ID's need a recovery check, please follow EMC306064(EMC Internal only) or escalate to EMC Pool ID 0x0 with name GP_FAST_R5_Pool_3 Pool ID 0x1 with name GP_FAST_R5_Pool_4 Pool ID 0x2 with name Hi_Perf_R5_Pool_5 Pool ID 0x3 with name Hi_Perf_R10_Pool_6 SMLink_Check Output - Fail
9
9© Copyright 2014 EMC Corporation. All rights reserved. Recovery Check activity checks for possible corruption of the SMLink. Recoverycheck is a read only tool and does not make any changes. It is advisable to run Recoverycheck at times of lower IO to prevent false positives Questions have been asked what does this mean, realistically this means at period that is not full production. If this cannot be avoided that is ok, but not ideal as it may require many more additional attempts at recovery check What is Recovery Check Recoverycheck is a detailed tool from Escalation Engineering
10
10© Copyright 2014 EMC Corporation. All rights reserved. Recovery Check is a read only tool. It does not do any modifications to the storage system of the pools. Before running the recovery check, we need to: Make sure there is no or very minimal I/Os on the array – Disable FAST Cache – Disable Autotiering – Disable Compression After running the Recovery Check tool, the FAST Cache, and Compression are re-enabled. Autotiering needs to be left disabled until the upgrade takes place. What is Recovery Check More info on the Recovery Check:
11
11© Copyright 2014 EMC Corporation. All rights reserved. How long does running Recovery Check take? – The answer is it varies. When running Recovery Check you will need to generate SAT files from each SP, which will take anywhere from 2- 15 minutes (both can be run at the same time.) Once this is complete the actual running of the Recovery Check will take somewhere between 5-30 minutes per pool, depending on the configuration of the array and how many pool LUNs there are. – FAST Cache if enabled increases the time to perform the Recovery Check, since we need to de- stage it first. – As a rule of thumb: Recovery Check with FAST Cache 6 Hours Recovery Check without FAST Cache 3 Hours – You can determine that from the RCM TRiiAGE: Array Serial Number: FNM00104100030 Array Model: VNX5500 ( BLOCK ) Array Software Revision: 05.32.000.5.207 05.32.000.5.207 IP Address: 10.241.164.136 10.241.164.137 SP Uptime: 10 days 04:25:05 10 days 04:35:44 EFD/FAST Cache Feature Enabled: Yes FAQs Frequently Asked Questions
12
12© Copyright 2014 EMC Corporation. All rights reserved. Why you need to disable Data Compression? – Data compression is a feature that analyzes the data on a disk and applies algorithms that reduce the size of repetitive sequences of bits that are inherent in some types of files. During the compression operation for a RAID group LUN, the software migrates and compresses the LUN data to a thin LUN in a pool. The LUN becomes a compressed thin LUN. Compression operations for pool LUNs (thick and thin) take place within the pool in which the LUN being compressed resides. Whenever data is compressed, there is a data movement which will affect the results of the recovery check. Why you need to disable Auto-Tiering? – The auto-tiering feature migrates data between storage tiers or different storage media (EFD, FC & SATA). The purpose of tiered storage is to retain the most frequently accessed or important data on fast, high performance (more expensive) drives, and move the less frequently accessed and less important data to low performance (less expensive) drives. Similar to Data Compression, there is data movement involved in Auto-Tiering too which can also cause false results. Why you need to disable FAST Cache? – Similar to the above two features, FAST Cache also involves data movement. When FAST Cache is enabled on a RAID Group Lun or in a pool, the data that is in the FAST Cache which is used less frequently, the data is moved to the HDD from FAST Cache. This data will be re-promoted to the FAST Cache when it becomes busy or more frequently used. FAQs Frequently Asked Questions
13
13© Copyright 2014 EMC Corporation. All rights reserved. Does running Recovery Check cause LUNs to go offline? – No, running Recoverycheck does not take LUNs offline Must pools be taken off-line during Recovery? – This answer depends on the analysis of the data previously provided. Escalation Engineering will endeavor not to take the LUN offline and to repair the affected LUN via a scripted batch file. However under certain circumstances the repair does require affected LUNs with SMLink corruption taken off-line for recovery. – If one or more affected LUNs are used for database or File Data Mover applications, the whole application may need to be brought off-line. Keep in mind that there is always a risk of unintentionally taking a pool offline when performing this type of manual recovery. Recovery should be performed during no/low I/O time periods. FAQs Frequently Asked Questions
14
14© Copyright 2014 EMC Corporation. All rights reserved. Do FAST Cache and Compression need to remain disabled after running Recoverycheck, but before recovery? – No, FAST Cache and Compression can be re-enabled on pool LUNs after Recoverycheck has run. Both features must be disabled on affected pools prior to recovery. This should be factored in to the total overall time required for the recovery operation. How long does running Recoverycheck take? – The answer is it varies. When running Recoverycheck you will need to generate SAT files from each SP, which will take anywhere from 2- 15 minutes (both can be run at the same time.) Once this is complete the actual running of the Recoverycheck will take somewhere between 5-30 minutes, depending on the configuration of the array and how many pool LUNs there are. FAQs Frequently Asked Questions
15
15© Copyright 2014 EMC Corporation. All rights reserved. FAST Cache has been disabled, but there is no progress – You need to make sure that setstats is enabled If FAST cache is taking too long, can I start the recovery check ? – Answer is you can, and in case you get a clean run from the 1 st time that would be enough. However, if you find corruptions, then you will need to wait for the FAST Cache to finish. FAQs Frequently Asked Questions
16
16© Copyright 2014 EMC Corporation. All rights reserved. Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.