Presentation is loading. Please wait.

Presentation is loading. Please wait.

OCA Crash Analysis Andre Vachon Software Development Lead Windows Product Feedback Microsoft Corporation.

Similar presentations


Presentation on theme: "OCA Crash Analysis Andre Vachon Software Development Lead Windows Product Feedback Microsoft Corporation."— Presentation transcript:

1 OCA Crash Analysis Andre Vachon Software Development Lead Windows Product Feedback Microsoft Corporation

2 2 What Is OCA Online Crash Analysis Online Crash Analysis Free failure analysis service, supported on Windows XP and later operating systems Free failure analysis service, supported on Windows XP and later operating systems Gathers direct customer data about customer Windows crashes Gathers direct customer data about customer Windows crashes Helps Microsoft and IHVs understand customer problems Helps Microsoft and IHVs understand customer problems

3 3 Goals Of OCA Data Analysis Provide feedback to customers to improve overall satisfaction Provide feedback to customers to improve overall satisfaction Real-time feedback about what caused the problem on their machine Real-time feedback about what caused the problem on their machine Links to help customers solve problems Links to help customers solve problems Make Windows a more reliable platform Make Windows a more reliable platform Find and fix bugs for all kernel mode bluescreens Find and fix bugs for all kernel mode bluescreens Make crash data more actionable for developers Make crash data more actionable for developers Help Microsoft and IHVs prioritize problems Help Microsoft and IHVs prioritize problems

4 4 OCA Data Analysis Process Fully automated Fully automated No human intervention No human intervention Runs in 2-3 seconds Runs in 2-3 seconds Takes dumps received from the customer and sends them to the debugger Takes dumps received from the customer and sends them to the debugger Execute !analyze in the debugger Execute !analyze in the debugger Generate a bucket ID Generate a bucket ID Store the output of the analysis into the OCA Database Store the output of the analysis into the OCA Database If the bucket ID has a solution, send the solution back to the customer If the bucket ID has a solution, send the solution back to the customer

5 5 What Does OCA Collect Dump files Dump files Minidumps by default Minidumps by default Optionally, customers can submit full dumps Optionally, customers can submit full dumps XML data XML data List of.sys files on the machine List of.sys files on the machine List of PnP IDs enumerated by PnP on the machine List of PnP IDs enumerated by PnP on the machine All the data is packaged in a.cab file All the data is packaged in a.cab file

6 6 What Is In A Kernel Minidump Header, basic OS information, PRCB Header, basic OS information, PRCB OS Module list (loaded and unloaded) OS Module list (loaded and unloaded) Faulting EPROCESS, ETHREAD, Stack and context Faulting EPROCESS, ETHREAD, Stack and context Data pages pointed to by the context Data pages pointed to by the context Data pages pointed to by the bugcheck params (Windows XP SP1) Data pages pointed to by the bugcheck params (Windows XP SP1) Some Optional data pages, if space is available in the dump file Some Optional data pages, if space is available in the dump file Optional bugcheck callback data Optional bugcheck callback data Minidumps will never contain all the information (neither will full dumps) Minidumps will never contain all the information (neither will full dumps) Targeted data collection to allow analysis of the majority of failures Targeted data collection to allow analysis of the majority of failures We ask specific customers to send us additional data when needed We ask specific customers to send us additional data when needed User minidumps contain different types of information User minidumps contain different types of information

7 7 Minidump Improvements Windows XPSP1 minidump improvements Windows XPSP1 minidump improvements Sysdata.xml contains PNP IDs Sysdata.xml contains PNP IDs Save data pages pointed to by bugcheck parameters Save data pages pointed to by bugcheck parameters KeBugCheck routine improvements in Windows XP SP1 and SP2 to collect more targeted data for crashes KeBugCheck routine improvements in Windows XP SP1 and SP2 to collect more targeted data for crashes More data pages pointed to by registers More data pages pointed to by registers Windows XP SP2 minidump improvements Windows XP SP2 minidump improvements More accurately save the context of the crash More accurately save the context of the crash Saved all the pages backed by those registers Saved all the pages backed by those registers SMBIOS data tables SMBIOS data tables MM pool changes better isolate a number of pool corruptions MM pool changes better isolate a number of pool corruptions

8 8 Debugging A Kernel Minidump Debuggers Debuggers Kernel minidumps require using KD or WinDbg Kernel minidumps require using KD or WinDbg Both WinDbg and VS supports debugging user mode minidumps Both WinDbg and VS supports debugging user mode minidumps Step 1: Get the images Step 1: Get the images A minidump contains minimal data, so code images must be loaded at debug time A minidump contains minimal data, so code images must be loaded at debug time Use the module timestamps stored in the dump files to find the correct images Use the module timestamps stored in the dump files to find the correct images All MS kernel mode code for recent OSes is on the internet symbol server All MS kernel mode code for recent OSes is on the internet symbol server Step 2: Extract PDB information from the images Step 2: Extract PDB information from the images The debug record stored in an image used to look for the symbols The debug record stored in an image used to look for the symbols If you have the wrong image, wrong symbols will be loaded If you have the wrong image, wrong symbols will be loaded Step 3: Get symbols Step 3: Get symbols Symbol server is again the best solution Symbol server is again the best solution Data in the minidump is limited Data in the minidump is limited Look at what you can Look at what you can Some minidumps will not yield useful results if critical information is missing Some minidumps will not yield useful results if critical information is missing Read the docs for details on loading a minidump in the debugger Read the docs for details on loading a minidump in the debugger

9 9 What Is A Bucket? Identifies component most likely responsible for the crash Identifies component most likely responsible for the crash Based on heuristics in !analyze Based on heuristics in !analyze Heuristics are continually improved Heuristics are continually improved Represents a unique bug or problem Represents a unique bug or problem If multiple bugs map to a bucket, we split the bucket If multiple bugs map to a bucket, we split the bucket Responses and solutions are associated to a bucket Responses and solutions are associated to a bucket A human has to verify the analysis results before a response can be attached to a bucket A human has to verify the analysis results before a response can be attached to a bucket

10 10 Sample Buckets OLD_IMAGE_FOO.SYS OLD_IMAGE_FOO.SYS Crash caused by an old version of foo.sys Crash caused by an old version of foo.sys OLD_IMAGE_foo.sys_DEV_3577 OLD_IMAGE_foo.sys_DEV_3577 Crash caused by an old version of foo.sys on device ID 3577 Crash caused by an old version of foo.sys on device ID 3577 0x44_BUGCHECKING_DRIVER_foo 0x44_BUGCHECKING_DRIVER_foo Driver foo.sys is known to commonly cause bugcheck 0x44 Driver foo.sys is known to commonly cause bugcheck 0x44 POOL_CORRUPTION_foo POOL_CORRUPTION_foo Driver foo.sys is known to cause pool corruption Driver foo.sys is known to cause pool corruption 0xBE_foo!bar+1a 0xBE_foo!bar+1a Driver foo.sys crashed in routine bar Driver foo.sys crashed in routine bar

11 11 Customer Interaction Send back to customers information about their problem in real-time Send back to customers information about their problem in real-time Currently Web-based interaction Currently Web-based interaction Contains link to web pages hosted by the third-party Contains link to web pages hosted by the third-party Better integration in the OS in the future Better integration in the OS in the future Two categories of feedback Two categories of feedback Response: link to a page describing a problem we know about, but is not solved yet Response: link to a page describing a problem we know about, but is not solved yet General troubleshooting steps of KB article General troubleshooting steps of KB article Company wants direct customer feedback Company wants direct customer feedback Solutions: Content that describes how to “fix” a problem Solutions: Content that describes how to “fix” a problem New drivers New drivers Hosted by ISV, IHV, OEM or Windows Update Hosted by ISV, IHV, OEM or Windows Update Service Pack Service Pack Tools to resolve a problem Tools to resolve a problem End-of-life statements are acceptable when hosted by the company End-of-life statements are acceptable when hosted by the company

12 12 Creating Responses Responses are linked by the OCA team Responses are linked by the OCA team Send mail to pfat @ microsoft.com when you find the root cause of a bucket and have a fix for it Send mail to pfat @ microsoft.com when you find the root cause of a bucket and have a fix for it Microsoft has generic templates for various solutions and responses Microsoft has generic templates for various solutions and responses Redirection to third party sites Redirection to third party sites Redirection to Windows Update Redirection to Windows Update KB Articles, etc. KB Articles, etc. IHVs and ISVs need to provide static web pages to have redirects IHVs and ISVs need to provide static web pages to have redirects

13 13 Customer Connection We collect very limited user feedback today We collect very limited user feedback today We collect whether responses were helpful or not to the customer We collect whether responses were helpful or not to the customer OCA intends to improve interaction with customers OCA intends to improve interaction with customers Collect Customer repro steps Collect Customer repro steps Enable direct contact between customer and developer Enable direct contact between customer and developer Ability for customers to get updated status on past crashes Ability for customers to get updated status on past crashes

14 14 OCA Crash Investigation Data collected by OCA is stored in a large database for crash analysis purposes Data collected by OCA is stored in a large database for crash analysis purposes Primary categorization is BucketID Primary categorization is BucketID Additional crash data stored in the OCA DB Additional crash data stored in the OCA DB OS Version OS Version Failure date Failure date Faulting driver Faulting driver Faulting driver timestamp Faulting driver timestamp OEM Name OEM Name CPU information CPU information Bug number Bug number More data as we scale our SQL implementation More data as we scale our SQL implementation

15 15 OCA Data Sharing IHVs IHVs https://winqual.microsoft.com hosts the Error Reporting Site https://winqual.microsoft.com hosts the Error Reporting Site https://winqual.microsoft.com Secure data sharing with any IHV signed up with WinQual Secure data sharing with any IHV signed up with WinQual Data sharing is done based on file name and file version Data sharing is done based on file name and file version Statistics and actual customer dump files are shared with IHVs Statistics and actual customer dump files are shared with IHVs More improvements coming to the site More improvements coming to the site If you need more information to debug problems, send us mail If you need more information to debug problems, send us mail OEMs OEMs OCA data is shared with OEMS on a regular basis OCA data is shared with OEMS on a regular basis OEMs see a list of all the crashes that happen on their machines OEMs see a list of all the crashes that happen on their machines Expect to hear from your OEM if you have a lot of OCA crashes Expect to hear from your OEM if you have a lot of OCA crashes

16 16 OCA Data Normalization The OCA data can not be normalized to determine absolute quality of a driver The OCA data can not be normalized to determine absolute quality of a driver OCA is an anonymous, opt-in system OCA is an anonymous, opt-in system We don’t know how many users send in reports and how often We don’t know how many users send in reports and how often We don’t know the software usage We don’t know the software usage scenarios of customers scenarios of customers We don’t get reports for “success” scenarios We don’t get reports for “success” scenarios We don’t know what the actual problem was until it’s fixed We don’t know what the actual problem was until it’s fixed Just fix the largest buckets first Just fix the largest buckets first

17 17 What Is !analyze Debugger extension designed to find root cause of bugs Debugger extension designed to find root cause of bugs Automated analysis Automated analysis Simplifies analysis of known problems Simplifies analysis of known problems Understand various states of the OS Understand various states of the OS Provides good starting point to analyze complex problems Provides good starting point to analyze complex problems Extract commonly used debugging information Extract commonly used debugging information Results of the analysis are Results of the analysis are “Bucket ID” “Bucket ID” Unique string representing the bug Unique string representing the bug An Owner for the problem, extracted from triage.ini An Owner for the problem, extracted from triage.ini In verbose mode In verbose mode Detailed list of all the data found during the analysis Detailed list of all the data found during the analysis

18 18 !analyze Output kd> !analyze -v THREAD_STUCK_IN_DEVICE_DRIVER (ea) <text> Debugging Details: ------------------ FAULTING_THREAD: 82493da8 DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_FAULT BUGCHECK_STR: 0xEA LAST_CONTROL_TRANSFER: from bf9c148e to bf9c1c8f STACK_TEXT: ae328db0 bf9c148e af0df9c0 013bca06 ae328df0 xxxxxx!vDmaCopy_r6+0x495 ae328dfc bf9a94ef 00000026 ae328ec0 ae329304 xxxxxx!vCopyFBToDMABuffer+0x17a … STACK_COMMAND:.thread ffffffff82493da8 ; kb FOLLOWUP_IP: xxxxxx!vDmaCopy_r6+495 bf9c1c8f 3b1f cmp ebx,[edi] FOLLOWUP_NAME: xxxxxx SYMBOL_NAME: xxxxxx!vDmaCopy_r6+495 MODULE_NAME: xxxxxx IMAGE_NAME: xxxxxx.dll DEBUG_FLR_IMAGE_TIMESTAMP: 3edc0abb BUCKET_ID: 0xEA_xxxxxx!vDmaCopy_r6+495 INTERNAL_BUCKET_URL: http://dbgportal/DBGPortal_ViewBucket.asp?BucketID=0xEA_xxxxxx!vDmaCopy_r6%2b495&Fram eID=undefined OCA_CRASHES: xxxx INTERNAL_RAID_BUG: http://watson/bug.aspx?DB=6&BugID=840654 Followup: xxxxxx

19 19 !analyze Algorithm Multi step algorithm Multi step algorithm Uses bugcheck or verifier code as initial input Uses bugcheck or verifier code as initial input Does stack analysis Does stack analysis Uses additional data about known problems provided by developers Uses additional data about known problems provided by developers Iterates on all the data above to determine the root cause Iterates on all the data above to determine the root cause

20 20 Analysis Step 1 Use bugcheck parameters to extract basic information Use bugcheck parameters to extract basic information Each bugcheck is processed by a separate routine that understands the meaning of each parameter Each bugcheck is processed by a separate routine that understands the meaning of each parameter Save trap frame, context recording, faulting thread, etc. Save trap frame, context recording, faulting thread, etc. If specific follow-up or faulting driver is found, report results If specific follow-up or faulting driver is found, report results

21 21 Analysis Step 2 Use information in step 1 to get faulting stack Use information in step 1 to get faulting stack Scan the stack for special functions such as Trap0E to find alternate stack Scan the stack for special functions such as Trap0E to find alternate stack Analyze frames on the final stack to determine most likely culprit Analyze frames on the final stack to determine most likely culprit Different weights are assigned to routines Different weights are assigned to routines Internal kernel routines have lowest weight Internal kernel routines have lowest weight Device drivers have highest weight Device drivers have highest weight Fine grain control provided by triage.ini Fine grain control provided by triage.ini Highest weight frame found on the stack is treated as the culprit Highest weight frame found on the stack is treated as the culprit

22 22 Symbol Server And Minidumps Minidumps store the timestamp of images Minidumps store the timestamp of images Debugger uses the file name, timestamp and image size to map the image Debugger uses the file name, timestamp and image size to map the image Debugger looks for the symbol file name in the mapped image Debugger looks for the symbol file name in the mapped image If the wrong image is loaded by the debugger, the symbols will also be wrong If the wrong image is loaded by the debugger, the symbols will also be wrong Storing images and symbols in symbol server is the best way for the debugger to get the correct version of the image Storing images and symbols in symbol server is the best way for the debugger to get the correct version of the image Also simplifies archiving of driver versions Also simplifies archiving of driver versions

23 23 IHV And ISV Symbols Symbols greatly help with the automated analysis of failures Symbols greatly help with the automated analysis of failures Don’t lose your symbols ! Don’t lose your symbols ! Sharing symbols with Microsoft Sharing symbols with Microsoft You can submit symbols with driver submissions to WHQL You can submit symbols with driver submissions to WHQL On-site vendors can host their own symbol server On-site vendors can host their own symbol server Symbol data is stored securely Symbol data is stored securely Symbols are not shared with other IHVs internally Symbols are not shared with other IHVs internally Symbols are not shared on the external public symbol server Symbols are not shared on the external public symbol server Sharing symbols is totally optional, but encouraged Sharing symbols is totally optional, but encouraged

24 24 Analysis Step 2 – IHV Symbols Without valid symbols Without valid symbols With valid symbols With valid symbols f18e7968 nt!KeBugCheckEx+0x19 f18e7980 nt!IopfCallDriver+0x18 f18e7990 Fastfat!FatSingleAsync+0x74 f18e7a5c Fastfat!FatCommonRead+0x88e f18e7acc Fastfat!FatFsdRead+0x136 f18e7adc nt!IopfCallDriver+0x31 f18e7ae8 SYMEVENT!CSymIrp::IrpRead+0x4b f18e7af8 nt!IopfCallDriver+0x31 f18e7b0c nt!IopPageReadInternal+0xf2 f18e7b2c nt!IoPageRead+0x19 f18e7b9c nt!MiDispatchFault+0x270 f18e7bec nt!MmAccessFault+0x5b7 f18e7bec nt!_KiTrap0E+0xb8 f18e7cc4 nt!CcMapData+0xef f18e7cf0 Fastfat!FatReadVolumeFile+0x38 f18e7e78 Fastfat!FatMountVolume+0x1f7 f18e7e98 Fastfat!FatCommonFileSystemControl+0x47 BUCKET_ID: POOL_CORRUPTION_Foo.sys f18e7968 nt!KeBugCheckEx+0x19 f18e7980 nt!IopfCallDriver+0x18 f18e7990 Fastfat!FatSingleAsync+0x74 f18e7a5c Fastfat!FatCommonRead+0x88e f18e7acc Fastfat!FatFsdRead+0x136 f18e7adc nt!IopfCallDriver+0x31 f18e7b0c SYMEVENT+0x61cb f18e7b2c nt!IoPageRead+0x19 f18e7b9c nt!MiDispatchFault+0x270 f18e7bec nt!MmAccessFault+0x5b7 f18e7bec nt!_KiTrap0E+0xb8 f18e7cc4 nt!CcMapData+0xef f18e7cf0 Fastfat!FatReadVolumeFile+0x38 f18e7e78 Fastfat!FatMountVolume+0x1f7 f18e7e98 Fastfat!FatCommonFileSystemControl+0x47 f18e7ee4 Fastfat!FatFsdFileSystemControl+0x85 f18e7ef4 nt!IopfCallDriver+0x31 f18e7f44 nt!IopMountVolume+0x1d1 BUCKET_ID: 0x35_SYMEVENT+61cb

25 25 Analysis Step 3 If stack does not yield an interesting frame, analyze raw stack data If stack does not yield an interesting frame, analyze raw stack data Iterate on all stack values using the same weight algorithm Iterate on all stack values using the same weight algorithm The ‘dps’ command will show that output The ‘dps’ command will show that output This finds drivers that corrupt the stack This finds drivers that corrupt the stack

26 26 Analysis Step 4 Check for presence of memory or pool corrupting drivers Check for presence of memory or pool corrupting drivers Check for corrupted code streams using !chkimg Check for corrupted code streams using !chkimg Bad RAM Bad RAM Check for other possible problems, such as invalid call sequences Check for other possible problems, such as invalid call sequences Possible CPU problem Possible CPU problem

27 27 Pool Corruption Pool corruption is very bad Pool corruption is very bad Driver A crashes because of driver B’s bug Driver A crashes because of driver B’s bug Very hard to identify the culprit Very hard to identify the culprit We estimate about 15% of all crashes are caused by pool corruption We estimate about 15% of all crashes are caused by pool corruption Many OCA failures are due to pool corruption Many OCA failures are due to pool corruption Every vendor has buckets assigned to them that are due to another driver Every vendor has buckets assigned to them that are due to another driver Run Driver Verifier ! Run Driver Verifier ! Track down all pool corruptions and fix them ! Track down all pool corruptions and fix them !

28 28 Hardware Issues Hardware problems are quite common Hardware problems are quite common Heating issues Heating issues Investigating data in SMBIOS and ACPI to help with this Investigating data in SMBIOS and ACPI to help with this Bad DMA Bad DMA May be detectable in the future with new hardware support in the processor May be detectable in the future with new hardware support in the processor Bad disk Bad disk Diagnosis tools are being investigated Diagnosis tools are being investigated Chipset problems (timing issues) Chipset problems (timing issues) No known detection mechanisms No known detection mechanisms CPU bugs CPU bugs No known detection mechanisms No known detection mechanisms Power glitches, surge Power glitches, surge No known detection mechanisms No known detection mechanisms Bad memory Bad memory Developing algorithms to detect bad memory from a minidump Developing algorithms to detect bad memory from a minidump Shipping a stand-alone memory checker Shipping a stand-alone memory checker http://oca.microsoft.com/en/windiag.asp http://oca.microsoft.com/en/windiag.asp http://oca.microsoft.com/en/windiag.asp

29 29 Analysis Step 5 Generate final bucket ID and follow-up based on all gathered information Generate final bucket ID and follow-up based on all gathered information Determine which fields need to be embedded in the bucket ID Determine which fields need to be embedded in the bucket ID Assign ownership of failure Assign ownership of failure Lookup in the OCA database for bug ID or solution for this bucket Lookup in the OCA database for bug ID or solution for this bucket

30 30 Triage.ini Data file used to drive !analyze heuristics. It contains Data file used to drive !analyze heuristics. It contains Lists of known bad drivers Lists of known bad drivers Reliability of certain routines within a driver Reliability of certain routines within a driver Who owns a particular module or routine Who owns a particular module or routine How certain bucket IDs should be generated How certain bucket IDs should be generated !analyze parses all the data in triage.ini to generate the final results !analyze parses all the data in triage.ini to generate the final results Data updated on a daily basis Data updated on a daily basis New tokens to control bucketing added regularly New tokens to control bucketing added regularly

31 31 Triage.ini Tokens Timestamps – link date, in HEX format Timestamps – link date, in HEX format Driver – full name of the image Driver – full name of the image Module – name of the image without the extension Module – name of the image without the extension Name – owner of that routine or module Name – owner of that routine or module poolcorruptors! = poolcorruptors! = memorycorruptors! = memorycorruptors! = oldimages! = oldimages! = bugcheckingdriver!0x6_ = bugcheckingdriver!0x6_ = Additional_DriverInfo! = Build, deviceID, Offset ! = Ignore_ ! = Ignore_ ! = maybe_ ! = maybe_ ! = specific_ ! = specific_ ! = last_ ! = last_

32 32 Changing Your OCA Buckets Images and Symbols Images and Symbols Sharing images and symbols with Microsoft can allow your buckets to be merged, or routines ignored Sharing images and symbols with Microsoft can allow your buckets to be merged, or routines ignored Triage.ini change Triage.ini change Algorithm changes Algorithm changes !analyze is not directly extensible by third parties yet !analyze is not directly extensible by third parties yet !analyze can call driver specific analysis routines. Can be used to parse bugcheck data block !analyze can call driver specific analysis routines. Can be used to parse bugcheck data block For any improvements, send mail to pfat @ microsoft.com For any improvements, send mail to pfat @ microsoft.com

33 33 Retriaging Process of re-analyzing crashes Process of re-analyzing crashes Re-execute !analyze on the dump file and update the database information Re-execute !analyze on the dump file and update the database information Done when a developer gives us an analysis change Done when a developer gives us an analysis change Triage.ini Triage.ini New !analyze heuristic New !analyze heuristic Dumps that are retriaged can go into new buckets Dumps that are retriaged can go into new buckets

34 34 Call To Action Look at your OCA failures Look at your OCA failures These are REAL customer problems These are REAL customer problems Fix your pool corruption problems Fix your pool corruption problems Tell us about the bugs you fix, so we can update !analyze and point customers to your driver updates Tell us about the bugs you fix, so we can update !analyze and point customers to your driver updates Attend the WinDbg Ask the Experts sessions Attend the WinDbg Ask the Experts sessions

35 35 Resources Debugger URL and download site Debugger URL and download site http://www.microsoft.com/whdc/ddk/debugging http://www.microsoft.com/whdc/ddk/debugging http://www.microsoft.com/whdc/ddk/debugging Debugger e-mail – for debugger bug reports and feature requests Debugger e-mail – for debugger bug reports and feature requests windbgfb @ microsoft.com windbgfb @ microsoft.com windbgfb @ microsoft.com windbgfb @ microsoft.com We try to fix all the bugs people report We try to fix all the bugs people report We do not provide general debugging support on this alias We do not provide general debugging support on this alias Debugger newsgroup Debugger newsgroup Microsoft.public.windbg Microsoft.public.windbg Good place for general debugging issues Good place for general debugging issues

36


Download ppt "OCA Crash Analysis Andre Vachon Software Development Lead Windows Product Feedback Microsoft Corporation."

Similar presentations


Ads by Google