Presentation is loading. Please wait.

Presentation is loading. Please wait.

Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS microsoft.com Microsoft Corporation.

Similar presentations


Presentation on theme: "Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS microsoft.com Microsoft Corporation."— Presentation transcript:

1 Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

2 Session Outline WHEA Overview Hardware Error Sources Hardware Error Management Solutions WHEA Integration PCI Express Advanced Error Reporting (AER) Example

3 Session Goals Attendees should leave this session with the following: A good understanding of: How platform hardware/firmware, device drivers, and error management software integrate with WHEA Knowledge of where to find resources for WHEA

4 WHEA Overview

5 Architecture - Overview

6 Key Components Platform Specific Hardware Error Driver (PSHED) Low-Level Hardware Error Handler (LLHEH) WheaReportHwError – Entry point to OS common error handling Error Record – Common OS error record Error Event Consumers

7 Error Sources

8 Hardware Error Source An error source is a mechanism that notifies software of hardware error conditions and provides information to describe the error condition Notification may be via interrupt, polling of error status registers, or callback from system firmware Error data may be recorded in hardware registers, mapped to PCI configuration space, provided by a system firmware interface, etc.

9 Hardware Error Sources and WHEA WHEA targets platform-level error sources Platform-level error sources usually aggregate error reporting for multiple of devices Error SourceHardware Machine CheckProcessor, Cache,TLBs, Memory Corrected Platform Error Memory controller Non-maskable InterruptIO Bus PCI ExpressDevice, Root Complex

10 Managing Error Sources

11 Managing Hardware Error Sources WHEA enables management of error sources A number of attributes associated with a given error source may be manageable Platform OEMs specify this functionality They can decide which attributes are exposed to be viewed and/or modified WHEA enables programmatic control over the attributes associated with an error source Whether an error source is enabled/disabled Thresholds associated with an error source Control register settings of a particular error source Error Severity Mappings Error Masking Settings

12 Managing Hardware Error Sources (con’t) OS queries the PSHED for a table of all the error sources on a given platform PSHED interfaces with the platform to extract this information and return it to the OS The OS makes this information available to management applications Some of this information may settable only by privileged entities These interfaces will be available during OS install, so platform-appropriate settings may be applied during setup This capability solves BIOS/OS conflicts over error source settings

13 Hardware Error Management Solutions Existing hardware error management solutions are necessarily proprietary Even those based on standards such as the Intelligent Platform Management Interface (IPMI) record error information in proprietary format in the SEL (system event log) A generic SDR (sensor data record) is used and record size constraints limit the richness of the error records Proprietary applications can consume and perform management operations on the proprietary error data These applications retrieve the error information in a proprietary manner – usually via a collections of device drivers that present the information to the management application

14 Hardware Error Management Solutions (con’t) WHEA enables generic hardware error management solutions Published error record format ETW-based error eventing model allows management applications to subscribe for the events in which they are interested WHEA permits value-add extensibility by having unstructured (e.g. proprietary) error data added to error records WHEA error records are potentially very rich in content and include OS context information to aid in problem diagnosis and resolution

15 WHEA Integration

16 How solution providers integrate with WHEA? System firmware/platform support Implement platform interfaces required by WHEA (e.g. Error Source Discovery and Error Record Serialization) PSHED Plug-ins Augment and/or override the behavior of the default per- processor-architecture PSHED LLHEHs Device drivers for some hardware error sources may be made WHEA aware to report hardware errors to the system Consumer Applications User-mode applications that perform health-monitoring and other higher-level error management functions

17 WHEA Integration - Platform Support OEMs will be required to implement at least minimal WHEA support to obtain Logo Error Source Discovery Error Record Serialization Opportunities exist for even tighter integration with the OS Adopting the WHEA error record format as the platforms native error record Improved platform-level mechanisms for reporting error conditions to the OS (e.g. using extended PCI config space and a structured error data format)

18 WHEA Integration - LLHEHs Bus drivers might be in charge of error sources that need to be exposed to WHEA Endpoint devices are not expected to do this Device drivers that fall into this category implement LLHEHs which handle errors and report them to the kernel

19 WHEA Integration - PSHED Plug-Ins The PSHED houses all hardware error related interactions between the OS and the platform The PSHED represents an opportunity for OEMs to rethink how some error handling features are implemented Some functionality may be moved into the PSHED rather than BIOS/FW Portions of the functionality may stay in BIOS/FW and PSHED plug-ins may interface with these functions

20 WHEA Integration – Management Applications Management applications implement high-level error monitoring, reporting, and potentially recovery capabilities These applications subscribe to receive error event notifications via ETW Generic processing of all error events is possible given the common error record format Extended processing of error events is possible through unstructured (private) error information recorded in the error record

21 PCI Express Example

22 PCI Express AER Example PCI Express Advanced Error Reporting (AER) represents a good technology to use in an example This example will show how PCI Express AER support can be integrated into WHEA

23 PCI Express AER Example – Platform-Level Support The platform BIOS must surface PCI Express AER as a platform error source Possible mechanisms include: ACPI Table or EFI runtime interface The platform must grant OS control of PCI Express error handling via ACPI _OSC Assume our example platform implements some non-standard PCI Express error registers that capture platform-specific information in addition to the standardized AER error registers.

24 PCI Express AER Example – LLHEH The PCI bus driver will implement the root port interrupt handler which receives error interrupts Therefore, the PCI bus driver will implement the LLHEH for PCI Express AER To accomplish this, the PCI bus driver must… Implement an ErrorSourceInitializer callback routine to initialize error reporting resourcesFrom its DriverEntry routine Register the ErrorSourceInitializer callback by calling WheaRegisterErrSrcInitializer After the initializer routine has been called, the bus driver can report hardware errors to the kernel

25 PCI Express AER Example – LLHEH (con’t) Upon detecting a PCI Express error, the PCI bus driver does the following Creates and initializes a WHEA_ERROR_PACKET using the error information it extracts from the PCI Express AER error status in extended config space The driver is responsible for mapping the error severity reported by the device into one of WHEA’s error severity levels Calls the PSHED’s PshedRetrieveErrorInfo routine, passing a pointer to the WHEA_ERROR_PACKET Calls WheaReportHwError, supplying a pointer to the WHEA_ERROR_PACKET

26 PCI Express AER Example – PSHED Remember, our example platform implements a set of non-standard PCI Express error registers A PSHED plug-in might participate in the error source discovery functionality to ensure that the OS sizes the WHEA_ERROR_PACKET for the PCI Express error source to accommodate the additional error information PshedRetrieveErrorInfo is called by the LLHEH when it detects an error condition A plug-in could extract the information in the non- standard error registers and add that information to the error packet

27 PCI Express AER Example – PSHED (Con’t) The PSHED will be called by WheaReportHwError to finalize construction of the error record At this point, a PSHED plug-in could use platform- specific information to populate additional error sections in the error record Note that the approach suggested gracefully accommodate platform differentiation An entry-level server line might ship without the PSHED plug-in and its error reporting capabilities would not include the additional non-standard registers A higher-level server line should ship with the plug-in and therefore offer extended error reporting (and possibly recovery) capabilities

28 PCI Express AER Example – Consumers A targeted consumer (management application) might be written with special knowledge of the information contained in the platform’s non- standard PCI Express error registers The consumer might implement extended error reporting, health monitoring, and even fail-over services

29 Call To Action Send us your questions Watch for WHEA logo requirements for Windows codenamed “Longhorn” Evaluate how your products will integrate with WHEA

30 Community Resources Windows Hardware & Driver Central (WHDC) www.microsoft.com/whdc/default.mspx Technical Communities www.microsoft.com/communities/products/default.mspx Non-Microsoft Community Sites www.microsoft.com/communities/related/default.mspx Microsoft Public Newsgroups www.microsoft.com/communities/newsgroups Technical Chats and Webcasts www.microsoft.com/communities/chats/default.mspx www.microsoft.com/webcasts Microsoft Blogs www.microsoft.com/communities/blogs

31 Additional Resources Email: Send feedback and questions to WHEAFB @ microsoft.com

32

33 © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.


Download ppt "Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS microsoft.com Microsoft Corporation."

Similar presentations


Ads by Google