Download presentation
Presentation is loading. Please wait.
Published byKristina Curtis Modified over 9 years ago
1
Graphics Hardware 2006 The Visual Vulnerability Spectrum: Characterizing Architectural Vulnerability for Graphics Hardware Jeremy Sheaffer †, David Luebke ‡, and Kevin Skadron † † University of Virginia Department of Computer Science ‡ NVIDIA Research
2
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as:
3
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors
4
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors
5
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors n Which might be stretched
6
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors n Which might be stretched, interpolated
7
Graphics Hardware 2006 Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors n Which might be stretched, interpolated, or repeated Graphics Hardware 2006
8
Motivation Graphics Hardware 2006 l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors n Which might be stretched, interpolated, or repeated –Single vertex errors
9
Graphics Hardware 2006
18
Motivation l Transient errors can cause undesirable artifacts, such as: –Single pixel errors –Single texel errors n Which might be stretched, interpolated, or repeated –Single vertex errors –Corrupt a frame –Crash the computer –Corrupt rendering state l Clearly, GPGPU changes the equation –In this work, we address only graphics workloads
19
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
20
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
21
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
22
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
23
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
24
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles
25
Graphics Hardware 2006 Causes of Transient Faults l Traditional causes –Cosmic radiation—gamma particles n Soft Error Rate (SER) is proportional to cosmic ray flux n Flux at sea level is about 1 particle / cm 2 second n Maximum flux of about 100 particles / cm 2 second occurs at airplane altitudes n Particles at higher altitudes tend to have higher energies due to less cascading –Terrestrial radiation—alpha particles n Initially discovered at nuclear test sites in the ’50s –Called soft errors or single event upsets (SEUs)
26
Graphics Hardware 2006 Traditional Causes l Well understood and quantified l Very important in super-computing installations –On order of 10 transient faults/day l Memory is the primary victim, not logic l Ziegler shows that only 1 in 40000 incident particles collides with silicon crystalline structure –Assuming a 1cm 2 processor n Approximately 1 collision/11 hours at sea level n 1/6 minutes in an airplane n Not all are high enough energy to cause errors n Clearly not very important for traditional graphics!
27
Graphics Hardware 2006 Near Future Causes l External EM noise l EM noise from crosstalk l di/dt and voltage droop l Parameter variations l Most of these can be at least partially accounted for through architectural or circuit level techniques –e.g. capacitors to compensate for di/dt l Overclocking exacerbates these problems
28
Graphics Hardware 2006 Near Future Causes l Neither well understood nor well quantified –Borkar shows exponential growth of transient errors at a rate of 8%/generation –Very little other literature exists n Primarily because nobody yet understands how to analyze the problem
29
Graphics Hardware 2006 Transient Fault Mitigation Techniques (CPU) l ECC and parity –These protect memory but not combinational logic n Until recently, memory has been the primary concern and ECC and parity the primary solutions l Scrubbing –Used in conjunction with ECC to reduce 2-bit errors l Hardware fingerprinting or state dump with rollback –Poorly evaluated l Larger or radiation-hardened gates –Increases the critical charge Q crit l Redundancy –Primarily employed to protect logic –Also sometimes used for memory
30
Graphics Hardware 2006 Reliability Through Redundancy l Primary topic in recent transient fault reliability literature l Many clever ideas proposed and (sort of) evaluated, including –Triple redundancy with voting n Boeing 777 uses triple redundancy in all fly-by-wire components and triple redundancy in all computers for ‘triple triple’ or 9× redundancy –Lockstepped processors –Redundant Multithreading n CRT — Chip-level Redundantly Threaded processors n SRT — Simultaneous and Redundantly Threaded processors n The concepts of a ‘Sphere of Replication’ and leading and trailing threads n LVQs—Load value queue n BOQs—Branch outcome queue n Compiler assisted techniques like the checking store buffer (CSB)
31
Graphics Hardware 2006 Architectural Vulnerability Factor l DUE—Detectable Unrecoverable Error l SDC—Silent Data Corruption l ACE—required for Architecturally Correct Execution l AVF—Architectural Vulnerability Factor –The likelihood that a transient error in a structure will lead to a computational error
32
Graphics Hardware 2006 Architectural Vulnerability Factor l B is the set of all bits in some structure l t b is the total time that bit b is ACE l ∆t is the total time of the simulation
33
Graphics Hardware 2006 AVF is Not Good for Graphics l Primarily because it assumes that all bits that influence the computation are equally important –Similarly, with AVF, any corruption of an ACE bit gives a ‘wrong answer’ l Caveat: GPGPU
34
Graphics Hardware 2006 Visual Vulnerability Spectrum l We note that many transient faults in graphics workloads do not matter and propose the Visual Vulnerability Spectrum to characterize them l The VVS consists of three orthogonal axes –Extent—how many pixels will be affected by an error –Magnitude—how severe is the error within the affected region –Persistence—how long will the error effect the output l For an error to be important, it must rank high on all three axes
35
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite
36
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel
37
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel
38
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel Single Vertex
39
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel Single Vertex Uniform Shader State
40
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel Single Vertex Uniform Shader State Clear Color
41
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel Single Vertex Uniform Shader State Clear Color Antialiasing State
42
Graphics Hardware 2006 Unnoticeable VVS Examples Extent Insufferable Magnitude Persistence Whole Screen UnnoticeableTransient Indefinite Single Pixel Single Texel Single Vertex Uniform Shader State Clear Color Antialiasing State
43
Graphics Hardware 2006 Application of the VVS l We analyzed the OpenGL 2.0 state vector using the VVS –A “proxy” for real microarchitectures –Has shortcomings, but a reasonable, first-order approximation of GPU state –We identified some sets of structures of: n High importance n Intermediate importance n Little importance
44
Graphics Hardware 2006 Important Structures l Matrix Stack l Scissor, depth and alpha test enable bits and functions l Viewport and clip plane function coefficients l Depth range l Lighting enable bits l Culling enable bits l Various polygon state, including fill mode, offset and stippling l Various texture state, including enable, active texture, and current texture unit l Current drawbuffer l Uniform and control-related shader state
45
Graphics Hardware 2006 Less Important State l Various vertex array state, including size, type, stride, etc. l Similar state for other types of arrays: texture, fog, color, etc. l High levels of the hierarchical Z-pyramid l Texture contents
46
Graphics Hardware 2006 Unimportant State l The framebuffer l Shader data registers l Antialiasing state l Various non-array vertex attribute state l Obviously all of this changes in the context of GPGPU
47
Graphics Hardware 2006 Mitigation Techniques for Graphics Applications l Full protection, via ECC or similar, on small, not easily recovered important state –Various enable bits, matrix stacks, shader control state, clip and viewport coefficients, etc. l Parity on slightly less important state that can be easily recovered, e.g. shader store
48
Graphics Hardware 2006 Mitigation Techniques for Graphics Applications l Periodic Detection –Requires reliable backing store and driver support –Takes advantage of ‘acceptable error’ –Techniques n Refresh-based n Piggyback error detection on DRAM refresh n Demand n Piggyback detection on use n Example: CRC on vertex array, checked as read n Analogs for texture reads?
49
Graphics Hardware 2006 Conclusions l Architectural vulnerability increasingly deserves the attention of the graphics community –But AVF is a poor metric for graphics computation l The Visual Vulnerability Spectrum provides a more useful metric –Extent, persistence, magnitude l Graphics hardware design suggests some novel simple error mitigation techniques –Partitioned protection, periodic detection
50
Graphics Hardware 2006 Future Work l GPGPU –Opportunities afforded by GPU design –Redundancy n Macro – SLI/Crossfire/video-out based n Micro – redundantly combine shader units in space or time –Secure backing store n Suggests checkpointing-type solutions
51
Graphics Hardware 2006 Acknowledgements l This work is supported by –A Graduate Research Fellowship from ATI –NSF Grants CCF-0429765 and CCR-0306404 –Army Research Office grant #W911NF-04-1-0288 –And a research grant from Intel MRL l Thanks to the reviewers for their helpful and insightful comments l Thanks also to Shubu Mukherjee for some otherwise unavailable information
52
Graphics Hardware 2006 Thank you! l Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.