On Cosmic Rays, Bat Droppings and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August.

Slides:



Advertisements
Similar presentations
Types and Programming Languages Lecture 7 Simon Gay Department of Computing Science University of Glasgow 2006/07.
Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Chapter 8 Fault Tolerance
Comparing Semantic and Syntactic Methods in Mechanized Proof Frameworks C.J. Bell, Robert Dockins, Aquinas Hobor, Andrew W. Appel, David Walker 1.
On Cosmic Rays, Bat Droppings and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August.
Foundational Certified Code in a Metalogical Framework Karl Crary and Susmit Sarkar Carnegie Mellon University.
The lambda calculus David Walker CS 441. the lambda calculus Originally, the lambda calculus was developed as a logic by Alonzo Church in 1932 –Church.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Time Bounds for General Function Pointers Robert Dockins and Aquinas Hobor (Princeton University) (NUS) TexPoint fonts used in EMF. Read the TexPoint manual.
The Design and Implementation of a Certifying Compiler [Necula, Lee] A Certifying Compiler for Java [Necula, Lee et al] David W. Hill CSCI
Expert System Human expert level performance Limited application area Large component of task specific knowledge Knowledge based system Task specific knowledge.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Concurrency.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1: Operating Systems Overview
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Register Allocation (via graph coloring)
A Type System for Expressive Security Policies David Walker Cornell University.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
On Cosmic Rays, Bat Droppings, and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August.
CS 104 Introduction to Computer Science and Graphics Problems Software and Programming Language (2) Programming Languages 09/26/2008 Yang Song (Prepared.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
1 The Problem o Fluid software cannot be trusted to behave as advertised unknown origin (must be assumed to be malicious) known origin (can be erroneous.
Cormac Flanagan University of California, Santa Cruz Hybrid Type Checking.
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
An Introduction to Programming and Object-Oriented Design Using Java By Jaime Niño and Fred Hosch Slides by Darwin Baines and Robert Burton.
Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Proof Carrying Code Zhiwei Lin. Outline Proof-Carrying Code The Design and Implementation of a Certifying Compiler A Proof – Carrying Code Architecture.
Seattle June 24-26, 2004 NASA/DoD IEEE Conference on Evolvable Hardware Self-Repairing Embryonic Memory Arrays Lucian Prodan Mihai Udrescu Mircea Vladutiu.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection.
CprE 458/558: Real-Time Systems
Chapter 7 Low-Level Programming Languages. 2 Chapter Goals List the operations that a computer can perform Discuss the relationship between levels of.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
CSE 311 Foundations of Computing I Lecture 28 Computability: Other Undecidable Problems Autumn 2011 CSE 3111.
SAFE KERNEL EXTENSIONS WITHOUT RUN-TIME CHECKING George C. Necula Peter Lee Carnegie Mellon U.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Language-Based Security: Overview of Types Deepak Garg Foundations of Security and Privacy October 27, 2009.
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
14 Compilers, Interpreters and Debuggers
Memory COMPUTER ARCHITECTURE
PROGRAMMING LANGUAGES
nZDC: A compiler technique for near-Zero silent Data Corruption
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Fault Tolerance Distributed Web-based Systems
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Introduction to Fault Tolerance
Threads and Memory Models Hal Perkins Autumn 2009
InCheck: An In-application Recovery Scheme for Soft Errors
Language-based Security
Types and Type Checking (What is it good for?)
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Programming with Shared Memory Specifying parallelism
COMP755 Advanced Operating Systems
Presentation transcript:

On Cosmic Rays, Bat Droppings and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August

A Little-Publicized Fact = 23

How do Soft Faults Happen? High-energy particles pass through devices and collides with silicon atom Collision generates an electric charge that can flip a single bit “Galactic Particles” Are high-energy particles that penetrate to Earth’s surface, through buildings and walls “Solar Particles” Affect Satellites; Cause < 5% of Terrestrial problems Alpha particles from bat droppings

How Often do Soft Faults Happen?

NYC Tucson, AZ Denver, CO Leadville, CO IBM Soft Fail Rate Study; Mainframes; 83-86

How Often do Soft Faults Happen? NYC Tucson, AZ Denver, CO Leadville, CO IBM Soft Fail Rate Study; Mainframes; [Zeiger-Puchner 2004] Some Data Points: 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months 2004: 1 fail/year for laptop with 1GB ram at sea-level 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004]

How Often do Soft Faults Happen? Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] we are approximately here 6 years from now

How Often do Soft Faults Happen? Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] Soft error rates go up as: Voltages decrease Feature sizes decrease Transistor density increases Clock rates increase we are approximately here 6 years from now all future manufacturing trends

How Often do Soft Faults Happen? In 1948, Presper Eckert notes that cascading effects of a single-bit error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004] In 2000, Sun server systems deployed to America Online, eBay, and others crashed due to cosmic rays [Baumann 2002] “The wake-up call came in the end of billion-dollar factory ground to a halt every month due to... a single bit flip” [Zeiger-Puchner 2004] Los Alamos National Lab Hewlett-Packard ASC Q 2048-node supercomputer was crashing regularly from soft faults due to cosmic radiation [Michalak 2005]

What Problems do Soft Faults Cause? a single bit in memory gets flipped a single bit in the processor logic gets flipped and there’s no difference in external observable behavior the processor completely locks up the computation is silently corrupted register value corrupted (simple data fault) control-flow transfer goes to wrong place (control-flow fault) different opcode interpreted (instruction fault)

Mitigation Techniques Hardware: error-correcting codes redundant hardware Pros: fast for a fixed policy Cons: FT policy decided at hardware design time mistakes cost millions one-size-fits-all policy expensive Software and hybrid schemes: replicate computations Pros: immediate deployment policies customized to environment, application reduced hardware cost Cons: for the same universal policy, slower (but not as much as you’d think).

Mitigation Techniques Hardware: error-correcting codes redundant hardware Pros: fast for fixed policy Cons: FT policy decided at hardware design time mistakes cost millions one-size-fits-all policy expensive Software and hybrid schemes: replicate computations Pros: immediate deployment policies customized to environment, application reduced hardware cost Cons: for the same universal policy, slower (but not as much as you’d think). It may not actually work! much research in HW/compilers community completely lacking proof

Agenda Answer basic scientific questions about software- controlled fault tolerance: Do software-only or hybrid SW/HW techniques actually work? For what fault models? How do we specify them? How can we prove it? Build compilers that produce software that runs reliably on faulty hardware Moreover: Let’s not replace faulty hardware with faulty software. A killer app for type systems & proof-carrying code

Lambda Zap: A Baby Step Lambda Zap [ICFP 06] a lambda calculus that exhibits intermittent data faults + operators to detect and correct them a type system that guarantees observable outputs of well-typed programs do not change in the presence of a single fault expressive enough to implement an ordinary typed lambda calculus End result: the foundation for a fault-tolerant typed intermediate language

Lambda zap models simple data faults only The Fault Model ( M, F[ v1 ] )---> ( M, F[ v2 ] ) Not modelled: memory faults (better protected using ECC hardware) control-flow faults (ie: faults during control-flow transfer) instruction faults (ie: faults in instruction opcodes) Goal: to construct programs that tolerate 1 fault observers cannot distinguish between fault-free and 1-fault runs

Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y

Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] atomic majority vote + output replicate instructions

Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3]

Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] but final output unchanged corrupted values copied and percolate through computation

Lambda to Lambda Zap: Control-flow let x = 2 in if x then e1 else e2 let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] majority vote on control-flow transfer recursively translate subexpressions

Lambda to Lambda Zap: Control-flow let x = 2 in if x then e1 else e2 let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] majority vote on control-flow transfer (function calls replicate arguments, results and function itself) recursively translate subexpressions

Almost too easy, can anything go wrong?...

yes! optimization reduces replication overhead dramatically (eg: ~ 43% for 2 copies), but can be unsound! original implementation of SWIFT [Reis et al.] optimized away all redundancy leaving them with an unreliable implementation!!

Faulty Optimizations let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] In general, optimizations eliminate redundancy, fault-tolerance requires redundancy. CSE let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1]

The Essential Problem voters depend on common value x1 let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] bad code:

let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] The Essential Problem voters depend on common value x1 let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] bad code: good code: voters do not depend on a common value

The Essential Problem voters depend on a common value let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] bad code: let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] good code: voters do not depend on a common value (red on red; green on green; blue on blue)

A Type System for Lambda Zap Key idea: types track the “color” of the underlying value & prevents interference between colors Colors C ::= R | G | B Types T ::= C int | C bool | C (T1,T2,T3)  (T1’,T2’,T3’)

Sample Typing Rules (x : T) in G G |--z x : T G |--z C n : C int Judgement Form: G |--z e : T where z ::= C |. simple value typing rules: G |--z C true : C bool

Sample Typing Rules G |--z e1 : R bool G |--z e2 : G bool G |--z e3 : B bool G |--z e4 : T G |--z e5 : T G |--z if [e1, e2, e3] then e4 else e5 : T Judgement Form: G |--z e : T where z ::= C |. G |--z e1 : R int G |--z e2 : G int G |--z e3 : B int G |--z e4 : T G |--z out [e1, e2, e3]; e4 : T sample expression typing rules: G |--z e1 : C int G |--z e2 : C int G |--z e1 + e2 : C int

Sample Typing Rules Judgement Form: G |--z e : T where z ::= C |. recall “zap rule” from operational semantics: ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: |-- v1 : T after: |-- v2 ?? T ==> how will we obtain type preservation?

Sample Typing Rules Judgement Form: G |--z e : T where z ::= C |. recall “zap rule” from operational semantics: G |--C C v : C U ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: |-- v1 : C U after: |--C v2 : C U by rule: no conditions “faulty typing” occurs within a single color only.

Theorems Theorem 1: Well-typed programs are safe, even when there is a single error. Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors [with a caveat]. Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap [that satisfies the caveat]. Theorem 4: There’s an extended type system for which theorem 2 is completely true without the caveat. ICFP 06 Lester Mackey Undergrad Project

Future Work Advanced fault models: control-flow instruction faults ==> requires encoding analysis New hybrid SW/HW fault detection algorithms Type-and reliability-preserving compiler: typed assembly language [type safety with control- flow faults proven, but much research remains] type- and reliability-preserving optimizations

Conclusions Semi-conductor manufacturers are deeply worried about how to deal with soft faults in future architectures (10+ years out) It’s a killer app for proofs and types AD: I’m looking for grad students and a post-doc Help me work on ZAP and PADS!

end!

The Caveat

out [2, 3, 3] bad, but well-typed code: outputs 3 after no faults out [2, 3, 3] outputs 2 after 1 fault out [2, 2, 3] Goal: 0-fault and 1-fault executions should be indistinguishable Solution: computations must independent, but equivalent

The Caveat modified typing: G |--z e1 : R U G |--z e2 : G U G |--z e3 : B U G |--z e4 : T G |--z e1 ~~ e2 G |--z e2 ~~ e G |-- out [e1, e2, e3]; e4 : T see Lester Mackey’s 60 page TR (a single-semester undergrad project)

Function O.S. follows

Lambda Zap: Triples let [x1, x2, x3] = e1 in e2 Elimination form: “triples” (as opposed to tuples) make typing and translation rules very elegant so we baked them right into the calculus: [e1, e2, e3] Introduction form: a collection of 3 items not a pointer to a struct each of 3 stored in separate register single fault effects at most one

Lambda to Lambda Zap: Control-flow let f = \x.e in f 2 let [f1, f2, f3] = \x. [[ e ]] in [f1, f2, f3] [2, 2, 2] majority vote on control-flow transfer

Lambda to Lambda Zap: Control-flow let f = \x.e in f 2 let [f1, f2, f3] = \x. [[ e ]] in [f1, f2, f3] [2, 2, 2] majority vote on control-flow transfer (M; let [f1, f2, f3] = \x.e1 in e2) ---> (M,l=\x.e1; e2[ l / f1][ l / f2][ l / f3]) operational semantics:

Related Work Follows

Software Mitigation Techniques Examples: N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005],... Hybrid hardware-software techniques: Watchdog Processors, CRAFT [Reis et al. 2005],... Pros: immediate deployment would have benefitted Los Alamos Labs, etc... policies may be customized to the environment, application reduced hardware cost Cons: For the same universal policy, slower (but not as much as you’d think).

Software Mitigation Techniques Examples: N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], etc... Hybrid hardware-software techniques: Watchdog Processors, CRAFT [Reis et al. 2005], etc... Pros: immediate deployment: if your system is suffering soft error-related failures, you may deploy new software immediately would have benefitted Los Alamos Labs, etc... policies may be customized to the environment, application reduced hardware cost Cons: For the same universal policy, slower (but not as much as you’d think). IT MIGHT NOT ACTUALLY WORK!