- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel
- 2 - Copyright © 2006 Intel Corporation. All Rights Reserved. What are we trying to do? Purpose: Simulate the occurrence of transient (or persistent) faults and analyze their impact on applications. Why Pin? Easy to model faults and measure their impact. Relatively fast (5-10 minutes per fault injection) Provides full program analysis
- 3 - Copyright © 2006 Intel Corporation. All Rights Reserved. Pros & Cons Software Instrumentation Architectural Simulator RTLSilicon Accuracy Ease of Use
- 4 - Copyright © 2006 Intel Corporation. All Rights Reserved. Pin’s View of the world Arch Reg Memory uArch State
- 5 - Copyright © 2006 Intel Corporation. All Rights Reserved. Modeling Microarchitectural Faults in Pin Accuracy of fault methodology depends on the complexity of the underlying microarchitecture Easier to model faults in an in-order, single issue machine Build a microarchitectural model into Pin A low fidelity model may suffice Adds complexity and slows down simulation time Mimic certain types of microarchitectural faults in Pin
- 6 - Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Destination Register Transmission Fault Fault occurs in latches when forwarding instruction output Change architectural value of destination register at the instruction where fault occurs NOTE: This is different than inserting fault into register file because the destination is selected based on the instruction where fault occurs Exec Unit Latches Bypass Logic ROB RS
- 7 - Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Load Data Transmission Faults Fault occurs when loading data from the memory system Before load instruction, insert fault into memory Execute load instruction After load instruction, remove fault from memory (Cleanup) NOTE: This models a fault occurring in the transmission of data from the STB or L1 Cache Load Buffer STB DCache Latches
- 8 - Copyright © 2006 Intel Corporation. All Rights Reserved. Five Step Program for Fault Analysis 1.Determine ‘when’ the fault occurs 2.Determine ‘where’ the fault occurs 3.Inject Fault 4.Cleanup (Optional) 5.Determine Outcome
- 9 - Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN Reality: Assuming that environmental conditions stay the same, transient faults can occur with equal probability at any time during the run of the application. Approximation: Transient faults occur on any dynamic instruction with equal probability
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN Sample Pin Tool: InstCount.C Purpose: Efficiently determines the number of dynamic instances of each static instruction. Output: For each static instruction Function name Dynamic instructions per static instruction IP: Count: Func: propagate_block.104 IP: Count: Func: propagate_block.104 IP: Count: Func: propagate_block.104 IP: Count: Func: propagate_block.104 IP: Count: Func: propagate_block.104 IP: Count: Func: propagate_block.104
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 2: WHERE Reality: Where the transient fault occurs is a function of the size of the structure on the chip. Faults can occur in both architectural and microarchitectural state. Approximation: Pin only provides architectural state, not microarchitectural state (no uops, for instance) Either inject faults only into architectural state Build an approximation for some microarchitectural state
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 3: Injecting Fault Pass context and other relevant information to analysis routine to modify the architectural state Inject fault Flush code cache to force immediate reinstrumentation Force execution at a particular point using the context
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 4: Cleanup Cleanup is an optional step and is only necessary for modeling microarchitectural faults, not architectural faults Modeling a fault in the transmission of data to load op
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5 :Determining Outcome Outcomes that can be tracked: Did the program complete? Did the program complete and have the correct IO result? If the program crashed, how many instructions were executed after fault injection before program crashed? If the program crashed, why did it crash (trapping signals)?
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-FaultFault Post Fault Cleanup? No Cleanup Fault Yes
Copyright © 2006 Intel Corporation. All Rights Reserved. Register Fault Pin Tool: RegFault.C main(int argc, char * argv[]) { if (PIN_Init(argc, argv)) { return Usage(); }; out_file.open(KnobOutputFile.Value().c_str()); faultInst = KnobFaultInst.Value(); TRACE_AddInstrumentFunction (Trace, 0); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddSignalInterceptFunction(SIGSEGV, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGFPE, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGILL, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGSYS, SigFunc, 0); PIN_StartProgram(); return 0; } MAIN
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-Fault FaultPost Fault Cleanup? No Cleanup Fault Yes
Copyright © 2006 Intel Corporation. All Rights Reserved. if (fineGrainCount == false) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertIfCall(bbl, IPOINT_BEFORE, (AFUNPTR)FindFineGrainThreshold, IARG_UINT32, BBL_NumIns(bbl), IARG_END); BBL_InsertThenCall(bbl, IPOINT_BEFORE,(AFUNPTR) SwitchToFineGrainCounting, IARG_END); } TRACE Instrumentation UINT32 FindFineGrainThreshold(UINT32 i) { curDynInst += i; return ( curDynInst >= (faultInst - fineGrainTrigger) ); } VOID SwitchToFineGrainCounting() { if (fineGrainCount == false) { fineGrainCount = true; PIN_RemoveInstrumentation(); } TRACE Analysis
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-Fault FaultPost Fault Cleanup? No Cleanup Fault Yes
Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS ins, VOID *v) { if (fineGrainCount == true) { if (faultDone == 0) { INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)FindFaultInst, IARG_END); INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)InsertFault, IARG_CONTEXT, IARG_END); } if (faultDone == 1) { …. Instruction Instrumentation INT32 FindFaultInst() { curDynInst++; return ( curDynInst >= faultInst ); } Instruction Analysis
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-Fault FaultPost Fault Cleanup? No Cleanup Fault Yes
Copyright © 2006 Intel Corporation. All Rights Reserved. VOID InsertFault(CONTEXT* _ctxt) { srand(curDynInst); GetFaultyBit(_ctxt, &faultReg, &faultBit); UINT32 old_val; UINT32 new_val; old_val = PIN_GetContextReg(_ctxt, faultReg); faultMask = (1 << faultBit); new_val = old_val ^ faultMask; PIN_SetContextReg(_ctxt, faultReg, new_val); PIN_RemoveInstrumentation(); faultDone = 1; PIN_ExecuteAt(_ctxt); } Fault Insertion Analysis Routine
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-FaultFaultPost Fault Cleanup? No Cleanup Fault Yes
Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS ins, VOID *v) { if (fineGrainCount == true) { if (faultDone == 0) { …. } if (faultDone == 1) { if (INS_HasFallThrough(ins)) { INS_InsertCall(ins, IPOINT_AFTER, (AFUNPTR)PrintHeartbeat, IARG_END); } if (INS_IsBranchOrCall(ins)) { INS_InsertCall(ins, IPOINT_TAKEN_BRANCH, (AFUNPTR)PrintHeartbeat, IARG_END); } Post Fault Instruction Instrumentation
Copyright © 2006 Intel Corporation. All Rights Reserved. VOID PrintHeartbeat() { postFaultInsts++; if (postFaultInsts & dumpMask) { out_file << "H: " << dec << dumpMask << endl; out_file.flush(); dumpMask = dumpMask << 1; } if (dumpMask > maxHB) { PIN_Detach(); } Post Fault Analysis
Copyright © 2006 Intel Corporation. All Rights Reserved. OUTPUT IP: 8192fcf COUNT: REG: esi FBIT: 24 MASK: OLD: bffeca90 NEW: befeca90 H: 1 H: 2 H: 4 H: 8. H: IP: 80babc0 COUNT: REG: ebp FBIT: 20 MASK: OLD: 0 NEW: H: 1 H: 2 H: 4 H: 8 H: 16 H: 32 Signal: 11 PostFaultInsts: 38 Fault Masked Program Failure
Copyright © 2006 Intel Corporation. All Rights Reserved. Sample Results
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Determining Outcome, Extreme Edition In the InjectFault step (STEP 3) Fork a process and inject fault into one process (parent process) Communicate information between processes (mkfifo) After fault injection, keep track of all writes to memory At each checkpoint, compare architectural state and stores What if there’s a control deviation? For every control operation, compare the next IP between processes If the control flow deviates, then wait until both routines return from the function where the deviation occurred before checking state.
Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Extreme Edition Adding this fork and compare feature takes time but it can be done. What does it buy? Does the fault propagate? How far does it propagate? How many registers, bytes of memory does it impact? What happens when there is a control deviation? Is there a higher incidence of program failure or IO error in the presence of a control deviation?
Copyright © 2006 Intel Corporation. All Rights Reserved. Pin Based Fault Checker START Count By Basic Block Reached Threshold? No Count Every Instruction Yes Found Inst? Yes No Insert Fault Clear Code Cache Restart Using Context Reached CheckPoint? Count Insts After Fault Print HB & Update Checkpoint Counter Reached Max HB? Detach From Pin & Run to Completion Yes No Pre-FaultFaultPost Fault Cleanup? No Cleanup Fault Yes No Change
Copyright © 2006 Intel Corporation. All Rights Reserved. Fork Process & Setup Communication Links Restart Using Context Parent Process? No Cleanup Fault Yes Parent Process? Insert Fault Yes No Cleanup Required? Post Fault Yes No Fault Insertion Fault Checker: Fault Insertion Parent Child Both
Copyright © 2006 Intel Corporation. All Rights Reserved. Get Next Inst & Count Insts CheckPoint? No Yes Store OP? Yes No Post Fault Old Data!= New Data? Save Data Yes No Ctrl OP? Parent IP != Child IP? Ctrl Deviation Parent? Communicate Reg & Store Data to Parent No Done Or Cont? Read Info From Child & Compare state Send Continue Signal to Child Yes No Checkpoint Comparison Fault Checker: Post Fault Parent State == Child State Send Done Signal to Child & Detach No Yes Done? Detach & Exit No Yes Parent Child Both
Copyright © 2006 Intel Corporation. All Rights Reserved. Get Next Inst Function Return? No Store OP? Yes No Ctrl Deviation Old Data!= New Data? Save Data Yes No Function Call Call Counter < 0 ? No Yes Call Counter = 0 Call Counter ++ Call Counter -- Checkpoint Comparison Fault Checker: Ctrl Deviation Yes Parent Child Both
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Additional Info Cannot check faults beyond a system call Kill child process and detach parent process from Pin Run parent/faulty process to completion Although not shown in flow chart, the Pin tool detaches after reaching a max number of check points Providing tighter bounds on ctrl deviation: May take a long time before returning from function call On a control deviation For both parent and child processes, save each store address and data For the parent process, tag the store with the number of instructions executed since control deviation occurred. After control merges and if architectural state is the same between the two processes, walk the list of stores from oldest to youngest and determine where the two processes matched.
Copyright © 2006 Intel Corporation. All Rights Reserved. Conclusion Fault insertion using Pin is a great way to determine the impacts faults have within an application Easy to use Enables full program analysis Accurately describes fault behavior once it has reached architectural state