Instruction-level Tracing: Framework & Applications Sanjay Bhansali Binary Technologies Group Center for Software Excellence (CSE) Microsoft 11/04/2005
Context Program analysis and transformation technology can have huge impact on engineering of software. Center for Software Excellence Part of Windows Core OS Division Balance research on innovation with focus on deployment Binary Technologies Group Binary analysis Static and Dynamic approaches PA can have big impact in engineering complex systems – more reliable, finding defects early, perform better, 9/20/2018
Outline Applications of Execution Traces Dynamic Translation Trace Capture Trace Replay Applications Related Work Summary 9/20/2018
Applications of Execution Traces Debugging Regression Analysis Bug detection Coverage Analysis Optimization Impact analysis Usage analysis … We want to build a framework that makes it easy to attack many different kinds of problems. 9/20/2018
Run Once, Analyze Many Complete instruction-level trace Deterministic, full fidelity replay of user mode execution Pros Run once, analyze multiple times Cons Trace size, performance 9/20/2018
Framework for Instruction level Tracing and Analysis Task and machine independent User mode processes Modest overhead (space and time) On-demand tracing Reduce engineering effort for building analysis tools 9/20/2018
Dynamic Binary Translation Runtime interpretation/translation of binary instructions Pros Requires no static instrumentation, or special symbol information Handle dynamically generated code, self modifying code Cons Approximately ~5x slower than native execution Competing alternative is to do it statically 9/20/2018
Nirvana Architecture Nirvana Client Nirvana API Code Cache JIT translator Application The core component is an instruction emulator that we call nirvana. VM monitor User Kernel Nirvana Driver Operating System 9/20/2018
JIT Translation Example Native code Translated code mov EDX, tls.ebp mov EAX, [EDX] mov eax, [ebp] 9/20/2018
JIT Translation Example Native code Translated code mov EDX, tls.ebp mov ECX, tls call MemReadCallback mov EAX, [EDX] mov eax, [ebp] 9/20/2018
Code Cache Management Single code cache Per Thread code cache Contention, locality Per Thread code cache Code bloat P+d code caches where P = number of processors Reuse code caches when possible Fall back on interpretation 9/20/2018
Self modifying code Snoop on system calls to flush hardware cache Watch page protection of code bytes Mark page if non-writable, and flush code cache on page protection change Insert self-mod instruction check otherwise Fall back on interpretation if too many code cache flushes 9/20/2018
Nirvana API RegisterEventCallback(event,callback) Events: Translation InstructionStart MemRead MemWrite FlowChange Sequencing 9/20/2018
Example Nirvana Client /* Memory Read Logger */ bool Initialize() { if (!InitializeNirvanaClient) RegisterCallback(MemReadEvent, MemCallback); } void MemCallback(NirvContext *ctx, void* pAddr, int nBytes) X86REGS *pRegs = (X86Regs*) ctx->cpuRegs; Log(pregs->InstructionPtr(),pAddr,nBytes); 9/20/2018
Tracing & Replay Overview Playback Process Record Process >> << || Application Nirvana Emulation Replay Defect Trace Writer Nirvana Trace Reader Debugger Trace Log … Different Machines 9/20/2018
Trace Writer Log only what cannot be regenerated by processor Values read from memory Values changed by kernel Machine and time sensitive instructions (cpuid,rdtsc) Everything else can be regenerated Trace size is ~4-5 bytes per instruction Trace files are self contained 9/20/2018
Optimization: Trace select reads Observation: Hardware caches eliminate most off-chip reads Use same trick to optimize logging: Have logger and replayer simulate identical cache memories Only log cache misses Average trace size is <1 bit per instruction 9/20/2018
Example The only read not predicted and logged follows the system call for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted) The only read not predicted and logged follows the system call 9/20/2018
Sequence points & Checkpoints lock xadd User/Kernel Kernel/User Kernel/User Exception Module Load Tracing uses per-thread streams for performance Sequence points used to impose partial order on instruction executions across threads Checkpoint frames for random access into the trace (every 5 million instructions) Hard key frames enable ring buffer 9/20/2018
Trace Writer Performance Application Simulated Instructions (millions) Trace File Size Bits / Instruction Native Execution Time Time While Tracing Overhead Gzip 24,097 245 MB 0.09 11.7s 187s 15.98 Excel 1,781 99 MB 0.47 18.2s 105s 5.76 Power Point 7,392 528 MB 0.60 43.6s 247s 5.66 IE 116 5 MB 0.50 0.499s 6.94s 13.90 Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01 Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98 9/20/2018
Trace Reader - Replay Nirvana requests code & data via the Fetch operations TraceReader uses same prediction cache as TraceWriter Instruction Fetch Trace Log Data Read Nirvana Miss Data Fetch Prediction Cache Data Write 9/20/2018
Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. Label T1, T2, T3 9/20/2018
Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. Label T1, T2, T3 9/20/2018
Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. 9/20/2018
Time Travel Debugging Examine a program as it runs backwards to figure out root cause of a problem. Reverse breakpoint Step back Search backwards in time Used to diagnose bugs in shipped products 9/20/2018
Truscan: Defect Detection Tool Scan traces for bugs that “hide” memory leaks dangling pointers un-initialized memory Report bugs that really happen – no false positives Debug with time travel debugging 9/20/2018
Example: Memory Leak Detection eax = HeapAlloc(42); mov [0x4004], eax ADDR = 0x3004 SIZE = 42 REFCOUNT = eax eax = 0; 0x4004 2 1 1 pass Leak!! mov [0x4004], 0 This example is trivial, but … 9/20/2018
Statistics A Windows Application (under development) 600 million instructions 80,000 allocations 30 million pointers 48 leaks (8 unique bugs) Native : ~9 seconds Trace: ~44 seconds Analyze: ~41 minutes (3 Ghz, single threaded, 1GB ram) 9/20/2018
Regression Analysis TraceDiff Callgraph OS1 OS2 App 1 Foo bar m1 m2 m3 p1 m4 Callgraph OS1 OS2 App 1 Instruction Sequence Mov edi, edi Push ebp Mov ebp, esp Sub esp, 0x54 Cmp [ebp+24],0 Jne … Call … Mov … Trace 1 . TraceDiff Coverage Foo Bar m1 m2 m3 p1 Trace 2 9/20/2018
Related Work Process Virtualization Instrumentation Trace Compression DynamoRIO, Mojo, DELI, ReVirt, Valgrind Instrumentation ATOM, Vulcan, SHADE, Pin Trace Compression VPC Reverse Debugging ReVirt, Traceback, BugNet, Flashback, FDR Program/Trace Diffing & Applications Zeller, Zhang&Gupta 9/20/2018
Summary Flexible framework for instruction level tracing and analysis Complete full-fidelity traces Run once, analyze multiple times Reasonable overhead Many useful applications Debugging, defect detection, optimization, … 9/20/2018
Shameless self promotion! Hiring for internships and full-time positions at all levels Contact: sanjaybh@microsoft.com 9/20/2018
Questions 9/20/2018