Instruction-level Tracing: Framework & Applications

Slides:

Advertisements

Similar presentations

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.

Advertisements

Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.

Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Operating Systems and Protection CS Goals of Today’s Lecture How multiple programs can run at once  Processes  Context switching  Process.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Tools for Investigating Graphics System Performance

Operating System Support Focus on Architecture

1 Operating Systems and Protection Professor Jennifer Rexford CS 217.

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

Microkernels: Mach and L4

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Virtual Machine Monitors CSE451 Andrew Whitaker. Hardware Virtualization Running multiple operating systems on a single physical machine Examples:  VMWare,

Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.

Tanenbaum 8.3 See references

Windows 2000 Memory Management Computing Department, Lancaster University, UK.

A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.

MAC OS – Unit A Page: 10-11, Investigating Data Processing Understanding Memory.

Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.

29th ACSAC (December, 2013) SPIDER: Stealthy Binary Program Instrumentation and Debugging via Hardware Virtualization Zhui Deng, Xiangyu Zhang, and Dongyan.

EECS 354 Network Security Reverse Engineering. Introduction Preventing Reverse Engineering Reversing High Level Languages Reversing an ELF Executable.

User-Mode Driver Emulation: Buffered Code Execution (BCE) Matt Conover Principal Software Engineer, Symantec Research Labs.

Processes and Virtual Memory

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Full and Para Virtualization

CSE 451: Operating Systems Winter 2015 Module 25 Virtual Machine Monitors Mark Zbikowski Allen Center 476 © 2013 Gribble, Lazowska,

Lecture 5 Rootkits Hoglund/Butler (Chapters 1-3).

We need a new, common Virtual Execution Environment Herman Venter Research in Software Engineering Group Microsoft Research, Redmond.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Introduction to Operating Systems Concepts

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Chapter Overview General Concepts IA-32 Processor Architecture

CS161 – Design and Architecture of Computer

Virtual Machine Monitors

Non Contiguous Memory Allocation

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

CS161 – Design and Architecture of Computer

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Effective Data-Race Detection for the Kernel

Operating System Structure

Group 8 Virtualization of the Cloud

Introduction to Operating Systems

OS Virtualization.

Virtualization Techniques

PerfView Measure and Improve Your App’s Performance for Free

A Survey on Virtualization Technologies

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Prof. Leonardo Mostarda University of Camerino

CSE451 Virtual Memory Paging Autumn 2002

CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors

JIT Compiler Design Maxine Virtual Machine Dhwani Pandya

Dynamic Binary Translators and Instrumenters

Computer Architecture and System Programming Laboratory

Presentation transcript:

Instruction-level Tracing: Framework & Applications Sanjay Bhansali Binary Technologies Group Center for Software Excellence (CSE) Microsoft 11/04/2005

Context Program analysis and transformation technology can have huge impact on engineering of software. Center for Software Excellence Part of Windows Core OS Division Balance research on innovation with focus on deployment Binary Technologies Group Binary analysis Static and Dynamic approaches PA can have big impact in engineering complex systems – more reliable, finding defects early, perform better, 9/20/2018

Outline Applications of Execution Traces Dynamic Translation Trace Capture Trace Replay Applications Related Work Summary 9/20/2018

Applications of Execution Traces Debugging Regression Analysis Bug detection Coverage Analysis Optimization Impact analysis Usage analysis … We want to build a framework that makes it easy to attack many different kinds of problems. 9/20/2018

Run Once, Analyze Many Complete instruction-level trace Deterministic, full fidelity replay of user mode execution Pros Run once, analyze multiple times Cons Trace size, performance 9/20/2018

Framework for Instruction level Tracing and Analysis Task and machine independent User mode processes Modest overhead (space and time) On-demand tracing Reduce engineering effort for building analysis tools 9/20/2018

Dynamic Binary Translation Runtime interpretation/translation of binary instructions Pros Requires no static instrumentation, or special symbol information Handle dynamically generated code, self modifying code Cons Approximately ~5x slower than native execution Competing alternative is to do it statically 9/20/2018

Nirvana Architecture Nirvana Client Nirvana API Code Cache JIT translator Application The core component is an instruction emulator that we call nirvana. VM monitor User Kernel Nirvana Driver Operating System 9/20/2018

JIT Translation Example Native code Translated code mov EDX, tls.ebp mov EAX, [EDX] mov eax, [ebp] 9/20/2018

JIT Translation Example Native code Translated code mov EDX, tls.ebp mov ECX, tls call MemReadCallback mov EAX, [EDX] mov eax, [ebp] 9/20/2018

Code Cache Management Single code cache Per Thread code cache Contention, locality Per Thread code cache Code bloat P+d code caches where P = number of processors Reuse code caches when possible Fall back on interpretation 9/20/2018

Self modifying code Snoop on system calls to flush hardware cache Watch page protection of code bytes Mark page if non-writable, and flush code cache on page protection change Insert self-mod instruction check otherwise Fall back on interpretation if too many code cache flushes 9/20/2018

Nirvana API RegisterEventCallback(event,callback) Events: Translation InstructionStart MemRead MemWrite FlowChange Sequencing 9/20/2018

Example Nirvana Client /* Memory Read Logger */ bool Initialize() { if (!InitializeNirvanaClient) RegisterCallback(MemReadEvent, MemCallback); } void MemCallback(NirvContext *ctx, void* pAddr, int nBytes) X86REGS *pRegs = (X86Regs*) ctx->cpuRegs; Log(pregs->InstructionPtr(),pAddr,nBytes); 9/20/2018

Tracing & Replay Overview Playback Process Record Process >> << || Application Nirvana Emulation Replay Defect Trace Writer Nirvana Trace Reader Debugger Trace Log … Different Machines 9/20/2018

Trace Writer Log only what cannot be regenerated by processor Values read from memory Values changed by kernel Machine and time sensitive instructions (cpuid,rdtsc) Everything else can be regenerated Trace size is ~4-5 bytes per instruction Trace files are self contained 9/20/2018

Optimization: Trace select reads Observation: Hardware caches eliminate most off-chip reads Use same trick to optimize logging: Have logger and replayer simulate identical cache memories Only log cache misses Average trace size is <1 bit per instruction 9/20/2018

Example The only read not predicted and logged follows the system call for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted) The only read not predicted and logged follows the system call 9/20/2018

Sequence points & Checkpoints lock xadd User/Kernel Kernel/User Kernel/User Exception Module Load Tracing uses per-thread streams for performance Sequence points used to impose partial order on instruction executions across threads Checkpoint frames for random access into the trace (every 5 million instructions) Hard key frames enable ring buffer 9/20/2018

Trace Writer Performance Application Simulated Instructions (millions) Trace File Size Bits / Instruction Native Execution Time Time While Tracing Overhead Gzip 24,097 245 MB 0.09 11.7s 187s 15.98 Excel 1,781 99 MB 0.47 18.2s 105s 5.76 Power Point 7,392 528 MB 0.60 43.6s 247s 5.66 IE 116 5 MB 0.50 0.499s 6.94s 13.90 Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01 Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98 9/20/2018

Trace Reader - Replay Nirvana requests code & data via the Fetch operations TraceReader uses same prediction cache as TraceWriter Instruction Fetch Trace Log Data Read Nirvana Miss Data Fetch Prediction Cache Data Write 9/20/2018

Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. Label T1, T2, T3 9/20/2018

Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. Label T1, T2, T3 9/20/2018

Trace Reader - Navigation Destination Current Position Checkpoint Frame 1 2 3 4 5 6 7 8 Navigation involves going back to the closest Checkpoint frame before the destination and executing forward to the destination from there. 9/20/2018

Time Travel Debugging Examine a program as it runs backwards to figure out root cause of a problem. Reverse breakpoint Step back Search backwards in time Used to diagnose bugs in shipped products 9/20/2018

Truscan: Defect Detection Tool Scan traces for bugs that “hide” memory leaks dangling pointers un-initialized memory Report bugs that really happen – no false positives Debug with time travel debugging 9/20/2018

Example: Memory Leak Detection eax = HeapAlloc(42); mov [0x4004], eax ADDR = 0x3004 SIZE = 42 REFCOUNT = eax eax = 0; 0x4004 2 1 1 pass Leak!! mov [0x4004], 0 This example is trivial, but … 9/20/2018

Statistics A Windows Application (under development) 600 million instructions 80,000 allocations 30 million pointers 48 leaks (8 unique bugs) Native : ~9 seconds Trace: ~44 seconds Analyze: ~41 minutes (3 Ghz, single threaded, 1GB ram) 9/20/2018

Regression Analysis   TraceDiff Callgraph OS1 OS2 App 1 Foo bar m1 m2 m3 p1 m4 Callgraph OS1 OS2 App 1   Instruction Sequence Mov edi, edi Push ebp Mov ebp, esp Sub esp, 0x54 Cmp [ebp+24],0 Jne … Call … Mov … Trace 1 . TraceDiff Coverage Foo Bar m1 m2 m3 p1 Trace 2 9/20/2018

Related Work Process Virtualization Instrumentation Trace Compression DynamoRIO, Mojo, DELI, ReVirt, Valgrind Instrumentation ATOM, Vulcan, SHADE, Pin Trace Compression VPC Reverse Debugging ReVirt, Traceback, BugNet, Flashback, FDR Program/Trace Diffing & Applications Zeller, Zhang&Gupta 9/20/2018

Summary Flexible framework for instruction level tracing and analysis Complete full-fidelity traces Run once, analyze multiple times Reasonable overhead Many useful applications Debugging, defect detection, optimization, … 9/20/2018

Shameless self promotion! Hiring for internships and full-time positions at all levels Contact: sanjaybh@microsoft.com 9/20/2018

Questions 9/20/2018