1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

Slides:



Advertisements
Similar presentations
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Memory System Characterization of Big Data Workloads
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
Phase Detection Jonathan Winter Casey Smith CS /05/05.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University
A Comparison of Software and Hardware Techniques for x86 Virtualization Keith Adams Ole Agesen Oct. 23, 2006.
Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Operating Systems: Internals and Design Principles
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.
Application Communities Phase 2 (AC2) Project Overview Nov. 20, 2008 Greg Sullivan BAE Systems Advanced Information Technologies (AIT)
Best detection scheme achieves 100% hit detection with
Protecting C and C++ programs from current and future code injection attacks Yves Younan, Wouter Joosen and Frank Piessens DistriNet Department of Computer.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Agile Paging: Exceeding the Best of Nested and Shadow Paging
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Simone Campanoni A research CAT Simone Campanoni
Gift Nyikayaramba 30 September 2014
Assessing and Understanding Performance
Lazy Diagnosis of In-Production Concurrency Bugs
Antonia Zhai, Christopher B. Colohan,
Predictive Performance
Using Dead Blocks as a Virtual Victim Cache
Phase Capture and Prediction with Applications
Adaptive Code Unloading for Resource-Constrained JVMs
Massachusetts Institute of Technology
Introduction to Virtual Machines
COMS 361 Computer Organization
Introduction to Virtual Machines
Dynamic Binary Translators and Instrumenters
Phase based adaptive Branch predictor: Seeing the forest for the trees
Canturk Isci Gilberto Contreras Margaret Martonosi
Presentation transcript:

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March San Jose, CA

2 The complexity crunch hardware complexity ↑ + software complexity ↑ ⇓ too many things happening concurrently, complex interactions and side-effects ⇓ less understanding of program execution behavior ↓

3 S/W developer –Computation vs. communication ratio –Function/path frequency –Test coverage H/W designer –Cache behavior –Branch prediction System administrator –Interaction between processes The importance of program behavior characterization Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

4 Common approaches to program understanding Overhead Level of detail Versatility Space and time overhead Coarse grained summaries vs. instruction level and contextual information Portability and ability to adapt to different uses

5 Common approaches to program understanding ProfilingSimulation Overheadhighvery high Level of detail very high Versatilityhighvery high

6 Common approaches to program understanding ProfilingSimulation HW Counters Overheadhighvery highvery low Level of detail very high very low Versatilityhighvery highvery low

7 Slowdown due to HW counters (counting L1 misses for 181.mcf on P4) Time (seconds) (>2000%) (1%) (10%) (1%) (35%) (325%) (% slowdown)

8 Common approaches to program understanding ProfilingSimulation HW Counters Desirable Approach Overheadhighvery highvery lowlow Level of detail very high very lowhigh Versatilityhighvery highvery lowhigh

9 Common approaches to program understanding ProfilingSimulation HW Counters Desirable Approach Overheadhighvery highvery lowlow Level of detail very high very lowhigh Versatilityhighvery highvery lowhigh UMI

10 Key components Dynamic Binary Instrumentation –Complete coverage, transparent, language independent, versatile, … Bursty Simulation –Sampling and fast forwarding techniques –Detailed context information –Reasonable extrapolation and prediction

11 Ubiquitous Memory Introspection online mini-simulations analyze short memory access profiles recorded from frequently executed code regions Key concepts –Focus on hot code regions –Selectively instrument instructions –Fast online mini-simulations –Actionable profiling results for online memory optimizations

12 Working prototype Implemented in DynamoRIO –Runs on Linux –Used on Intel P4 and AMD K7 Benchmarks –SPEC 2000, SPEC 2006, Olden –Server apps: MySQL, Apache –Desktop apps: Acrobat-reader, MEncoder

13 UMI is cheap and non-intrusive ( SPEC2K reference workloads on P4) Average 0% 20% 40% 60% 80% 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 56.bzip2 300.twolf em3d health mst treeadd tsp ft DynamoRIO UMI Average overhead is 14% 1% more than DynamoRIO % slowdown compared to native execution

14 What can UMI do for you? Inexpensive introspection everywhere Coarse grained memory analysis –Quick and dirty Fine grained memory analysis –Expose opportunities for optimizations Runtime memory-specific optimizations –Pluggable prefetching, learning, adaptation

15 Coarse grained memory analysis Experiment: measure cache misses in three ways –HW counters –Full cache simulator (Cachegrind) –UMI Report correlation between measurements –Linear relationship of two sets of data 1 strong positive correlation 0 strong negative correlation no correlation

16 Cache miss correlation results HW counter vs. Cachegrind –0.99 –20x to 100x slowdown HW counter vs. UMI –0.88 –Less than 2x slowdown in worst case (HW, CG) (HW, UMI)

17 What can UMI do for you? Inexpensive introspection everywhere Coarse grained memory analysis –Quick and dirty Fine grained memory analysis –Expose opportunities for optimizations Runtime memory-specific optimizations –Pluggable prefetching, learning, adaptation

18 Fine grained memory analysis Experiment: predict delinquent loads using UMI –Individual loads with cache miss rate greater than threshold Delinquent load set determined according to full cache simulator (Cachegrind) –Loads that contribute 90% of total cache misses Measure and report two metrics Delinquent loads identified by Cachegrind Delinquent loads predicted by UMI recall false positive

19 UMI delinquent load prediction accuracy Recall (higher is better) False positive (lower is better) benchmarks with ≥ 1% miss rate 88%55% benchmarks with < 1% miss rate 26%59%

20 What can UMI do for you? Inexpensive introspection everywhere Coarse grained memory analysis –Quick and dirty Fine grained memory analysis –Expose opportunities for optimizations Runtime memory-specific optimizations –Pluggable prefetcher, learning, adaptation

21 Experiment: online stride prefetching Use results of delinquent load prediction Discover stride patterns for delinquent loads Insert instructions to prefetch data to L2 Compare runtime for UMI and P4 with HW stride prefetcher

22 Data prefetching results summary running time normalized to native execution (lower is better) SW prefetching + DynamoRIO + UMI HW prefetching + Native Binary Combined prefetching + DynamoRIO + UMI ft Average mcf 171.swim 172.mgrid 179.art 183.equake 188.ammp 191.fma3d 301.apsi em3d mst ft Average More than offsets slowdowns from binary instrumentation!

23 The Gory Details

24 UMI components Region selector Instrumentor Profile analyzer

25 Region selector Identify representative code regions –Focus on traces, loops –Frequently executed code –Piggy back on binary instrumentor tricks Reinforce with sampling –Time based, or leverage HW counters –Naturally adapt to program phases

26 Instrumentor Record address references –Insert instructions to record address referenced by memory operation Manage profiling overhead –Clone code trace (akin to Arnold-Ryder scheme) –Selective instrumentation of memory operations E.g., ignore stack and static data

27 Recording profiles Code Trace Profile T1 T2 T1 counter 0x1040x0280x013 0x012 0x100 0x0240x011 op3op2op1 Code Trace T1 Code Trace T2 0x0310x032 op2op1 counter page protection to detect profile overflow Address Profiles early trace exist

28 Mini-simulator Triggered when code or address profile is full Simple cache simulator –Currently simulate L2 cache of host –LRU replacement –Improve approximations with techniques similar to offline fast forwarding simulators Warm up and periodic flushing Other possible analyzer –Reference affinity model –Data reuse and locality analysis

29 Mini-simulations and parameter sensitivity 181.mcf sampling frequency threshold recall false positive 1.40 normalized performancerecall ratiofalse positive ratio Regular data structures If sampling threshold too high, starts to exceed loop bounds: miss out on profiling important loops Adaptive threshold is best

30 Mini-simulations and parameter sensitivity K2K4K8K32K address profile length 197.parser recall false positive 16K normalized performancerecall ratiofalse positive ratio Irregular data structures Need longer profiles to reach useful conclusions

31 Summary UMI is lightweight and has a low overhead –1% more than DynamoRIO –Can be done with Pin, Valgrind, etc. –No added hardware necessary –No synchronization or syscall headaches –Other cores can do real work! Practical for extracting detailed information –Online and workload specific –Instruction level memory reference profiles –Versatile and user-programmable analysis Facilitate migration of offline memory optimizations to online setting

32 Future work More types of online analysis –Include global information –Incremental (leverage previous execution info) –Combine analysis across multiple threads More runtime optimizations –E.g., advanced prefetch optimization Hot data stream prefetch Markov prefetcher –Locality enhancing data reorganization Pool allocation Cooperative allocation between different threads Your ideas here…