Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.

Slides:

Advertisements

Similar presentations

A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.

Advertisements

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.

Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

A Comparison of Online and Dynamic Impact Analysis Algorithms Ben Breech Mike Tegtmeyer Lori Pollock University of Delaware.

University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.

Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.

Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.

University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,

Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

SAGE: Self-Tuning Approximation for Graphics Engines

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Defining Anomalous Behavior for Phase Change Memory

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Improving Network I/O Virtualization for Cloud Computing.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.

Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Perseus Design. 2 Lockheed Martin and Government Use Only Architecture Behavioral “signatures” are extracted from a baseline execution Prototype will.

Determina, Inc. Persisting Information Across Application Executions Derek Bruening Determina, Inc.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Capriccio: Scalable Threads for Internet Service

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan August,

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Introduction to threads

Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Adaptive Cache Partitioning on a Composite Core

Effective Data-Race Detection for the Kernel

Some challenges in heterogeneous multi-core systems

Haishan Zhu, Mattan Erez

Xen Network I/O Performance Analysis and Opportunities for Improvement

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Dynamic Binary Translators and Instrumenters

TEE-Perf A Profiler for Trusted Execution Environments

Run time performance for all benchmarked software.

CS Introduction to Operating Systems

Presentation transcript:

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott Mahlke University of Michigan 2 Google

2 Datacenter Applications In 2010, US Datacenters spent 70~90 billion kWh * Datacenter application performance is critical Profiling can help *[Koomey`11]

 Challenges for Datacenters Need to run on live traffic Difficult to isolate Overheads Value profiling 3.8x slowdown 1 Path profiling 31%, edge profiling 16% 2 Binary management Many programs, multiple versions 3 Traditional Profiling Source Code Instrumented Binary Input Data Instrumentation Build Training Run Profile Data 1 [Calder`99] 2 [Ball`96]

 Continuous profiling infrastructure for datacenters  Negligible overhead Sampling based Aggregated profiling overhead less than 0.01%  Limitations Heavily rely on Performance Monitoring Units Limited flexibility and portabiliity [Ren et al.`10] 4 Google-Wide Profiling

 Unified profiling infrastructure for datacenters Flexible types of profile data Portable across heterogeneous datacenter  While maintaining Low overhead Does not burden binary management 5 Goals Sampling Dynamic Binary Instrumentation

6 Instrumentation Sampling hardware operating system application system call gateway

6 Instrumentation Sampling hardware operating system application [Bruening`04] dispatch instrumentation engine client code cache DynamoRIO context switch

6 Instrumentation Sampling hardware operating system application shepherding thread start profiling dispatch instrumentation engine client code cache stop profiling

 Unbounded profiling periods due to fragment linking  Latency degradation due to initial instrumentation  Multi-threade programs 7 Problems with Basic Implementation

code cache 8 Temporal Unlinking/Relinking of Fragments BB1 BB2 dispatch context switch BB2->BB1

9 S/W Code Cache Pre-population hardware operating system application shepherding thread dispatch instrumentation engine client code cache  Still have latency degradation for intial instrumentation phases

 Sampling makes it possible to miss thread operations  Forces Instant Profiling’s signal handler for every thread  Enumerates all threads and sends profiling start signal to each thread 10 Multithreaded Program Support

 6-core Intel Xeon 2.67GHz w/ 12MB L3  12GB main memory  Linux kernel  gcc w/ -O3  SPEC INT2006, BigTable, Web search  Edge profiling client 11 Experimental Setup

12 Naïve Edge Profiling

13 Profiling Overhead

14 S/W Code Cache Prepopulation

15 Profiling Accuracy

16 Asymptotic Accuracy

 Low-overhead, portable, flexible profiling needed  Instant Profiling Combines sampling and DBI Pre-populates S/W code cache Tunable tradeoff between overhead and information Provides eventual profiling accuracy  Less than 5% overhead, more than 80% accuracy for naïve edge profiling client 17 Conclusion

18 Thank you!