1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
EE384y: Packet Switch Architectures
Advanced Piloting Cruise Plot.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Processes and Operating Systems
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Remote Educational Programming Of Robots (REPOR) Tord Fauskanger Aurelie Aurilla Bechina Arntzen Dag Samuelsen Buskerud University College.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
VARUN GUPTA Carnegie Mellon University 1 Partly based on joint work with: Anshul Gandhi Mor Harchol-Balter Mike Kozuch (CMU) (CMU) (Intel Research)
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Solve Multi-step Equations
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.
ABC Technology Project
Center for Activity Theory and Developmental Work Research Using Activity Theory to Help Teachers Redesign their Practice Yrjö Engeström University.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
DB analyzer utility An overview 1. DB Analyzer An application used to track discrepancies and other reports in Sanchay Post Constantly updated by SDC.
25 seconds left…...
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 MC 2 –Copying GC for Memory Constrained Environments Narendran Sachindran J. Eliot.
Januar MDMDFSSMDMDFSSS
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
Energy Generation in Mitochondria and Chlorplasts
Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.
How’s the Parallel Computing Revolution Going? 1How’s the Parallel Revolution Going?McKinley Kathryn S. McKinley The University of Texas at Austin.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.
Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard.
Adaptive Code Unloading for Resource-Constrained JVMs
Presentation transcript:

1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of Texas at Austin

2 Shocking News! In 2000, Java overtook C and C++ as the most popular programming language [TIOBE ]

3 Systems Research in Industry and Academia ISCA papers use C and/or C++ 5 papers are orthogonal to the programming language 2 papers use specialized programming languages 2 papers use Java and C from SPEC 1 paper uses only Java from SPEC

4 What is Experimental Computer Science?

5 An idea An implementation in some system An evaluation

6 The success of most systems innovation hinges on evaluation methodologies. 1.Benchmarks reflect current and ideally, future reality 2.Experimental design is appropriate 3.Statistical data analysis

7 The success of most systems innovation hinges on experimental methodologies. 1.Benchmarks reflect current and ideally, future reality [DaCapo Benchmarks 2006] 2.Experimental design is appropriate. 3.Statistical Data Analysis [Georges et al. 2006]

8 Were not in Kansas anymore! –JIT compilation, GC, dynamic checks, etc Methodology has not adapted –Needs to be updated and institutionalized …this sophistication provides a significant challenge to understanding complete system performance, not found in traditional languages such as C or C++ [Hauswirth et al OOPSLA 04] Experimental Design

9 Comprehensive comparison –3 state-of-the-art JVMs –Best of 5 executions –19 benchmarks –Platform: 2GHz Pentium-M, 1GB RAM, linux

10 Experimental Design

11 Experimental Design

12 Experimental Design

13 Experimental Design First Iteration Second Iteration Third Iteration

14 Experimental Design Another Experiment Compare two garbage collectors –Semispace Full Heap Garbage Collector –Marksweep Full Heap Garbage Collector

15 Experimental Design Another Experiment Compare two garbage collectors –Semispace Full Heap Garbage Collector –Marksweep Full Heap Garbage Collector Experimental Design –Same JVM, same compiler settings –Second iteration for both –Best of 5 executions –One benchmark - SPEC 209_db –Platform: 2GHz Pentium-M, 1GB RAM, linux

16 Marksweep vs Semispace

17 Marksweep vs Semispace

18 Semispace Marksweep Marksweep vs Semispace

19 Experimental Design

20 Experimental Design: Best Practices Measuring JVM innovations Measuring JIT innovations Measuring GC innovations Measuring Architecture innovations

21 JVM Innovation Best Practices Examples: –Thread scheduling –Performance monitoring Workload triggers differences –real workloads & perhaps microbenchmarks –e.g., force frequency of thread switching Measure & report multiple iterations –start up –steady state (aka server mode) –never configure the VM to use completely unoptimized code! Use a modest or multiple heap sizes computed as a function of maximum live size of the application Use & report multiple architectures

22 Best Practices Pentium M AMD Athlon SPARC

23 JIT Innovation Best Practices Example: new compiler optimization –Code quality: Does it improve the application code? –Compile time: How much compile time does it add? –Total time: compiler and application time together –Problem: adaptive compilation responds to compilation load –Question: How do we tease all these effects apart?

24 JIT Innovation Best Practices Teasing apart compile time and code quality requires multiple experiments Total time: Mix methodology –Run adaptive system as intended Result: mixture of optimized and unoptimized code –First & second iterations (that include compile time) –Set and/or report the heap size as a function of maximum live size of the application –Report: average and show statistical error Code quality –OK: Run iterations until performance stabilizes on best, or –Better: Run several iterations of the benchmark, turn off the compiler, and measure a run guaranteed to have no compilation –Best: Replay mix compilation Compile time –Requires the compiler to be deterministic –Replay mix compilation

25 Replay Compilation Force the JIT to produce a deterministic result Make a compilation profiler & replayer Profiler –Profile first or later iterations with adaptive JIT, pick best or average –Record profiling information used in compilation decisions, e.g., dynamic profiles of edges, paths, &/or dynamic call graph –Record compilation decisions, e.g., compile method bar at level two, inline method foo into bar –Mix of optimized and unoptimized, or all optimized/unoptimized Replayer –Reads in profile –As the system loads each class, apply profile +/- innovation Result –controlled experiments with deterministic compiler behavior –reduces statistical variance in measurements Still not a perfect methodology for inlining

26 GC Innovation Best Practices Requires more than one experiment... Use & report a range of fixed heap sizes –Explore the space time tradeoff –Measure heap size with respect to the maximum live size of the application –VMs should report total memory not just application memory Different GC algorithms vary in the meta-data they require JIT and VM use memory... Measure time with a constant workload –Do not measure through put Best: run two experiments –mix with adaptive methodology: what users are likely to see in practice –replay: hold the compiler activity constant Choose a profile with best application performance in order to keep from hiding mutator overheads in bad code.

27 Architecture Innovation Best Practices Requires more than one experiment... Use more than one VM Set a modest heap size and/or report heap size as a function of maximum live size Use a mixture of optimized and uncompiled code Simulator needs the same code in many cases to perform comparisons Best for microarchitecture only changes: –Multiple traces from live system with adaptive methodology start up and steady state with compiler turned off what users are likely to see in practice Wont work if architecture change requires recompilation, e.g., new sampling mechanism –Use replay to make the code as similar as possible

28 statistics Disraeli benchmar ks There are lies, damn lies, and sometimes more than twice as fast our …. is better or almost as good as …. across the board garbage collection degrades performance by 70% speedups of 1.2x to 6.4x on a variety of benchmarks our prototype has usable performance the overhead …. is on average negligible Quotes from recent research papers …demonstrating high efficiency and scalability our algorithm is highly efficient can reduce garbage collection time by 50% to 75% speedups…. are very significant (up to 54-fold) speed up by 10-25% in many cases… …about 2x in two cases… …more than 10x in two small benchmarks …improves throughput by up to 41x

29 Conclusions Methodology includes –Benchmarks –Experimental design –Statistical analysis [OOPSLA 2007] Poor Methodology –can focus or misdirect innovation and energy We have a unique opportunity –Transactional memory, multicore performance, dynamic languages, What we can do –Enlist VM builders to include replay –Fund and broaden participation in benchmarking Research and industrial partnerships Funding through NSF, ACM, SPEC, industry or ?? –Participate in building community workloads

30 Thank you!