Kremlin: Rethinking and Rebooting gprof for the Multicore Era

Slides:

Advertisements

Similar presentations

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering.

Test Metrics: A Practical Approach to Tracking & Interpretation Presented By: Shaun Bradshaw Director of Quality Solutions May 20, 2004 Test Metrics: A.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Presented by Rengan Xu LCPC /16/2014

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Parallelizing Compilers Presented by Yiwei Zhang.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Parallel Programming with MPI and OpenMP

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Theory and Practice of Software Testing

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Sunpyo Hong, Hyesoon Kim

Software Engineering Prof. Dr. Bertrand Meyer March 2007 – June 2007 Chair of Software Engineering Lecture #20: Profiling NetBeans Profiler 6.0.

Parallel Computing Presented by Justin Reschke

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Skadu: Efficient Vector Shadow Memories for Poly-Scopic Program Analysis Donghwan Jeon*, Saturnino Garcia +, and Michael Bedford Taylor UC San Diego 1.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Gwangsun Kim, Jiyun Jeong, John Kim

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ph.D. in Computer Science

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CSCI1600: Embedded and Real Time Software

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Objective of This Course

SAT-Based Area Recovery in Technology Mapping

Ann Gordon-Ross and Frank Vahid*

Register Pressure Guided Unroll-and-Jam

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Outline System architecture Current work Experiments Next Steps

Maximizing Speedup through Self-Tuning of Processor Allocation

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Kremlin: Rethinking and Rebooting gprof for the Multicore Era Saturnino Garcia, Donghwan Jeon, Chris Louie, Michael B. Taylor Computer Science & Engineering Department University of California, San Diego

Motivating a “gprof for parallelization” How effective are programmers at picking the right parts of a program to parallelize? User study* we performed at UC San Diego (UCSD IRB #100056) First and second year CS graduate students Users parallelize their programs and submit to job queue for timing 32-core AMD machine, Cilk++, access to gprof Students were graded based on effectiveness of their parallel speedup; students told serial optimization would not help their grade *Disclaimer: No graduate students were harmed as a result of this study

Fruitless Parallelization Effort User Study: Results Examined student’s activities to determine result of efforts Significant fraction of fruitless effort because of three basic problems Low Parallelism: Region was not parallel enough Low Coverage: Region’s execution time was too small Poor Planning: Speedup negated by subsequent parallelization User 143 User 249 User 371 time Fruitless Parallelization Effort

“What parts of this program should I spend time optimizing?” gprof answers the question: “What parts of this program should I spend time optimizing?” Kremlin answers the question: spend time parallelizing?”

Kremlin’s Usage Model Usage model inspired by gprof 1. Compile instrumented binary Usage model inspired by gprof $> make CC=kremlin-cc $> ./tracking lolcats $> kremlin tracking --personality=openmp File (lines) Self-P Cov (%) 1 imageBlur.c (49-58) 145.3 9.7 2 imageBlur.c (37-45) 145.3 8.7 3 getInterpPatch.c (26-35) 25.3 8.9 4 calcSobel_dX.c (59-68) 126.2 8.1 5 calcSobel_dX.c (46-55) 126.2 8.1 ... 2. Profile with sample input 3. Run analysis tool to create plan

Kremlin’s Key Components Serial Src Code Parallelism Discovery “What’s the potential parallel speedup of each part of this program?” Hierarchical Critical Path Analysis (HCPA) Quantifies self-parallelism in each program region Self-Parallelism Estimates ideal parallel speedup of a specific region Parallelism Planning “What regions must I parallelize to get the maximum benefit on this system?” Planning Personalities Incorporates target specific constraints in parallelization Parallelization Plan

Developing an Approach for Parallelism Discovery Existing Technique: 1980’s-era Critical Path Analysis (CPA) Finds critical path through the dynamic execution of a program Mainly used in research studies to quantify limits of parallelism critical path (cp) work critical path length parallelism = work ~= # of instrs instruction data or control dependence

Benefits of CPA as a Basis for a Parallelism Discovery Evaluates program’s potential for parallelization under relatively optimistic assumptions Closer approximation to what human experts can achieve Versus pessimistic static analysis in automatic parallelizing compilers Result is predictive of parallelism after typical parallelization transformations e.g., Loop interchange, loop fission, locality enhancement 11:47

Improving CPA with Hierarchical CPA (HCPA) CPA is typically run on an entire program Not helpful for identifying specific regions to parallelize Doesn’t help evaluate execution time of a program if only a subset of the program is parallelized Hierarchical CPA is a region-based analysis Self-Parallelism (sp) identifies parallelism in specific regions Provides basis for estimating parallel speedup of individual regions sp=100 sp=1 for(i=1..100) { for(j=1..100) { a[i][j] = a[i][j-1]+3; b[i][j] = b[i][j-1]+5; } sp=2

HCPA Step 1: Hierarchically Apply CPA Goal: Introduce localization through region-based analysis Shadow-memory based implementation Performs CPA analysis on every program region Single pass: Concurrently analyzes multiple nested regions for(i) (work, cp length) = (100000,500) for(i=1..100) { for(j=1..100) { a[i][j] = a[i][j-1]+3; b[i][j] = b[i][j-1]+5; } for(j) (1000,500) for(j) (1000,500) ... ✗ for(i): p = 100000 500 = 200 W CP = 1 (10,5) 100 (10,5) ...

HCPA Step 2: Calculate Self-Parallelism Goal: Eliminate effect of nested parallelism in parallelism calculation Approximate self-parallelism using only HCPA output “Subtracts” nested parallelism from overall parallelism work(for_i) = 100 * work(for_j) for(i) (W, CP) = (100000,500) 500 CP other ... W(forj) contributes to parallelism in both inner and outer regions for(j) (1000,500) for(j) (1000,500) ... 500 CP ... work’(for_i) = 100 * CP(for_j) adjust work for(i): p = 100000 500 = 200 W CP = ✔ for(i): self-p = 100*500 500 = 100 W’ CP =

HCPA Step 3: Compute Static Region Data Goal: Convert dynamic region data to static region output Merge dynamic nodes associated with same static region Work: Sum of work across dynamic instances Self-Parallelism: Weighted Average across dynamic instances for(i) (work, sp) = (100000,100) for(i) for(j) body (work, avg. sp) (100000,100) (100000,1) (100000,2) for(j) (1000,1) for(j) (1000,1) merge dynamic regions ... 1 (10,2) 100 (10,2) ...

Further Details on Discovery in Our Paper Kremlin handles much more complex structures than just nested for loops: finds parallelism in arbitrary code including recursion Self-parallelism metric is defined and discussed in detail in the paper Compression technique used to reduce size of HCPA output

Creating a Parallelization Plan Goal: Use HCPA output to select best regions for target system Planning personalities allow user to incorporate system constraints Software constraints: What types of parallelism can I specify? Hardware constraints: Synchronization overhead, etc. Planning algorithm can change based on constraints

parallelized time reduction = W - (W/SP) An OpenMP Planner Based on OpenMP 2.0 specification Focused on loop-level parallelism Disallows nested parallelism because of overhead Planning algorithm based on dynamic programming PTR = 50k 45k 20k 25k parallelized time reduction = W - (W/SP) A C B D E F Region Work SP A 100k 2 B 50k C 10 D E 25k 5 F ✔ ✔

Evaluation Methodology: Benchmarks: NAS OpenMP and SpecOMP Ran Kremlin on serial versions; targeting OpenMP Parallelized according to Kremlin’s plan Gathered performance results on 8 socket AMD 8380 Quad-core Compared against third-party parallelized versions (3rd Party) Benchmarks: NAS OpenMP and SpecOMP Have both serial and parallel versions Wide range of parallel speedup (min: 1.85x, max: 25.89x) on 32 cores

How much effort is saved using Kremlin? # of Regions Parallelized Suite Benchmark 3rd Party Kremlin Reduction SpecOMP art 3 4 0.75x ampp 6 2.00x equake 10 1.67x NPB ep 1 1.00x is ft mg 8 1.25x cg 22 9 2.44x lu 28 11 2.55x bt 54 27 sp 70 58 1.21x Overall 211 134 1.57x 1.89x average improvement for 5 most widely modified benchmarks.

How good is Kremlin-guided performance? Compared performance against expert, third-party version Significantly better results on two benchmarks Required 65 fewer regions to get within 4% of performance on others (1.87X improvement)

Does Kremlin pick the best regions first? Determined what % of speedup comes from first {25,50,75,100}% of recommended regions Fraction of Kremlin Plan Applied First 25% of regions Second 25% Third 25% Last 25% regions Marginal benefit (% max speedup) (avg) 56.2% 30.2% 9.2% 4.4% 86.4% in first half + decreasing marginal benefit = highest impact regions favored

Conclusion Kremlin helps a programmer determine: “What parts of this program should I spend time parallelizing?” Three key techniques introduced by Kremlin Hierarchical CPA: How much total parallelism is in each region? Self-Parallelism: How much parallelism is only in this region? Planning Personalities: What regions are best for my target system? Compelling results 1.57x average reduction in number of regions parallelized Greatly improved performance on 2 of 11 benchmarks; very close on others

***

Self-Parallelism for Three Common Loop Types DOALL DOACROSS Serial … CP (N/2) * CP N * CP = 2.0 … CP Loop’s Critical Path Length (CP) PUT SELF PARALLELISM NUMBERS IN RED We have examined a DOALL loop example where all iterations are independent of each other, but Parkour’s self-parallelism calculation works well for other loop types as well. With a DOACROSS loop where half of the work in an iteration can overlap with the next iteration, the critical path length of the loop is (N/2) * CP, and resulting self-parallelism is 2.0 which conforms to our intuition. For a serial loop where the critical path length is identical to the adjusted work, the calculated Self-parallelism is 1.0. In all cases, Parkour successfully quantifies self-parallelism by extending CPA. N * CP = 1.0 CP … CP Adjusted Work (ET’) N * CP Self- Parallelism = N N * CP CP

Kremlin System Architecture

Interpreting the Parallelism Metric 1.0 100.0 10000.0 ... Totally Serial ... Highly Parallel All work is on critical path (ET == CP) Most work is off critical path (ET >> CP) Parallelism is a result of execution time spent off of the critical path