Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

CSRD, University of Illinois at Urbana-Champaign 1 A Complete Compilation System.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Compiling with multicore Jeehyung Lee Spring 2009.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

May/01/2000HIPS Online Computation of Critical Paths for Multithreaded Languages Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa University of Tokyo.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Memory-Aware Compilation Philip Sweany 10/20/2011.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

Skadu: Efficient Vector Shadow Memories for Poly-Scopic Program Analysis Donghwan Jeon*, Saturnino Garcia +, and Michael Bedford Taylor UC San Diego 1.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Automatic Thread Extraction with Decoupled Software Pipelining

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Martin Rinard Laboratory for Computer Science

CSCI1600: Embedded and Real Time Software

Department of Computer Science University of California, Santa Barbara

Ann Gordon-Ross and Frank Vahid*

Kremlin: Rethinking and Rebooting gprof for the Multicore Era

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Instruction Scheduling Hal Perkins Autumn 2005

How to improve (decrease) CPI

Maximizing Speedup through Self-Tuning of Processor Allocation

Department of Computer Science University of California, Santa Barbara

Introduction to Optimization

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1

Questions in Parallel Software Engineering 2 I heard about these new-fangled multicore chips. How much faster will PowerPoint be with 128 cores? We wasted 3 months for 0.5% parallel speedup. Can’t we get parallel performance estimates earlier? How can I set the parallel performance goals for my intern, Asok? Dilbert asked me to achieve 128X speedup. How can I convince him it is impossible without changing the algorithm?

Kismet Helps Answer These Questions 3 $> make CC=kismet-cc $> $(PROGRAM) $(INPUT) $> kismet –opteron -openmp Cores Speedup (est.) 1. Produce instrumented binary with kismet-cc 2. Perform parallelism profiling with a sample input 3. Estimate speedup under given constraints Kismet automatically provides the estimated parallel speedup upperbound from serial source code. Kismet’s easy-to-use usage model

Kismet Overview 4 Serial Source Code Parallelization Planner Kismet Sample Input Parallelization Constraints Speedup Estimates Parallelism Profile Hierarchical Critical Path Analysis Kismet extends critical path analysis to incorporate the constraints that affect real-world speedup. Find the best parallelization for the target machine Measure parallelism

Outline Introduction Background: Critical Path Analysis How Kismet Works Experimental Results Conclusion 5

The Promise of Critical Path Analysis (CPA) Definition: program analysis that computes the longest dependence chain in the dynamic execution of a serial program Typical Use: approximate the upperbound on parallel speedup without parallelizing the code Assumes: an ideal execution environment –All parallelism exploitable –Unlimited cores –Zero parallelization overhead 6

How Does CPA Compute Critical Path? la $2, $ADDR load $3, $2(0) addi $4, $2, #4 7 store $4, $2(4) store $3, $2(8)

node: dynamic instruction with latency How Does CPA Compute Critical Path? 1 1 la $2, $ADDR load $3, $2(0) addi $4, $2, #4 8 store $4, $2(4) 1 1 store $3, $2(8) 1 1

node: dynamic instruction with latency edge: dependence between instructions How Does CPA Compute Critical Path? 1 1 la $2, $ADDR load $3, $2(0) addi $4, $2, #4 9 store $4, $2(4) 1 1 store $3, $2(8) 1 1

work = 8 node: dynamic instruction with latency edge: dependence between instructions work: serial execution time, total sum of node weights How Does CPA Compute Critical Path? 1 1 la $2, $ADDR load $3, $2(0) addi $4, $2, #4 10 store $4, $2(4) 1 1 store $3, $2(8) 1 1

work = 8 node: dynamic instruction with latency edge: dependence between instructions critical path length (cp): minimum parallel execution time work: serial execution time, total sum of node weights cp = 6 How Does CPA Compute Critical Path? 1 1 la $2, $ADDR load $3, $2(0) addi $4, $2, #4 11 store $4, $2(4) 1 1 store $3, $2(8) 1 1

work = 8 node: dynamic instruction with latency edge: dependence between instructions critical path length (cp): minimum parallel execution time work: serial execution time, total sum of node weights Total-Parallelism = work critical path length cp = 6 How Does CPA Compute Critical Path? 1 1 la $2, $ADDR load $3, $2(0) addi $4, $2, #4 12 store $4, $2(4) 1 1 store $3, $2(8) 1 1 Total-Parallelism = 1.33 Total-Parallelism = 1.33

13 All the work on the critical path Most work off the critical path Total-Parallelism Min 1.0 Totally Serial …… Total-Parallelism Metric: Captures the Ideal Speedup of a Program Highly Parallel

Why CPA is a Good Thing Works on original, unmodified serial programs Provides an approximate upperbound in speedup, after applying typical parallelization transformations –e.g. loop interchange, loop fusion, index-set splitting, … Output is invariant of serial expression of program –Reordering of two independent statements does not change parallelism 14

A Brief History of CPA Employed to Characterize Parallelism in Research –COMET [Kumar ‘88]: Fortran statement level –Paragraph [Austin ‘92]: Instruction level –Limit studies for ILP-exploiting processors [Wall, Lam ‘92] Not widely used in programmer-facing parallelization tools 15

Why isn’t CPA commonly used in programmer-facing tools? 16 Benchmark ep life is sp unstruct sha Measured Speedup (16 cores) Optimism Ratio Optimism Ratio CPA estimated speedups do not correlate with real-world speedups. CPA Estimated Speedup

CPA Problem #1: Data-flow Style Execution Model Is Unrealistic void outer( ) { …. middle(); } void middle( ) { …. inner(); } void inner( ) { …. parallel doall for-loop reduction } Difficult to map this onto von Neumann machine and imperative programming language Time Invoke middleInvoke inner

18 Overhead Exploitability What type of parallelism is supported by the target platform? e.g. Thread Level (TLP), Data Level (DLP), Instruction Level (ILP) How many cores are available for parallelization? Resource Constraints Do overheads eclipse the benefit of the parallelism? e.g. scheduling, communication, synchronization CPA Problem #2: Key Parallelization Constraints Are Ignored

Outline Introduction Background: Critical Path Analysis How Kismet Works Experimental Results Conclusion 19

Kismet Extends CPA to Provide Practical Speedup Estimates 20 Parallelization Planner Kismet Hierarchical Critical Path Analysis CPA Measure parallelism with a hierarchical region model Find the best parallelization strategy with target constraints

Revisiting CPA Problem #1: Data-flow Style Execution Model Is Unrealistic void top( ) { …. middle(); } void middle( ) { …. inner(); } void inner( ) { …. parallel doall for-loop reduction } Time

Hierarchical Critical Path Analysis (HCPA) Step 1. Model a program execution with hierarchical regions Step 2. Recursively apply CPA to each nested region Step 3. Quantify self-parallelism loop i loop jloop k foo()bar1()bar2() for (j=0 to 32) for (i=0 to 4) for (k=0 to 2) foo (); bar1(); bar2(); 22 HCPA Step 1. Hierarchical Region Modeling

HCPA Step 2: Recursively Apply CPA Total-Parallelism from inner() = ~7X Total-Parallelism from middle() and inner() = ~6X Total-Parallelism from outer(), middle(), and inner() = ~5X What is a region’s parallelism excluding the parallelism from its nested regions?

HCPA: Introducing Self-Parallelism 24 a[i] = a[i] + 1; b[i] = b[i] -1; for (i=0 to 100) { } for (i=0 to 100) { } a[i] = a[i] + 1; b[i] = b[i] -1; 2X100X200X Total-Parallelism (from CPA) Self-Parallelism (from HCPA) Represents a region’s ideal speedup Differentiates a parent’s parallelism from its children’s Analogous to self-time in serial profilers

HCPA Step 3: Quantifying Self-Parallelism 25 for (i=0 to 100) { } a[i] = a[i] + 1; b[i] = b[i] -1; for (i=0 to 100) { } a[i] = a[i] + 1; b[i] = b[i] -1; a[i] = a[i] + 1; b[i] = b[i] -1; for (i=0 to 100) { } Self-Parallelism(Parent) Total-Parallelism(Parent) Total-Parallelism(Children) Generalized Self-Parallelism Equation

Self-Parallelism: Localizing Parallelism to a Region Self-P (inner) = ~7.0 X Self-P (middle) = ~1.0 X Self-P (outer) = ~1.0 X

Classifying Parallelism Type 27 Leaf Region? ILP Loop Region? P Bit == 1? TLP DOACROSS DOALL yes no yes no See our paper for details…

Why HCPA is an Even Better Thing HCPA: –Keeps all the desirable properties of CPA –Localizes parallelism to a region via the self-parallelism metric and hierarchical region modeling –Facilitates the classification of parallelism –Enables more realistic modeling of parallel execution (see next slides) 28

Outline Introduction Background: Critical Path Analysis How Kismet Works HCPA Parallelization Planner Experimental Results Conclusion 29

30 Overhead Exploitability What type of parallelism is supported by the target platform? e.g. Thread Level (TLP), Data Level (DLP), Instruction Level (ILP) How many cores are available for parallelization? Resource Constraints Do overheads eclipse the benefit of the parallelism? e.g. scheduling, communication, synchronization Revisiting CPA Problem #2: Key Parallelization Constraints Are Ignored

Parallelization Planner Overview 31 core count exploitability overhead … Parallel Execution Time Model Parallel Execution Time Model Planning Algorithm Parallel Speedup Upperbound Parallel Speedup Upperbound Goal: Find the speedup upperbound based on the HCPA results and parallelization constraints. Target-dependent parallelization planner region structure self-parallelism HCPA profile Constraints

A sample core allocation process Planning Algorithm: Allocates Cores with Key Constraints for (j=0 to 32) for (i=0 to 4) for (k=0 to 2) foo (); bar1(); bar2(); 32 loop i self-p=4.0 loop j self-p=32.0 loop k self-p=1.5 foo() self-p=1.5 bar1() self-p=2.0 bar2() self-p=5.0 loop j Max 4X Max 32X Max 2X Exploitability Core Count Self-Parallelism The allocated core count should not exceed ceil [self-p]. If a region’s parallelism is not exploitable, do not parallelize the region. The product of allocated cores from the root to a leaf should not exceed the total available core count. Max 5xMax 2x 1X The product should not exceed the available core count

Planning Algorithm: Finding the Best Core Allocation core: 4X core: 8Xcore: 1X Plan APlan BPlan C core: 2X core: 4X core: 2X core: 16X Highest Speedup 33 Core allocation plans for 32 cores How can we evaluate the execution time for a specific core allocation? Speedup 14.5X Speedup 7.2X Speedup 3.3X Estimate the execution time for each plan and pick the one with the highest speedup.

Parallel Execution Time Model: A Bottom-up Approach loop i speedup=4.0 R is a non-leaf region R is a leaf region Bottom-up evaluation with each region’s estimated speedup and parallelization overhead O(R) loop j speedup=4.0 loop k speedup=1.0 foo() speedup=1.0 bar1() speedup=1.0 bar2() speedup=1.0 ptime (loop j) ptime (loop k) ptime(loop i)

More Details in the Paper How do we reduce the log file size of HCPA by orders of magnitude? What is the impact of exploitability in speedup estimation? How do we predict superlinear speedup? And many others… 35

Outline Introduction Background: Critical Path Analysis How Kismet Works Experimental Results Conclusion 36

Platform Processor 8 * Quad Core AMD Opteron core MIT Raw Parallelization Method OpenMP (Manual)RawCC (Automatic) Exploitable Parallelism Loop-Level Parallelism (LLP) Instruction-Level Parallelism (ILP) Synchronization Overhead High ( > 10,000 cycles) Low ( < 100 cycles) Methodology Compare estimated and measured speedup To show Kismet’s wide applicability, we targeted two very different platforms Raw Multicore 37

Speedup Upperbound Predictions: NAS Parallel Benchmarks 38

Speedup Upperbound Predictions: NAS Parallel Benchmarks 39 Predicting Superlinear Speedup Without Cache Model With Cache Model

Speedup Upperbound Predictions: Low-Parallelism SpecInt Benchmarks 40

Conclusion 41 Kismet provides parallel speedup upperbound from serial source code. HCPA profiles self-parallelism using a hierarchical region model and the parallelization planner finds the best parallelization strategy. Kismet will be available for public download in the first quarter of We demonstrated Kismet’s ability to accurately estimate parallel speedup on two different platforms.

*** 42

Self-Parallelism for Three Common Loop Types DOACROSSDOALL CP … … (N/2) * CPCP Self- Parallelism Loop Type Loop’s Critical Path Length (cp) N * CP (N/2) * CP = 2.0 N * CP CP = N 43 Work N * CP CP … Serial N * CP = 1.0 N * CP

Raw Platform: Target Instruction-Level Parallelism Exploits ILP in each basic block by executing instructions on multiple cores Leverages a low-latency inter-core network to enable fine-grained parallelization Employs loop unrolling to increase ILP in a basic block 44 RawCC

Adapting Kismet to Raw Constraints to filter unprofitable patterns –Target only leaf regions as they capture ILP –Like RawCC, Kismet performs loop unrolling to increase ILP, possibly bringing superlinear speedup Greedy Planning Algorithm –Greedy algorithm works well as leaf regions will run independent of each other –Parallelization overhead limits the optimal core count for each region 45 ILP Non-ILP A BC DEF 16 Cores8 Cores16 Cores

Speedup Upperbound Predictions: Raw Benchmarks 46 Superlinear Speedup

Multicore Platform: Target Loop-Level Parallelism Models OpenMP parallelization focusing on loop-level parallelism Disallows nested parallelization due to excessive synchronization overhead via shared memory Models cache effect to incorporate increased cache size from multiple cores 47

Adapting Kismet to Multicore 48 Solution (A) = {A:32} or {C:8, D:32} Parallelize Descendants Parallelize the Parent Constraints to filter unprofitable OpenMP usage –Target only loop-level parallelism –Disallow nested parallelization Bottom-up Dynamic Programming –Parallelize either parent region or a set of descendants –Save the best parallelization for a region R in Solution(R) A BC DEF A B C DEF Parallelize Don’t Parallelize 32 Cores 8 Cores

Impact of Memory System Gather cache miss ratios for different cache sizes Log load / store counts for each region Integrate memory access time in time model 49

50