Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University

Slides:

Advertisements

Similar presentations

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

SoC CAD 1 Automatically Exploiting Cross-Invocation Paralleism Using Runtime Information 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.

Chapter 3 Loaders and Linkers

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

Improving BER Performance of LDPC Codes Based on Intermediate Decoding Results Esa Alghonaim, M. Adnan Landolsi, Aiman El-Maleh King Fahd University of.

Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz April 7 th, 2010 Youngjoon Jo.

Efficient Systematic Testing for Dynamically Updatable Software Christopher M. Hayden, Eric A. Hardisty, Michael Hicks, Jeffrey S. Foster University of.

University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.

Chapter 10 Storage Management Implementation details beyond programmer’s control Storage/CPU time trade-off Binding times to storage.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Run time vs. Compile time

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.

Testing an individual module

Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Handouts Software Testing and Quality Assurance Theory and Practice Chapter 9 Functional Testing

Query Processing Presented by Aung S. Win.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Sparse Coding for Specification Mining and Error Localization Runtime Verification September 26, 2012 Wenchao Li, Sanjit A. Seshia University of California.

Multi-Dimensional Arrays

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Introduction. 2COMPSCI Computer Science Fundamentals.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Course Web Page Most information about the course (including the syllabus) will be posted on the course wiki:

Bug Localization with Machine Learning Techniques Wujie Zheng

Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Thread-Level Speculation Karan Singh CS

Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

The Volcano Optimizer Generator Extensibility and Efficient Search.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 153: Concepts of Compiler Design October 7 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak

Enhancing the Role of Inlining in Effective Interprocedural Parallelization Jichi Guo, Mike Stiles Qing Yi, Kleanthis Psarris.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.

PLC '06 Experience in Testing Compiler Optimizers Using Comparison Checking Masataka Sassa and Daijiro Sudo Dept. of Mathematical and Computing Sciences.

Implementation of a Run Time System to support parallelization of Partially Parallel loops using R-LRPD Test Pranav Garg ( Y5313 ) Virajith Jalaparti (

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

/ PSWLAB Evidence-Based Analysis and Inferring Preconditions for Bug Detection By D. Brand, M. Buss, V. C. Sreedhar published in ICSM 2007.

Hybrid Analysis of Memory Reference Patterns Silvius Rus Lawrence Rauchwerger Jay Hoeflinger This presentation will probably involve audience discussion,

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

A Parallel Communication Infrastructure for STAPL

Code Optimization.

Conception of parallel algorithms

Martin Rinard Laboratory for Computer Science

Address-Value Delta (AVD) Prediction

A Practical Stride Prefetching Implementation in Global Optimizer

Presentation transcript:

Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University

CC00: Reducing Run-time Overhead2 Reduce Run-time Overhead Parallel programming is difficult. Run-time technique can succeed where static compilation fails when access patterns are input dapendent. Run-time overhead should be reduced as much as possible. Different run-time techniques should be explored for sparse code. What is the Problem? Goal of present work:

CC00: Reducing Run-time Overhead3 Fundamental Work: LRPD test Speculatively apply the most promising transformations. Speculatively execute the loop in parallel and subsequently test the correctness of the execution. Memory references are marked in loop. If the execution was incorrect, the loop is re-executed sequentially. Main Idea: Array privatization, reduction parallelization is applied and verified. When using processor-wise test, cross-processor dependence are tested. Related Ideas:

CC00: Reducing Run-time Overhead4 LRPD Test: Algorithm Main components for speculative execution: Checkpointing/restorati on mechanism Error(hazard) detection method for testing the validity of the speculative parallel execution Automatable strategy Restore from checkpoint Re-execute loop in serial Initialize shadow arrays Checkpoint shared arrays Execute loop as DOALL Dependence analysis Commit result Done Success ? Yes No

CC00: Reducing Run-time Overhead5 LRPD Test: Example Original loop and input data Loop transformed for speculative execution. The markwrite and markread operations update the appropriate shadow array. Analysis of shadow arrays after loop execution. do i=1,5 z = A(K(i)) if (B(i).eq..true.) then A(L(i)) = z + C(i) endif enddo B(1:5) = ( ) K(1:5) = ( ) L(1:5) = ( ) do i=1,5 markread(K(i)) z = A(K(i)) if (B(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endif enddo

CC00: Reducing Run-time Overhead6 Sparse Applications The dimension of the array under test is much larger than the number of distinct elements referenced by loop. Use of shadow arrays is very expensive. Need compacted shadow structure. Rely on indirect, multi-level addressing. Example: SPICE 2G6.

CC00: Reducing Run-time Overhead7 Overhead minimization The number of markings for memory references during speculative execution. Speculate about the data structures and reference patterns of the loop and customize the shape and size of shadow structure. Initialize shadow arrays Checkpoint shared arrays Execute loop as DOALL Dependence analysis Commit result Done Success ? Yes No Restore from checkpoint Re-execute loop in serial Overhead Reduced

CC00: Reducing Run-time Overhead8 Redundant marking elimination Same address type based aggregation Memory references are classified as: RO, WF, RW, NO To mark minimal sites by using dominance relationship. Algorithm relies on recursive aggregation by DFS traversal on CDG. Combine markings based on rules. Boolean expression simplification will enhance the ability of reducing more redundant markings. Loop invariant predicates of references will enable extracting a inspector loop.

CC00: Reducing Run-time Overhead9 Redundant marking elimination: Rules RO NO RO RW NO RO WF A F

CC00: Reducing Run-time Overhead10 Grouping of related references Two references are related if satisfy: 1) The subscripts of two references are of form ptr + affine function 2) The corresponding ptrs are the same. Related references of the same type can be grouped for the purpose of marking reduction. Grouping procedure is to find a minimum number of disjoint sets of references. Result: 1) reduced number of marking instruction. 2) reduced size of shadow structure.

CC00: Reducing Run-time Overhead11 CDG and colorCDG colorCDG represents the predicates clearer than CDG does. colorCDG and CDG contain the same information in a clearer form. S0 S5S4S3S2S1 AA!A!A!A!A CDG A colorCDG S0 A!A!A S5S4S3S2S1

CC00: Reducing Run-time Overhead12 Grouping Algorithm Procedure: Grouping Input: Statement N; Predicate C; Output: Groups return_groups; Local: Groups branch_groups; Branch branch; Statement node; Predicate path_condition; return_groups = CompGroups (N, C) if (N.succ_entries > 0) for (branch in N.succ) Initialize(branch_groups) path_condition = C & branch.predicate for (node in B.succ) branch_groups  = Grouping(node, path_condition) end return_groups  = branch_groups end return return_groups

CC00: Reducing Run-time Overhead13 Sparse access classification Identify base-pointers Classify references of same base-pointers to: A) Monotonic Access + constant stride B) Monotonic Access + variable stride C) Random access , 3 41 (1, 4, 4) Array + Range Info Hash Table + Range Info

CC00: Reducing Run-time Overhead14 Check in bit vector for types of references Range (min/max) Comparison Overlap ? Hash Table Merging Triplets Comparison Array Merging What shadow structure ? Success ? Return Success Return Failure Yes No Yes No Triplets Array Hash table Dependence Analysis Run-time marking routines are adaptive. Type of references is recorded in bit vector. Access and analysis of the shadow structures are expensive normally. It can be cheap when speculation on the access pattern is correct.

CC00: Reducing Run-time Overhead15 Experimental Results Marking overhead reduction for speculative loop. Shadow structures for sparse codes: SPICE 2G6, BJT loop. Experiments run on: 16 processor HP-V class, 4GB memory, HPUX11 operation system. Techniques are implemented in Polaris infrastructure. Codes: SPICE 2G6, P3M, OCEAN, TFFT2 Results Experimental Setup

CC00: Reducing Run-time Overhead16 Marking Overhead Reduction Polaris + Run-time pass + Grouping Speedup Ratio =

CC00: Reducing Run-time Overhead17 Experiment: SPICE 2G6, BJT loop BJT is about 30% of total sequential execution Time. Unstructured Loop  DO loop Distribute the DO loop to –Sequential loop collects all nodes in the linked list into a temporary array. –Parallel loop does the real work. Most indices are based on common base-pointer. It does its own memory management. SPICE 2G6, BJT loop: Grouping method. Choice of shadow structures. We applied:

CC00: Reducing Run-time Overhead18 Experiment: Speedup Input: Extended from long input of Perfect suit Input: Extended from 8-bits adder

CC00: Reducing Run-time Overhead19 Experiment: Speedup Input: Extended from long input of SPEC 92 Input: Extended from 8-bits adder

CC00: Reducing Run-time Overhead20 Experiment: Execution Time Input: Extended from long input of Perfect suit Input: Extended from 8-bits adder

CC00: Reducing Run-time Overhead21 Conclusion increased potential speedup, efficiency of run-time parallelized loops. will dramatically increase coverage of automatic parallelism of Fortran programs. Proposed Techniques Techniques used for SPICE may be applicable to C code.