EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.

Slides:



Advertisements
Similar presentations
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
Introductory Courses in High Performance Computing at Illinois David Padua.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
Instruction Level Parallelism (ILP) Colin Stevens.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
1 Compiling with multicore Jeehyung Lee Spring 2009.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SAGE: Self-Tuning Approximation for Graphics Engines
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
EECS 583 – Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012.
Thread-Level Speculation Karan Singh CS
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Computer Programming 2 Why do we study Java….. Java is Simple It has none of the following: operator overloading, header files, pre- processor, pointer.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 21, 2011.
Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.
Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.
University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.
15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Dynamo: A Runtime Codesign Environment
Computer Architecture Principles Dr. Mike Frank
EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining
CSCI1600: Embedded and Real Time Software
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Chapter 2.1 Repetition.
Lecture 2 The Art of Concurrency
CSCI1600: Embedded and Real Time Software
Presentation transcript:

EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012

- 1 - Announcements & Reading Material v Exam next Monday, Nov 19, 10:40-12:30 »Check room assignments on course website v No class next Wednes (Nov 21) »Sleep in and recover from the exam! v Today’s class reading »“ Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April v Next class reading (Monday, Nov 26) »“Exploiting coarse-grained task, data, and pipeline parallelism in stream programs,” M. Gordon, W. Thies, and S. Amarasinghe, Proc. of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct

- 2 - The Data Dependence Problem while (ptr != NULL) {... ptr = ptr->next; sum = sum + foo; } How to deal with control dependences? How to deal with linked data structures? How to remove programmatic dependences?

- 3 - sum2 += x sum1 += x We Know How to Break Some of These Dependences – Recall ILP Optimizations sum+=x sum = sum1 + sum2 Thread 1 Thread 0 Apply accumulator variable expansion!

- 4 - DOALL Coverage – Provable and Profiled Still not good enough! Few dependences hinder parallelization in many loops

: while (node) { 2: work(node); 3: node = node->next; } Speculative Loop Fission 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) } If this were traversing an array, it would be a DOALL loop Separate out data structure access and work Parallelize work Sequential part Parallel part

: while (node) { 2: work(node); 3: node = node->next; } Execution of Fissed Loop 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREAD j+1 ) } XBEGIN 5: node = node_array[IS]; i = 0; 1 ' :while (node && i++ < CS) { 2: work(node); 3 ' : node = node->next; } RECV(THREAD j-1 ) XCOMMIT SEND(THREAD j+1 ) }

- 7 - Infrequent Dependence Isolation 1: 2: 1: 2: 99% 1% break A B C A B C’ C 1% 99%

- 8 - Infrequent Dependence Isolation - cont for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } } 1 % Sample loop from yacc benchmark

- 9 - Speculative Prematerialization v Iterations are sequential due to distance 1 recurrence of the red variables v Decouple iteration chunks by prematerializing last outside the loop

DOALL Coverage – Profiled and Transformed

Coverage Breakdown

Improving DSWP with Value Speculation Motivating Example l = create_list(); while (...){ process_list(l); modify_list(l); } while (c != null){ c1 = c; c = c->next_cl; int w = c1->pick_weight; if (w < wm){ wm = w; cm = c1; }

Need to Break Cross-thread Dependences Thread 1 Thread 2 value predict c Thread 1 Thread 2 Synchronization Value Speculation value predict c

Value Speculation 0x1000x1800x1200x2000x2200x300 value predict c Thread 1 Thread

Value Speculation for TLP: Observation 1 v Predicting a value in a particular iteration is harder than predicting a value of the operation within a range of iterations April 15April 30

Value Speculation for TLP: Observation 2 v It is sufficient to predict the values in a few iterations to get TLP 0x1000x1800x1200x2000x2200x300 value predict c = 0x200 Thread 1 Thread

Memoize and Speculative Parallel Execution Core 1 Core SVA Memory

Speculative Parallel Execution Core 1 Core Memory

Mis-speculation Core 1 Core Memory

Memoizing Once: Problems – Delete One of the Starting Nodes

Memoizing Once: Problems – The List Grows after the Initial Traversal

Discussion Point 1 – Fission vs Value Speculation v Fission vs value speculation »When is fission better? »When is value speculation better? »Is either more powerful than the other? »Which would you use in your compiler and why? »Would you be confident your code would run correctly if these transformations were done? v Memoization/caching is a powerful optimization »Where is this used today? Can it be used more often?

Discussion Point 2 – Dependences v Are these “tricks” valid for other common data structures? »Are any generalizations needed? v What other common data dependences are we missing in loops?

Dynamic Parallelization – Is this Practical?

Client-side Computation in JavaScript v Flexibility, ease of prototyping, and portability 25 Poor performance is one of the main challenges

JavaScript Parallelization v A typical static parallelization flow 26 Source code Memory dependence analysis Parallel code generation Parallel execution Compile time Runtime Memory profiling Data flow analysis Runtime Dynamic parallelization $ $ Speculation engine (Software transactional memory) $

ParaScript Approach v Light-weight dynamic analysis & code generation for speculative DOALL loops v Low-cost customized SW speculation with a single checkpoint 27 Hot loop detection Initial parallelizabilit y assessment Parallel Code generation Parallel execution Sequential execution Loop Selection Runtime Customize d speculation Finish Abort

Scalar Array Conflict Detection v Initial assessment catches trivial conflicts v Keep track of max and min accessed element indices v Cross-check RD/WR sets after thread execution 28 A[0] = … A[5] = … B[7] = … A[6] = A[5]+1 Thread 1 Thread 2 &A05 Array write-set &A66 Array write-set &B77 &A55 Array read-set ptr min max

Object Array Conflict Detection v More involved than scalar arrays v Different indices of the same array may point to the same object 29 ptr RefCnt myObj0 header ptr RefCnt myObj1 header A &A 1 1 B &B1 2 If dependent based on data-flow analysis If dependent based on data-flow analysis

Experimental Setup v Implemented in Firefox 3.7a1pre v Subset of SunSpider benchmark suite »Others identified as not parallelizable early on, causing 2% slow-down due to the initial analysis. v A set of Pixastic Image Processing filters v 8-processor system -- 2 Intel Xeon Quad-cores, running Ubuntu 9.10 v Ran each benchmark 10 times and took the average 30

Parallelism Coverage 31 High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values High fraction of sequential execution in the getImageData() browser function DOALL loop that extracts pixel RGB & alpha values

SunSpider 32 A long iteration dominates execution