Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University http://parasol.tamu.edu/ Speculative Run-Time Parallelization

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

Data Dependence Data Dependence (DD) : Data dependence relations are used as the essential ordering constraints among statements or operations in a program. Data dependence happens when two memory accesses access one memory location and at least one of them writes to the location... = X X =.... = X X =.. flowantioutput Three basic data dependence relations

Parallel Loop Can a loop be executed in parallel ? Test procedure –FOR every pair of load/store and store/store operations: DO – IF (L and S could access the same location in diff. Iteration) – LOOP is sequential For array, the memory accesses are functions of loop indices. These functions can be: linear, non-linear, unknown map. for i =.. A[i] = A[i] + B[i] for i =.. A[i+1] = A[i] + B[i] A parallel loopA sequential loop

Parallelism Enabling Transformations for i =.. temp = A[i] A[i] = B[i] B[i] = temp for i = … private temp temp = A[i] A[i] = B[i] B[i] = temp anti output Privatization: let each iteration have separate location for ‘temp’ ‘temp’ is used as temporary storage in each iteration Parallel or not?

Reduction Parallelization Reduction is Associative and commutative operation of the form: x = x  exp x does not occur in exp or anywhere else in the loop (exceptions) Usually extracted by pattern matching. do i =.. A = A + B[i] Reduction parallelization: pA[1:P] = 0 do p = 1, P do i = my_i … pA[p] = pA[p] + B[i] do p = 1, P A = A + pA[p] anti output flow Parallel or not?

Irregular Applications: The most challenging for i =.. A[B[i]] = A[C[i]] + B[i] output flow From compiler’s viewpoint, in irregular programs, –Data arrays are accessed via indirections, that usually are input dependent. –Optimization, parallelization information not available statically Problem: loop is sequential if any B[i]=C[j], i<>j Or any B[i]=B[j], i<>j Adaptation: matrix pivoting, adaptive mesh refinement, etc. Approximately, more than 50% scientific programs are irregular [Kennedy’94] Irregular applications involve problems defined on irregular or adaptive data structures –CFD, Molecular dynamics, sparse linear algebra.

Irregular Programs: an Example Sparse Symmetric MVM (irregular : data accesses via indirections) DO I=1,N ! Row ID DO K=row[I], row[I+1] J=col[K] ! Column ID B[I] += M[K] * X[J] ! M[K]=M[I,J] B[J] += M[K] * X[I] ! M[K]=M[J,I] XXX XXX XXX XXX XXX X X X X X X X X MXB =x DO I=1,N ! Row ID DO J=1,N ! Column ID B[I] += M[I,J] * X[J] Dense MVM (regular)

Run-Time Parallelization: The Idea Perform program analysis and optimization during program execution (run-time) Select dynamically between pre-optimized versions of the code -N < K < N do i = 1,N A[i+K]=A[i]+B[i] T doall i = 1,N A[i+K]=A[i]+B[i] F do i = 1,N A[i+K]=A[i]+B[i] (K, N from input) [Wolfe 97]

Run-time Parallelization: Irregular Apps. Solution: instrument code to 1.Collect data access pattern (represented by W[i], R[i]) 2.Verify whether any data dependence could occur Inspector/executor, Speculation Problem: Loop is n ot parallel if any R[i] = W[j], i <> j DO i= A[ W[i] ] = A[ R[i] ] + C[i] do i = … trace W[i], R[i] Analyze and schedule doall i = … A[W[i]] = A[R[i]] … Inspector/executor doall i = … trace W[i], R[i] A[W[i]] = A[R[i]] … Analyze if ( fail ) do i = … speculation

Run-time Parallelization: Approaches Inspector: Tracing + scheduling Executor End Inspector/Executor [Saltz et al 88,91] No Pattern Changed? Yes Schedule Reuse Speculative execution + tracing Success ? Test Roll back + Sequential loop Yes No End Speculative Execution [Rauchwerger & Padua, 95] Checkpoint

Overview of Speculative Parallelization Speculative parallel execution Success ? Error Detection (Data dependence analysis) Checkpoint Restore Sequential execution Polaris Source Code Compile time Run-time Static analysis Run-time transformations Yes No Run-time Optimization Postpones analysis and optimization until execution time –checkpoint/restoration mechanism –error detection method to test validity of speculation Use actual run-time values of program parameters affecting performance Select dynamically between pre- optimized versions of the code

Speculative DOALL Parallelization Main Idea: Speculatively execute the loop in parallel + Record accesses to data under test in shadows Afterwards, analyze if the loop was truly parallel (no actual dependences) by identifying multiple accesses to same location Speculative parallel execution + MARK Read, Write Success ? Analysis Checkpoint Restore Sequential execution Yes No End [Rauchwerger & Padua ’94] WRWR Replicated shadow of data array (A) Merged shadow of data array DO i= A[ W[i] ] = A[ R[i] ] + C[i] Problem:

DOALL Test – Marking and Analysis Parallel speculative execution –Mark read and write operations into different private shadow arrays. marking write  clear read mark. –Incremental private write counter (# write operations). Post-speculation analysis –Merge private shadow arrays to global shadow arrays. –Count elements have been marked write. –If (write shadow ^ read shadow  0)  exist anti or flow dep. –If (# modified elements < # write operations)  exist output dep.

LRPD Test: Main Ideas Lazy (value-based) Reduction Privatization DOALL test Errors: –Loop-carried flow dependence –Loop-carried anti or output dep. for arrays that are not privatizable. –Speculatively applied privatization and reduction parallelization transformations are INVALID. [Rauchwerger & Padua ’95]

Error Detection for Speculative Parallelization Types of Errors: Data Dependence Related Writing a memory location in different iterations Writing and reading a memory location in different iterations Exceptions General approach for Data Dependence Detection Shadow arrays under test Record accesses to data under test in shadows Detect errors (dependences) by identifying multiple accesses to same location

Speculative Transformations: Privatization and Reductions Reduction Shared Variables form: var = var  exp() accessed only in reduction statement  associative (commutative operator) change algorithm: accumulate in private storage do i=1,n do j=1,m f(i,j) =... A(j)=A(j) + f(i,j) enddo Privatizable Shared Variables every read covered by a write allocate private storage/processor do i=1,n do j=1,m A(j) = f(i,j) B(i,j) = A(j) enddo Independent Shared Variables read only; accessed only once no transformation needed do I=1,n f(I) = A(I) B(I) = g(I) enddo

The problem : Parallelization of the following loop: Do I=1,5 z = A(K(I)) if B(I) then A(L(I)) = z + C(I) endif Enddo B(1:5) = (1,0,1,0,1) K(1:5) = (1,2,3,4,1) L(1:5) = (2,2,4,4,2) all iterations executed concurrently unsafe if some A(K(i)) == A(L(j)), i  j Types of Errors: Data Dependence Related Writing a memory location in different iterations Writing and reading a memory location in different iterations LRPD Test: an Example

Parallel Speculative Execution and Marking Phase allocate shadow arrays Aw, Ar, Anp - one per processor speculatively privatize A and execute loop in parallel Record accesses to data under test in shadows Markwrite() If first time A(i) written in iter. Mark Aw(i) Clear Ar(i) increment tw_A (write counter) Markread() If A(i) not already written in iter. Mark Ar(i) Mark Anp(i) (not privatizable) do i = 1, 5 S1 z = A[K[i]] if (B[i]) then S2 A[L[i]] = z+C[i] endif enddo doall i = 1, 5 S1 z = A[K[i]] if (B[i]) then markread(K[i]) markwrite(L[i]) increment (tw_A) S2 A[L[i]] = z+C[i] endif enddo LRPD Test: Marking Phase

Post-execution Analysis Phase, Detect errors (dependences) by identifying multiple accesses to same location. compute tm(A) = sum of marks in Aw across processors (total number of writes in distinct iterations) if Aw ^ Ar then loop was NOT a DOALL if tw = tm then loop was a DOALL if Aw ^ Anp then loop was NOT a DOALL otherwise loop privatization was valid and loop was a DOALL Shadow ArrayAttemptedCountedOutcome 1234TwTm Aw(1:4)010132 FAIL Ar(1:4)1010 Anp(1:4)1010 Aw ^ Ar0000Pass Aw ^ Anp0000Pass LRPD Test: Analysis Phase

do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do 4 K[1:8] = [1,2,3,1,4,2,1,1] 44 L[1:8] = [4,5,5,4,3,5,3,3] iter12345678 A() 1RRRR 2RR 3RWWW 4WWR 5WWW  For LRPD test –One data dependence can invalidate speculative parallelization. –Slowdown is proportional to speculative parallel execution time. –Partial parallelism is not exploited. Partially Parallel Loop Example

Main Idea –Transform a partially parallel loop into a sequence of fully parallel, block-scheduled loops. –Iterations before the first data dependence are correct and committed. –Re-apply the LRPD test on the remaining iterations. Worst case –Sequential time plus testing overhead [Dang, Yu and Rauchwerger’02] The Recursive LRPD

success Initialize Commit Analyze Execute as DOALL Checkpoint if failure Reinitialize Restore Restart p0p1p2p3 2 nd stage After 2 nd stage After 1 st stage 1 st stage Block scheduled iterations Algorithm

Example do i = 1, 8 B[i] = f(i) z = A[K[i]] A[L[i]] = z + C[i] enddo L[1:5] = [2,2,4,4,2,1,5,5] K[1:5] = [1,2,3,4,1,2,4,2] start = newstart = 1; success = false; end = 8 initialize shadow array; checkpoint B while (.not. success) doall i = newstart, end B[i] = f(i) z = pA[K[i]] pA[L[i]] = z + C[i] markread(K[i]); markwrite(L[i]) end doall analyze(success, newstart) commit(pA, A, start, newstart-1) if (.not. success) then restore B[newstart:end] reinitialize shadow array endif end while

Implemented in run-time pass in Polaris and additional hand-inserted code. –Privatization with copy-in/copy-out for arrays under test. –Replicated buffers for reductions. –Backup arrays for checkpointing. Implementation

7-85-63-41-2iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First Stage: Detect cross-proc. DD 7-85-6iter W5 R4 W3 W2 R1 A() P4P3P2P1proc Second Stage: fully parallel do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do 4 K[1:8] = [1,2,3,1,4,2,1,1] 44 L[1:8] = [4,5,5,4,3,5,3,3] Recursive LRPD Example

Redistribute remaining iterations across processors. Execution time for each stage will decrease. Disadvantages: –May uncover new dependences across processors. –May incur remote cache misses from data redistribution. p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage With Redistribution p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage Without Redistribution Work Redistribution

do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do K[1:8] = [1,2,3,1,4,2,1,1] L[1:8] = [4,5,5,4,2,5,3,3] 7-85-63-41-2iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First StageSecond Stage 8765iter W5 R4 WW3 RW2 RR1 A() P4P3P2P1proc Third Stage 876iter W5 4 WW3 R2 RR1 A() P4P3P2P1proc Work Redistribution Example

Overhead of Speculative DOALL Test Overhead of using shadow array. –Marking (speculation) phase overhead. Proportional to # distinct array references in the loop. Can be further reduced by better using control flow info. –Analysis phase overhead. Proportional to the size of data array. Ideally only proportional to # touched array elements. Overhead reduction [CC’00]. –Use different shadow structure for loops w/ sparse access patterns. –Reduce redundant markings as much as possible.

1234 Question: exist multiple accesses to same location? Two ways to log necessary info. to answer the question. 1.For each data element: which operation accessed it? –Complexity: Proportional to number of elements 2.For each memory related operation: which element did it access? –Complexity: Proportional to number of iterations 1234 Operations (iterations) Data elements 5678 Dense access Sparse access Sparse Memory Accesses Overhead of LRPD test (first way). Marking (speculation) phase: proportional to # operations. Analysis phase: p roportional to # elements.  not efficient for loops with “sparse accesses”. [Yu and Rauchwerger’00]

Reduce Overhead of Run-Time Test 2.List – Monotonic Access + variable stride 3.Hash Table – Random access 1.Close form – Monotonic Access + constant stride (1, 3, 4) –Use compacted shadow structure for loops w/ sparse access patterns (2 nd way). 1234 1234 1234 –Run-time library adaptively selects shadow structures among close form, list and hash. –Compile-time technique to reduce redundant markings.

Run-Time Test for Loops with Sparse Access Speculative Execution –For every static marking site, mark in a temporary private shadow structure. –At the end of each iteration, adaptively aggregate the markings (triplet  list  hash table). –Overhead: proportional to # of distinct array references. Analysis Phase –Compares pair by pair the aggregated shadow structures. –May reduce to ranges or triplets comparison. –Overhead: proportional to the dynamic marking sites, constant  proportion of # of distinct array references.

Combine Marks do … if (pred1) then A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = … do … if (pred1) then Mark(A, W(i), WF) A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = … One mark for multiple references if: –Their subscripts are only different on a loop-inv. expression. –They are the same type among RO, RW, WR. –They have the same guarding predicates. Combining procedure: –Partition the subscript expressions. –Apply set operations during recursive traversal of CDG.

Effect of Speculative Parallelization ProgramTechniquesCoverageSuite TRACKR-LRPD98%Perfect SPICESparse Test / R-LRPD89%SPEC’92 FMA3DR-LRPD71%SPEC’00 MDGLRPD99%Perfect

Speculative Run-time Parallelization: Summary Run-time techniques apply program analysis and optimization transformations during program execution. Speculative run-time parallelization techniques (LRPD test, etc.) collect memory access history while executing a loop in parallel. Recursive LRPD test can speculatively parallelize any loop. Overhead of Run-time speculation can be further reduced by adaptively applying different shadow data structures.

The New Computing Challenge Today’s systems: General purpose, Heterogeneous –Poor portability, low efficiency –Need automatic system-level software support GAUSSIAN Quantum chemistry system CHARMM Molecular dynamics of organic systems SPICE Circuit simulation ASCI Multi-physics simulations The Challenge: Easy to Use & High Performance Today’s scientific applications: Bio, Multi-physics, etc –Time consuming, dynamic features and irregular data structures. –Need automatic optimization techniques to generate shorten execution.

Today: System Centric Computing No Global Optimization (In the interest of dynamic applications) Compiler (static) Application (algorithm) System (OS & Arch) Execution Development, Analysis & Optimization Input Data OS services are generic Architecture is generic Compilers are conservative Application Compiler OS System-Centric Computing HW

Approach: Application Centric Computing SmartApps Input Data Application Compiler HW OS Application-Centric Computing Run-time System: Execution, Analysis & Optimization Compiler Application Development, Analysis & Optimization Architecture (reconfigurable) OS (modular) Compiler (run-time) SmartApp Application Control Instance-specific optimization Compiler + OS + Architecture + Data + Feedback

SmartApps System Architecture Configurable executable Compiler-internal info. Parallelizing Compiler Augmented with runtime techniques Application Get Runtime Information Sample input, system. etc. Execute Application Continuously monitor performance and adapt as necessary Adaptive Software Runtime tuning (no recompile/reconfigure) Generate Optimal Application and System Configuration Recompile Application and/or Reconfigure System Smart Application run-time system Small adaptation (tuning) Large adaptation (failure, phase change)

Related Publications The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger and David Padua, PLDI’95 Parallelizing While loops for Multiprocessor Systems, Lawrence Rauchwerger and David Padua, IPPS’95 Run-time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy Amato and David Padua, ICS’95 SmartApps: An Application Centric Approach to High Performance Computing: Compiler- Assisted Software and Hardware Support for Reduction Operations, F. Dang, M. J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas, NSFNGS, 2002 The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, F. Dang, H. Yu and L. Rauchwerger, IPDPS’02 Hybrid Analysis: Static & Dynamic Memory Reference Analysis, S. Rus, L. Rauchwerger and J. Hoeflinger, ICS’02 Techniques for Reducing the Overhead of Run-time Parallelization, H. Yu and L. Rauchwerger, CC’00 Adaptive Reduction Parallelization Techniques, H. Yu and L. Rauchwerger, ICS’00 http://parasol.tamu.edu/

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Similar presentations

Presentation on theme: "Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Similar presentations

Presentation on theme: "Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization."— Presentation transcript:

Similar presentations

About project

Feedback