Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Numerical Algorithms Matrix multiplication

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.

Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Parallelizing Compilers Presented by Yiwei Zhang.

Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.

Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Chapter 1 Algorithm Analysis

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Chapter 12 Recursion, Complexity, and Searching and Sorting

Thread-Level Speculation Karan Singh CS

Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,

Data Structure Introduction.

CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Implementation of a Run Time System to support parallelization of Partially Parallel loops using R-LRPD Test Pranav Garg ( Y5313 ) Virajith Jalaparti (

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Dependence Analysis and Loops CS 3220 Spring 2016.

 Problem Analysis  Coding  Debugging  Testing.

Buffering Techniques Greg Stitt ECE Department University of Florida.

SketchVisor: Robust Network Measurement for Software Packet Processing

Code Optimization.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Optimizing Transformations Hal Perkins Winter 2008

Parallel Computation Patterns (Reduction)

Parallel Sorting Algorithms

Gary M. Zoppetti Gagan Agrawal

Loop-Level Parallelism

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

rePLay: A Hardware Framework for Dynamic Optimization

CSE 542: Operating Systems

Presentation transcript:

Lawrence Rauchwerger Parasol Laboratory CS, Texas A&M University Speculative Run-Time Parallelization

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

Data Dependence Data Dependence (DD) : Data dependence relations are used as the essential ordering constraints among statements or operations in a program. Data dependence happens when two memory accesses access one memory location and at least one of them writes to the location... = X X =.... = X X =.. flowantioutput Three basic data dependence relations

Parallel Loop Can a loop be executed in parallel ? Test procedure –FOR every pair of load/store and store/store operations: DO – IF (L and S could access the same location in diff. Iteration) – LOOP is sequential For array, the memory accesses are functions of loop indices. These functions can be: linear, non-linear, unknown map. for i =.. A[i] = A[i] + B[i] for i =.. A[i+1] = A[i] + B[i] A parallel loopA sequential loop

Parallelism Enabling Transformations for i =.. temp = A[i] A[i] = B[i] B[i] = temp for i = … private temp temp = A[i] A[i] = B[i] B[i] = temp anti output Privatization: let each iteration have separate location for ‘temp’ ‘temp’ is used as temporary storage in each iteration Parallel or not?

Reduction Parallelization Reduction is Associative and commutative operation of the form: x = x  exp x does not occur in exp or anywhere else in the loop (exceptions) Usually extracted by pattern matching. do i =.. A = A + B[i] Reduction parallelization: pA[1:P] = 0 do p = 1, P do i = my_i … pA[p] = pA[p] + B[i] do p = 1, P A = A + pA[p] anti output flow Parallel or not?

Irregular Applications: The most challenging for i =.. A[B[i]] = A[C[i]] + B[i] output flow From compiler’s viewpoint, in irregular programs, –Data arrays are accessed via indirections, that usually are input dependent. –Optimization, parallelization information not available statically Problem: loop is sequential if any B[i]=C[j], i<>j Or any B[i]=B[j], i<>j Adaptation: matrix pivoting, adaptive mesh refinement, etc. Approximately, more than 50% scientific programs are irregular [Kennedy’94] Irregular applications involve problems defined on irregular or adaptive data structures –CFD, Molecular dynamics, sparse linear algebra.

Irregular Programs: an Example Sparse Symmetric MVM (irregular : data accesses via indirections) DO I=1,N ! Row ID DO K=row[I], row[I+1] J=col[K] ! Column ID B[I] += M[K] * X[J] ! M[K]=M[I,J] B[J] += M[K] * X[I] ! M[K]=M[J,I] XXX XXX XXX XXX XXX X X X X X X X X MXB =x DO I=1,N ! Row ID DO J=1,N ! Column ID B[I] += M[I,J] * X[J] Dense MVM (regular)

Run-Time Parallelization: The Idea Perform program analysis and optimization during program execution (run-time) Select dynamically between pre-optimized versions of the code -N < K < N do i = 1,N A[i+K]=A[i]+B[i] T doall i = 1,N A[i+K]=A[i]+B[i] F do i = 1,N A[i+K]=A[i]+B[i] (K, N from input) [Wolfe 97]

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

Run-time Parallelization: Irregular Apps. Solution: instrument code to 1.Collect data access pattern (represented by W[i], R[i]) 2.Verify whether any data dependence could occur Inspector/executor, Speculation Problem: Loop is n ot parallel if any R[i] = W[j], i <> j DO i= A[ W[i] ] = A[ R[i] ] + C[i] do i = … trace W[i], R[i] Analyze and schedule doall i = … A[W[i]] = A[R[i]] … Inspector/executor doall i = … trace W[i], R[i] A[W[i]] = A[R[i]] … Analyze if ( fail ) do i = … speculation

Run-time Parallelization: Approaches Inspector: Tracing + scheduling Executor End Inspector/Executor [Saltz et al 88,91] No Pattern Changed? Yes Schedule Reuse Speculative execution + tracing Success ? Test Roll back + Sequential loop Yes No End Speculative Execution [Rauchwerger & Padua, 95] Checkpoint

Overview of Speculative Parallelization Speculative parallel execution Success ? Error Detection (Data dependence analysis) Checkpoint Restore Sequential execution Polaris Source Code Compile time Run-time Static analysis Run-time transformations Yes No Run-time Optimization Postpones analysis and optimization until execution time –checkpoint/restoration mechanism –error detection method to test validity of speculation Use actual run-time values of program parameters affecting performance Select dynamically between pre- optimized versions of the code

Speculative DOALL Parallelization Main Idea: Speculatively execute the loop in parallel + Record accesses to data under test in shadows Afterwards, analyze if the loop was truly parallel (no actual dependences) by identifying multiple accesses to same location Speculative parallel execution + MARK Read, Write Success ? Analysis Checkpoint Restore Sequential execution Yes No End [Rauchwerger & Padua ’94] WRWR Replicated shadow of data array (A) Merged shadow of data array DO i= A[ W[i] ] = A[ R[i] ] + C[i] Problem:

DOALL Test – Marking and Analysis Parallel speculative execution –Mark read and write operations into different private shadow arrays. marking write  clear read mark. –Incremental private write counter (# write operations). Post-speculation analysis –Merge private shadow arrays to global shadow arrays. –Count elements have been marked write. –If (write shadow ^ read shadow  0)  exist anti or flow dep. –If (# modified elements < # write operations)  exist output dep.

LRPD Test: Main Ideas Lazy (value-based) Reduction Privatization DOALL test Errors: –Loop-carried flow dependence –Loop-carried anti or output dep. for arrays that are not privatizable. –Speculatively applied privatization and reduction parallelization transformations are INVALID. [Rauchwerger & Padua ’95]

Error Detection for Speculative Parallelization Types of Errors: Data Dependence Related Writing a memory location in different iterations Writing and reading a memory location in different iterations Exceptions General approach for Data Dependence Detection Shadow arrays under test Record accesses to data under test in shadows Detect errors (dependences) by identifying multiple accesses to same location

Speculative Transformations: Privatization and Reductions Reduction Shared Variables form: var = var  exp() accessed only in reduction statement  associative (commutative operator) change algorithm: accumulate in private storage do i=1,n do j=1,m f(i,j) =... A(j)=A(j) + f(i,j) enddo Privatizable Shared Variables every read covered by a write allocate private storage/processor do i=1,n do j=1,m A(j) = f(i,j) B(i,j) = A(j) enddo Independent Shared Variables read only; accessed only once no transformation needed do I=1,n f(I) = A(I) B(I) = g(I) enddo

The problem : Parallelization of the following loop: Do I=1,5 z = A(K(I)) if B(I) then A(L(I)) = z + C(I) endif Enddo B(1:5) = (1,0,1,0,1) K(1:5) = (1,2,3,4,1) L(1:5) = (2,2,4,4,2) all iterations executed concurrently unsafe if some A(K(i)) == A(L(j)), i  j Types of Errors: Data Dependence Related Writing a memory location in different iterations Writing and reading a memory location in different iterations LRPD Test: an Example

Parallel Speculative Execution and Marking Phase allocate shadow arrays Aw, Ar, Anp - one per processor speculatively privatize A and execute loop in parallel Record accesses to data under test in shadows Markwrite() If first time A(i) written in iter. Mark Aw(i) Clear Ar(i) increment tw_A (write counter) Markread() If A(i) not already written in iter. Mark Ar(i) Mark Anp(i) (not privatizable) do i = 1, 5 S1 z = A[K[i]] if (B[i]) then S2 A[L[i]] = z+C[i] endif enddo doall i = 1, 5 S1 z = A[K[i]] if (B[i]) then markread(K[i]) markwrite(L[i]) increment (tw_A) S2 A[L[i]] = z+C[i] endif enddo LRPD Test: Marking Phase

Post-execution Analysis Phase, Detect errors (dependences) by identifying multiple accesses to same location. compute tm(A) = sum of marks in Aw across processors (total number of writes in distinct iterations) if Aw ^ Ar then loop was NOT a DOALL if tw = tm then loop was a DOALL if Aw ^ Anp then loop was NOT a DOALL otherwise loop privatization was valid and loop was a DOALL Shadow ArrayAttemptedCountedOutcome 1234TwTm Aw(1:4) FAIL Ar(1:4)1010 Anp(1:4)1010 Aw ^ Ar0000Pass Aw ^ Anp0000Pass LRPD Test: Analysis Phase

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do 4 K[1:8] = [1,2,3,1,4,2,1,1] 44 L[1:8] = [4,5,5,4,3,5,3,3] iter A() 1RRRR 2RR 3RWWW 4WWR 5WWW  For LRPD test –One data dependence can invalidate speculative parallelization. –Slowdown is proportional to speculative parallel execution time. –Partial parallelism is not exploited. Partially Parallel Loop Example

Main Idea –Transform a partially parallel loop into a sequence of fully parallel, block-scheduled loops. –Iterations before the first data dependence are correct and committed. –Re-apply the LRPD test on the remaining iterations. Worst case –Sequential time plus testing overhead [Dang, Yu and Rauchwerger’02] The Recursive LRPD

success Initialize Commit Analyze Execute as DOALL Checkpoint if failure Reinitialize Restore Restart p0p1p2p3 2 nd stage After 2 nd stage After 1 st stage 1 st stage Block scheduled iterations Algorithm

Example do i = 1, 8 B[i] = f(i) z = A[K[i]] A[L[i]] = z + C[i] enddo L[1:5] = [2,2,4,4,2,1,5,5] K[1:5] = [1,2,3,4,1,2,4,2] start = newstart = 1; success = false; end = 8 initialize shadow array; checkpoint B while (.not. success) doall i = newstart, end B[i] = f(i) z = pA[K[i]] pA[L[i]] = z + C[i] markread(K[i]); markwrite(L[i]) end doall analyze(success, newstart) commit(pA, A, start, newstart-1) if (.not. success) then restore B[newstart:end] reinitialize shadow array endif end while

Implemented in run-time pass in Polaris and additional hand-inserted code. –Privatization with copy-in/copy-out for arrays under test. –Replicated buffers for reductions. –Backup arrays for checkpointing. Implementation

iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First Stage: Detect cross-proc. DD iter W5 R4 W3 W2 R1 A() P4P3P2P1proc Second Stage: fully parallel do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do 4 K[1:8] = [1,2,3,1,4,2,1,1] 44 L[1:8] = [4,5,5,4,3,5,3,3] Recursive LRPD Example

Redistribute remaining iterations across processors. Execution time for each stage will decrease. Disadvantages: –May uncover new dependences across processors. –May incur remote cache misses from data redistribution. p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage With Redistribution p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage Without Redistribution Work Redistribution

do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do K[1:8] = [1,2,3,1,4,2,1,1] L[1:8] = [4,5,5,4,2,5,3,3] iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First StageSecond Stage 8765iter W5 R4 WW3 RW2 RR1 A() P4P3P2P1proc Third Stage 876iter W5 4 WW3 R2 RR1 A() P4P3P2P1proc Work Redistribution Example

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

Overhead of Speculative DOALL Test Overhead of using shadow array. –Marking (speculation) phase overhead. Proportional to # distinct array references in the loop. Can be further reduced by better using control flow info. –Analysis phase overhead. Proportional to the size of data array. Ideally only proportional to # touched array elements. Overhead reduction [CC’00]. –Use different shadow structure for loops w/ sparse access patterns. –Reduce redundant markings as much as possible.

1234 Question: exist multiple accesses to same location? Two ways to log necessary info. to answer the question. 1.For each data element: which operation accessed it? –Complexity: Proportional to number of elements 2.For each memory related operation: which element did it access? –Complexity: Proportional to number of iterations 1234 Operations (iterations) Data elements 5678 Dense access Sparse access Sparse Memory Accesses Overhead of LRPD test (first way). Marking (speculation) phase: proportional to # operations. Analysis phase: p roportional to # elements.  not efficient for loops with “sparse accesses”. [Yu and Rauchwerger’00]

Reduce Overhead of Run-Time Test 2.List – Monotonic Access + variable stride 3.Hash Table – Random access 1.Close form – Monotonic Access + constant stride (1, 3, 4) –Use compacted shadow structure for loops w/ sparse access patterns (2 nd way) –Run-time library adaptively selects shadow structures among close form, list and hash. –Compile-time technique to reduce redundant markings.

Run-Time Test for Loops with Sparse Access Speculative Execution –For every static marking site, mark in a temporary private shadow structure. –At the end of each iteration, adaptively aggregate the markings (triplet  list  hash table). –Overhead: proportional to # of distinct array references. Analysis Phase –Compares pair by pair the aggregated shadow structures. –May reduce to ranges or triplets comparison. –Overhead: proportional to the dynamic marking sites, constant  proportion of # of distinct array references.

Combine Marks do … if (pred1) then A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = … do … if (pred1) then Mark(A, W(i), WF) A(1+W(i)) = … A(2+W(i)) = … … if (pred1) then A(3+W(i)) = … One mark for multiple references if: –Their subscripts are only different on a loop-inv. expression. –They are the same type among RO, RW, WR. –They have the same guarding predicates. Combining procedure: –Partition the subscript expressions. –Apply set operations during recursive traversal of CDG.

Outline Motivation LPRD test, a speculative run-time parallelization framework Recursive LRPD test for partially parallel loops A run-time test for loops with sparse memory accesses Summary and Current work

Effect of Speculative Parallelization ProgramTechniquesCoverageSuite TRACKR-LRPD98%Perfect SPICESparse Test / R-LRPD89%SPEC’92 FMA3DR-LRPD71%SPEC’00 MDGLRPD99%Perfect

Speculative Run-time Parallelization: Summary Run-time techniques apply program analysis and optimization transformations during program execution. Speculative run-time parallelization techniques (LRPD test, etc.) collect memory access history while executing a loop in parallel. Recursive LRPD test can speculatively parallelize any loop. Overhead of Run-time speculation can be further reduced by adaptively applying different shadow data structures.

The New Computing Challenge Today’s systems: General purpose, Heterogeneous –Poor portability, low efficiency –Need automatic system-level software support GAUSSIAN Quantum chemistry system CHARMM Molecular dynamics of organic systems SPICE Circuit simulation ASCI Multi-physics simulations The Challenge: Easy to Use & High Performance Today’s scientific applications: Bio, Multi-physics, etc –Time consuming, dynamic features and irregular data structures. –Need automatic optimization techniques to generate shorten execution.

Today: System Centric Computing No Global Optimization (In the interest of dynamic applications) Compiler (static) Application (algorithm) System (OS & Arch) Execution Development, Analysis & Optimization Input Data OS services are generic Architecture is generic Compilers are conservative Application Compiler OS System-Centric Computing HW

Approach: Application Centric Computing SmartApps Input Data Application Compiler HW OS Application-Centric Computing Run-time System: Execution, Analysis & Optimization Compiler Application Development, Analysis & Optimization Architecture (reconfigurable) OS (modular) Compiler (run-time) SmartApp Application Control Instance-specific optimization Compiler + OS + Architecture + Data + Feedback

SmartApps System Architecture Configurable executable Compiler-internal info. Parallelizing Compiler Augmented with runtime techniques Application Get Runtime Information Sample input, system. etc. Execute Application Continuously monitor performance and adapt as necessary Adaptive Software Runtime tuning (no recompile/reconfigure) Generate Optimal Application and System Configuration Recompile Application and/or Reconfigure System Smart Application run-time system Small adaptation (tuning) Large adaptation (failure, phase change)

Related Publications The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, Lawrence Rauchwerger and David Padua, PLDI’95 Parallelizing While loops for Multiprocessor Systems, Lawrence Rauchwerger and David Padua, IPPS’95 Run-time Methods for Parallelizing Partially Parallel Loops, Lawrence Rauchwerger, Nancy Amato and David Padua, ICS’95 SmartApps: An Application Centric Approach to High Performance Computing: Compiler- Assisted Software and Hardware Support for Reduction Operations, F. Dang, M. J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas, NSFNGS, 2002 The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops, F. Dang, H. Yu and L. Rauchwerger, IPDPS’02 Hybrid Analysis: Static & Dynamic Memory Reference Analysis, S. Rus, L. Rauchwerger and J. Hoeflinger, ICS’02 Techniques for Reducing the Overhead of Run-time Parallelization, H. Yu and L. Rauchwerger, CC’00 Adaptive Reduction Parallelization Techniques, H. Yu and L. Rauchwerger, ICS’00