Download presentation
Presentation is loading. Please wait.
Published byKory Terry Modified over 9 years ago
1
Thread-Level Speculation Karan Singh CS 612 2.23.2006
2
CS 6122 Introduction extraction of parallelism at compile time is limited TLS allows automatic parallelization by supporting thread execution without advance knowledge of any dependence violations Thread-Level Speculation (TLS) is a form of optimistic parallelization
3
2.23.2006CS 6123 Introduction Zhang et al. extensions to cache coherence protocol hardware to detect dependence violations Pickett et al. design for a Java-specific software TLS system that operates at the bytecode level
4
Hardware for Speculative Run- Time Parallelization in Distributed Shared-Memory Multiprocessors Ye Zhang Lawrence Rauchwerger Josep Torrellas
5
2.23.2006CS 6125 Outline Loop parallelization basics Speculative Run-Time Parallelization in Software Speculative Run-Time Parallelization in Hardware Evaluation and Comparison
6
2.23.2006CS 6126 Loop parallelization basics a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations need to analyze data dependences across iterations: flow, anti, output if no dependences – doall loop if only anti or output dependences – privatization, scalar expansion …
7
2.23.2006CS 6127 Loop parallelization basics to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration dependences
8
2.23.2006CS 6128 Speculative Run-Time Parallelization in Software mechanism for saving/restoring state before executing speculatively, we need to save the state of the arrays that will be modified dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel after execution, arrays are restored from their backups
9
2.23.2006CS 6129 Speculative Run-Time Parallelization in Software LRPD test to detect dependences flags existence of cross-iteration dependences apply to those arrays whose dependences cannot be analyzed at compile-time two phases: Marking & Analysis
10
2.23.2006CS 61210 LRPD test setup backup A(1:s) initialize shadow arrays to zero A r (1:s), A w (1:s) initialize scalar Atw to zero
11
2.23.2006CS 61211 LRPD test marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set A w (i) read from A(i): if A(i) not written in this iteration, set A r (i) at end of iteration, count how many different elements of A have been written and add count to Atw
12
2.23.2006CS 61212 LRPD test analysis: performed after the speculative execution compute Atm = number of non-zero A w (i) for all elements i of the shadow array if any(A w (:)^ A r (:)), loop is not a doall; abort execution else if Atw == Atm, then loop is a doall
13
2.23.2006CS 61213 Example w(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0any(Aw ^ Ar) = 0 Atw = 2 Atm = 1 Since Atw ≠ Atm, parallelization fails
14
2.23.2006CS 61214 Example w(x)r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails
15
2.23.2006CS 61215 Example w(x) r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails
16
2.23.2006CS 61216 Example w(x) r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0*any(Aw ^ Ar) = 0 Atw = 1 Atm = 1 Since Atw == Atm, loop is a doall * if A(i) not written in this iteration, set Ar(i)
17
2.23.2006CS 61217 Example
18
2.23.2006CS 61218 Speculative Run-Time Parallelization in Software implementation in a DSM system, each processor allocates a private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are merged in parallel compiler integration part of a front-end parallelizing compiler parallelize loops chosen based on user feedback or heuristics about previous success rate
19
2.23.2006CS 61219 Speculative Run-Time Parallelization in Software improvements privatization iteration-wise vs. processor-wise shortcomings overhead of analysis phase and extra instructions for marking we get to know parallelization failed only after the loop completes execution
20
2.23.2006CS 61220 privatization example for i = 1 to N tmp = f(i) /* f is some operation */ A(i) = A(i) + tmp enddo in privatization, for each processor, we create private copies of the variables causing anti or output dependences privatization
21
2.23.2006CS 61221 Speculative Run-Time Parallelization in Hardware extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences on detection, parallel execution is immediately aborted extra state in tags of all caches fast memory in the directories
22
2.23.2006CS 61222 Speculative Run-Time Parallelization in Hardware two sets of transactions non-privatization algorithm privatization algorithm
23
2.23.2006CS 61223 non-privatization algorithm identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor a pattern where an element is read by several processors and later written by one is flagged as not parallel
24
2.23.2006CS 61224 non-privatization algorithm fast memory has three entries: ROnly, NoShr, First these entries are also sent to cache and stored in tags of the corresponding cache line per-element bits in tags of different caches and directories are kept coherent
25
2.23.2006CS 61225 non-privatization algorithm
26
2.23.2006CS 61226 Speculative Run-Time Parallelization in Hardware implementation need three supports: storage for access bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address modify three parts: primary cache, secondary cache, directory
27
2.23.2006CS 61227 implementation primary cache access bits stored in an SRAM table called Access Bit Array algorithm operations determined by Control input Test Logic performs operations
28
2.23.2006CS 61228 implementation secondary cache need Access Bit Array L1 miss, L2 hit L2 provides data and access bits to L1 access bits sent directly to the test logic in L1 bits generated are stored in access bit array of L1
29
2.23.2006CS 61229 implementation directory small dedicated memory for access bits with lookup table access bits generated by logic are sent to processor transaction overlapped with memory and directory access
30
2.23.2006CS 61230 Evaluation execution drive simulations of CC-NUMA shared memory multiprocessor using Tango-lite loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track compare four environments: Serial, Ideal, SW, HW loops run with 16 processes (except Ocean which runs with 8 processes)
31
2.23.2006CS 61231 Evaluation loop execution speedup
32
2.23.2006CS 61232 Evaluation slowdown due to failure
33
2.23.2006CS 61233 Evaluation scalability
34
2.23.2006CS 61234 Software vs. Hardware in hardware, failure to parallelize is detected on the fly several operations are performed in hardware, which reduces overheads hardware scheme has better scalability with number of processors hardware scheme has less space overhead
35
2.23.2006CS 61235 Software vs. Hardware in hardware, non-privatization test is processor-wise without requiring static scheduling hardware scheme can be applied to pointer-based C code more efficiently however, software implementation does not require any hardware!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.