Thread-Level Speculation Karan Singh CS
CS 6122 Introduction extraction of parallelism at compile time is limited TLS allows automatic parallelization by supporting thread execution without advance knowledge of any dependence violations Thread-Level Speculation (TLS) is a form of optimistic parallelization
CS 6123 Introduction Zhang et al. extensions to cache coherence protocol hardware to detect dependence violations Pickett et al. design for a Java-specific software TLS system that operates at the bytecode level
Hardware for Speculative Run- Time Parallelization in Distributed Shared-Memory Multiprocessors Ye Zhang Lawrence Rauchwerger Josep Torrellas
CS 6125 Outline Loop parallelization basics Speculative Run-Time Parallelization in Software Speculative Run-Time Parallelization in Hardware Evaluation and Comparison
CS 6126 Loop parallelization basics a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations need to analyze data dependences across iterations: flow, anti, output if no dependences – doall loop if only anti or output dependences – privatization, scalar expansion …
CS 6127 Loop parallelization basics to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration dependences
CS 6128 Speculative Run-Time Parallelization in Software mechanism for saving/restoring state before executing speculatively, we need to save the state of the arrays that will be modified dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel after execution, arrays are restored from their backups
CS 6129 Speculative Run-Time Parallelization in Software LRPD test to detect dependences flags existence of cross-iteration dependences apply to those arrays whose dependences cannot be analyzed at compile-time two phases: Marking & Analysis
CS LRPD test setup backup A(1:s) initialize shadow arrays to zero A r (1:s), A w (1:s) initialize scalar Atw to zero
CS LRPD test marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set A w (i) read from A(i): if A(i) not written in this iteration, set A r (i) at end of iteration, count how many different elements of A have been written and add count to Atw
CS LRPD test analysis: performed after the speculative execution compute Atm = number of non-zero A w (i) for all elements i of the shadow array if any(A w (:)^ A r (:)), loop is not a doall; abort execution else if Atw == Atm, then loop is a doall
CS Example w(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0any(Aw ^ Ar) = 0 Atw = 2 Atm = 1 Since Atw ≠ Atm, parallelization fails
CS Example w(x)r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails
CS Example w(x) r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails
CS Example w(x) r(x) parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0*any(Aw ^ Ar) = 0 Atw = 1 Atm = 1 Since Atw == Atm, loop is a doall * if A(i) not written in this iteration, set Ar(i)
CS Example
CS Speculative Run-Time Parallelization in Software implementation in a DSM system, each processor allocates a private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are merged in parallel compiler integration part of a front-end parallelizing compiler parallelize loops chosen based on user feedback or heuristics about previous success rate
CS Speculative Run-Time Parallelization in Software improvements privatization iteration-wise vs. processor-wise shortcomings overhead of analysis phase and extra instructions for marking we get to know parallelization failed only after the loop completes execution
CS privatization example for i = 1 to N tmp = f(i) /* f is some operation */ A(i) = A(i) + tmp enddo in privatization, for each processor, we create private copies of the variables causing anti or output dependences privatization
CS Speculative Run-Time Parallelization in Hardware extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences on detection, parallel execution is immediately aborted extra state in tags of all caches fast memory in the directories
CS Speculative Run-Time Parallelization in Hardware two sets of transactions non-privatization algorithm privatization algorithm
CS non-privatization algorithm identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor a pattern where an element is read by several processors and later written by one is flagged as not parallel
CS non-privatization algorithm fast memory has three entries: ROnly, NoShr, First these entries are also sent to cache and stored in tags of the corresponding cache line per-element bits in tags of different caches and directories are kept coherent
CS non-privatization algorithm
CS Speculative Run-Time Parallelization in Hardware implementation need three supports: storage for access bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address modify three parts: primary cache, secondary cache, directory
CS implementation primary cache access bits stored in an SRAM table called Access Bit Array algorithm operations determined by Control input Test Logic performs operations
CS implementation secondary cache need Access Bit Array L1 miss, L2 hit L2 provides data and access bits to L1 access bits sent directly to the test logic in L1 bits generated are stored in access bit array of L1
CS implementation directory small dedicated memory for access bits with lookup table access bits generated by logic are sent to processor transaction overlapped with memory and directory access
CS Evaluation execution drive simulations of CC-NUMA shared memory multiprocessor using Tango-lite loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track compare four environments: Serial, Ideal, SW, HW loops run with 16 processes (except Ocean which runs with 8 processes)
CS Evaluation loop execution speedup
CS Evaluation slowdown due to failure
CS Evaluation scalability
CS Software vs. Hardware in hardware, failure to parallelize is detected on the fly several operations are performed in hardware, which reduces overheads hardware scheme has better scalability with number of processors hardware scheme has less space overhead
CS Software vs. Hardware in hardware, non-privatization test is processor-wise without requiring static scheduling hardware scheme can be applied to pointer-based C code more efficiently however, software implementation does not require any hardware!