Download presentation
Presentation is loading. Please wait.
Published byDarcy Heath Modified over 9 years ago
1
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft
2
Motivations CMPs are not just small multiprocessors –Different computation/communication ratio –Different shared resources Inter-core fabric offers potential to support optimizations/acceleration –CMPs for vector, streaming workloads
3
Fine-grained Parallelism CMPs in role of vector processors –Software synchronization still expensive –Can target inner-loop parallelism Barriers a straightforward organizing tool –Opportunity for hardware acceleration Faster barriers allow greater parallelism –1.2x – 6.4x on 256 element vectors –3x – 12.2x on 1024 element vectors
4
Accelerating Barriers Barrier Filters: a new method for barrier synchronization –No dedicated networks –No new instructions –Changes only in shared memory system –CMP-friendly design point Competitive with dedicated barrier network –Achieves 77%-95% of dedicated network performance
5
Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary
6
Observation and Intuition Observations –Barriers need to stall forward progress –There exist events that already stall processors Co-opt and extend existing stall behavior –Cache misses Either I-Cache or D-Cache suffices
7
High Level Barrier Behavior A thread can be in one of three states 1. Executing –Perform work –Enforce memory ordering –Signal arrival at barrier 2. Blocking –Stall at barrier until all arrive 3. Resuming –Release from barrier
8
Barrier Filter Example CMP augmented with filter –Private L1 –Shared, banked L2 # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING
9
Example: Memory Ordering Before/after for memory –Each thread executes a memory fence # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING
10
Example: Signaling Arrival Communication with filter –Each thread invalidates a designated cache line # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING
11
Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation –Checks address for match –Records arrival # Threads: 3 Filter State Arrived-counter : 0 Thread A : EXECUTING Thread B : EXECUTING Thread C : EXECUTING Arrived-counter : 1 Thread A : BLOCKING
12
Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation –Checks address for match –Records arrival # Threads: 3 Filter State Arrived-counter : 1 Thread A : BLOCKING Thread B : EXECUTING Thread C : EXECUTING Arrived-counter : 2 Thread C : BLOCKING
13
Example: Stalling Thread A attempts to fetch the invalidated data Fill request not satisfied –Thread stalling mechanism # Threads: 3 Filter State Arrived-counter : 2 Thread A : BLOCKING Thread B : EXECUTING Thread C : BLOCKING
14
Example: Release Last thread signals arrival Barrier release –Counter resets –Filter state for all threads switches # Threads: 3 Filter State Arrived-counter : 2 Thread A : BLOCKING Thread B : EXECUTING Thread C : BLOCKING Arrived-counter : 0 Thread C : RESUMING Thread A : RESUMING Thread B : RESUMING
15
Example: Release After release –New cache-fill requests served –Filter serves pending cache- fills # Threads: 3 Filter State Arrived-counter : 0 Thread A : RESUMING Thread B : RESUMING Thread C : RESUMING
16
Example: Release After release –New cache-fill requests served –Filter serves pending cache- fills # Threads: 3 Filter State Arrived-counter : 0 Thread A : RESUMING Thread B : RESUMING Thread C : RESUMING
17
Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary
18
Software Interface Communication requirements –Let hardware know # of threads –Let threads know signal addresses Barrier filters as virtualized resource –Library interface –Pure software fallback User scenario –Application calls OS to create barrier with # threads –OS allocates barrier filter, relays address and # threads –OS returns address to application
19
Barrier Filter Hardware Additional hardware: “address filter” –In controller for shared memory level –State table, associated FSMs –Snoops invalidations, fill requests for designated addresses Makes use of existing instructions and existing interconnect network
20
Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank
21
Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank
22
Barrier Filter Internals Each barrier filter supports one barrier –Barrier state –Per-thread state, FSMs Multiple barrier filters –In each controller –In banked caches, at a particular bank
23
Why have an exit address? Needed for re-entry to barriers –When does Resuming again become Executing? –Additional fill requests may be issued Delivery is not a guarantee of receipt –Context switches –Migration –Cache eviction
24
Ping-Pong Optimization Draws from sense reversal barriers –Entry and exit operations as duals Two alternating arrival addresses –Each conveys exit to the other’s barrier –Eliminates explicit invalidate of exit address
25
Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary
26
Methodology Used modified version of SMT-Sim We performed experiments using 7 different barrier implementations –Software: Centralized, combining tree –Hardware: Filter barrier (4 variants), dedicated barrier network We examined performance over a set of parallelizeable kernels –Livermore loops 2, 3, 6 –EEMBC kernels autocorrelation, viterbi
27
Benchmark Selection Barriers are seen as heavyweight operations –Infrequently executed in most workloads Example: Ocean from SPLASH-2 –On simulated 16 core CMP: 4% of time in barriers Barriers will be used more frequently on CMPs
28
Latency Micro-benchmark Average time of barrier execution (in isolation) –#threads = #cores
29
Latency Micro-benchmark Notable effects due to bus saturation –Barrier filter scales well up until this point
30
Latency Micro-benchmark Filters closer to dedicated network than software –Significant speedup vs. software still exhibited
31
Autocorrelation Kernel On 16 core CMP –7.98x speedup for dedicated network –7.31x speedup for best filter barrier –3.86 speedup for best software barrier Significant speedup opportunities with fast barriers
32
Viterbi Kernel Not all applications can scale to arbitrary number of cores Viterbi performance higher on 4 or 8 cores than on 16 cores Viterbi on 4 core CMP
33
Livermore Loops Serial/parallel crossover –HW achieves on 4x smaller problem Livermore Loop 3 on 16-core CMP
34
Livermore Loops Reduction in parallelism to avoid false sharing Livermore Loop 3 on 16-core CMP
35
Result Summary Fine-grained parallelism on CMPs –Significant speedups possible 1.2x – 6.4x on 256 element vectors 3x – 12.2x on 1024 element vectors –False sharing affects problem size/scaling Faster barriers allow greater parallelism –HW approaches extend worthwhile problem sizes Barrier filters give competitive performance –77% - 95% of dedicated network performance
36
Conclusions Fast barriers –Can organize fine-grained data parallelism on a CMP CMPs can act in a vector processor role –Exploit inner-loop parallelism Barrier filters –CMP-oriented fast barrier
37
(FIN) Questions?
40
Extra Graphs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.