Download presentation
Presentation is loading. Please wait.
Published byDarrius Rodd Modified over 10 years ago
1
Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University of Edinburgh
2
Intl. Symp. on Microarchitecture - December 20112 Introduction Source: Intel Multi-cores and many-cores here to stay
3
Intl. Symp. on Microarchitecture - December 20113 Introduction Multi-cores and many-cores are here to stay Parallel programming is essential to realize potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement coarse-grain parallelism? –Nested parallelism in multi-threaded applications –Exploit it using implicit speculative parallelism
4
Intl. Symp. on Microarchitecture - December 20114 Contributions Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability: –Improve scalability by 40% on avg. –Same energy consumption Detailed analysis of multithreaded scalability: –Performance bottlenecks –Behavior on different input datasets Auto-tuning to dynamically select the number of explicit and implicit threads
5
Intl. Symp. on Microarchitecture - December 20115 Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions
6
Intl. Symp. on Microarchitecture - December 20116 Bottlenecks: Large Critical Sections T0T0 T1T1 T2T2 T3T3 Time Integer Sort (IS) NASPB
7
Intl. Symp. on Microarchitecture - December 20117 Bottlenecks: Load Imbalance T0T0 T1T1 T2T2 T3T3 Time RADIOSITY SPLASH 2 Can we use these cores to accelerate this app.?
8
Intl. Symp. on Microarchitecture - December 20118 Outline Introduction Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions
9
9 Proposal Programming: –Users explicitly parallelize code –Tradeoff development time for performance gains Architecture and Compiler: –Exploit fine-grain parallelism on top of user threads –Thread-Level Speculation (TLS) within each user thread Hardware: –Support both explicit and implicit threads simultaneously in a nested fashion Intl. Symp. on Microarchitecture - December 2011
10
Speculative 10 Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } T0T0 TKTK TLTL TMTM ……… T K,i T K,i+1 T K,i+2 T K,i+3 Speculative T L,i T L,i+1 T L,i+2 T L,i+3 Intl. Symp. on Microarchitecture - December 2011
11
11 Proposal: Many-core Architecture Many-core partitioned in clusters (tiles) Coherence (MESI) –Snooping coherence within cluster –Directory coherence across clusters Support for TLS only within cluster –Snooping TLS protocol –Speculative buffering in L1 data caches Intl. Symp. on Microarchitecture - December 2011
12
12 Proposal: Many-core Architecture T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 T 16 T 17 T 18 T 19 T 20 T 21 T 22 T 23 T 24 T 25 T 26 T 27 T 28 T 29 T 30 T 31 Mem. Contr. C0C0 C1C1 C2C2 C3C3 ICDCICDCICDCICDC L2 $ Dir/ Router Intl. Symp. on Microarchitecture - December 2011
13
13 Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 2x Explicit Threads
14
Intl. Symp. on Microarchitecture - December 201114 Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 4ETs + 4ISTs
15
Intl. Symp. on Microarchitecture - December 201115 Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 2x Explicit Threads
16
Intl. Symp. on Microarchitecture - December 201116 Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 4ETs + 4ISTs
17
Intl. Symp. on Microarchitecture - December 201117 Expected Speedup Behavior
18
18 Proposal: Auto-Tuning the Thread Count Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT08] ) Developed a prototype in the Omni OpenMP System Intl. Symp. on Microarchitecture - December 2011
19
19 Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning i omp parallel region i detected: First time: Can we compute iteration count statically and is less than max core count? Yes -> set Initial Tcount to 32 Measure execution time t i 1 M=32
20
Intl. Symp. on Microarchitecture - December 201120 Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning ii omp parallel region i detected: Set Tcount to next value (16) Measure execution time t i 2 t i 2 < t i 1 continue exploration
21
Intl. Symp. on Microarchitecture - December 201121 Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning iii omp parallel region i detected: Set Tcount to next value (8) Measure execution time t i 3 t i 3 > t i 2 stop exploration
22
Intl. Symp. on Microarchitecture - December 201122 Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning iiii omp parallel region i detected: Use Tcount = 16, no further exploration Set TLS to 4-way
23
Intl. Symp. on Microarchitecture - December 201123 Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions
24
24 Evaluation Methodology SESC simulator - extended to model our scheme Architecture: –Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor –Tile/System: 128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory Intl. Symp. on Microarchitecture - December 2011
25
25 Evaluation Methodology Benchmarks : –12 workloads from PARSEC 2.1, SPLASH2, NASPB –Simulate parallel region to completion Compilation : –MIPS binaries generated using GCC 3.4.4 –Speculation added automatically through source-to- source compiler –Selection of speculation regions through manual profiling Power: –CACTI 4.2 and Wattch Intl. Symp. on Microarchitecture - December 2011
26
26 Evaluation Methodology Alternative schemes compared against: –Core Fusion [Ipek ISCA07]: Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core resources doubled without latency increase => upper bound –Frequency Boost: Inspired by Turbo Boost [Intel08] For each idle core one other core gains a frequency boost of 800MHz with a 200mV increase in voltage (same power cap) All these schemes shift resources to a subset of cores in order to improve performance Intl. Symp. on Microarchitecture - December 2011
27
27 Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions
28
Intl. Symp. on Microarchitecture - December 201128 Bottom Line Speedup over best scalability point TLS-4: 41% avg TLS-2:27% avg
29
Intl. Symp. on Microarchitecture - December 201129 Energy Showing best performing point for each scheme Energy consumption slightly lower on avg
30
Intl. Symp. on Microarchitecture - December 201130 Energy Showing best performing point for each scheme Spending less time in busy synchronization
31
Intl. Symp. on Microarchitecture - December 201131 Energy Showing best performing point for each scheme High mispeculation: Higher energy
32
Intl. Symp. on Microarchitecture - December 201132 Energy Showing best performing point for each scheme Little synchronization: Higher energy
33
Intl. Symp. on Microarchitecture - December 201133 Serial/Critical Sections is NASPB
34
Intl. Symp. on Microarchitecture - December 201134 Load Imbalance radiosity SPLASH2
35
Intl. Symp. on Microarchitecture - December 201135 Synchronization Heavy ocean SPLASH2
36
Intl. Symp. on Microarchitecture - December 201136 Coarse-Grain Partitioning swaptions PARSEC
37
Intl. Symp. on Microarchitecture - December 201137 Poor Static Partitioning sp NASPB
38
Intl. Symp. on Microarchitecture - December 201138 Effect of Dataset size Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp
39
Intl. Symp. on Microarchitecture - December 201139 Effect of Dataset size Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep
40
Intl. Symp. on Microarchitecture - December 201140 Effect of Dataset size Improved scalability, lessened TLS boost: streamcluster
41
Intl. Symp. on Microarchitecture - December 201141 Effect of Dataset size Worse scalability, even better TLS boost: water
42
Intl. Symp. on Microarchitecture - December 201142 Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions
43
Intl. Symp. on Microarchitecture - December 201143 Conclusions Multicores and many-cores are here to stay –Parallel programming essential to exploit new hardware –Some coarse-grain parallel programs do not scale –Enough nested parallelism to improve scalability Proposed speculative parallelization through implicit speculative threads on top of explicit threads: –Significant scalability improvement of 40% on avg –No increase in total energy consumptions –Presented an auto-tuning mechanism to dynamically choose the number of threads that performs within 6% of the oracle
44
Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University of Edinburgh
45
Intl. Symp. on Microarchitecture - December 201145 Related Work [von Praun PPoPP07] Implicit ordered transactions [Kim Micro10] Speculative Parallel-stage Decoupled Software Pipelining [Ooi ICS01] Multiplex [Madriles ISCA09] Anaphase [Rajwar MICRO01],[Martinez ASPLOS02] Speculative Lock Elision [Moravan ASPLOS06], etc., Nested transactional memory
46
Intl. Symp. on Microarchitecture - December 201146 Bibliography [Intl08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008 [Ipek ISCA07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors [von Praun PPoPP07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007 [Kim Micro10] Scalable speculative parallelization in commodity clusters, MICRO, 2010 [Ooi ICS01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001 [Madriles ISCA09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009
47
Intl. Symp. on Microarchitecture - December 201147 Bibliography [Rajwar MICRO01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001 [Martinez ASPLOS02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002 [Moravan ASPLOS06] Supporting nested transactional memory in logtm. ASPLOS 2006 [Curtis-Maury PACT08] Prediction models for multi-dimensional power- performance optimization on many-cores.
48
Intl. Symp. on Microarchitecture - December 201148 Benchmark details
49
Intl. Symp. on Microarchitecture - December 201149 Fetched Instructions
50
Intl. Symp. on Microarchitecture - December 201150 Failed Speculation
51
Intl. Symp. on Microarchitecture - December 201151 Serial/Critical Sections bodytrack PARSEC
52
Intl. Symp. on Microarchitecture - December 201152 Background: Speculative Parallelization Assume no dependences and execute threads in parallel Track data accesses Detect violations Squash offending threads and restart them for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … }
53
53 Background: Speculative Parallelization Speculative Time TJTJ T J+1 T J+2 for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } iteration J … = A[4] + … … A[5] = … iteration J+1 … = A[2] + … … A[2] = … iteration J+2 … = A[5] + … … A[6] = … ld A[4] st A[5] ld A[2] ld A[5] RAW Intl. Symp. on Microarchitecture - December 2011
54
54 Energy Showing best performing point for each scheme
55
Intl. Symp. on Microarchitecture - December 201155 Bottom Line Speedup over best scalability point
56
Intl. Symp. on Microarchitecture - December 201156 Auto-tuning OpenMP apps Performs within 6% of static oracle
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.