Arun Kejariwal‡,¥, Xinmin Tian‡

Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006
Arun Kejariwal‡,¥, Xinmin Tian‡ Milind Girkar‡, Wei Li‡, Sergey Kozhukhov‡, Hideki Saito‡ Utpal Banerjee‡ Alexandru Nicolau¥, Alexander V. Veidenbaum¥ Constantine D. Polychronopoulos* ‡Software and Solutions Group, Intel Corporation ¥Center for Embedded Computer Systems, University of California, Irvine *Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign March 16, 2007

Parallelism Becoming Ubiquitous
Emergence of multi-cores, hyper-threaded processors Intel® Core™ 2 Duo processor Intel® Kentsfield (quad-core) processor Program parallelization Auto-parallelization Hardware-assisted (speculative) parallelization Loop-level Main contributions Evaluating the performance potential of thread-level parallelism using SPEC* CPU2006 Trade-off between ILP and TLP (thread-level parallelism) Analysis w.r.t. threading overhead, transformations and conflict probability *Other brands and names may be claimed as the property of others. 4/27/2019

Clarification of Chosen Approach
Tight upper bounds on the performance potential of TLS by filtering out: Inherently parallel program regions Non-profitable candidates for TLS Focus on parallelism that cannot be exploited with state-of-the-art compiler technology, but can uniquely be exploited via TLS 4/27/2019

Global View 4/27/2019

Differentiating TLP and TLS
Loop-level parallelism Parallel (DOALL) loops – corresponds to true TLP Non-parallel (Non-DOALL) loops – corresponds to sTLP Performance achievable by each technique standalone Performance achievable by each technique in conjunction with others 4/27/2019

Taxonomy CS = Control Speculation DDS = Data Dependence Speculation
DVS = Data Value Speculation 4/27/2019

Methodology Details Analysis will refer only to innermost loops (however…) Two-step approach – filter out loops: - that can be parallelized using state-of-the-art compiler techniques (dep. analysis, pointer analysis, IPA, etc) - for which it is more profitable to exploit ILP instead of sTLP Filtering was done by using the Intel Compiler and manual analysis (for <10% of the loops). The remaining loops are considered for evaluating the performance potential of TLS at the innermost loop level. 4/27/2019

Writing 0 in each iteration
The Baseline Auto-parallelization Writing 0 in each iteration Semantic-driven parallelization – obviates the need for TLS 4/27/2019

ILP/sTLP Trade-off Determine what is achievable beyond existing ILP techniques Example: A candidate loop for DVS Also a candidate for software pipelining Too “small” for TLS to be profitable (too little computation vs. threading overhead) Filter out such loops from the set of candidates for speculative parallelization 4/27/2019

ILP/sTLP Trade-off (contd.)
CS+DVS vs. Perfect Pipelining (PP) Profitability of CS+DVS: High coverage per iteration PP can be applied in an unrestricted fashion Example loop on the right Less than 1% coverage Very small coverage per iteration Not suitable for TLS 4/27/2019

ILP/sTLP Trade-off (contd.)
Symbolic Analysis Convert a non-DOALL loop into a DOALL loop Example loop on the right No need for TLS Too small to exploit non-speculative TLP in profitable fashion non-DOALL loop DOALL loop 4/27/2019

Procedural-level TLS Subject to
Limitations of the inlining heuristic of the compiler The strength of dependence analysis supported Example loop on the right Procedural calls phi0, phi1 and phi2 are loop invariant Hoist the functions Resulting loop – A DOALL loop 4/27/2019

Evaluation of TLS SPEC* INT and SPEC* FP 2006
Intel® auto-parallelizing compiler System configuration Evaluate the performance potential of various speculation techniques, at the innermost loop level, subject to practical constraints such as threading overhead Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Notes on Results Important issue 1: previous studies of the potential of TLS used unrealistic (small) thread overheads. Authors study: thread overhead ~= 1000 cycles Linux NPTL ~= 10k cycles Thread overhead = - thread creation - thread management - thread synchronization 4/27/2019

Notes on Results (contd.)
Important issue 2: all results presented heretofore correspond to dynamic scheduling with one iteration scheduled at a time. Loops whose coverage per iteration is smaller than the threading overhead are filtered out. 4/27/2019

Variation of TLS Performance Potential w.r.t. Threading Overhead
Using state-of-the-art threading library – overhead is min. 1K cycles Ideal case: 40% Practical case: <3% Highlights the high sensitivity w.r.t. the threading overhead Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Variation of TLS Performance Potential w.r.t. Threading Overhead
Ideal case: 40% Practical case: 0% TLS is not useful at ALL!! Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Mitigating the Threading Overhead
Loop Unrolling Higher inter-thread destruction interference in the D-Cache (me: but, but!) Increases the number of potential dependences between any two iterations of the loop Result in higher misspeculation rate Unroll twice Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Mitigating the Threading Overhead (contd.)
Using large number of processors Tovhd < (Np -1) x Titer Tovhd < (Niter – 1) x Titer for unbounded number of processors Talk about the trend as represented by the arrow. Also, mention that the model does not account for the increase in conflict probability. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Performance Potential of Different Types of TLS
SPEC CINT2006 (refer to the paper for CFP2006) Gray columns: assuming Tovhd = 1000 cycles White columns: assuming Tovhd = 10 cycles Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Bounds on Conflict Probability
Model the impact of misspeculation penalty For m points-of-speculation on an iteration Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Bounds on Conflict Probability (contd.)
Example: 401.perlbench Applying TLS on Loop #8 is profitable only if the misspeculation probability < 0.28 Applying TLS on other loops is not beneficial Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Conclusions Performance potential of innermost loop-level TLS (excluding TLP) Assuming threading overhead 1K cycles: Geometric Mean 1% Performance potential of standalone CS, DDS and DVS is rather limited Future work – evaluate the TLS-run time parallelization trade-off Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel® products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or 4/27/2019

Arun Kejariwal‡,¥, Xinmin Tian‡

Similar presentations

Presentation on theme: "Arun Kejariwal‡,¥, Xinmin Tian‡"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Arun Kejariwal‡,¥, Xinmin Tian‡

Similar presentations

Presentation on theme: "Arun Kejariwal‡,¥, Xinmin Tian‡"— Presentation transcript:

Similar presentations

About project

Feedback