Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J.

Similar presentations


Presentation on theme: "Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J."— Presentation transcript:

1 Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu

2 Department of Electrical and Computer Engineering University of Minnesota Acknowledgements  Graduate students (who did the real work) oYing Chen oResit Sendag oJoshua Yi  Faculty collaborator oDouglas Hawkins (School of Statistics)  Funders oNational Science Foundation oIBM oHP/Compaq oMinnesota Supercomputing Institute

3 Department of Electrical and Computer Engineering University of Minnesota Problem #1  Speculative execution is becoming more popular oBranch prediction oValue prediction oSpeculative multithreading  Potentially higher performance  What about impact on the memory system? oPollute cache/memory hierarchy? oLeads to more misses?

4 Department of Electrical and Computer Engineering University of Minnesota Problem #2  Computer architecture research relies on simulation  Simulation is slow oYears to simulate SPEC CPU2000 benchmarks  Simulation can be wildly inaccurate oDid I really mean to build that system?  Results are difficult to reproduce  Need statistical rigor

5 Department of Electrical and Computer Engineering University of Minnesota  The Superthreaded Architecture  The Wrong Execution Cache (WEC)  Experimental Methodology  Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 2003] Outline (Part 1)

6 Department of Electrical and Computer Engineering University of Minnesota Hard-to-Parallelize Applications  Early exit loops  Pointers and aliases  Complex branching behaviors  Small basic blocks  Small loops counts → Hard to parallelize with conventional techniques.

7 Department of Electrical and Computer Engineering University of Minnesota Introduce Maybe dependences  Data dependence?  Pointer aliasing? oYes oNo o Maybe  Maybe allows aggressive compiler optimizations oWhen in doubt, parallelize  Run-time check to correct wrong assumption.

8 Department of Electrical and Computer Engineering University of Minnesota Thread Pipelining Execution Model Thread i Thread i+1 Thread i+2 Fork Sync … … Fork Sync … … Fork Sync … … CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK

9 Department of Electrical and Computer Engineering University of Minnesota The Superthread Architecture Instruction Cache Data Cache Execution Unit Dependence Buffer Registers PC Comm Super-Scalar Core Execution Unit Dependence Buffer Registers PC Comm Execution Unit Dependence Buffer Registers PC Comm Execution Unit Dependence Buffer Registers PC Comm

10 Department of Electrical and Computer Engineering University of Minnesota Wrong Path Execution Within Superscalar Core Predicted pathCorrect path Prediction result is wrong Wrong path Speculative execution Wrong path execution Not ready to be executed

11 Department of Electrical and Computer Engineering University of Minnesota Wrong Thread Execution Sequential region Mark the successor threads as wrong threads Sequential region between two parallel regions Parallel region Kill all the wrong threads from the Previous parallel region Wrong thread kills itself

12 Department of Electrical and Computer Engineering University of Minnesota How Could Wrong Thread Execution Help Improve Performance? for (i=0; i<10; i++) { …… for (j=0; j<i; j++) { …… x=y[j]; …… } …… } Parallelized When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]… When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]… i=4 i=5 TU1TU2TU3TU4 TU1TU2TU3TU4 y[0]y[1]y[2] y[3] y[0]y[1]y[2] wrong threads y[4] y[5] y[3] y[4] y[5]y[6]

13 Department of Electrical and Computer Engineering University of Minnesota Operation of the WEC Wrong executionCorrect execution

14 Department of Electrical and Computer Engineering University of Minnesota Processor Configurations for Simulations baseline (orig) wrong path (wp) wrong thread (wth) wrong execution cache (wec) prefetch into WEC victim cache (vc) next line prefetch (nlp) orig  vc  wth-wp-vc  wth-wp-wec  nlp  SIMCA (the SIMulator for the Superthreaded Architecture) configurations features

15 Department of Electrical and Computer Engineering University of Minnesota Parameters for Each Thread Unit Issue rate8 instrs/cycle per thread unit branch target buffer4-way associative, 1024 entries speculative memory bufferfully associative, 128 entries round-trip memory latency200 cycles fork delay4 cycles unidirectional communication ring2 requests/cycle bandwidth Load/store queue64 entries Reorder buffer64 entries INT ALU, INT multiply/divide units8, 4 FP adders, FP multiply/divide units4, 4 WEC8 entries (same block size as L1 cache), fully associative L1 data cache distributed, 8 KB, 1 -way associative, block size of 64 bytes L1 instruction caches distributed, 32 KB, 2 -way associative, block size of 64 bytes L2 cache unified, 512 KB, 4 -way associative, block size of 128 bytes

16 Department of Electrical and Computer Engineering University of Minnesota Characteristics of the Parallelized SPEC2000 Benchmarks Bench -marks SPEC 2000 Type Input set Fraction Parallelized Loop Coalesc -ing Loop Unroll- ing Statement Reordering to Increase Overlap 175.vpr INTSPEC test 8.6%  164.gzip INTMinneSPEC large 15.7%  181.mcf INTMinneSPEC large 36.1%  197.parser INTMinneSPEC medium 17.2%  183.equake FPMinneSPEC large 21.3%  177.mesa FPSPEC test 17.3% 

17 Department of Electrical and Computer Engineering University of Minnesota Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks # of TUs Issue rate 1111 1 16 2828 4444 8282 1 Reorder buffer size 81286432168 INT ALU 1168421 INT MULT 184211 FP ALU 1168421 FP MULT 184211 L1 data cache size (KB) 23216842 Baseline configuration

18 Department of Electrical and Computer Engineering University of Minnesota Performance of the wth-wp-wec Configuration on Top of the Parallel Execution

19 Department of Electrical and Computer Engineering University of Minnesota Performance Improvements Due to the WEC

20 Department of Electrical and Computer Engineering University of Minnesota Sensitivity to L1 Data Cache Size

21 Department of Electrical and Computer Engineering University of Minnesota Sensitivity to WEC Size Compared to a Victim Cache

22 Department of Electrical and Computer Engineering University of Minnesota Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)

23 Department of Electrical and Computer Engineering University of Minnesota Additional Loads and Reduction of Misses %

24 Department of Electrical and Computer Engineering University of Minnesota Conclusions for the WEC  Allow loads to continue executing even after they are known to be incorrectly issued oDo not let them change state  45.5% average reduction in number of misses o9.7% average improvement on top of parallel execution o4% average improvement over victim cache o5.6% average improvement over next-line prefetching  Cost o14% additional loads oMinor hardware complexity

25 Department of Electrical and Computer Engineering University of Minnesota Typical Computer Architecture Study 1.Find an interesting problem/performance bottleneck  E.g. Memory delays 2.Invent a clever idea for solving it.  This is the hard part. 3.Implement the idea in a processor/system simulator  This is the part grad students usually like best 4.Run simulations on n “standard” benchmark programs  This is time-consuming and boring 5.Compare performance with and without your change  Execution time, clocks per instruction (CPI), etc.

26 Department of Electrical and Computer Engineering University of Minnesota Problem #2 – Simulation in Computer Architecture Research  Simulators are an important tool for computer architecture research and design oLow cost oFaster than building a new system oVery flexible

27 Department of Electrical and Computer Engineering University of Minnesota Performance Evaluation Techniques Used in ISCA Papers * Some papers used more than one evaluation technique.

28 Department of Electrical and Computer Engineering University of Minnesota Simulation is Very Popular, But …  Current simulation methodology is not oFormal oRigorous oStatistically-based  Never enough simulations oDesign a new processor based on a few seconds of actual execution time  What are benchmark programs really exercising?

29 Department of Electrical and Computer Engineering University of Minnesota An Example -- Sensitivity Analysis  Which parameters should be varied? Fixed?  What range of values should be used for each variable parameter?  What values should be used for the constant parameters?  Are there interactions between variable and fixed parameters?  What is the magnitude of those interactions?

30 Department of Electrical and Computer Engineering University of Minnesota Let’s Introduce Some Statistical Rigor  Decreases the number of errors oModeling oImplementation oSet up oAnalysis  Helps find errors more quickly  Provides greater insight oInto the processor oEffects of an enhancement  Provides objective confidence in results  Provides statistical support for conclusions

31 Department of Electrical and Computer Engineering University of Minnesota  A statistical technique for oExamining the overall impact of an architectural change oClassifying benchmark programs oRanking the importance of processor/simulation parameters oReducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 2003] Outline (Part 2)

32 Department of Electrical and Computer Engineering University of Minnesota A Technique to Limit the Number of Simulations  Plackett and Burman designs (1946) oMultifactorial designs oOriginally proposed for mechanical assemblies  Effects of main factors only oLogically minimal number of experiments to estimate effects of m input parameters (factors) oIgnores interactions  Requires O( m ) experiments oInstead of O(2 m ) or O( v m )

33 Department of Electrical and Computer Engineering University of Minnesota Plackett and Burman Designs  PB designs exist only in sizes that are multiples of 4  Requires X experiments for m parameters o X = next multiple of 4 ≥ m  PB design matrix oRows = configurations oColumns = parameters’ values in each config High/low = +1/ -1 oFirst row = from P&B paper oSubsequent rows = circular right shift of preceding row oLast row = all (-1)

34 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +1 3 +1 +1 4 +1 5 +1 +1 6 +1 +1 7 +1 +1 8 Effect

35 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +111 3 +1 +1 4 +1 5 +1 +1 6 +1 +1 7 +1 +1 8 Effect

36 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +111 3 +1 +12 4 +1 1 5 +1 +1 9 6 +1 +1 74 7+1 +1 +17 8 4 Effect

37 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +111 3 +1 +12 4 +1 1 5 +1 +1 9 6 +1 +1 74 7+1 +1 +17 8 4 Effect65

38 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +111 3 +1 +12 4 +1 1 5 +1 +1 9 6 +1 +1 74 7+1 +1 +17 8 4 Effect65-45

39 Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG 1+1 +1 9 2 +1 +111 3 +1 +12 4 +1 1 5 +1 +1 9 6 +1 +1 74 7+1 +1 +17 8 4 Effect65-4575-75 7367

40 Department of Electrical and Computer Engineering University of Minnesota PB Design  Only magnitude of effect is important oSign is meaningless  In example, most → least important effects: o[C, D, E] → F → G → A → B

41 Department of Electrical and Computer Engineering University of Minnesota Case Study #1  Determine the most significant parameters in a processor simulator.

42 Department of Electrical and Computer Engineering University of Minnesota Determine the Most Significant Processor Parameters  Problem oSo many parameters in a simulator oHow to choose parameter values? oHow to decide which parameters are most important?  Approach oChoose reasonable upper/lower bounds. oRank parameters by impact on total execution time.

43 Department of Electrical and Computer Engineering University of Minnesota Simulation Environment  SimpleScalar simulator osim-outorder 3.0  Selected SPEC 2000 Benchmarks o gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf  MinneSPEC Reduced Input Sets  Compiled with gcc (PISA) at O3

44 Department of Electrical and Computer Engineering University of Minnesota Functional Unit Values ParameterLow ValueHigh Value Int ALUs 14 Int ALU Latency 2 Cycles1 Cycle Int ALU Throughput 1 FP ALUs 14 FP ALU Latency 5 Cycles1 Cycle FP ALU Throughputs 1 Int Mult/Div Units 14 Int Mult Latency 15 Cycles2 Cycles Int Div Latency 80 Cycles10 Cycles Int Mult Throughput 1 Int Div Throughput Equal to Int Div Latency FP Mult/Div Units 14 FP Mult Latency 5 Cycles2 Cycles FP Div Latency 35 Cycles10 Cycles FP Sqrt Latency 35 Cycles15 Cycles FP Mult Throughput Equal to FP Mult Latency FP Div Throughput Equal to FP Div Latency FP Sqrt Throughput Equal to FP Sqrt Latency

45 Department of Electrical and Computer Engineering University of Minnesota Memory System Values, Part I ParameterLow ValueHigh Value L1 I-Cache Size 4 KB128 KB L1 I-Cache Assoc 1-Way8-Way L1 I-Cache Block Size 16 Bytes64 Bytes L1 I-Cache Repl Policy Least Recently Used L1 I-Cache Latency 4 Cycles1 Cycle L1 D-Cache Size 4 KB128 KB L1 D-Cache Assoc 1-Way8-Way L1 D-Cache Block Size 16 Bytes64 Bytes L1 D-Cache Repl Policy Least Recently Used L1 D-Cache Latency 4 Cycles1 Cycle L2 Cache Size 256 KB8192 KB L2 Cache Assoc 1-Way8-Way L2 Cache Block Size 64 Bytes256 Bytes

46 Department of Electrical and Computer Engineering University of Minnesota Memory System Values, Part II ParameterLow ValueHigh Value L2 Cache Repl Policy Least Recently Used L2 Cache Latency 20 Cycles5 Cycles Mem Latency, First 200 Cycles50 Cycles Mem Latency, Next 0.02 * Mem Latency, First Mem Bandwidth 4 Bytes32 Bytes I-TLB Size 32 Entries256 Entries I-TLB Page Size 4 KB4096 KB I-TLB Assoc 2-WayFully Assoc I-TLB Latency 80 Cycles30 Cycles D-TLB Size 32 Entries256 Entries D-TLB Page Size Same as I-TLB Page Size D-TLB Assoc 2-WayFully-Assoc D-TLB Latency Same as I-TLB Latency

47 Department of Electrical and Computer Engineering University of Minnesota Processor Core Values ParameterLow ValueHigh Value Fetch Queue Entries 432 Branch Predictor 2-LevelPerfect Branch MPred Penalty 10 Cycles2 Cycles RAS Entries 464 BTB Entries 16512 BTB Assoc 2-WayFully-Assoc Spec Branch Update In CommitIn Decode Decode/Issue Width 4-Way ROB Entries 864 LSQ Entries 0.25 * ROB1.0 * ROB Memory Ports 14

48 Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 1. Run simulations to find response With input parameters at high/low, on/off values Confi g Input Parameters (factors)Respons e ABCDEFG 1+1 +1 9 2 +1 +1 3 +1 +1 …………………… Effect

49 Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 2. Calculate the effect of each parameter Across configurations Confi g Input Parameters (factors)Respons e ABCDEFG 1+1 +1 9 2 +1 +1 3 +1 +1 …………………… Effect65

50 Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 3. For each benchmark Rank the parameters in descending order of effect (1=most important, …) ParameterBenchmark 1Benchmark 2Benchmark 3 A3128 B29422 C267 …………

51 Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 4. For each parameter Average the ranks ParameterBenchmark 1Benchmark 2Benchmark 3Average A31287.67 B2942218.3 C2675 ……………

52 Department of Electrical and Computer Engineering University of Minnesota Most Significant Parameters NumberParametergccgzipartAverage 1 ROB Entries 4122.77 2 L2 Cache Latency 2444.00 3 Branch Predictor Accuracy 52277.69 4 Number of Integer ALUs 83299.08 5 L1 D-Cache Latency 77810.00 6 L1 I-Cache Size 161210.23 7 L2 Cache Size 69110.62 8 L1 I-Cache Block Size 3161011.77 9 Memory Latency, First 936312.31 10 LSQ Entries 10123912.62 11 Speculative Branch Update 2881618.23

53 Department of Electrical and Computer Engineering University of Minnesota General Procedure  Determine upper/lower bounds for parameters  Simulate configurations to find response  Compute effects of each parameter for each configuration  Rank the parameters for each benchmark based on effects  Average the ranks across benchmarks  Focus on top-ranked parameters for subsequent analysis

54 Department of Electrical and Computer Engineering University of Minnesota Case Study #2  Determine the “big picture” impact of a system enhancement.

55 Department of Electrical and Computer Engineering University of Minnesota Determining the Overall Effect of an Enhancement  Problem: oPerformance analysis is typically limited to single metrics Speedup, power consumption, miss rate, etc. oSimple analysis Discards a lot of good information

56 Department of Electrical and Computer Engineering University of Minnesota Determining the Overall Effect of an Enhancement  Find most important parameters without enhancement oUsing Plackett and Burman  Find most important parameters with enhancement oAgain using Plackett and Burman  Compare parameter ranks

57 Department of Electrical and Computer Engineering University of Minnesota Example: Instruction Precomputation  Profile to find the most common operations o0+1, 1+1, etc.  Insert the results of common operations in a table when the program is loaded into memory  Query the table when an instruction is issued  Don’t execute the instruction if it is already in the table  Reduces contention for function units [Yi, Sendag, Lilja, Europar, 2002]

58 Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries 2.77 L2 Cache Latency 4.00 Branch Predictor Accuracy 7.69 Number of Integer ALUs 9.08 L1 D-Cache Latency 10.00 L1 I-Cache Size 10.23 L2 Cache Size 10.62 L1 I-Cache Block Size 11.77 Memory Latency, First 12.31 LSQ Entries 12.62

59 Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries 2.77 L2 Cache Latency 4.00 Branch Predictor Accuracy 7.69 7.92 Number of Integer ALUs 9.0810.54 L1 D-Cache Latency 10.00 9.62 L1 I-Cache Size 10.2310.15 L2 Cache Size 10.6210.54 L1 I-Cache Block Size 11.7711.38 Memory Latency, First 12.3111.62 LSQ Entries 12.6213.00

60 Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries 2.77 0.00 L2 Cache Latency 4.00 0.00 Branch Predictor Accuracy 7.69 7.92-0.23 Number of Integer ALUs 9.0810.54-1.46 L1 D-Cache Latency 10.00 9.62 0.38 L1 I-Cache Size 10.2310.15 0.08 L2 Cache Size 10.6210.54 0.08 L1 I-Cache Block Size 11.7711.38 0.39 Memory Latency, First 12.3111.62 0.69 LSQ Entries 12.6213.00-0.38

61 Department of Electrical and Computer Engineering University of Minnesota Case Study #3  Benchmark program classification.

62 Department of Electrical and Computer Engineering University of Minnesota Benchmark Classification  By application type oScientific and engineering applications oTransaction processing applications oMultimedia applications  By use of processor function units oFloating-point code oInteger code oMemory intensive code  Etc., etc.

63 Department of Electrical and Computer Engineering University of Minnesota Another Point-of-View  Classify by overall impact on processor  Define: oTwo benchmark programs are similar if – They stress the same components of a system to similar degrees  How to measure this similarity? oUse Plackett and Burman design to find ranks oThen compare ranks

64 Department of Electrical and Computer Engineering University of Minnesota Similarity metric  Use rank of each parameter as elements of a vector  For benchmark program X, let o X = (x 1, x 2,…, x n-1, x n ) ox 1 = rank of parameter 1 ox 2 = rank of parameter 2 o…

65 Department of Electrical and Computer Engineering University of Minnesota Vector Defines a Point in n -space (y 1, y 2, y 3 ) Param #3 (x 1, x 2, x 3 ) Param #2 Param #1 D

66 Department of Electrical and Computer Engineering University of Minnesota Similarity Metric  Euclidean Distance Between Points

67 Department of Electrical and Computer Engineering University of Minnesota Most Significant Parameters NumberParametergccgzipart 1 ROB Entries 412 2 L2 Cache Latency 244 3 Branch Predictor Accuracy 5227 4 Number of Integer ALUs 8329 5 L1 D-Cache Latency 778 6 L1 I-Cache Size 1612 7 L2 Cache Size 691 8 L1 I-Cache Block Size 31610 9 Memory Latency, First 9363 10 LSQ Entries 101239 11 Speculative Branch Update 28816

68 Department of Electrical and Computer Engineering University of Minnesota Distance Computation  Rank vectors oGcc = (4, 2, 5, 8, …) oGzip = (1, 4, 2, 3, …) oArt = (2, 4, 27, 29, …)  Euclidean distances oD(gcc - gzip) = [(4-1) 2 + (2-4) 2 + (5-2) 2 + … ] 1/2 oD(gcc - art) = [(4-2) 2 + (2-4) 2 + (5-27) 2 + … ] 1/2 oD(gzip - art) = [(1-2) 2 + (4-4) 2 + (2-27) 2 + … ] 1/2

69 Department of Electrical and Computer Engineering University of Minnesota Euclidean Distances for Selected Benchmarks gccgzipartmcf gcc 081.992.694.5 gzip 0113.5109.6 art 098.6 mcf 0

70 Department of Electrical and Computer Engineering University of Minnesota Dendogram of Distances Showing (Dis-)Similarity

71 Department of Electrical and Computer Engineering University of Minnesota Final Benchmark Groupings GroupBenchmarks IGzip,mesa IIVpr-Place,twolf IIIVpr-Route, parser, bzip2 IVGcc, vortex VArt VIMcf VIIEquake VIIIammp

72 Department of Electrical and Computer Engineering University of Minnesota Conclusion  Multifactorial (Plackett and Burman) design oRequires only O( m ) experiments oDetermines effects of main factors only oIgnores interactions  Logically minimal number of experiments to estimate effects of m input parameters  Powerful technique for obtaining a big-picture view of a lot of simulation data

73 Department of Electrical and Computer Engineering University of Minnesota Conclusion  Demonstrated for oRanking importance of simulation parameters oFinding overall impact of processor enhancement oClassifying benchmark programs  Current work comparing simulation strategies oReduced input sets (e.g. MinneSPEC) oSampling (e.g. SimPoints, sampling)

74 Department of Electrical and Computer Engineering University of Minnesota Goals  Develop/understand tools for interpreting large quantities of data  Increase insights into processor design  Improve rigor in computer architecture research


Download ppt "Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J."

Similar presentations


Ads by Google