Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSCI 4717/5717 Computer Architecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Instruction Level Parallelism (ILP) Colin Stevens.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Multiscalar processors
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Copyright 2004 David J. Lilja1 Design of Experiments Goals Terminology Full factorial designs m-factor ANOVA Fractional factorial designs Multi-factorial.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Pipelining and Parallelism Mark Staveley
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
12/12/2001HPCC SEMINARS1 When All Else Fails, Guess: The Use of Speculative Multithreading for High-Performance Computing David J. Lilja Presented By:
Sunpyo Hong, Hyesoon Kim
Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Multiscalar Processors
Pipeline Implementation (4.6)
/ Computer Architecture and Design
Lecture 14: Reducing Cache Misses
Lecture 11: Memory Data Flow Techniques
Advanced Computer Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
rePLay: A Hardware Framework for Dynamic Optimization
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Department of Electrical and Computer Engineering University of Minnesota Acknowledgements  Graduate students (who did the real work) oYing Chen oResit Sendag oJoshua Yi  Faculty collaborator oDouglas Hawkins (School of Statistics)  Funders oNational Science Foundation oIBM oHP/Compaq oMinnesota Supercomputing Institute

Department of Electrical and Computer Engineering University of Minnesota Problem #1  Speculative execution is becoming more popular oBranch prediction oValue prediction oSpeculative multithreading  Potentially higher performance  What about impact on the memory system? oPollute cache/memory hierarchy? oLeads to more misses?

Department of Electrical and Computer Engineering University of Minnesota Problem #2  Computer architecture research relies on simulation  Simulation is slow oYears to simulate SPEC CPU2000 benchmarks  Simulation can be wildly inaccurate oDid I really mean to build that system?  Results are difficult to reproduce  Need statistical rigor

Department of Electrical and Computer Engineering University of Minnesota  The Superthreaded Architecture  The Wrong Execution Cache (WEC)  Experimental Methodology  Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 2003] Outline (Part 1)

Department of Electrical and Computer Engineering University of Minnesota Hard-to-Parallelize Applications  Early exit loops  Pointers and aliases  Complex branching behaviors  Small basic blocks  Small loops counts → Hard to parallelize with conventional techniques.

Department of Electrical and Computer Engineering University of Minnesota Introduce Maybe dependences  Data dependence?  Pointer aliasing? oYes oNo o Maybe  Maybe allows aggressive compiler optimizations oWhen in doubt, parallelize  Run-time check to correct wrong assumption.

Department of Electrical and Computer Engineering University of Minnesota Thread Pipelining Execution Model Thread i Thread i+1 Thread i+2 Fork Sync … … Fork Sync … … Fork Sync … … CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK

Department of Electrical and Computer Engineering University of Minnesota The Superthread Architecture Instruction Cache Data Cache Execution Unit Dependence Buffer Registers PC Comm Super-Scalar Core Execution Unit Dependence Buffer Registers PC Comm Execution Unit Dependence Buffer Registers PC Comm Execution Unit Dependence Buffer Registers PC Comm

Department of Electrical and Computer Engineering University of Minnesota Wrong Path Execution Within Superscalar Core Predicted pathCorrect path Prediction result is wrong Wrong path Speculative execution Wrong path execution Not ready to be executed

Department of Electrical and Computer Engineering University of Minnesota Wrong Thread Execution Sequential region Mark the successor threads as wrong threads Sequential region between two parallel regions Parallel region Kill all the wrong threads from the Previous parallel region Wrong thread kills itself

Department of Electrical and Computer Engineering University of Minnesota How Could Wrong Thread Execution Help Improve Performance? for (i=0; i<10; i++) { …… for (j=0; j<i; j++) { …… x=y[j]; …… } …… } Parallelized When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]… When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]… i=4 i=5 TU1TU2TU3TU4 TU1TU2TU3TU4 y[0]y[1]y[2] y[3] y[0]y[1]y[2] wrong threads y[4] y[5] y[3] y[4] y[5]y[6]

Department of Electrical and Computer Engineering University of Minnesota Operation of the WEC Wrong executionCorrect execution

Department of Electrical and Computer Engineering University of Minnesota Processor Configurations for Simulations baseline (orig) wrong path (wp) wrong thread (wth) wrong execution cache (wec) prefetch into WEC victim cache (vc) next line prefetch (nlp) orig  vc  wth-wp-vc  wth-wp-wec  nlp  SIMCA (the SIMulator for the Superthreaded Architecture) configurations features

Department of Electrical and Computer Engineering University of Minnesota Parameters for Each Thread Unit Issue rate8 instrs/cycle per thread unit branch target buffer4-way associative, 1024 entries speculative memory bufferfully associative, 128 entries round-trip memory latency200 cycles fork delay4 cycles unidirectional communication ring2 requests/cycle bandwidth Load/store queue64 entries Reorder buffer64 entries INT ALU, INT multiply/divide units8, 4 FP adders, FP multiply/divide units4, 4 WEC8 entries (same block size as L1 cache), fully associative L1 data cache distributed, 8 KB, 1 -way associative, block size of 64 bytes L1 instruction caches distributed, 32 KB, 2 -way associative, block size of 64 bytes L2 cache unified, 512 KB, 4 -way associative, block size of 128 bytes

Department of Electrical and Computer Engineering University of Minnesota Characteristics of the Parallelized SPEC2000 Benchmarks Bench -marks SPEC 2000 Type Input set Fraction Parallelized Loop Coalesc -ing Loop Unroll- ing Statement Reordering to Increase Overlap 175.vpr INTSPEC test 8.6%  164.gzip INTMinneSPEC large 15.7%  181.mcf INTMinneSPEC large 36.1%  197.parser INTMinneSPEC medium 17.2%  183.equake FPMinneSPEC large 21.3%  177.mesa FPSPEC test 17.3% 

Department of Electrical and Computer Engineering University of Minnesota Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks # of TUs Issue rate Reorder buffer size INT ALU INT MULT FP ALU FP MULT L1 data cache size (KB) Baseline configuration

Department of Electrical and Computer Engineering University of Minnesota Performance of the wth-wp-wec Configuration on Top of the Parallel Execution

Department of Electrical and Computer Engineering University of Minnesota Performance Improvements Due to the WEC

Department of Electrical and Computer Engineering University of Minnesota Sensitivity to L1 Data Cache Size

Department of Electrical and Computer Engineering University of Minnesota Sensitivity to WEC Size Compared to a Victim Cache

Department of Electrical and Computer Engineering University of Minnesota Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)

Department of Electrical and Computer Engineering University of Minnesota Additional Loads and Reduction of Misses %

Department of Electrical and Computer Engineering University of Minnesota Conclusions for the WEC  Allow loads to continue executing even after they are known to be incorrectly issued oDo not let them change state  45.5% average reduction in number of misses o9.7% average improvement on top of parallel execution o4% average improvement over victim cache o5.6% average improvement over next-line prefetching  Cost o14% additional loads oMinor hardware complexity

Department of Electrical and Computer Engineering University of Minnesota Typical Computer Architecture Study 1.Find an interesting problem/performance bottleneck  E.g. Memory delays 2.Invent a clever idea for solving it.  This is the hard part. 3.Implement the idea in a processor/system simulator  This is the part grad students usually like best 4.Run simulations on n “standard” benchmark programs  This is time-consuming and boring 5.Compare performance with and without your change  Execution time, clocks per instruction (CPI), etc.

Department of Electrical and Computer Engineering University of Minnesota Problem #2 – Simulation in Computer Architecture Research  Simulators are an important tool for computer architecture research and design oLow cost oFaster than building a new system oVery flexible

Department of Electrical and Computer Engineering University of Minnesota Performance Evaluation Techniques Used in ISCA Papers * Some papers used more than one evaluation technique.

Department of Electrical and Computer Engineering University of Minnesota Simulation is Very Popular, But …  Current simulation methodology is not oFormal oRigorous oStatistically-based  Never enough simulations oDesign a new processor based on a few seconds of actual execution time  What are benchmark programs really exercising?

Department of Electrical and Computer Engineering University of Minnesota An Example -- Sensitivity Analysis  Which parameters should be varied? Fixed?  What range of values should be used for each variable parameter?  What values should be used for the constant parameters?  Are there interactions between variable and fixed parameters?  What is the magnitude of those interactions?

Department of Electrical and Computer Engineering University of Minnesota Let’s Introduce Some Statistical Rigor  Decreases the number of errors oModeling oImplementation oSet up oAnalysis  Helps find errors more quickly  Provides greater insight oInto the processor oEffects of an enhancement  Provides objective confidence in results  Provides statistical support for conclusions

Department of Electrical and Computer Engineering University of Minnesota  A statistical technique for oExamining the overall impact of an architectural change oClassifying benchmark programs oRanking the importance of processor/simulation parameters oReducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 2003] Outline (Part 2)

Department of Electrical and Computer Engineering University of Minnesota A Technique to Limit the Number of Simulations  Plackett and Burman designs (1946) oMultifactorial designs oOriginally proposed for mechanical assemblies  Effects of main factors only oLogically minimal number of experiments to estimate effects of m input parameters (factors) oIgnores interactions  Requires O( m ) experiments oInstead of O(2 m ) or O( v m )

Department of Electrical and Computer Engineering University of Minnesota Plackett and Burman Designs  PB designs exist only in sizes that are multiples of 4  Requires X experiments for m parameters o X = next multiple of 4 ≥ m  PB design matrix oRows = configurations oColumns = parameters’ values in each config High/low = +1/ -1 oFirst row = from P&B paper oSubsequent rows = circular right shift of preceding row oLast row = all (-1)

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect65

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect65-45

Department of Electrical and Computer Engineering University of Minnesota PB Design Matrix ConfigInput Parameters (factors)Response ABCDEFG Effect

Department of Electrical and Computer Engineering University of Minnesota PB Design  Only magnitude of effect is important oSign is meaningless  In example, most → least important effects: o[C, D, E] → F → G → A → B

Department of Electrical and Computer Engineering University of Minnesota Case Study #1  Determine the most significant parameters in a processor simulator.

Department of Electrical and Computer Engineering University of Minnesota Determine the Most Significant Processor Parameters  Problem oSo many parameters in a simulator oHow to choose parameter values? oHow to decide which parameters are most important?  Approach oChoose reasonable upper/lower bounds. oRank parameters by impact on total execution time.

Department of Electrical and Computer Engineering University of Minnesota Simulation Environment  SimpleScalar simulator osim-outorder 3.0  Selected SPEC 2000 Benchmarks o gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf  MinneSPEC Reduced Input Sets  Compiled with gcc (PISA) at O3

Department of Electrical and Computer Engineering University of Minnesota Functional Unit Values ParameterLow ValueHigh Value Int ALUs 14 Int ALU Latency 2 Cycles1 Cycle Int ALU Throughput 1 FP ALUs 14 FP ALU Latency 5 Cycles1 Cycle FP ALU Throughputs 1 Int Mult/Div Units 14 Int Mult Latency 15 Cycles2 Cycles Int Div Latency 80 Cycles10 Cycles Int Mult Throughput 1 Int Div Throughput Equal to Int Div Latency FP Mult/Div Units 14 FP Mult Latency 5 Cycles2 Cycles FP Div Latency 35 Cycles10 Cycles FP Sqrt Latency 35 Cycles15 Cycles FP Mult Throughput Equal to FP Mult Latency FP Div Throughput Equal to FP Div Latency FP Sqrt Throughput Equal to FP Sqrt Latency

Department of Electrical and Computer Engineering University of Minnesota Memory System Values, Part I ParameterLow ValueHigh Value L1 I-Cache Size 4 KB128 KB L1 I-Cache Assoc 1-Way8-Way L1 I-Cache Block Size 16 Bytes64 Bytes L1 I-Cache Repl Policy Least Recently Used L1 I-Cache Latency 4 Cycles1 Cycle L1 D-Cache Size 4 KB128 KB L1 D-Cache Assoc 1-Way8-Way L1 D-Cache Block Size 16 Bytes64 Bytes L1 D-Cache Repl Policy Least Recently Used L1 D-Cache Latency 4 Cycles1 Cycle L2 Cache Size 256 KB8192 KB L2 Cache Assoc 1-Way8-Way L2 Cache Block Size 64 Bytes256 Bytes

Department of Electrical and Computer Engineering University of Minnesota Memory System Values, Part II ParameterLow ValueHigh Value L2 Cache Repl Policy Least Recently Used L2 Cache Latency 20 Cycles5 Cycles Mem Latency, First 200 Cycles50 Cycles Mem Latency, Next 0.02 * Mem Latency, First Mem Bandwidth 4 Bytes32 Bytes I-TLB Size 32 Entries256 Entries I-TLB Page Size 4 KB4096 KB I-TLB Assoc 2-WayFully Assoc I-TLB Latency 80 Cycles30 Cycles D-TLB Size 32 Entries256 Entries D-TLB Page Size Same as I-TLB Page Size D-TLB Assoc 2-WayFully-Assoc D-TLB Latency Same as I-TLB Latency

Department of Electrical and Computer Engineering University of Minnesota Processor Core Values ParameterLow ValueHigh Value Fetch Queue Entries 432 Branch Predictor 2-LevelPerfect Branch MPred Penalty 10 Cycles2 Cycles RAS Entries 464 BTB Entries BTB Assoc 2-WayFully-Assoc Spec Branch Update In CommitIn Decode Decode/Issue Width 4-Way ROB Entries 864 LSQ Entries 0.25 * ROB1.0 * ROB Memory Ports 14

Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 1. Run simulations to find response With input parameters at high/low, on/off values Confi g Input Parameters (factors)Respons e ABCDEFG …………………… Effect

Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 2. Calculate the effect of each parameter Across configurations Confi g Input Parameters (factors)Respons e ABCDEFG …………………… Effect65

Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 3. For each benchmark Rank the parameters in descending order of effect (1=most important, …) ParameterBenchmark 1Benchmark 2Benchmark 3 A3128 B29422 C267 …………

Department of Electrical and Computer Engineering University of Minnesota Determining the Most Significant Parameters 4. For each parameter Average the ranks ParameterBenchmark 1Benchmark 2Benchmark 3Average A B C2675 ……………

Department of Electrical and Computer Engineering University of Minnesota Most Significant Parameters NumberParametergccgzipartAverage 1 ROB Entries L2 Cache Latency Branch Predictor Accuracy Number of Integer ALUs L1 D-Cache Latency L1 I-Cache Size L2 Cache Size L1 I-Cache Block Size Memory Latency, First LSQ Entries Speculative Branch Update

Department of Electrical and Computer Engineering University of Minnesota General Procedure  Determine upper/lower bounds for parameters  Simulate configurations to find response  Compute effects of each parameter for each configuration  Rank the parameters for each benchmark based on effects  Average the ranks across benchmarks  Focus on top-ranked parameters for subsequent analysis

Department of Electrical and Computer Engineering University of Minnesota Case Study #2  Determine the “big picture” impact of a system enhancement.

Department of Electrical and Computer Engineering University of Minnesota Determining the Overall Effect of an Enhancement  Problem: oPerformance analysis is typically limited to single metrics Speedup, power consumption, miss rate, etc. oSimple analysis Discards a lot of good information

Department of Electrical and Computer Engineering University of Minnesota Determining the Overall Effect of an Enhancement  Find most important parameters without enhancement oUsing Plackett and Burman  Find most important parameters with enhancement oAgain using Plackett and Burman  Compare parameter ranks

Department of Electrical and Computer Engineering University of Minnesota Example: Instruction Precomputation  Profile to find the most common operations o0+1, 1+1, etc.  Insert the results of common operations in a table when the program is loaded into memory  Query the table when an instruction is issued  Don’t execute the instruction if it is already in the table  Reduces contention for function units [Yi, Sendag, Lilja, Europar, 2002]

Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries 2.77 L2 Cache Latency 4.00 Branch Predictor Accuracy 7.69 Number of Integer ALUs 9.08 L1 D-Cache Latency L1 I-Cache Size L2 Cache Size L1 I-Cache Block Size Memory Latency, First LSQ Entries 12.62

Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries 2.77 L2 Cache Latency 4.00 Branch Predictor Accuracy Number of Integer ALUs L1 D-Cache Latency L1 I-Cache Size L2 Cache Size L1 I-Cache Block Size Memory Latency, First LSQ Entries

Department of Electrical and Computer Engineering University of Minnesota The Effect of Instruction Precomputation Average Rank ParameterBeforeAfterDifference ROB Entries L2 Cache Latency Branch Predictor Accuracy Number of Integer ALUs L1 D-Cache Latency L1 I-Cache Size L2 Cache Size L1 I-Cache Block Size Memory Latency, First LSQ Entries

Department of Electrical and Computer Engineering University of Minnesota Case Study #3  Benchmark program classification.

Department of Electrical and Computer Engineering University of Minnesota Benchmark Classification  By application type oScientific and engineering applications oTransaction processing applications oMultimedia applications  By use of processor function units oFloating-point code oInteger code oMemory intensive code  Etc., etc.

Department of Electrical and Computer Engineering University of Minnesota Another Point-of-View  Classify by overall impact on processor  Define: oTwo benchmark programs are similar if – They stress the same components of a system to similar degrees  How to measure this similarity? oUse Plackett and Burman design to find ranks oThen compare ranks

Department of Electrical and Computer Engineering University of Minnesota Similarity metric  Use rank of each parameter as elements of a vector  For benchmark program X, let o X = (x 1, x 2,…, x n-1, x n ) ox 1 = rank of parameter 1 ox 2 = rank of parameter 2 o…

Department of Electrical and Computer Engineering University of Minnesota Vector Defines a Point in n -space (y 1, y 2, y 3 ) Param #3 (x 1, x 2, x 3 ) Param #2 Param #1 D

Department of Electrical and Computer Engineering University of Minnesota Similarity Metric  Euclidean Distance Between Points

Department of Electrical and Computer Engineering University of Minnesota Most Significant Parameters NumberParametergccgzipart 1 ROB Entries L2 Cache Latency Branch Predictor Accuracy Number of Integer ALUs L1 D-Cache Latency L1 I-Cache Size L2 Cache Size L1 I-Cache Block Size Memory Latency, First LSQ Entries Speculative Branch Update 28816

Department of Electrical and Computer Engineering University of Minnesota Distance Computation  Rank vectors oGcc = (4, 2, 5, 8, …) oGzip = (1, 4, 2, 3, …) oArt = (2, 4, 27, 29, …)  Euclidean distances oD(gcc - gzip) = [(4-1) 2 + (2-4) 2 + (5-2) 2 + … ] 1/2 oD(gcc - art) = [(4-2) 2 + (2-4) 2 + (5-27) 2 + … ] 1/2 oD(gzip - art) = [(1-2) 2 + (4-4) 2 + (2-27) 2 + … ] 1/2

Department of Electrical and Computer Engineering University of Minnesota Euclidean Distances for Selected Benchmarks gccgzipartmcf gcc gzip art mcf 0

Department of Electrical and Computer Engineering University of Minnesota Dendogram of Distances Showing (Dis-)Similarity

Department of Electrical and Computer Engineering University of Minnesota Final Benchmark Groupings GroupBenchmarks IGzip,mesa IIVpr-Place,twolf IIIVpr-Route, parser, bzip2 IVGcc, vortex VArt VIMcf VIIEquake VIIIammp

Department of Electrical and Computer Engineering University of Minnesota Conclusion  Multifactorial (Plackett and Burman) design oRequires only O( m ) experiments oDetermines effects of main factors only oIgnores interactions  Logically minimal number of experiments to estimate effects of m input parameters  Powerful technique for obtaining a big-picture view of a lot of simulation data

Department of Electrical and Computer Engineering University of Minnesota Conclusion  Demonstrated for oRanking importance of simulation parameters oFinding overall impact of processor enhancement oClassifying benchmark programs  Current work comparing simulation strategies oReduced input sets (e.g. MinneSPEC) oSampling (e.g. SimPoints, sampling)

Department of Electrical and Computer Engineering University of Minnesota Goals  Develop/understand tools for interpreting large quantities of data  Increase insights into processor design  Improve rigor in computer architecture research