Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Slides:

Advertisements

Similar presentations

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Fast Algorithms For Hierarchical Range Histogram Constructions

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

Phase Detection Jonathan Winter Casey Smith CS /05/05.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Workload Characteristics and Representative Workloads David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Clustering Unsupervised learning Generating “classes”

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Best detection scheme achieves 100% hit detection with

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Page 1 SARC Samsung Austin R&D Center SARC Maximizing Branch Behavior Coverage for a Limited Simulation Budget Maximilien Breughe 06/18/2016 Championship.

Recent Advances in Iterative Parameter Estimation

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CSCI1600: Embedded and Real Time Software

Lecture 14: Reducing Cache Misses

Phase Capture and Prediction with Applications

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Adapted from the slides of Prof

Program Phase Directed Dynamic Cache Way Reconfiguration

Patrick Akl and Andreas Moshovos AENAO Research Group

rePLay: A Hardware Framework for Dynamic Optimization

CSCI1600: Embedded and Real Time Software

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Introduction Modeling Simulation Performance Study Processor Design

Architectural Simulators Explore Design Space Evaluate existing hardware, or Predict performance of proposed hardware Designer has control Functional Simulators: Model architecture (programmers’ focus) Eg., sim-fast, sim-safe Performance Simulators: Model microarchitecture (designer’s focus) Eg., cycle-by-cycle (sim-outoforder)

Simulation Issues Real-applications take too long for a cycle-by- cycle simulation Vast design space: Design Parameters: code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc. Architectural metrics: IPC/ILP, cache miss rate, branch prediction accuracy, etc. Find design flaws + Provide design improvements Need a “robust” simulation methodology !!

Two Methodologies HLS Hybrid: Statistical + Symbolic REF: HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA BBDA Basic block distribution analysis REF: Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT

HLS: An Overview A hybrid processor simulator HLS Statistical Model Symbolic Execution Performance Contours spanned by design space parameters What can be achieved? Explore design changes in architectures and compilers that would be impractical to simulate using conventional simulators

HLS: Main Idea Application code Statistical Profiling Instruction stream, data stream Machine independent characteristics: -basic block size -Dynamic instruction distance -Instruction mix Machine independent characteristics: -basic block size -Dynamic instruction distance -Instruction mix Structural Simulation of FU, issue pipeline units Architecture metrics: -Cache behavior -Branch prediction accuracy Architecture metrics: -Cache behavior -Branch prediction accuracy Synthetically generated code

Statistical Code Generation Each “synthetic instruction” contains the following parameters based on the statistical profile: Functional unit requirements Dynamic instruction distances Cache behavior

Validation of HLS against SimpleScalar For varying combinations of design parameters: Run original benchmark code on SimpleScalar (use sim-outoforder) Run statistically generated code on HLS Compare SimpleScalar IPC vs. HLS IPC

Validation: Single- and Multi-value correlations IPC vs. L1-cache hit rate For SPECint95: HLS Errors are within 5-7% of the cycle-by-cycle results !!

Validation: L1 Instruction Cache Miss Penalty vs. Hit Rate Correlation suggests that: Cache hit rate should be at least 88% to dominate

HLS: Code Properties Basic Block Size vs. L1-Cache Hit Rate Correlation suggests that: Increasing block size helps only when L1 cache hit rate is >96% or <82%

HLS: Code Properties Dynamic Instruction Distance vs. Basic Block Size Correlation suggests that: Moderate DID values suffice for IPC, and high values of basic block size (>8) does not help without an increase in DID

HLS: Value Prediction DID vs. Value predictability GOAL: Break True Dependency Stall Penalty for mispredict vs. Value Prediction Knowledge

HLS: More Multi-value Correlations L1-cache hit rate vs. Value Predictability DID vs. Superscalar issue width

HLS: Discussion Low error rate only on SPECint95 benchmark suite. High error rates on SPECfp95 and STREAM benchmarks Findings: by R. H. Bell et. Al, 2004 Reason: Instruction-level granularity for workload Recommended Improvement: Basic block-level granularity

Goals The end of the initialization The period of the program Ideal place to simulate given a specific number of instructions one has to simulate Accurate confidence estimation of the simulation point. Revamp this slide.

Program Behavior Program behavior has ramification on architectural techniques. Program behavior is different in different parts of execution. Initialization Cyclic behavior (Periodic)

Basic Block Distribution Analysis Each basic block gets executed a certain number of times. Number of times each basic block executes gives a fingerprint. Use the fingerprints to find representative areas to simulate. How does fingerprinting help?

Cyclic Behavior of Programs Cyclic Behavior is not representative of all programs. Common case for compute bound applications. SPEC95 wave program executes 7 billion instructions before it reaches the code that amounts to the bulk of execution.

Basic Block Vectors Fast profiling to determine the number of times a basic block executes. Behavior of the program is directly related to the code that it is executing. Profiling gives a basic block fingerprint for that particular interval of time. Full execution of the program and the interval we choose spends proportionally the same amount of time in the same code. Collected in intervals of 100 million instructions.

Basic Block Vector - BBV BBV is a single dimensional array. There is an element for each basic block in the program. Each element is the count of how many times a given basic block was entered during an interval. Varying size intervals A BBV collected over an interval of N times 100 million instructions is a BBV of duration N.

Basic Block Vectors BBV is normalized Each element divided by the sum of all elements. Target BBV BBV for the entire execution of the program. Objective Find a BBV of small duration similar to Target BBV.

Basic Block Vector Difference Difference between BBVs Element wise subtraction, sum of absolute values. A number between 0 and 2. Manhattan and Euclidean Distance.

Basic Block Difference Graph Plot of how well each individual sample in the program compares to the target BBV. For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV. Used to Find the initialization phase Find the period for the program.

Basic Block Difference Graph Diagram and explain

Initialization Initialization is not trivial. Important to simulate representative sections of code. Detection of the end of the initialization phase is important. Initialization Difference Graph Initial Representative Signal - First quarter of BB Difference graph. Slide it across BB difference graph. Difference calculated at each point for first half of BBDG. When IRS reaches the end of the initialization stage on the BB difference graph, the difference is maximized.

Initialization Diagram and explain

Period Period Difference Graph Period Representative Signal Part of BBDG, starting from the end of initialization to ¼th the length of program execution. Slide across half the BBDG. Distance between the minimum Y-axis points is the period. Using larger durations of a BBV creates a BBDG that emphasizes the larger periods.

Period Diagram and explain

Method SimpleScalar modified. Output and clear statistics counters every 100 million instructions committed. Graphed data: IPC, % RUU Occupancy, Cache Miss Rate etc. To get the most representative sample of a program at least one full period must be simulated.

Results

Basic Block Similarity Matrix A phase of a program behavior can be defined as all similar sections of execution regardless of temporal adjacency. Similarity Matrix Upper Triangle N X N Matrix, where N is the number of intervals in the program execution. An entry at (x, y) in the matrix represents Manhattan distance between the BBV at x and BBV at y.

Basic Block Similarity Matrix IMAGE and explain the image.

Finding Basic Block Similarity Many intervals of execution are similar to each other. It makes sense to group them together. Analogous to clustering.

Clustering Goal is to divide a set of points into groups such that points within each group are similar to one another by some metric. This problem arises in other fields such as computer vision, genomics etc. Two types of clustering algorithms exist Partitioning Choose an initial solution then iteratively update to find better solution Linear Time Complexity Hierarchical Divisive or Agglomerative Quadratic Time Complexity

Phase Finding Algorithm Generate BBVs with a duration of 1. Reduce the dimension of the BBVs to 15. Apply clustering algorithm on the BBVs. Score the clustering and choose the most suitable.

Random Projection Curse of Dimensionality BBV dimensions Number of executed Basic Blocks. Could grow to millions. Dimension Selection Dimension Reduction Random Linear Projection.

Clustering Algorithm K-means algorithm Iterative optimizing algorithm. Two repetitive phases that converge. WORK IN PROGRESS