Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Slides:



Advertisements
Similar presentations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Advertisements

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Advanced Computer Architecture Lab University of Michigan MASE Eric Larson MASE: Micro Architectural Simulation Environment Eric Larson, Saugata Chatterjee,
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.
Sunpyo Hong, Hyesoon Kim
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Prophet/Critic Hybrid Branch Prediction B B B
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
PipeliningPipelining Computer Architecture (Fall 2006)
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Topics to be covered Instruction Execution Characteristics
Dynamic Branch Prediction
Multiscalar Processors
‘99 ACM/IEEE International Symposium on Computer Architecture
Introduction to SimpleScalar
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
5.2 Eleven Advanced Optimizations of Cache Performance
Flow Path Model of Superscalars
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Running OpenSSL Crypto Algorithms in Simplescalar
Superscalar Pipelines Part 2
Computer Architecture Lecture 4 17th May, 2006
Module 3: Branch Prediction
Lecture 14: Reducing Cache Misses
Phase Capture and Prediction with Applications
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
ECE/CS 552: Pipelining to Superscalar
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis

Overview HLS is a hybrid performance simulation –Statistical + Symbolic Fast Accurate Flexible

Motivation I-cache hit rate I-cache miss penalty Branch miss-predict penalty Basic block size Dispatch bandwidth

Motivation Fast simulation –seconds instead of hours or days –Ideally is interactive Abstract simulation –simulate performance of unknown designs –application characteristics not applications

Outline Simulation technologies and HLS From applications to profiles Validation Examples Issues Conclusion

Design Flow with HLS Cycle-by- Cycle Simulation HLS Profile Design Issue Possible solution Estimate Performance

Traditional Simulation Techniques Cycle-by-cycle (Simplescalar, SimOS,etc.) + accurate – slow Native emulation/basic block models (Atom, Pixie) + fast, complex applications – useful to a point (no low-level modifications)

Statistical / Symbolic Execution HLS + fast (near interactive) + accurate / – within regions + permits variation of low-level parameters + arbitrary design points / – use carefully

HLS: A Superscalar Statistical and Symbolic Simulator L2 Cache L1 I-cache L1 D-cache Main Memory Branch Predictor Fetch Unit Out of order Dispatch Unit Out of order Completion Unit Out of order Execution core StatisticalSymbolic

Workflow Code Binary sim-stat sim-outorder app profile Stat-binary HLS machine-profile R10k machine-configuration

Machine Configurations Number of Functional units (I,F,[L,S],B) Functional unit pipeline depths Fetch, Dispatch and completion bandwidths Memory access latencies Mis-speculation penalties

Profiles Machine profile: –cache hit rates => (  ) –branch prediction accuracy => (  ) Application profile: –basic block size => ( ,  ) –instruction mix (% of I,F,L,S,B) –dynamic instruction distance (histogram)

Statistical Binary 100 basic blocks Correlated: –random instruction mix –random assignment of dynamic instruction distance –random distribution of cache and branch behaviors

Statistical Binary load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0) integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1) branch (l1 i-cache, l2 i-cache, branch-predictor accr., dep 0, dep 1) store (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dep 0, dep 1) load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0) core functional unit requirements cache behavior during I-fetch cache behavior during data access dynamic instruction distance branch predictor behavior

HLS Instruction Fetch Stage integer (...) branch (...) store (...) load (...) integer (...) branch (...) load (...) integer (..) Similar to conventional instruction fetch: - has a PC - has a fetch window - interacts with caches - utilizes branch predictor - passes instructions to dispatch Differences: - caches and branch predictor are statistical models Fetches symbolic instructions and interacts with a statistical memory system and branch predictor model.

Validation - SimpleScalar vs. HLS

Validation - R10k vs. HLS

HLS Multi-value Validation with SimpleScalar HLS Simple-Scalar (Perl)

HLS Multi-Value Validation with SimpleScalar HLS Simple-Scalar (Xlisp)

Example use of HLS An intuitive result: branch prediction accuracy becomes less important (crosses fewer iso-IPC contour lines, as basic block size increase). (Perl)

Example use of HLS Another intuitive result: gains in IPC due to basic block size are front-loaded (Perl) Trade-off between front-end (fetch/dispatch) and back-end (ILP) processor performance

Example use of HLS This space intentionally left blank. (Perl)

Related work R. Carl and J.E. Smith. Modeling superscalar processors via statistical simulation - PAID Workshop - June N. Jouppi. The non-uniform distribution of instruction-level and machine parallelism and its effect on performance. - IEEE Trans D. Noonburg and John Shen. Theoretical modeling of superscalar processor performance - MICRO27 - November 1994.

Questions & Future Directions How important are different well-performing benchmarks anyway? –easily summarized –summaries are not precise => yet precise enough –Will the statistical+symbolic technique work for poorly behaved applications? Will it extend to deeper pipelines and more real processors (i.e. Alpha, P6 architecture)?

Conclusion HLS: Statistical + Symbolic Execution –Intuitive design space exploration Fast Accurate –Flexible Validated against cycle-by-cycle and R10k Future work: deeper pipelines, more hardware validations, additional domains source code at: