Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Performance of Cache Memory

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Chapter 12 Pipelining Strategies Performance Hazards.

Multiscalar processors

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Revisiting Load Value Speculation:

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Pipelining Example Laundry Example: Three Stages

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Dynamic Scheduling Why go out of style?

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

COSC3330 Computer Architecture

CSL718 : Superscalar Processors

Multiscalar Processors

Multilevel Memories (Improving performance using alittle “cash”)

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Flow Path Model of Superscalars

Ka-Ming Keung Swamy D Ponpandi

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Chapter 1 Computer System Overview

Lecture 10: ILP Innovations

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

Motivation  How to simulate the dynamic behavior of superscalar processor? Execution-driven simulation Trace-driven simulation with full trace Trace-driven simulation with filtered trace? Use timing-aware L1 filtered traces to predict a core ’ s dynamic behavior  Goal: Accurate trace-driven simulation using storage efficient filtered traces Perfect branch prediction and instruction cache is used Form a baseline study of multicore simulation at an early design stage using traces

Our contributions  Three trace-driven simulation models with L1 filtered traces L1 filtered traces are generated with perfect L2 cache  Model 1: Isolated Cache Miss Model (IsoCMM) No MLP (Memory Level Parallelism) Assume every L1 dcache miss is isolated Pessimistic when scheduling trace items  Model 2: Independent Cache Miss Model (IndepCMM) MLP Assume every L1 dcache miss is independent Optimistic when scheduling trace items  Model 3: Pairwise Dependent Cache Miss Model (PDepCMM) MLP Built on top of independent cache miss model Honors dependency between trace items

Isolated cache miss model  Sequentially process each trace item Update statistics and access L2 cache  If L2 cache miss occurs, return the estimated impact of L2 cache miss to the program execution time  Real impact of L2 cache miss on address A to the program execution time T run 2 – T run 1 smaller than memory access latency run 1 run 2 A L1 miss, L2 hit L1 miss, L2 miss

Instruction permeability analysis  Assume all L1 dcache misses are isolated run 1 (perfect L2) run 2 run3 … … … a1 cycles trace item (n) trace item (n+1) … … … b1 cycles A L1 miss, L2 hit L1 miss, L2 miss B AB B A trace simulation Hit on addr. A a1 cycles Miss on addr. A b1 cycles C C C a2 cycles c2 cycles a3 cycles b3 cycles D D D trace item (n+2) trace item (n+3) trace simulation Hit on addr. B a2 cycles Miss on addr. B c2 cycles trace simulation Hit on addr. C a3 cycles Miss on addr. C b3 cycles

Accuracy of isolated cache miss model high L1 dcache miss rate + clustered L1 dcache misseslow L1 dcache miss rate + sparse L1 dcache misses

Weakness of isolated cache miss model  Memory accesses are processed sequentially (no possible MLP) Impact of L2 cache misses are accumulated Superscalar processor can overlap multiple independent memory accesses  Our approach:  Consider MLP  Use a model that can overlap memory accesses Model 2: Independent cache miss model

How to process trace items in parallel?  Instructions can be issued after being placed in reorder buffer (ROB)  Assign a unique ROB sequence ID for every instruction  Estimate which trace items can be processed in parallel by comparing the ROB sequence ID in the trace items inst #1 64-entry Reorder Buffer … inst #35 inst #64 … inst #80 ……… inst #1 inst #35 inst #64 inst #80 … Trace file

Independent cache miss model  L1 misses are assumed to be independent run 1 … 5 cycles15 cycles20 cycles8 cycles A (2)B (10)C (40)D (85)E (90) L1 miss, L2 miss 64-entry Reorder Buffer … C #40 … A #2 B #10 … trace item B and C can be processed according to the # cycles recorded in its trace item B #10 … C #40 … trace item D can be processed after trace item B commits C #40 … D #85 blocked! still blocked trace item E can be processed according to the # cycles recorded in the trace item E #90 64-entry Reorder Buffer ……

Accuracy of Independent cache miss model 80% of trace items have dependencies

Weakness of independent cache miss model  Can be too optimistic!  Dependencies between trace items have to be considered  Our approach:  A model that honors the dependency between trace items Model 3: Pairwise dependent cache miss model inst #1 … Inst #35 … Inst #64 dependent

Pairwise dependent cache miss model  Based on the independent cache miss model  Honor the dependency between trace items run 1 … 5 cycles15 cycles20 cycles8 cycles A (2)B (10)C (40)D (85)E (90) L1 miss, L2 miss 64-entry Reorder Buffer trace item E has to wait until trace item D gets resolved C #40 … D #85 E #90 …… dependency

Accuracy of Pairwise dependent cache miss model

simulation speedup of 30.7x on average why?

Dependency issues  L1 dcache delayed hits are included in the trace file in addition to the L1 dcache misses. delayed hits with long memory access latencies with perfect L2 cache cache hits Subsequent accesses to the same cache block B A L1 dcache miss on cache block B  Large CPI errors due to dependencies can be reduced by c onsidering all potential dependencies between trace items

Accuracy of Pairwise dependent cache miss model simulation speedup of 30.7x on average  Can we apply pairwise dependent cache miss model to a multicore system simulation?  Two important questions need to be answered

Does our model preserve a processor ’ s dynamic behavior?  Can the model predict how shared resources (e.g., L2 cahce) are temporally used by multiple superscalar cores?  All six benchmarks showed at least 93.5% similarity 93.5% similarity

Can the model accurately project the relative performance?  Given the baseline configuration, can the pairwise dependent cache miss model accurately project the performance of different configurations compared to the baseline configuration?  Conf. 1: Larger L2 cache  Conf. 2: Smaller L2 cache  Conf. 3: Faster memory  Conf. 4: Slower memory  Conf. 5: Slower L2 cache

Conclusions  Introduced three strategies using filtered traces to predict superscalar processor performance from traces  Pairwise dependent cache miss model is the best among the three Can accurately predict superscalar processor performance (4.8% RMS error) Simulation speedup of 30.7 x on average when running spec2k benchmark suite in our current implementation  Our model is amenable for use in multicore simulations targeting multiprogrammed workloads. Preserves processor cores' dynamic resource usage behaviors Predicts relative performance with precision

Questions ? Accurately Approximating Superscalar Processor Performance from Traces