1 Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.

Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

Phase Detection Jonathan Winter Casey Smith CS /05/05.

Performance Counter Based Architecture Level Power Modeling ( ) MethodologyResults Motivation & Goals Processor power is increasing.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Multiscalar processors

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

ECE462/562 Class Project Intelligent Cache Replacement Policy Team member : Chen, Kemeng Gregory A Reida.

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Assessing and Understanding Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

CSC 4250 Computer Architectures

‘99 ACM/IEEE International Symposium on Computer Architecture

Michigan Technological University, Houghton MI

CSCI1600: Embedded and Real Time Software

Phase Capture and Prediction with Applications

Adapted from the slides of Prof

COMS 361 Computer Organization

rePLay: A Hardware Framework for Dynamic Optimization

CSCI1600: Embedded and Real Time Software

Phase based adaptive Branch predictor: Seeing the forest for the trees

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

1 Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang

2 Motivation  Widening gap between processor and memory speed  memory wall  Static compiler analysis has limited capability  regular array references only  index arrays  integer code  Reuse distance prediction across program inputs  number of distinct memory locations accessed between two references to the same memory location  applicable to more than just regular scientific code  locality as a function of data size  predictable on whole program and per instruction basis for scientific codes

3 Motivation  Memory distance  A dynamic quantifiable distance in terms of memory reference between tow access to the same memory location.  reuse distance  access distance  value distance  Is memory distance predictable across both integer and floating-point codes?  predict miss rates  predict critical instructions  identify instructions for load speculation

4 Related Work  Reuse distance  Mattson, et al. ’70  Sugamar and Abraham ’94  Beyls and D’Hollander ’02  Ding and Zhong ’03  Zhong, Dropsho and Ding ’03  Shen, Zhong and Ding ’04  Fang, Carr, Önder and Wang ‘04  Marin and Mellor-Crummey ’04  Load speculation  Moshovos and Sohi ’98  Chyrsos and Emer ’98  Önder and Gupta ‘02

5 Background  Memory distance  can use any granularity (cache line, address, etc.)  either forward or backward  represented as a pattern  Represent memory distance as a pattern  divide consecutive ranges into intervals  we use powers of 2 up to 1K and then 1K intervals  Data size  the largest reuse distance for an input set  characterize reuse distance as a function of the data size  Given two sets of patterns for two runs, can we predict a third set of patterns given its data size?

6 Background  Let be the distance of the i th bin in the first pattern and be that of the second pattern. Given the data sizes s 1 and s 2 we can fit the memory distances using  Given c i, e i, and f i, we can predict the memory distance of another input set with its data size

7 Instruction Based Memory Distance Analysis  How can we represent the memory distance of an instruction?  For each active interval, we record 4 words of data min, max, mean, frequency  Some locality patterns cross interval boundaries merge adjacent intervals, i and i + 1, if merging process stops when a minimum frequency is found needed to make reuse distance predictable  The set of merged intervals make up memory distance patterns

8 Merging Example

9 What do we do with patterns?  Verify that we can predict patterns given two training runs  coverage  accuracy  Predict miss rates for instructions  Predict loads that may be speculated

10 Prediction Coverage  Prediction coverage indicates the percentage of instructions whose memory distance can be predicted  appears in both training runs  access pattern appears in both runs and memory distance does not decrease with increase in data size (spatial locality) same number of intervals in both runs  Called a regular pattern  For each instruction, we predict its i th pattern by  curve fitting the i th pattern of both training runs  applying the fitting function to construct a new min, max and mean for the third run  Simple, fast prediction

11 Prediction Accuracy  An instruction’s memory distance is correctly predicted if all of its patterns are predicted correctly  predicted and observed patterns fall in same interval  or, given two patterns A and B such that B.min  A.max  B.max

12 Experimental Methodology  Use 11 CFP2000 and 11 CINT2000 benchmarks  others don’t compile correctly  Use ATOM to collect reuse distance statistics  Use test and train data sets for training runs  Evaluation based on dynamic weighting  Report reuse distance prediction accuracy  value and access very similar

13 Reuse Distance Prediction SuitePatternsCoverage % Accuracy % %constant%linear CFP CINT

14 Coverage issues  Reasons for no coverage 1. instruction does not appear in at least one test run 2. reuse distance of test is larger than train 3. number of patterns does not remain constant in both training runs SuiteReason 1Reason 2Reason 3 CFP %0.3%2.5% CINT %4.4%1.8%

15 Prediction Details  Other patterns  183.equake has 13.6% square root patterns  200.sixtrack, 186.crafty all constant (no data size change)  Low coverage  189.lucas – 31% of static memory operations do not appear in training runs  164.gzip – the test reuse distance greater than train reuse distance cache-line alignment

16 Number of Patterns Suite1234 55 CFP %10.5%4.8%1.4%1.5% CINT %10.9%7.6%4.6%5.3%

17 Miss Rate Prediction  Predict a miss for a reference if the backward reuse distance is greater than the cache size.  neglects conflict misses  Accurate miss rate prediction

18 Miss Rate Prediction Methodology  Three miss-rate prediction schemes  TCS – test cache simulation Use the actual miss rates from running the program on a the test data for the reference data miss rates  RRD – reference reuse distance Use the actual reuse distance of the reference data set to predict the miss rate for the reference data set An upper bound on using reuse distance  PRD –predicted reuse distance Use the predicted reuse distance for the reference data set to predict the miss rate.

19 Cache Configurations config no.L1L2 132K, fully assoc.1Mfully assoc K, 2-way1M 8-way 4-way 2-way

20 L1 Miss Rate Prediction Accuracy SuitePRDRRDTCS CFP CINT

21 L2 Miss Rate Accuracy Suite2-wayFully Associative PRDRRDTCSPRDRRDTCS CFP200091%93%87%97%99.9%91% CINT200091%95%87%94%99.9%89%

22 Critical Instructions  Given reuse distance for an instruction  Can we determine which instructions are critical in terms of cache performance?  An instruction is critical if it is in the set of instructions that generate the most L2 cache misses  Those top miss-rate instructions whose cumulative total misses account for 95% of the misses in a program.  Use the execution frequency of one training run to determine the relative contribution number of misses for each instruction  Compare the actual critical instructions with predicted  Use cache configuration 2

23 Critical Instruction Prediction SuitePRDRRDTCS%pred%act CPF200092%98%51%1.66%1.67% CINT200089%98%53%0.94%0.97%

24 Critical Instruction Patterns Suite1234 55 CFP CINT

25 Miss Rate Discussion  PRD performs better than TCS when data size is a factor  TCS performs better when data size doesn’t change much and there are conflict misses  PRD is much better at identifying the critical instructions than TCS  these instructions should be targets of optimization

26 Memory Disambiguation  Load speculation  Can a load safely be issued prior to a preceding store?  Use a memory distance to predict the likelihood that a store to the same address has not finished  Access distance  The number of memory operations between a store to and load from the same address  Correlated to instruction distance and window size  Use only two runs If access distance not constant, use the access distance of larger of two data sets as a lower bound on access distance

27 When to Speculate  Definitely “no”  access distance less than threshold  Definitely “yes”  access distance greater than threshold  Threshold lies between intervals  compute predicted mis-speculation frequency (PMSF) speculate is PMSF < 5%  When threshold does not intersect intervals total of frequencies that lie below the threshold  Otherwise

28 Value-based Prediction  Memory dependence only if addresses and values match store a 1, v 1 store a 2, v 2 store a 3, v 3 load a 4, v 4 Can move ahead if a 1 =a 2 =a 3 =a 4, v 2 =v 3 and v 1 ≠v 2  The access distance of a load to the first store in a sequence of stores storing the same value is called the value distance

29 Experimental Design  SPEC CPU2000 programs  SPEC CFP swim, 172.mgrid, 173.applu, 177.mesa, 179.art, 183.equake, 188.ammp, 301.apsi  SPEC CINT gzip, 175.vpr, 176.gcc, 181.mcf, 186.crafty, 197.parser, 253.perlbmk, 300.twolf  Compile with gcc –O3  Comparison  Access distance, value distance  Store set with 16KB table, also with values  Perfect disambiguation

30 Micro-architecture issue width8 fetch width8 retire width16 window size128 load/store queue 128 functional units8 fetch multiblock gshare data cacheperfect memory ports2 OperationLatency load2 int division8 int multiply4 other int1 float multiply4 float addition3 float division8 other float2

31 IPC and Mis-speculation Suite Access Distance Store Set 16KB Table Perfect CFP CINT Suite Mis-speculation Rate% Speculated Loads AccessStore SetAccessStore Set CFP CINT

32 Value-based Disambiguation Suite Value Distance Store Set 16KB Value CFP CINT Suite Mis-speculation Rate % Speculated Loads CFP CINT

33 Cache Model SuiteAccessStore Set 16K CFP CINT SuiteValueStore Set 16K CFP CINT

34 Summary  Over 90% of memory operations can have reuse distance predicted with a 97% and 93% accuracy, for floating-point and integer programs, respectively  We can accurately predict miss rates for floating- point and integer codes  We can identify 92% of the instructions that cause 95% of the L2 misses  Access- and value-distance-based memory disambiguation are competitive with best hardware techniques without a hardware table

35 Future Work  Develop a prefetching mechanism that uses the identified critical loads.  Develop an MLP system that uses critical loads and access distance.  Path-sensitive memory distance analysis  Apply memory distance to working-set based cache optimizations  Apply access distance to EPIC style architectures for memory disambiguation.