On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Chapter 12 Pipelining Strategies Performance Hazards.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Revisiting Load Value Speculation:

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Fetch Directed Prefetching - a Study

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

IMP: Indirect Memory Prefetcher

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Sunpyo Hong, Hyesoon Kim

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Reza Yazdani Albert Segura José-María Arnau Antonio González

Outline Motivation Project Goals Methodology Preliminary Results

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

CARP: Compression-Aware Replacement Policies

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Presented by David Wolinsky

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University

June 12, 2005MSP Introduction Memory wall –Increasing gap between processor and memory speeds –Concentration on bandwidth at the expense of latency Prefetch important data –Do not wait until the processor requests data –Pro-actively fetch the data that is likely to be consumed in the near future

June 12, 2005MSP Stream Prefetching Prefetching with outcome-based prediction –Use the history of previous misses to guess data addresses that are likely to miss soon Stream prefetching –A special case of outcome-based prediction –Proposed 15 years ago –The only hardware prefetching scheme used in modern microprocessors

June 12, 2005MSP Contributions Detailed sensitivity analysis of main prefetcher parameters on SPECcpu2000 programs –No such study in the literature –Many research papers fail to specify prefetcher parameters in comparative studies Case study –Evaluate performance of Runahead execution on a baseline with different stream prefetcher parameters

June 12, 2005MSP Outline Introduction Stream Prefetcher Operation Evaluation Methodology Experimental Results Conclusion

June 12, 2005MSP How Stream Prefetchers Work validstream addressstride validstream addressstride ……… validstream addressstride addr … AGU Global miss history addr + stride * lookahead miss addr = Stream exists? prefetch addr Stream table

June 12, 2005MSP Measured Parameters validstream addressstride validstream addressstride ……… validstream addressstride addr … miss addr prefetch distance Number of supported streams miss history length AGU addr + stride * lookahead = Stream exists? prefetch addr

June 12, 2005MSP Evaluation Methodology Benchmarks –22 SPECcpu2000 programs, highly optimized –All F77, C, and C++ programs –Multiple reference inputs per program –SimPoint interval of 500 million instructions Simulated architecture –SimpleScalar v4.0 cycle-accurate simulator –Aggressive superscalar Alpha like core

June 12, 2005MSP Simulated System Execution Core Fetch/issue/commit4/4/4 I-window/ROB/LSQ64/128/64 LdSt/Int/FP units2/4/2 Execution latenciesSimilar to Alpha Branch predictor16K-entry bimodal/gshare hybrid Memory Subsystem Cache sizes64KB IL1, 64KB DL1, 1MB L2 Cache associativity2-way L1, 4-way L2 Cache latencies2 cyc L1, 20 cyc L2 Main memory latency400 cycles

June 12, 2005MSP Outline Introduction Motivation Implementation Experimental Results Conclusion

June 12, 2005MSP Miss History Length 7 programs are very sensitive 16-entry history is enough

June 12, 2005MSP Number of Stream Table Entries only 3 programs are sensitive > 8 streams provides little benefit

June 12, 2005MSP L2 Cache Prefetch Distance 11 programs are very sensitive FP speedup varies by 80% - 140%

June 12, 2005MSP Case Study: Runahead Execution Performance of stream prefetching is highly dependent on parameter choice Another proposal: Runahead execution –Pseudo-retire long latency loads stalling the pipeline and continue executing –Roll back to checkpoint after load comes back from memory

June 12, 2005MSP Speedup over Stream Prefetching SPEC fp speedup drops by > 2x

June 12, 2005MSP Conclusion Key observations –The performance of the stream prefetcher is highly dependent on its configuration –Varying the prefetch distance alone almost doubles the average performance benefit –Choosing a non-optimal stream prefetcher as a baseline can distort results by a factor of two Conclusion –Parameter optimizations are imperative when comparing stream prefetchers to other prefetching techniques

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University