Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Slides:

Advertisements

Similar presentations

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Advertisements

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332,

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

Revisiting Load Value Speculation:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Eager Writeback — A Technique for Improving Bandwidth Utilization

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

ECE Dept., Univ. Maryland, College Park

Zhichun Zhu Zhao Zhang ECE Department ECE Department

5.2 Eleven Advanced Optimizations of Cache Performance

Lu Peng, Jih-Kwon Peir, Konrad Lai

Out-of-Order Commit Processors

Flow Path Model of Superscalars

ECE 445 – Computer Organization

Hyperthreading Technology

Half-Price Architecture

Using Dead Blocks as a Virtual Victim Cache

IA-64 Microarchitecture --- Itanium Processor

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Out-of-Order Commit Processors

Presentation transcript:

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan

Motivation Increasing Memory – Processor frequency Gap Increasing Memory – Processor frequency Gap Large Data Caches to hide Long Latencies Large Data Caches to hide Long Latencies Larger caches – Longer Access Latencies [McFarland 98] Larger caches – Longer Access Latencies [McFarland 98]  Processor Cycle determines Cache Size  Intel Pentium III – 16K DL1 Cache, 3 cycle access  Intel Pentium 4 – 8K DL1 Cache, 2 cycle access Need Large AND Fast Caches! Need Large AND Fast Caches!

Related Work Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98]  All Loads are NOT equal  Determining Criticality – Very Complex  Sophisticated Simulator with Rollback Non-Critical Buffer [Fisk & Bahar, ICCD99] Non-Critical Buffer [Fisk & Bahar, ICCD99]  Determining Criticality – Performance Degradation/Dependency Chains  Non-Critical Buffer – Victim Cache for non- critical loads  Small Performance Improvements (upto 4%)

Related Work(contd.) Locality vs. Criticality [Srinivasan et.al., ISCA 01] Locality vs. Criticality [Srinivasan et.al., ISCA 01]  Determining Criticality – Practical Heuristics  Potential for Improvement – 40%  Locality is better than Criticality Non-Vital Loads [Rakvic et.al., HPCA 02] Non-Vital Loads [Rakvic et.al., HPCA 02]  Determining Criticality – Run-time Heuristics  Small and fast Vital cache for Vital Loads  17% Performance Improvement

Load Latency Tolerance

Criticality Criticality – Effect of Load Latency on Performance Criticality – Effect of Load Latency on Performance Two thresholds – Performance and Latency Two thresholds – Performance and Latency A Very Direct Estimation of Criticality A Very Direct Estimation of Criticality Computation Intensive! Computation Intensive! Static Static

Determining Criticality- A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles

Most Frequently Executed Loads

Criticality(contd..) Benchmark (SPECINT 2000) # of Load Insns accounting for 80% of Load references BZIP2130 CRAFTY905 EON550 GAP100 GCC4650! GZIP74 MCF115 PARSER305 TWOLF185

Critical Cache Configuration

Effectiveness? Load Reference Distribution Load Reference Distribution  What %age of Loads Identified as Critical  Miss Rate for Critical Load References Critical Cache Configuration compared with Critical Cache Configuration compared with  Faster Conventional Cache Configuration  DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles Critical Cache Configuration compared with Critical Cache Configuration compared with  Larger Conventional Cache Configuration  DL1 Sizes – 8KB, 16KB, 32KB, 64KB

Processor Configuration Similar to Alpha using SimpleScalar-3.0 [Austin, Burger 97] Fetch Width 8 instructions per cycle Fetch Queue Size 64 Branch Predictor 2 Level, 4K entry level 2 Branch Target Buffer 2K entries, 8 way associative Issue Width 4 instructions per cycle Decode Width 4 instructions per cycle RUU Size 128 Load/Store Queue Size 32 Instruction Cache 64KB, 2-way, 64 byte lines L2 Cache 1MB, 2-way, 128 byte lines Memory Latency 64 cycles

Results Benchmark # of Critical Load Insns. Critical Load Refs (% of total Load Refs) Miss rate of Critical Loads for 1K critical cache BZIP CRAFTY EON GAP GZIP MCF PARSER TWOLF

Results Comparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated

Results Comparison with a faster Conventional Cache Configuration IPCs normalized to 32K- 1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated

Results Comparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration

Results Comparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache

Conclusions & Future Work Conclusions Conclusions  Compares well with a faster conventional cache  Outperforms a larger conventional cache in most cases Future Work Future Work  More heuristics to refine “criticality”  Why are “critical loads” critical?  Criticality of a memory address vs. criticality of a load instruction  Criticality for lowpower Caches