LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Introduction to the TRAMS project objectives and results in Y1 Antonio Rubio, Ramon Canal UPC, Project coordinator CASTNESS’11 WORKSHOP ON TERACOMP FET.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 12 Reduce Miss Penalty and Hit Time

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

High Performing Cache Hierarchies for Server Workloads

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects Sarangi et al Prateeksha Satyamoorthy CS

MASSIVE ARRAYS OF IDLE DISKS FOR STORAGE ARCHIVES D. Colarelli D. Grunwald U. Colorado, Boulder.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

The Cost of Fixing Hold Time Violations in Sub-threshold Circuits Yanqing Zhang, Benton Calhoun University of Virginia Motivation and Background Power.

On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Case Study - SRAM & Caches

11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.

Dept. of Computer Science, UC Irvine

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Process Variation Mohammad Sharifkhani. Reading Textbook, Chapter 6 A paper in the reference.

Computer Architecture Lecture 26 Fasih ur Rehman.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration 班級：積體所碩一學生：林欣緯指導教授：魏凱城老師 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Outline Introduction: BTI Aging and AVS Signoff Problem

Lecture 40: Review Session #2 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Lecture 20 Last lecture: Today’s lecture: Types of memory

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

COSC3330 Computer Architecture

Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram

Multilevel Memories (Improving performance using alittle “cash”)

William Stallings Computer Organization and Architecture 7th Edition

Adaptive Cache Replacement Policy

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Challenges in Nanoelectronics: Process Variability

Module IV Memory Organization.

Impact of Parameter Variations on Multi-core chips

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Lecture 22: Cache Hierarchies, Memory

Cache - Optimization.

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State University, Tempe Chaitali Chakrabarti, Arizona State University, Tempe

Introduction: Variations Process Variation: Due to loss of control in manufacturing process. Variation in channel length, oxide thickness and doping concentration. Systematic Variation: wafer to wafer variation Random Variation: within-die variation Voltage Variation: Voltage within a chip varies. IR drop ~ 3%. LDO tolerance ~ 5%. Temperature Variation.

Introduction: Variations To design reliable circuits, we use worst case corner as a sign-off criteria. The worst case corner is a compromise between yield and performance. The delay variation in sub 100nm impacts maximum operable frequency significantly. Cell TypeFFTYPSS SEN_INV_16.11ps7.97ps10.95ps 8192x16 memory960ps1330ps2110ps

Introduction: Techniques to Compensate for Variation  System level techniques for compensating variation at expense of performance. [Roy, VLSI Design ‘09] –CRISTA: Critical Path Isolation for Timing Adaptiveness –Variation Tolerance by Trading Off Quality of Results.  Memories suffer from more variation than logic due to small transistor size and caches often decide the maximum operable frequency of a system.  Our paper present system-level techniques to improve performance of a variation-tolerant cache by reducing usage of cache blocks having large delay variation. –LA-LRU: Latency Aware LRU policy. –Block Rearrangement. Page 4

Agenda  Adaptive Cache Architecture [Ben Naser, TVLSI ‘08]  LA-LRU  Block Rearrangement  Simulation Results  Summary Page 5

Adaptive Cache Architecture (1/2)  [Ben Naser, TVLSI ‘08]: Monte Carlo simulations using 32nm PTM models showed that access to 25% of cache blocks requires two cycle access and occasionally even three cycle accesses to account for delay variability. Page 6

Adaptive Cache Architecture (2/2) Page 7

LA-LRU (1/6)  Adaptive Cache architecture with LRU policy  The delay storage provides information on latency of each cache block which can be used to modify the replacement policy to increase single cycle access. Page 8 Type of Access%of Cache Accesses Hit with one cycle access Hit with two cycle access Hit with three cycle access2.224 Miss0.760

LA-LRU (2/6)  Conventional Least Recently Used Replacement Policy (LRU)  MRU data can be in high latency ways. Page 9 LRU mechanism in conventional Cache (000 -> MRU data, 111->LRU data)

LA-LRU (3/6)  Latency-Aware Least Recently Used Replacement Policy (LA-LRU)  The LRU data is always in high latency ways. Page 10 LA-LRU mechanism (000 -> MRU data, 111->LRU data)

LA-LRU (4/6)  Cache Access Distribution  Significant increase in one cycle accesses. Page 11 Type of Access%of Cache Accesses LRULA-LRU Hit with one cycle access Hit with two cycle access Hit with three cycle access Miss0.760

LA-LRU (5/6)  Architecture Block Diagram: 64KB/8/32 Page 12

LA-LRU (6/6)  Latency overhead with LA-LRU –Hit in ways with latency 1 = 1 cycle –Hit in ways with latency 2 = 2-4 cycles –Hit in ways with latency 3 = 3-6 cycles –Miss in cache = cache miss penalty  Assume tag array is unaffected by process variation and exchanges in it can be made within one cycle.  Synthesis using Synopsys shows that power overhead is only 3.5% of the total LRU logic. Page 13

Block Rearrangement (1/1)  High Latency blocks randomly distributed.  Modify Address decoder to have uniform distribution of high latency ways among sets.  Overhead of Perfect BRT: (log 2 (number of sets)) mux stages in decoder. Page 14 No block rearrangement Perfect block rearrangement Paired block rearrangement

Simulation Results (1/3)  Comparing following architectures: –NPV : No process variation. –WORST: Each access takes three cycles. –ADAPT: Cache access depends on the latency of block. –LA-LRU: The proposed replacement policy. –LA-LRU with PairedBRT: LA-LRU with block rearrangement within two adjacent sets. –LA-LRU with PerfectBRT: LA-LRU with block rearrangement amongst all sets. Page 15

Simulation Results (2/3)  Simulation Environment –Modified Wattch Simplescalar to measure performance for Xscale, PowerPC and Alpha21265 like processor configurations using SPEC2000 benchmarks. –Generate random distribution of latency for following two variation models: a)15% two cycle and 0% three cycle latency blocks. b)25% two cycle and 1% three cycle latency blocks. Page 16

Simulation Results (3/3) Page 17  Average Degradation in Memory Access Latency  LA-LRU is sufficient for Xscale and PowerPC.  LA-LRU + PerfectBRT is better than LA-LRU for Alpha Cache Architecture%degradation for Xscale (32K/32/32) %degradation for PowerPC (32K/8/32) %degradation for Alpha21265 (64K/2/64) (a)(b)(a)(b)(a)(b) NPV0.00 WORST ADAPT LA-LRU LA-LRU+PairedBRT LA-LRU+PerfectBRT

Summary (1/1)  LA-LRU combined with adaptive techniques to vary cache access latency, improves performance of cache affected by variation significantly.  LA-LRU reduces memory access latency degradation due to variation to almost 0 for almost any cache configuration.  For low-associative caches, block rearrangement with LA- LRU may be used to further reduce any performance degradation.  The power overhead of implementing the LA-LRU scheme is negligible because LA-LRU logic is excited less than 1% of time and is only 3.5% of the power consumption of LRU logic.  Observed similar results for policies such as FIFO etc. Page 18