University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Slides:

Advertisements

Similar presentations

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Advertisements

Computer Organization and Architecture

Computer Organization and Architecture

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

A Probabilistic Approach to Nano- computing J. Chen, J. Mundy, Y. Bai, S.-M. C. Chan, P. Petrica and R. I. Bahar Division of Engineering Brown University.

Memory Organization.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Post-Manufacturing ECC Customization Based on Orthogonal Latin Square Codes and Its Application to Ultra-Low Power Caches Rudrajit Datta and Nur A. Touba.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

CPEN Digital System Design

A Robust Pulse-triggered Flip-Flop and Enhanced Scan Cell Design

Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

VIRTUAL MEMORY By Thi Nguyen. Motivation  In early time, the main memory was not large enough to store and execute complex program as higher level languages.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Design For Manufacturability in Nanometer Era

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Computer Architecture Chapter (5): Internal Memory

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Advanced Caches Smruti R. Sarangi.

CS161 – Design and Architecture of Computer

COSC3330 Computer Architecture

Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram

Computer Architecture & Operations I

William Stallings Computer Organization and Architecture 7th Edition

BIC 10503: COMPUTER ARCHITECTURE

Lecture 24: Memory, VM, Multiproc

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Architectures in High Defect Density Technologies Micro-42 December 14, 2009 Z ereh C ache: Armoring Cache Amin Ansari, Shantanu Gupta, Shuguang Feng, and Scott Mahlke University of Michigan, Ann Arbor

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Process Variation 2  Process variation o Intra-die and inter-die variations  Intra-die variation became significant from 0.13um o Systematic and random variations  Systematic: equipment, process (sub-wave length lithography)  Random: inherent variation (random dopant fluctuation) o Causes delay variations by affecting transistors and metal wires too slow too leaky [Borkar’03]  Robustness of silicon chips affected by o Process/manufacturing variation o Operating condition variation (temperature) [Austin’08]

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science On-Chip Cache Reliability  More than 70% of the chip area in the modern processors can be devoted to the on-chip SRAM arrays  The yield of an unprotected cache in 45 nm can be as low as 30% [Agarwal’05] 3  Lower granularity of redundancy required to provide the robustness guarantee → complexity of the spare substitution A 2-bank 2MB L2 cache with 128B block size and 256b word

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Our Goals  High failure rate tolerant cache protection scheme o A unified architectural approach for both L1 and L2  Providing high degree of freedom in spare substitution o Minimizing the amount of redundancy required while keeping the access delay and power overheads as low as possible  Using our scheme to mitigate the yield loss, caused by process variation induced failures, in high- performance microprocessors 4

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Functional Spare Lines Cache Lines collision 1 collision 2 Motivation  We want to use redundancy in multiple ways o Minimizing the amount of required redundancy  Lines with the same color form a “logical group” o Several word-lines along with their corresponding spare line  Changing group of the lines can solve the collisions  Line swapping allows a line to change its group 5 Not functional

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science ZerehCache Architecture Sense Amp & Drivers Data Column Decoder Row Decoder BIST block Interleaved Spare Cache (2) (3) Interconnection Network Hit/Miss Mux Level == = (1) Data Array Offset Network Configuration Storage Tag (1)Set (2)Word (3) Address Format … (2) Fault Map Comparator Tag Array 6 Data array divided to equal sized groups MUXing level: Does the selection between the main cache and spare cache data chunks Interconnection network: Between decoder and main cache Benes network is leveraged Shuffling between the lines to resolve the collisions Depth of network determines the degree of freedom in line swapping (access delay) Group 1 Group 2 Spare cache: Each line in spare cache corresponds to a logical group of lines in the main cache Each line is broken up to smaller redundancy units with fixed size Comparison stage: Determines whether the unit of redundancy should replace the main cache data chunk Network configuration storage: Contains the proper configuration of the interconnection network (non-volatile memory) Fault map: Same number of lines as spare cache For each redundancy unit, it stores the row number in the corresponding logical group which utilized that redundancy

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Benes Network (BN) Group 1 Data Array Row Decoder Group 2 Group 3 Group 4 Row Decoder Interconnection Network Data Array Butterfly Interconnection Network 7 Interconnection net consists of multiple unidirectional local BNs with small depth Shuffling the cache lines and forming the logical groups Two back-to-back butterfly nets Non-blocking, full permutation among the lines in the same “swapping set” Logarithmic diameter (scalable) 2n-1 stages for connecting 2 n lines Efficient implementation [Shi’03]

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Spare Lines Cache Lines collision 1 collision 2 Configuration Necessity 8  Configuration: Assignment of the cache word-lines to the spare word-lines [essential for having a functional cache] Not functional

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science ZC Configuration  Two algorithmic problems: o Effective group formation o Benes network configuration problem  Effective group formation [Graph Coloring] o The base graph is constructed from the collision pattern in the main/spare caches o A heavily optimized solver (IBSC) [manufacturing test time]  Benes network configuration o It maps a valid coloring assignment to the actual content of the network configuration storage [Recursive Subnet Mapping] 9

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science ab dc Local BN 2 Construction of Collision Graph ab dc  An edge exists between two nodes, if the corresponding lines cannot be in the same logical group o Intrinsic edges: bold lines, limitations of network o Fault edges: dotted lines, collision pattern Lines with the same color in the main/spare cache form a single logical group Each logical group contains a single spare line

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science a b c d X X X X Spare Cache Main Cache a b c d Spare Cache Main Cache Local BN 2 Local BN 1 Usage of Local Benes Networks 11 Virtual cache layout Actual cache layout Between the even lines Between the odd lines

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Design Space Exploration  A large set of values for the main design parameters: o Number of cache spare lines (from {2 i | i {0, 1, …, 8}}) o Depth of BNs (from {1, 3, 5, …, 19}) o MUXing granularity (from {2i | i {0, 1, …, 10}}) o 1980 initial design points for L1 and L2 ZCs  Practical constraints on design space: o Graph-coloring solver time o Cache access latency o Probability of operation o Area and power overhead 12

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Design Space Exploration (cont) 13  Graph-coloring solver time o Manufacturing test time is expensive (10s from several mins)  Cache access latency o An additional cycle for caches assuming no slack  P op : Probability that a specific ZC can properly operate for a given P F and also architecture parameters o Calculated using probabilistic analysis of random graphs o MUXing granularity ↓ results in P op ↑ o Number of redundant lines ↑ results in P op ↑ o Depth of BNs ↑ results in P op ↑

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Area Overhead  CACTI along with PowerCompiler and DesignCompiler are used to evaluate overheads of our extra structures 14 L1/2 - # of spare lines - MUXing granularity - BN depth

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Dynamic Energy Overhead  CACTI along with PowerCompiler and DesignCompiler are used to evaluate overheads of our extra structures 15 L1/2 - # of spare lines - MUXing granularity - BN depth

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science  Comparing ZC with: row-redundancy, 1-bit, and 2-bit error correction coding schemes.  For each protection mechanism and a given P F, area overhead increased from 0% until P op > 90%. Protecting L2 is much harder than L1 because of larger size and longer lines ECC-2 is costly even for the low failure rate situations ( 50% energy consumption overhead ) Comparison with Conventional Methods 16

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Yield Analysis  Inter-die variation, intra-die variation, and clustering effect (using large area clustering negative binomial model [Koren’00])  VARIUS used to model the inter-die variation and module-level intra-die variation  Population of 1000 chips generated based on the different process variation characteristics for 45nm 17 An L2 bank Yield of a cache without any protection mechanism can be as low as 33% Around 30% is reported in [Agarwal’05] For our selected L1 and L2 ZCs, 99% and 96% manufacturing yield can be achieved, respectively. Working enabled by ZC

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Comparison with Recently Proposed Methods 18 Comparing with: Wilkerson et. al. [WD/BF] and 8T SRAM bit-cells

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusion  ZC minimizes the required redundancy by: o Static row multiplexing o Dynamic line swapping o Effective group formation  Efficient design space exploration methodology for finding the optimal ZCs based on the design objectives  Outperforming the conventional and recently proposed cache protection mechanisms o L1 ZC: 16% area and15% power overheads o L2 ZC: 8% area and 12% power overheads 19

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Thank You 20