August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

361 Computer Architecture Lecture 15: Cache Memory
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Processor - Memory Interface
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
Cache Memory.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Lecture 20 Last lecture: Today’s lecture: Types of memory
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
CMSC 611: Advanced Computer Architecture
Cache Memory.
Processor support devices Part 2: Caches and the MESI protocol
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Lecture: Large Caches, Virtual Memory
Replacement Policy Replacement policy:
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Andy Wang Operating Systems COP 4610 / CGS 5765
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Module IV Memory Organization.
Andy Wang Operating Systems COP 4610 / CGS 5765
Chapter 6 Memory System Design
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Lecture 21: Memory Hierarchy
Cache - Optimization.
Sarah Diesburg Operating Systems CS 3430
Andy Wang Operating Systems COP 4610 / CGS 5765
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache

Motivation Cache Background System Overview Methodology Progress Future Work Outline 2

Goal Create a configurable shared Last Level Cache for the use in the PolyBlaze system Motivation 3

Introduction 4 Zia Eric Kevan

In modern systems, processors out perform main memory, creating a bottleneck This problem is only exacerbated as more cores contend for the memory This problem is reduced if each processor maintains a local copy of the data Cache Background 5

A cache is a small amount of memory on the same die as the processor The cache is capable of providing a lower latency and a higher throughput than the main memory Systems may include multiple cache levels The smallest and most local cache is the L1 cache. The next level cache is the L2, etc Caches 6

Shared Last Level Cache Acts as a common location for data Can be used to maintain cache coherency between processors Does not exist in current MicroBlaze system We will design our own shared L2 Cache to maintain cache coherency 7

Cache Speeds In typical systems: An L1 cache is very fast (1 or 2 cycles ) An L2 cache is slower (10’s of cycles) Main memory is very slow (100’s of cycles) 8

Cache Speeds In our system we expect : The L1 cache to be very fast (1 or 2 cycles ) The L2 cache to be about (10 of cycles) Main memory to be faster (10’s of cycles) In order to model the memory bottleneck of a much faster system we’ll need to stall the Main Memory 9

Direct Mapped Cache 10 Caches store Data, a Valid Bit and a unique identifier called a tag

Tags 11 As an example imagine a system with the following : 32-bit Address Bus, and 32-bit Word Size 64-KByte Cache with 32-Byte Line Size Therefore we have 2047 (2 11 ) Lines

Set-Associated Cache 12 A cache with n possible entries for each address is called an n-way set associated cache 4-Way Set Associated Cache

Replacement Policies 13 When an entry needs to be evicted from the cache we need to decide which Way it is evicted from. To do this we use a replacement policy LRU Clock FIFO

LRU 14 Keep track of when each entry is accessed Always evict the Least Recently Used Implemented using a stack MRU LRU Access 4Access 2

Clock 15 For each Way we store a Reference Bit Also store a pointed to the oldest entry (Hand) Starting with the Hand we test and clear each R Bit until we reach one that is

System Overview 16

PolyBlaze L2 Cache Way Set Associated Cache LRU or Clock Replacement Policy 32 or 64 Byte Line Width 64 Bit Memory Interface Write Back Cache

L2 Cache 18

Reuse Policy 19 Determines which Way is evicted on Cache Miss Currently uses LRU Policy

Tag Bank 20 Contains Tags and Valid Bits Stored on FPGA using BRAMs Instantiate one bank for each Way

Control Unit 21 Finite State Machine for L2 Cache Pipelining If a request is outstanding from NPI we can service other requests in SRAM

Data Bank 22 Control interface for off-chip SRAM

SRAM bit ZBT synchronous SRAM 1 MB

Methodology 24 Break L2 cache into three parts and test separately then combine and test system SRAM Controller NPI Interface L2 Core Complete L2 Cache

SRAM Controller 25 Create a wrapper that connects the SRAM controller to the MicroBlaze by an FSL Write a program that will write and read data to all addresses in the SRAM Write all 1’s Write all 0’s Alternate writing all 1’s and all 0’s Write Random data √ √ √ √

NPI Interface 26 Uses a custom FSL width, so we cannot test using MicroBlaze Create a hardware test bench to read and write data to all addresses Write all 1’s Write all 0’s Alternate writing all 1’s and all 0’s Write Random data X X X X

L2 Core 27 Simulate the core of the L2 cache in iSim Write a test bench that will approximate the responses from the L1/L2 Arbiter, SRAM Controller, and NPI Interface The test bench will write to each line multiple times to create a large number of cache misses X X X

Complete L2 Cache 28 Combine the L2 Cache with the rest of PolyBlaze Write test programs to read and write to various regions of memory X X

Current Progress 29 SRAM Controller and Data Bank: Designed and Tested NPI Interface: Testing and Debugging in Progress L2 Core: Testing and Debugging in Progress

Future Work 30 Add Clock Replacement Policy to L2 Cache Add a Write Back Buffer to L2 Cache Migrate System from XUPV5 to a BEE3 so we can create a system with more cores Modify the L2 Cache into a NUMA system Add Custom Hardware Accelerators to PolyBlaze

Questions? 31