Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

MEMORY popo.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Performance of Cache Memory
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Bloom Filters Kira Radinsky Slides based on material from:
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.
11/2/2004Comp 120 Fall November 9 classes to go! VOTE! 2 more needed for study. Assignment 10! Cache.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Computer System Overview
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
11/3/2005Comp 120 Fall November 10 classes to go! Cache.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Data Structures Hashing Uri Zwick January 2014.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Computer Orgnization Rabie A. Ramadan Lecture 7. Wired Control Unit What are the states of the following design:
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
IT253: Computer Organization Lecture 11: Memory Tonga Institute of Higher Education.
TinyLFU: A Highly Efficient Cache Admission Policy
IT253: Computer Organization
1 Memory Hierarchy The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Linked List. Background Arrays has certain disadvantages as data storage structures. ▫In an unordered array, searching is slow ▫In an ordered array, insertion.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Intro  Scratchpad rings and queues.  First – In – Firs – Out (FIFO) data structure.  Rings are fixed-sized, circular FIFO.  Queues not fixed-size.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Computer Architecture Lecture 25 Fasih ur Rehman.
CMSC 611: Advanced Computer Architecture
Updating SF-Tree Speaker: Ho Wai Shing.
Basic Performance Parameters in Computer Architecture:
Morgan Kaufmann Publishers
Memory Organization.
CS 3410, Spring 2014 Computer Science Cornell University
EMOMA- Exact Match in One Memory Access
CSE 373: Data Structures and Algorithms
Presentation transcript:

Hit or Miss ? !!!

 Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check in the high- speed cache memory first before looking in the slower main memory.  Cache memory may be three to five times faster than system DRAM.

 Most computers have two separate memory caches; L1 cache, located on the CPU, and L2 cache, located between the CPU and DRAM.  L1 cache is faster than L2, and is the first place the CPU looks for its data. If data is not found in L1 cache, the search continues with L2 cache, and then on to DRAM.

 Shared cache: is a cache which shared among several processors.  In multi-core system, the shared cache is usually overloaded with many accesses from the different cores.  Our goal is to reduce the load from the shared cache.  To achieve this goal we will build a predictor which predict if we going to get a hit or miss when we access the shared cache.

 Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have a hit.  Have a high hit rate especially on misses. Hit or Miss ? !!!

 Bloom filter: is a method representing a set of N elements ( a 1,…,a n ) to support membership queries.  The idea is to allocate a vector v of m bits, initially all set to 0.  Choose k independent hash functions, h 1,…, h k,each with range 1… m.  For each element a, the bits at positions h 1 ( a ),..., h k ( a ) in v are set to 1.

 Given a query for b we check the bits at positions h 1 ( b ), h 2 ( b ),..., h k ( b ).  If any of them is 0, then certainly b is not in the set A.  Otherwise we conjecture that b is in the set although there is a certain probability that we are wrong. This is called a “false positive”.  The parameters k and m should be chosen such that the probability of a false positive (and hence a false hit) is acceptable.

A = {123, 456, 764, 227} H(x) = x % 16A = {123, 456, 764}A = {123, 456}A = {123}A = {} Insert (123) Insert (456) Insert (764)Insert (227)H(123) = 11 H(456) = 8 H(764) = 12H(227) = Is 227 in A ? Bloom Array H(227)=3 Bloom[3]=1 I think, Yes it is. Right Prediction Is 151 in A ? H(151)=7 Bloom[7]=0 Right Prediction Certainly No. Is 504 in A ? H(504)=8 Bloom[8]=1 I think, Yes it is. !!Ops ! False Positive

 We used a separate predictor for each set in the L2 cache. Set 0 Set 1 Set N Set 0 Set 1 Set N Set 0 Set 1 Set N Set 0 Set 1 Set N Array 0 Array 1 Array N

SSmall size. SSimple and fast. IImplementable with hardware. DDoes not need too much power. DDoes not predict miss if we have a hit.

 If A is a dynamic group, and in our case it is a dynamic one, it will be too hard to update the array when removing an element “e” from A, we can’t simply turn off Bloom[H(e)], to do so we must check that there is no element “e1” in A such that H(e)=H(e1). And this take a lot of time.   If we don’t update the array the hit rate will become low. 

 Using counters instead of binary cells, so when removing an element we simply reduce the appropriate counters.  The problem with this solution is:  The size will become large.

 Note that the number of elements in each set is usually small (cache associative), allow us to use limited counters, for example 2 bit counters.  In this way we get a small predictor, but we still have problem when the counter reached its saturation, and it happened with low probability.

 Adding an overflow flag for each bloom array allow us to reduce the counter when it reach its saturation in few cases.  Overflow flag = 1, if and only if we tried to increase a saturated counter in the appropriate array.  How does it help?  If the overflow flag is 0, we can reduce a saturated counter, we were unable to do this before.

 How can we solve the problem of the not updates arrays?  Entering the arrays that need update to a queue and every N cycles we update one of them, (in this way the lines in the DRAM updates)  When we enter an array to the queue?  After K failed attempts to reduce a counter in the array due to overflow.

 We don’t have an infinity queue in the hardware, so what can we do if the queue is full and we need to enter an array to it?  We turn on a flag which indicate that the array need update and it not entered to the queue yet, and in the next time that we access the array we will try again to enter it to the queue.

 We get all the L2 accesses from simics for 9 benchmarks.  We implemented a simulator to the cache and the predictor with Perl.  In the command line we can choose the configuration that we want, by changing the following parameters:

 Cache parameters:  Lines number – the number of the lines in the cache.  Line size – the size of each line in the cache.  Associative – the associative of the cache.

 Predictor parameters:  Bloom array size – The number of entries in bloom array.  Bloom max counter – The counter limit for each entry.  Number of hashes – The number of hash functions that the algorithm use.

 Predictor parameters:  Bloom max not updated - Number of times of fails to decrement the Bloom counter in a specific entry, and failed due to the fact that the counter is saturated.  Enable bloom update – Enable array update.  Bloom update period – Number of L2 accesses between 2 updates.

 In the following graphs we see the hit rate of the predictor versus the cache hit rate.  We configured the predictor and the cache with the following parameters.  Bloom array size = 64  Bloom max counter = 3  Associative = 16  Line size = 64  Update period = 1

 Project goal achieved:  We saw in the above graphs that we get a high hit rate on misses, for example the average hit rate on misses with 16M cache is 93.5%.  What’s next?  Using the predictor idea to other units in the computer, for example in the DRAM.

 ary-cache/node8.html ary-cache/node8.html  show_glossary.asp show_glossary.asp  ashecka/thanks.gif ashecka/thanks.gif