TWO FAST AND HIGH-ASSOCIATIVITY CACHE SCHEMES By Amala Nitya Rama Venkata Prasad.

Slides:

Advertisements

Similar presentations

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Processor - Memory Interface

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

Memory Organization.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Memory Systems Architecture and Hierarchical Memory Systems

Cache memory October 16, 2007 By: Tatsiana Gomova.

Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.

Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory

1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

CMPE 421 Parallel Computer Architecture

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang Ohio State University Acknowledgement of Contributions: Chenxi Zhang, Tongji University.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Author : Guangdeng Liao, Heeyeol Yu, Laxmi Bhuyan Publisher : Publisher : DAC'10 Presenter : Jo-Ning Yu Date : 2010/10/06.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Virtual Memory 1 1.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.

Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.

COSC2410: LAB 19 INTRODUCTION TO MEMORY/CACHE DIRECT MAPPING 1.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.

CS161 – Design and Architecture of Computer

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

An Introduction to Cache Design

ECE232: Hardware Organization and Design

How can we find data in the cache?

CS-447– Computer Architecture Lecture 20 Cache Memories

Virtual Memory 1 1.

Presentation transcript:

TWO FAST AND HIGH-ASSOCIATIVITY CACHE SCHEMES By Amala Nitya Rama Venkata Prasad

Outline  Cache Mapping Techniques DM Cache Fully Associative Cache N- way Set Associative  Traditional Set Associative Cache  Search Approaches Parallel Approach Sequential Approach  Cache Schemes HR cache Column Associative Cache  Sequential Multicolumn Cache  Parallel Multicolumn Cache  Performance Evaluation, Simulation Results & Bottlenecks

Introduction In the race to improve Cache Performance, many researchers have proposed schemes which increase cache’s associativity. What is Associativity?  The Associativity of a cache is the number of places in the cache where a block may reside.  In a direct- mapped cache, which has an associativity of 1 means that there is only one location in order to search for a match.  Increasing associativity reduces the miss rate.

Cache Mapping & Associativity  A very important factor in determining the effectiveness cache relates to how the cache is mapped to the system memory.  There are many different ways to allocate the storage in our cache to the memory addresses it serves.  There are three different ways that this mapping can generally be done

Direct Mapped Cache  The simplest way to allocate the cache to the system memory is to determine how many cache lines there are (16,384 for example) and chop the system memory into the same number of chunks.  Then each chunk gets the use of one cache line. This is called direct mapping.  So if we have 64 MB of main memory addresses, each cache line would be shared by 4,096 memory addresses (64 M divided by 16 K).

Fully Associative Cache  Instead of hard-allocating cache lines to particular memory locations, it is possible to design the cache so that any line can store the contents of any memory location.  This is called fully associative mapping  The fully associative cache has the best hit ratio because any line in the cache can hold any address that needs to be cached.

N way Set Associative  This is a compromise between the direct mapped and fully associative designs.  In this case the cache is broken into sets where each set contains "N" cache lines, let's say 4. Then, each memory address is assigned a set, and can be cached in any one of those 4 locations within the set that it is assigned to.  This design means that there are "N" possible places that a given memory location may be in the cache.

Observations  HIT RATIO Direct Mapped Cache - Less Fully Associative - Best N- way Set Associative - Very Good N>1 - Better as N increases  SEARCH SPEED Direct Mapped Cache - Best Fully Associative - More N- way Set Associative - Good N>1 - Worse as N increases

Traditional Set Associative Cache  A traditional implementation of the set-associative cache has the disadvantage of longer access cycle times than that of a direct-mapped cache.  Several methods have been proposed for implementing associativity in non-traditional ways.  However, most of them have only achieved an associativity of two. The others with higher associativity still present longer access cycle times or suffer from larger average access times.

Contd…  In a traditional n-way set-associative cache, a data array holds data blocks and a tag array holds their associated tags, which are all organized into n banks.  There is a one -to - one correspondence between data banks and tag banks.  On a reference, the cache system uses the lower bits of the block address to select a row in the tag array of n tags and a row (of the same row number) in the data array of n data items.

 The system reads the n tags simultaneously and compares them against the tag bits of the address in parallel.  At the same time, it reads candidate data items, one of which a multiplexer later selects using the outcome of the tag comparisons.  In an n-way set-associative cache, mapping functions determine the n candidate locations, which can be any blocks in the cache, that make up a set together with search approach determine how to examine these locations.

Mapping Functions  Most cache designs use bit selection as the mapping function.  Bit selection works by conceptually dividing the cache into n equal banks.  It then uses the lower bits of the block address as an index to select one candidate location from each bank.

Search Approaches It determines how the cache system examines candidate locations for a match when servicing a reference. Types of Searches  Parallel Search  Sequential Search

Parallel Search  This approach examines all locations in parallel,while the cache reads and compares the tags, it also reads all the candidate data then the multiplexer later selects one data point if the reference turns out to be a hit.  Each location in the set is of equal importance.  Traditional implementations of set-associative caches use this search approach.  These implementations organize both tag memory and data memory into multiple banks.

Sequential Search  A sequential search examines locations one by one until it finds a match or exhausts all the locations.  This approach organizes both tag memory and data memory into a single bank.  To implement an associativity of n, however, the tag memory and data memory are divided into n banks.  With a sequential search, the cache access time for a hit may vary for different references, depending on how many locations the cache system examines before it finds a hit  We call a reference an ith hit if the system finds a match in the ith location,its access time is the ith hit time.

Contd…  The ratio of the number of ith hits to the total number of hits is the ith hit rate.  We call a reference the ith fetch if it turns out to be a miss after the system examines i locations, in which case the system must fetch a block from lower-level memory.  To ensure that most hits are first hits, the MRU location should be the entry point for the search.

Approaches for Search Order Cache System must determine the search order including the entry point before a search order begins. Types of Approaches  Mapping Approach  Swapping Approach

Mapping Approach Here we determine the search order on a reference by using a steering information in a mapping table. The mapping will lengthen the cache access cycle, however, if it uses the effective address as the index to look up the steering information. This approach requires extra memory for the mapping table.

Swapping Approach  This approach determines the search order statically. In other words, it uses a fixed order and assumes that the most recently used block is in location 0, the second most recently used block is in location 1, and so forth.  Performing appropriate swaps of blocks after each reference can guarantee the correct order.  This approach is impractical for associativity larger than two.

Implementing Associativity >2 The column-associative cache and the predictive sequential associative cache seem to have achieved optimal performance for an associativity of two. Increasing associativity beyond 2, is one of the most important ways to further improve cache performance. There are two schemes for implementing associativity >2 Sequential multicolumn cache Parallel multicolumn cache For an associativity of 4, they achieve the low miss rate.

Advantages & Disadvantages of High Associativity Increasing associativity, however, also increases hit access time. This generally leads to a longer cycle time.  The multiplexing logic that selects the correct data point from the referenced set causes a delay.  Only after completing tag checking does the multiplexing logic know which data point to select and dispatch to the CPU.  Therefore, it is difficult to implement a speculative dispatch of data to the CPU

Contd… Increasing associativity also has the advantage of reducing the probability of thrashing. Repeatedly accessing m different blocks that map to the same set will cause thrashing.

Cache Schemes Hash- Rehash cache  The HR-cache organization uses a sequential search.It keeps MRU blocks in 1 st location for references.  The cache blocks are divided into two banks. Each bank contains one of the two blocks that comprise a set.

Mechanism  If the data is not found in the first block, the second block is examined. If the data is found in the second block, the contents of the first and second blocks are exchanged.  If the data is not found by the second probe, a miss occurs. The first block is moved to the second block, since the first block was more likely to have been more recently referenced than the second block, and the missing block is placed in the first block.

Figure

Column –associative cache (CA –cache)  This improved the miss rate and average access time of the HR-Cache, but still requires that entire cache blocks be exchanged.  In the CA-Cache, a rehash bit is associated with each cache block, indicating that the data stored in that cache block would be found using the second hash function.

Mechanism  Consider the process of fetching the addresses and When the address is fetched, index two would be examined first, followed by index six.  A miss occurs, the data from index two is moved to index six, and is loaded into index two.  The address at index six would now be found using the second hashing function, and the corresponding rehash bit is set.  Note that the more recently referenced block was discarded.

 Now, address is referenced. Index six is examined first; that block does not contain , but the rehash bit has been set, indicating that line two can not contain  Index two is not examined, and the miss is issued one cycle earlier than in the case of the HR-Cache.

Figure

Observations HR Cache High Miss rates Average Access Time CA Cache Less Miss rates Access Time is Less

Sequential multicolumn cache  The SMC is an extension of the column- associative cache. It uses a sequential search approach starting from the major location.  During the process it searches non major locations which are much less important, search of the non major locations will degrade performance for higher associativity. Therefore, we use a search index to improve search efficiency.

 The search examines the major location. During the same time, the cache system reads index from the index table and uses it to form the address of the next selected location. If the first probe is a miss, the system examines the next selected location using the index information.  This process continues until it produces a match or until it has examined all the selected locations, Each probe requires one cycle.  For a non first hit or a miss, a swap operation is necessary to move the new MRU block to the major location for the current reference.

Parallel multicolumn cache.  A PMC separates the implementation of the tag array from that of the data array. It keeps the data memory in one bank, it organizes the tag memory into n banks.  The cache system accesses the data bank sequentially, one word at a time, but performs tag checking in parallel. This decoupling makes it possible to dispatch data speculatively to the CPU. No multiplexing operation on the data is necessary.

 It uses parallel search, it reads the data in the major location from the data bank, and dispatches it to the CPU. At the same time, it reads and checks n tags in parallel, A first hit occurs if the tag matches the tag of the reference. If any of the other tags match the tag of the reference, a second hit occurs, requiring one more cycle to access the data bank again for the desired data.  The cache system generates the address for this access using the outcome of the tag checking. If the reference is a miss, it will fetch a new block from lower-level memory. As in the SMC, a swap operation is necessary in the case of a miss or a second hit.

Performance evaluation  We compare the performance of our cache schemes with that of the column-associative cache because that cache performed best in comparison tests.  Agarwal and Pudar4 compared the performance of the hash rehash cache, the column-associative cache, and Jouppi’s victim cache using ATUM (Address Traces Using Microcode)11 traces. The results show that the column associative cache has the lowest miss rate of the three.  Calder compared the performance of the hash-rehash cache, the column-associative cache, Kessler’s two-way MRU cache. The column- associative cache has the lowest average access time.

 For our comparisons, in the ATUM traces, which include traces for the following 10 programs: dec0, fora, forf.3, fsxzz, macr, memxx, mul8.3, pasc, spic, and ue02. These traces comprise realistic workloads and include both operating system and multiprogramming activity.  Agarwal and Pudar used the ATUM traces to evaluate the column associative cache. Thus, using the same traces would be fair for our comparisons. We extended the Dinero III simulator extensively to support our schemes and the column-associative cache.

 We tested associativities of 4, 8, and 16, and cache sizes from 1 to 128 Kbytes. We chose a block size of 16 for our simulation, the same as that in Agarwal and Pudar’s simulation.4  We also studied performance of the SMC and PMC for miss penalties ranging from 20 to 100 cycles with the cache size fixed at 4 Kbytes.

Performance metrics.  The miss rate and the average access time are two important measurements of cache performance. The average access time, reflects the actual performance of the cache design more directly and more accurately than does the miss rate.  Suppose M is the miss penalty (in cycles), R is the total number of references in a trace, Hi is the number of ith hits, and Fi is the number of ith fetches in the simulation. Also suppose that TCOL, TSMCn, TPMCn, and T¢PMCn are the average access times for the column-associative cache, the SMC (with associativity n), the PMC 1, and the PMC 2.

 To compare the performance of our multicolumn caches with that of the direct-mapped cache and the column- associative cache, we define improvement in average access time over the direct-mapped cache ( IMP)  IMP= Tdir-T * 100% Tdir  TDIR is the average access time of a direct-mapped cache, and T is TCOL, TSMCn, TPMCn, or T¢PMCn, depending on the cache under consideration.

Simulation results.

The IMP as a function of the miss penalty, from which we conclude the following:  The curves for the SMC and PMC rise more sharply than that of the column-associative cache as the miss penalty increases, which means that the multicolumn has more benefits for larger miss penalties.  The improvement of the PMC 1 approaches that of the SMC as the miss penalty increases. Therefore, the PMC scheme works well for a pipelined cache when the miss penalty is large.  There is still a relatively large gain in performance when n increases from 4 to 8 and the miss penalty is large. The gain is much smaller, however, when n increases to 16.

TWO MAJOR PERFORMANCE BOTTLENECKS  Implementing set-associative caches are the multiplexing operation to select the data item using a parallel search, and the delay for tag comparisons using a sequential search. A direct-mapped cache does not need the multiplexing, and a hit is always the first hit. However, the miss rate is significantly higher than that of an n-way set-associative cache.  Our multicolumn cache system gains the advantage of direct mapped caches by first checking the major location while allowing high-degree n-way associativity by minimizing the number of sequential searches. The additional hardware cost is small and acceptable.

THANK YOU