Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan.

Slides:



Advertisements
Similar presentations
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Advertisements

Chapter 4 Memory Management Basic memory management Swapping
Page Replacement Algorithms
Online Algorithm Huaping Wang Apr.21
Cache and Virtual Memory Replacement Algorithms
Virtual Memory: Page Replacement
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Fast Algorithms For Hierarchical Range Histogram Constructions
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 15: Background Information for the VMWare ESX Memory Management.
Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Competitive Analysis.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
CS 342 – Operating Systems Spring 2003 © Ibrahim Korpeoglu Bilkent University1 Memory Management – 4 Page Replacement Algorithms CS 342 – Operating Systems.
1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.
Computer System Overview
Ecole Polytechnique, Nov 7, Online Job Scheduling Marek Chrobak University of California, Riverside.
OS Spring’04 Virtual Memory: Page Replacement Operating Systems Spring 2004.
Paging for Multi-Core Shared Caches Alejandro López-Ortiz, Alejandro Salinger ITCS, January 8 th, 2012.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Lecture 2 Computational Complexity
Online Paging Algorithm By: Puneet C. Jain Bhaskar C. Chawda Yashu Gupta Supervisor: Dr. Naveen Garg, Dr. Kavitha Telikepalli.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Duality between Reading and Writing with Applications to Sorting Jeff Vitter Department of Computer Science Center for Geometric & Biological Computing.
Memory Management Techniques
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
Operating Systems CMPSC 473 Virtual Memory Management (3) November – Lecture 20 Instructor: Bhuvan Urgaonkar.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
1 Chapter 5-1 Greedy Algorithms Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
PAGE REPLACEMNT ALGORITHMS FUNDAMENTAL OF ALGORITHMS.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
Cache Memory Yi-Ning Huang. Principle of Locality Principle of Locality A phenomenon that the recent used memory location is more likely to be used again.
Computer Organization
Virtual memory.
Chapter 1 Computer System Overview
COSC3330 Computer Architecture
Multilevel Memories (Improving performance using alittle “cash”)
Operating Systems Virtual Memory Alok Kumar Jagadev.
Lecture 10: Virtual Memory
CS61C : Machine Structures Lecture 6. 2
Bojian Zheng CSCD70 Spring 2018
CS61C : Machine Structures Lecture 6. 2
Greedy Algorithms / Caching Problem Yin Tat Lee
CGS 3763 Operating Systems Concepts Spring 2013
Lecture 08: Memory Hierarchy Cache Performance
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Performance metrics for caches
Performance metrics for caches
Performance metrics for caches
Chapter 1 Computer System Overview
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Performance metrics for caches
Operating Systems CMPSC 473
Lecture 9: Caching and Demand-Paged Virtual Memory
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan

The RAM Model In the previous lecture we discussed a cache in an operating system We saw a lower bound on sorting: N = number of sorted elements B = number of elements in each block M = memory size

The I/O Model 1. A datum can be accessed only from fast memory 2. B elements are brought to memory in each access 3. Computation cost << I/O cost 4. A block of data can be placed anywhere in fast memory 5. I/O operations are explicit

The Cache Model 1. A datum can be accessed only from fast memory √ 2. B elements are brought to memory in each access √ 3. Computation cost << I/O cost L denotes normalized cache latency, accessing a block from cache costs 1 4. A block of data can be placed anywhere in fast memory A fixed mapping distributes main memory in the cache 5. I/O operations are explicit The cache is not visible to the programmer

Notation I(M,B) - The I/O model C(M,B,L) - The cache model n = N/B, m = M/B – The size of the data and of memory in blocks (instead of elements) The goal of an algorithm design is to minimize running time = (number of cache accesses) + (L* number of memory accesses)

Reminder – Cache Associativity Associativity specifies the number of different frames in which a memory block can reside Fully Associative Direct Mapped 2-Way Associative Set

Emulation Theorem An algorithm A in I(M,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T ) steps. The additional memory requirement is m blocks. In other words – an algorithm that is efficient in main memory, can be efficient in cache.

Proof (1) 1 m C[] 2 21m n Mem[] Buf[]

Proof (2) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba

Proof (3) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba q

Proof (4) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b

Proof (5) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b q

Proof (6) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b q

Block efficient algorithms For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred. In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps. The algorithms for sorting, FFT, and matrix transposition are block efficient.

Extension to set-associative cache In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block. In the emulation technique described before we do not have explicit control of the replacement. Instead, a property of LRU will be used, and the cache will be used only partially.

Optimal Replacement Algorithm for Cache OPT or MIN – a hypothetical algorithm that minimizes cache misses for a given (finite) access trace. Offline – it knows in advance which blocks will be accessed next. Evicts from cache the block which will be accessed again in the longest time in the future. Was proven to be optimal – better than any online algorithm. Proposed by Belady in Used to theoretically test efficiency of online algorithms.

LRU vs. OPT For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m. For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m. LRU – 3X misses OPT – X misses 6 = (1-1/3)

Extension to set-associative cache – Cont. Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2. We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2 These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm. In the real cache, k lines will be managed by LRU and will experience twice the misses.

Extension to set-associative cache – Cont. 1 m C[] 2 21m n Mem[] Buf[]

Generalized Emulation Theorem An algorithm A in I(M/2,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T ) steps. The additional memory requirement is m/2 blocks.

The cache complexity of sorting The lower bound for sorting in I(M,B) is The lower bound for sorting in C(M,B,L) is I = computationsT = I/O operations

Cache Miss Classes Compulsory Miss – a block is being referenced for the first time Capacity Miss – a block was evicted from the cache because it is too small Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.

Average case performance of merge-sort in the cache model We want to estimate the number of cache misses while performing the algorithm: Compulsory misses are unavoidable Capacity misses are minimized by the I/O algorithm We can quantify the expected number of conflict misses.

When does a conflict miss occur? s cache sets are available for k runs S 1 …S k. The expected number of elements in any run S i is N/k. A leading block is a cache line containing a leading element of a run. b i is the leading block of S i. A conflict occurs when two leading blocks are mapped to the same cache set.

When does a conflict miss occur – Cont. Formally: a conflict miss occurs for element S i,j+1 when there is at least one element x in a leading block b k, k≠i, such that S i,j <x<S i,j+1 and S(b i ) = S(b k ). SiSi SkSk j j+1 x …

How many conflict misses to expect Pi = the probability of conflict for element i, 1≤i≤N. Assume uniform distribution: The leading blocks among cache sets The leading element within the leading block If k is Ω(s) then Pi is Ω(1). For each round, the number of conflict misses is Ω(N).

How many conflict misses to expect – Cont. The expected number of conflict misses throughout merge-sort is This includes O(N) misses for each pass. By choosing k<<s we minimize the probability of conflict misses, but we incur more capacity misses.

Conclusions There is a way to transform I/O efficient algorithms to cache efficient algorithms It is only for blocking, direct mapped cache that does not distinguish between reads and writes. The constants are important in these orders of magnitude.