1 The Storage Hierarchy Registers Cache memory Main memory (RAM) Hard disk Removable media (CD, DVD etc) Internet Fast, expensive, few Slow, cheap, a lot.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
11/2/2004Comp 120 Fall November 9 classes to go! VOTE! 2 more needed for study. Assignment 10! Cache.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
11/3/2005Comp 120 Fall November 10 classes to go! Cache.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Systems I Locality and Caching
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Tuesday.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Computing hardware CPU.
Supercomputing in Plain English An Introduction to High Performance Computing Part II: The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing.
Supercomputing in Plain English The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing Center for Education & Research Blue Waters.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Parallel & Cluster Computing The Tyranny of the Storage Hierarchy
Supercomputing in Plain English The Tyranny of the Storage Hierarchy PRESENTERNAME PRESENTERTITLE PRESENTERDEPARTMENT PRESENTERINSTITUTION DAY MONTH DATE.
CMPE 421 Parallel Computer Architecture
Introduction to Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Josh Alexander, University of Oklahoma Ivan Babic, Earlham.
Supercomputing and Science An Introduction to High Performance Computing Part II: The Tyranny of the Storage Hierarchy: From Registers to the Internet.
IT253: Computer Organization
Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
Supercomputing in Plain English Part II: The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing Center for Education & Research.
What is cache memory?. Cache Cache is faster type of memory than is found in main memory. In other words, it takes less time to access something in cache.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing Center for Education & Research.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Chapter 9 Memory Organization By Jack Chung. MEMORY? RAM?
Parallel & Cluster Computing The Storage Hierarchy Paul Gray, University of Northern Iowa David Joiner, Shodor Education Foundation Tom Murphy, Contra.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa.
Parallel & Cluster Computing 2005 The Tyranny of the Storage Hierarchy Paul Gray, University of Northern Iowa David Joiner, Kean University Tom Murphy,
NETW3005 Memory Management. Reading For this lecture, you should have read Chapter 8 (Sections 1-6). NETW3005 (Operating Systems) Lecture 07 – Memory.
Memory Hierarchies Sonish Shrestha October 3, 2013.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Andrew.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Chapter 2 Memory and process management
The Goal: illusion of large, fast, cheap memory
Ramya Kandasamy CS 147 Section 3
How will execution time grow with SIZE?
Supercomputing in Plain English The Tyranny of the Storage Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Memory Operation and Performance
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Memory System Performance Chapter 3
Cache Memory and Performance
Operating Systems: Internals and Design Principles, 6/E
Presentation transcript:

1 The Storage Hierarchy Registers Cache memory Main memory (RAM) Hard disk Removable media (CD, DVD etc) Internet Fast, expensive, few Slow, cheap, a lot

What is Main Memory? Where data reside for a program that is currently running Sometimes called RAM (Random Access Memory): you can load from or store into any main memory location at any time Much slower => much cheaper => much bigger 2

What Main Memory Looks Like 3 … ,870,911 You can think of main memory as a big long 1D array of bytes.

Cache [4]

Cache Caching is a technology based on the memory subsystem of your computer. The main purpose of a cache is to accelerate your computer while keeping the price of ­the computer low. Caching allows you to do your computer tasks more rapidly.computer 5

Librarian Example Librarian behind the desk to give you the books you ask for – he fetches it for you from a set of stacks in the storeroom You ask for the book Moby Dick The librarian goes into the storeroom, gets the book, returns to the counter and gives the book to the customer. Later, the client comes back to return the book. The librarian takes the book and returns it to the storeroom. He then returns to his counter waiting for another customer. Let's say the next customer asks for Moby Dick. The librarian then has to return to the storeroom to get the book he recently handled and give it to the client 6

Librarian Example Cont. Under this model, the librarian has to make a complete round trip to fetch every book -- even very popular ones that are requested frequently. Is there a way to improve the performance of the librarian? 7

Librarian Example with Cache Let's give the librarian a backpack into which he will be able to store 10 books (in computer terms, the librarian now has a 10-book cache). In this backpack, he will put ­the books the clients return to him, up to a maximum of 10. The day starts. The backpack of the librarian is empty. Our first client arrives and asks for Moby Dick. The librarian has to go to the storeroom to get the book. He gives it to the client. 8

Librarian Example With Cache Later, the client returns and gives the book back to the librarian. Instead of returning to the storeroom to return the book, the librarian puts the book in his backpack and stands there (he checks first to see if the bag is full). Another client arrives and asks for Moby Dick. Before going to the storeroom, the librarian checks to see if this title is in his backpack. He finds it! All he has to do is take the book from the backpack and give it to the client. There's no journey into the storeroom, so the client is served more efficiently. 9

Not in Backpack What if the client asked for a title not in the cache (the backpack)? In this case, the librarian is less efficient with a cache than without one, because the librarian takes the time to look for the book in his backpack first. Even in our simple librarian example, the latency time (the waiting time) of searching the cache is so small compared to the time to walk back to the storeroom that it is irrelevant. The cache is small (10 books), and the time it takes to notice a miss is only a tiny fraction of the time that a journey to the storeroom takes. 10

What is Cache? A special kind of memory where data reside that are about to be used or have just been used. Very fast => very expensive => very small (typically 100 to 10,000 times as expensive as RAM per byte) Data in cache can be loaded into or stored from registers at speeds comparable to the speed of performing computations. Data that are not in cache (but that are in Main Memory) take much longer to load or store. Cache is near the CPU: either inside the CPU or on the motherboard that the CPU sits on. 11

The Relationship Between Main Memory & Cache

13 RAM is Slow CPU The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out.

From Cache to the CPU 14 Typically, data move between cache and the CPU at speeds relatively near to that of the CPU performing calculations. CPU Cache

15 Why Have Cache? CPU Cache is much closer to the speed of the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second!

Important Facts about Cache Cache technology is the use of a faster but smaller memory type to accelerate a slower but larger memory type. When using a cache, you must check the cache to see if an item is in there. If it is there, it's called a cache hit. If not, it is called a cache miss and the computer must wait for a round trip from the larger, slower memory area. A cache has some maximum size that is much smaller than the larger storage area. 16

Facts about Cache cont. It is possible to have multiple layers of cache. With our librarian example, the smaller but faster memory type is the backpack, and the storeroom represents the larger and slower memory type. This is a one-level cache. There might be another layer of cache consisting of a shelf that can hold 100 books behind the counter. The librarian can check the backpack, then the shelf and then the storeroom. This would be a two-level cache. 17

How Cache Works When you request data from a particular address in Main Memory, here’s what happens: 1.The hardware checks whether the data for that address is already in cache. If so, it uses it. 1.Otherwise, it loads from Main Memory the entire cache line that contains the address. 18

If It’s in Cache, It’s Also in RAM If a particular memory address is currently in cache, then it’s also in Main Memory (RAM). That is, all of a program’s data are in Main Memory, but some are also in cache. 19

Cache Use Jargon Cache Hit: the data that the CPU needs right now are already in cache. (remember librarian example) Cache Miss: the data that the CPU needs right now are not currently in cache. If all of your data are small enough to fit in cache, then when you run your program, you’ll get almost all cache hits (except at the very beginning), which means that your performance could be excellent! Sadly, this rarely happens in real life: most problems of scientific or engineering interest are bigger than just a few MB. 20

Cache Lines A cache line is a small, contiguous region in cache, corresponding to a contiguous region in RAM of the same size, that is loaded all at once. Typical size of cache lines: 32 to 1024 bytes Cache reuse: the program is much more efficient if all of the data and instructions fit in cache; try to use what’s in cache a lot before using anything that isn’t in cache (e.g., tiling) 21

Improving Your Cache Hit Rate Many scientific codes use a lot more data than can fit in cache all at once. Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache. So, how can you improve your cache hit rate? Use the same solution as in Real Estate: Location, Location, Location! 22

Cache Example 1 Cache Line Size (CLS) = 32 bytes Cache Size (CS) = 64 lines Direct Mapping Each element is 4 bytes long real a (1024, 100) … do i=1, 1024 do j=1, 100 a(i,j) = a(i,j) + 1 end do end do Cache misses = 100*1024 = 102,400 23

Example 1 using Loop Interchange Interchange I and J real a (1024, 100) … do j=1, 1024 do i=1, 100 a(i,j) = a(i,j) + 1 end do Takes advantage of spatial locality to reduce cache misses Cache misses = 102,400/8 = 12,800, number of cache lines occupied by the array a 24

25 Tiling

26 Tiling Tile: A small rectangular subdomain of a problem domain. Sometimes called a block or a chunk. Tiling: Breaking the domain into tiles. Tiling strategy: Operate on each tile to completion, then move to the next tile. Tile size can be set at runtime, according to what’s best for the machine that you’re running on.

Example 2 do j=1, 100 do i=1, 4096 a(i) = a(i)*a(i) end do Cache misses = (4096/8)* 100 After accessing elements a(1..512) the cache is full and a new cache line has to be written back to memory before a new line is written to cache Each element a(i) is reused 100 times  we can apply TILING to reduce the number of cache misses 27

Example 2 using Tiling do ii= 1, 4096, B do j=1, 100 do i=ii, min(ii+B-1,4096) a(i) = a(i) + a(i) end do The cache can accommodate 512 elements of a. Thus, if we choose B=512, cache misses = 8 28

29 A Sample Application Matrix-Matrix Multiply Let A, B and C be matrices of sizes nr  nc, nr  nk and nk  nc, respectively:

30 Matrix Multiply: Naïve Version SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_naive

31 Multiplying Within a Tile SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend INTEGER :: r, c, q DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_tile

32 The Advantages of Tiling It allows your code to exploit data locality better, to get much more cache reuse: your code runs faster! It’s a relatively modest amount of extra coding (typically a few wrapper functions and some changes to loop bounds). If you don’t need tiling – because of the hardware, the compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.

33 Will Tiling Always Work? Tiling WON’T always work. Why? Well, tiling works well when: the order in which calculations occur doesn’t matter much, AND there are lots and lots of calculations to do for each memory movement. If either condition is absent, then tiling may not help.