Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
ITEC 352 Lecture 25 Memory(3). Review Questions RAM –What is the difference between register memory, cache memory, and main memory? –What connects the.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 10 – Memory Operation and Performance CachesCaches – repeat some concepts Virtual Memory (VM)
Memory Organization.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Systems I Locality and Caching
Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
IT253: Computer Organization
Memory Management – Page 1 of 49CSCI 4717 – Computer Architecture Memory Management Uni-program – memory split into two parts –One for Operating System.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Memory Architecture Chapter 5 in Hennessy & Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
3/1/2002CSE Virtual Memory Virtual Memory CPU On-chip cache Off-chip cache DRAM memory Disk memory Note: Some of the material in this lecture are.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Virtual Memory. Cache memory enhances performance by providing faster memory access speed. Virtual memory enhances performance by providing greater memory.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
The Goal: illusion of large, fast, cheap memory
Lecture 12 Virtual Memory.
Ramya Kandasamy CS 147 Section 3
How will execution time grow with SIZE?
The Hardware/Software Interface CSE351 Winter 2013
Chapter 9 – Real Memory Organization and Management
Morgan Kaufmann Publishers Memory & Cache
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Computer Architecture
How can we find data in the cache?
Memory Operation and Performance
Contents Memory types & memory hierarchy Virtual memory (VM)
CS 3410, Spring 2014 Computer Science Cornell University
Fundamentals of Computing: Computer Architecture
Presentation transcript:

Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs faster. This lecture covers: Memory Systems Caches Virtual Memory (VM)

Caches – fast memory between CPU and main Memory Cache Design Parameters A Diversity of Caches, different level of cachesA Diversity of Caches Looking at the Caches Cache-aware ProgrammingCache-aware Programming (Column-major and row-major, row-major will make the program faster.)

A Diversity of Caches Multiple Levels of CachesMultiple Levels of Caches (L1, L2, or L3 cache, L1 is faster than L2 and L2 is faster than L3. L1 is more expensive than L2..) On-Chip CachesOn-Chip Caches (only one level cache, L1) Instruction and Data CachesInstruction and Data Caches (separate data and instruction in the cache. There is likely that instruction will be reused later.)

Instruction and Data Cache – from Right is two photos of a CPU (Central Processing Unit). The photo on the bottom is the CPU chip from the outside. The photo on top is a large road map of the inside of the CPU, showing data cache and instruction cache.

Multiple Levels of Caches Modern computer systems don't have just one cache between the CPU and memory. There is usually a hierarchy of caches. The caches are usually called L1, L2, etc.—which is shorthand for Level 1, Level 2, and so on. An L1 cache is the cache that is within the CPU, and is, therefore, the fastest and smallest, but more expensive The last cache (usually L2 or L3) is the cache that loads data directly from the DRAM main memory, less expensive

L1 and L2 Cache Level 1 cache memory, is memory that is included in the CPU itself. Level 2 cache memory, is memory outside of the CPU. Photo below shows level 2 cache memory on the Processor

Example – shows the benefit of cache memory of different levels /* Assumes n is a power of two */ It is to divide the data into two void merge_sort (int * data, int n) { int half = n >> 1; if (n == 1) return; binary_sort(data, half); binary_sort(data + half, half); merge(data, data + half, half); } // no need to memorise

Graph of Merge Sort the access times in nanoseconds (ns) for the L1 cache (T 1 ), L2 cache (T 2 ), L3 cache (T 3 ), and main memory (T m ).

Merge sort- in a fast small cache (in L1 only) Look at the total time

Merge sort - in a slow cache (in L3 only) Look at the total time, it is longer

On-Chip Caches multilevel caches can improve the performance of a computer. However, usually there is no major difference between having a single L3-sized cache and three caches It is not as significant as the difference between a single large cache and a single small one

Instruction and Data Caches Programs access and fetch instructions in much more predictable ways than they do data. For instance, instruction fetches exhibit much more spatial locality than data, because it is very likely that an instruction fetch will be soon followed by the fetch of the instruction next to it. For example, the program is executing a++, there is high chance to execute b++ and c= a + b*3; a++; b++; c =a+ b*3; Even when a branch or jump instruction makes this untrue, it is very likely that the instruction fetched next will be one that has already been fetched recently.

multiple levels of caches Note that L1 is within CPU chip

Looking at the cache design We can deduce many things about the cache design of a particular computer by carefully examining its memory performance. We can design a benchmark program whose locality we control such as. int data[MAXSIZE]; for (i = 0; i < repeat; i++) { for (i = 0; i < N; i++) { dummy = data[i]; } }

Explanation to the program This loop accesses a chunk of memory repeatedly. By varying N, we vary the temporal locality of the accesses. For example, for N == 4, each of the values data[i] will be accessed every 4 iterations, but if N is 16, each data[i] will be accessed only every 16 iterations. A cache of size 16 would cause the benchmark to perform much more poorly for N == 32 than for N == 8, because for N == 32, each data[i] would have been evicted (means removed) from the cache before it was accessed again.

Control the spatial locality Here, stride controls the amount of spatial locality int data[MAXSIZE]; for (i = 0; i < repeat; i++) { for (i = 0; i < N; i += stride) { dummy = data[i]; } }

Result of benchmark Transfer rate in MB/s You can see the performance Is not proportional to L1 cache. Why? It is effective between 4K and 512 bytes

Interpretation of result We immediately notice that memory performance comes in three discrete steps. In the best performing step, the program is accessing so little data that all of its references fit in the L1 cache, and the rest of the hierarchy is almost never required. In the next step down, the references no longer fit in the L1 but fit in the L2 cache, and access to main memory is almost never required. Try to fit L1, L2 etc.

Graph showing size of L1 and performance Performance: Transfer rate

The effect of stride (steps)

Cache-Aware Programming That is how to optimise the performance Instruction Cache Overflow Cache Collisions Unused Cache Lines Insufficient Temporal Locality

Example (1) – 4ms (assume 1M)

Example (2) – 3ms (assume 1M)

Example (3) – 3 ms (assume 512K)

Example (4) – 2.5 ms (assume 512K)

Example (5) – 2.3 ms (Assume 256K)

Example (5) – 2.3 ms

Instruction cache - Program of complicated for/loop Below is a program involving three complicated operations: for (i = 0; i < MAX; i++) { }

It is better to separate into three So that each complicated operation can maximise the cache memory (instruction cache). for (i = 0; i < MAX; i++) { } for (i = 0; i < MAX; i++) { } for (i = 0; i < MAX; i++) { }

Cache Collisions Cache collisions can also cause our programs to execute slowly. a cache collision occurs when a cache line is evicted (switched out) even though the cache is not full. It happens when the line is full, the system has to decide which data line to remove (switch out).

Program of cache collision Below is the program involving variables a and b. int a[N]; int b[N]; int c[N]; for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; }

Reason of Cache Collision It is possible that the compiler may allocate a, b, and c to memory addresses that map to the same cache set. In this case, the assignment c[i] = a[i] + b[i] will cause three cache misses in every iteration of the loop, because the cache will be constantly evicting the cache line that the CPU requires next. This operation will cause three operations, as c[], a[] and b[] are in the same cache line.

Graph showing the Cache collision

The solution is to offset memory location #define CACHELINESIZE #define COFFSET ((2 * CACHELINESIZE) / sizeof(int)) int a[N]; int b[N]; int int c[N + COFFSET]; for (i = 0; i < N; i++) { c[i + COFFSET] = a[i] + b[i]; }

Graph showing cache after change

Under-used Cache Lines Suppose the cache line is 32 bytes wide, as it often is. If a program is reading contiguous 4-byte integers (continuous), the reference to the first will cause the first eight integers (integers 0–7) to be loaded into the cache. The reference to the 9th will cause integers 8–15 to be loaded, and so on. The hit ratio, even on a cold cache, will be at least 7/8, or Now consider a program that reads integers with a stride of eight or more. This means that the program reads the first integer, then the 9th (or higher), then the 17 th etc.

Graph showing the effect Cache miss

Example of a matrix int data[M][N]; for (i = 0 ; i < N; i++) { for (j = 0; j < M; j++) { sum += data[j][i]; } }

Row-major and Column-major

Accessing a column-major

Accessing row data It will be faster, as it accesses [0,0], [0,1][0,2] which will be loaded into cache line after reading [00] up to [13]

Changing the order of the iterations is not always better. Below is an example. //It is because we have fixed transposed[i][j], but not original [j][i] int original[M][N]; int transposed[N][M]; for (i = 0; i < M; i++) { for (j = 0; j < N; j++) { transposed[i][j] = original[j][i]; } }

Effect of rotating shape This is the effect of previous program. It is to rotate the image.

Insufficient Temporal Locality int original[M][N]; int transposed[N][M]; for (k = 0; k < M / m; k++) { for (l = 0; l < N / n; k++) { for (i = k*m; i < (k+1)*m; i++) { for (j = l*n; j < (l+1)*n; j++) { transposed[i][j] = original[j][i]; } } } }

Blocked transpose gets around cache misses m and n must be a square and is determined by the cache line size, say 32 bytes. So that it will fit into the cache.

Virtual Memory (VM) The term virtual memory refers to a combination of hardware and operating system software that solve several computing problems. It receives a single name because it is a single mechanism, but it meets several goals: To simplify memory management and program loading by providing virtual addresses. To allow multiple large programs to be run without the need for large amounts of RAM, by providing virtual storage.

Virtual Addresses SegmentationSegmentation – group pages together with different size Memory ProtectionMemory Protection – due to the support of more than ONE process, to protect the memory being corrupted by others PagingPaging – use the same size in disk and memory and load it into memory or from memory to disk. But computers hold several programs in memory at the same time.

Virtual Memory - Explanation Sequence of virtual memory. Program size is larger than main memory. Memory Disk Page

contradictory about VM facts: The compiler determines the address at which a program will execute, by hard-wiring a lot of addresses of variables and instructions into the machine code it generates. The location of the program is not determined until the program is executed and may be anywhere in main memory.

Solution to contradictory facts Code Relocation: Have the compiler generate addresses relative to a base address, and change the base address when the program is executed. This means that the address of each reference is calculated explicitly by adding the relative address to the base address. Drawback: Address Translation: At run time, provide programs the illusion that there are no other programs in memory. Compilers can then generate any absolute address they wish. Two programs may contain references to the same address without interference.

Virtual and Physical Addresses The addresses issued by the compiler are called virtual addresses. The addresses that result from the translation are called physical addresses, because they refer to an actual memory chip.

Multiple programs without relocation Program A shares some memory locations belonging to Program B.

Relocatable code can share memory Program A uses the memory locations belonging to itself.

Summary Cache, L1 (within CPU), L2 and L3 Data cache and instruction cache Program: column major and row major, row- major can enhance the performance. Virtual memory: memory is too small to cater for the whole program. It loads the page into memory.