11/8/2005Comp 120 Fall 20051 8 November 9 classes to go! Read 7.3-7.5 Section 7.5 especially important!

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

11/2/2004Comp 120 Fall November 9 classes to go! VOTE! 2 more needed for study. Assignment 10! Cache.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

11/3/2005Comp 120 Fall November 10 classes to go! Cache.

L18 – Memory Hierarchy 1 Comp 411 – Fall /30/2009 Memory Hierarchy Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Processor Design 5Z032 Memory Hierarchy Chapter 7 Henk Corporaal Eindhoven University of Technology 2009.

Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computing Systems Memory Hierarchy.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.

ECE Dept., University of Toronto

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Improving Memory Access 2/3 The Cache and Virtual Memory

Computer Organization CS224 Fall 2012 Lessons 39 & 40.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CSE 351 Section 9 3/1/12.

Cache Memory and Performance

CS2100 Computer Organization

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

Morgan Kaufmann Publishers Memory & Cache

ECE 445 – Computer Organization

Systems Architecture II

Set-Associative Cache

November 14 6 classes to go! Read

Morgan Kaufmann Publishers

Lecture 20: OOO, Memory Hierarchy

CS 3410, Spring 2014 Computer Science Cornell University

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Caches & Memory.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

11/8/2005Comp 120 Fall November 9 classes to go! Read Section 7.5 especially important!

11/8/2005Comp 120 Fall Assignment 7 Bug As assigned the forward and backward sums are equal! I wanted them to be different! X == 0.99 will fix it.

11/8/2005Comp 120 Fall Direct-Mapping Example TagData Memory With 8 byte BLOCKS, the bottom 3 bits determine the byte in the BLOCK With 4 cache BLOCKS, the next 2 bits determine which BLOCK to use 1028d = b  line = 00b = 0d 1000d = b  line = 01b = 1d 1040d = b  line = 10b = 2d

11/8/2005Comp 120 Fall Direct Mapping Miss TagData Memory What happens when we now ask for address 1012? 1012d = b  line = 10b = 2d but earlier we put 1040d there d = b  line = 10b = 2d

11/8/2005Comp 120 Fall Some Associativity can help Direct-Mapped caches are very common but can cause problems... SET ASSOCIATIVITY can help. Multiple Direct-mapped caches, then compare multiple TAGS –2-way set associative = 2 direct mapped + 2 TAG comparisons –4-way set associative = 4 direct mapped + 4 TAG comparisons Now array size == power of 2 doesn’t get us in trouble But –slower –less memory in same area –maybe direct mapped wins...

11/8/2005Comp 120 Fall Associative Cache

11/8/2005Comp 120 Fall What about store? What happens in the cache on a store? –WRITE BACK CACHE  put it in the cache, write on replacement –WRITE THROUGH CACHE  put in cache and in memory What happens on store and a MISS? –WRITE BACK will fetch the line into cache –WRITE THROUGH might just put it in memory

11/8/2005Comp 120 Fall Cache Block Size and Hit Rate Increasing the block size tends to decrease miss rate: Use split caches because there is more spatial locality in code:

11/8/2005Comp 120 Fall Cache Performance Simplified model: execution time = (execution cycles + stall cycles)  cycle time stall cycles = # of instructions  miss ratio  miss penalty Two ways of improving performance: –decreasing the miss ratio –decreasing the miss penalty What happens if we increase block size?

11/8/2005Comp 120 Fall Associative Performance

11/8/2005Comp 120 Fall Multilevel Caches We can reduce the miss penalty with a 2 nd level cache Add a second level cache: –often primary cache is on the same chip as the processor –use SRAMs to add another cache above primary memory (DRAM) –miss penalty goes down if data is in 2nd level cache Example: –Base CPI=1.0 on a 500Mhz machine with a 5% miss rate, 200ns DRAM access –Adding 2nd level cache with 20ns access time decreases miss rate to 2% Using multilevel caches: –try to optimize the hit time on the 1st level cache –try to optimize the miss rate on the 2nd level cache

11/8/2005Comp 120 Fall Matrix Multiply A VERY common operation in scientific programs Multiply a LxM matrix by an MxN matrix to get an LxN matrix result This requires L*N inner products each requiring M * and + So 2*L*M*N floating point operations Definitely a FLOATING POINT INTENSIVE application L=M=N=100, 2 Million floating point operations

11/8/2005Comp 120 Fall Matrix Multiply const int L = 2; const int M = 3; const int N = 4; void mm(double A[L][M], double B[M][N], double C[L][N]) { for(int i=0; i<L; i++) for(int j=0; j<N; j++) { double sum = 0.0; for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; C[i][j] = sum; }

11/8/2005Comp 120 Fall Matrix Memory Layout Our memory is a 1D array of bytes How can we put a 2D thing in a 1D memory? double A[2][3]; Row Major Column Major addr = base+(i*3+j)*8 addr = base + (i + j*2)*8

11/8/2005Comp 120 Fall Where does the time go? The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: mul $t1, i, M add $t1, $t1, k mul $t1, $t1, 8 add $t1, $t1, A l.d $f1, 0($t1) mul $t2, k, N add $t2, $t2, j mul $t2, $t2, 8 add $t2, $t2, B l.d $f2, 0($t2) mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 add k, k, 1 slt $t0, k, M bne $t0, $zero, L1

11/8/2005Comp 120 Fall Change Index * to + The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: l.d $f1, 0($t1) add $t1, $t1, AColStep l.d $f2, 0($t2) add $t2, $t2, BRowStep mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 add k, k, 1 slt $t0, k, M bne $t0, $zero, L1 AColStep = 8 BRowStep = 8 * N

11/8/2005Comp 120 Fall Eliminate k, use an address instead The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: l.d $f1, 0($t1) add $t1, $t1, AColStep l.d $f2, 0($t2) add $t2, $t2, BRowStep mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 bne $t1, LastA, L1

11/8/2005Comp 120 Fall We made it faster The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: l.d $f1, 0($t1) add $t1, $t1, AColStep l.d $f2, 0($t2) add $t2, $t2, BRowStep mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 bne $t1, LastA, L1 Now this is FAST! Only 7 instructions in the inner loop! BUT... When we try it on big matrices it slows way down. Whas Up?

11/8/2005Comp 120 Fall Now where is the time? The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: l.d $f1, 0($t1) add $t1, $t1, AColStep l.d $f2, 0($t2) add $t2, $t2, BRowStep mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 bne $t1, LastA, L1 lots of time wasted here! possibly a little stall right here

11/8/2005Comp 120 Fall Why? The inner loop takes all the time for(int k=0; k<M; k++) sum = sum + A[i][k] * B[k][j]; L1: l.d $f1, 0($t1) add $t1, $t1, AColStep l.d $f2, 0($t2) add $t2, $t2, BRowStep mul.d $f3, $f1, $f2 add.d $f4, $f4, $f3 bne $t1, LastA, L1 This load always misses! This load usually hits (maybe 3 of 4)

11/8/2005Comp 120 Fall Matrix Multiply Simulation Matrix Size NxN Cycles/MAC Simulation of 2k direct-mapped cache with 32 and 16 byte blocks

11/8/2005Comp 120 Fall Cultural Highlight 18 th Annual Sculpture in the Garden Exhibition at the NC Botanical Garden Veteran’s Day Friday

11/8/2005Comp 120 Fall classes to go 8 Read Section 7.5 especially important!