Instructor: Justin Hsia

Slides:

Advertisements

Similar presentations

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

Computer Architecture Lecture 26 Fasih ur Rehman.

Instructor: Justin Hsia 7/17/2013Summer Lecture #141 CS 61C: Great Ideas in Computer Architecture Performance Programming, Technology Trends.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Morgan Kaufmann Publishers

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

CMSC 611: Advanced Computer Architecture

Cache Memories.

CS161 – Design and Architecture of Computer

Cache Memories CSE 238/2038/2138: Systems Programming

Stateless Combinational Logic and State Circuits

How will execution time grow with SIZE?

Basic Performance Parameters in Computer Architecture:

Optimizing Cache Performance in Matrix Multiplication

The Hardware/Software Interface CSE351 Winter 2013

Lecture 13: Reduce Cache Miss Rates

Exploiting Parallelism

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers

Caches III CSE 351 Autumn 2017 Instructor: Justin Hsia

Architecture & Organization 1

Multiprocessor Cache Coherency

Cache Memory Presentation I

CS 105 Tour of the Black Holes of Computing

William Stallings Computer Organization and Architecture 7th Edition

Caches III CSE 351 Winter 2017.

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Instructor: Justin Hsia

Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Memory Hierarchies.

Architecture & Organization 1

November 14 6 classes to go! Read

CPE 631 Lecture 05: Cache Design

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Miss Rate versus Block Size

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

/ Computer Architecture and Design

/ Computer Architecture and Design

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

COMS 361 Computer Organization

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 4 Multiprocessors

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Lecture 7 Memory Hierarchy and Cache Design

10/18: Lecture Topics Using spatial locality

Instructor: Michael Greenbaum

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Instructor: Justin Hsia CS 61C: Great Ideas in Computer Architecture Performance Programming, Technology Trends Instructor: Justin Hsia 7/10/2012 Summer 2012 -- Lecture #13

Review of Last Lecture Cache performance measured in CPIstall and AMAT Parameters that matter: HT, MR, MP Multilevel caches reduce miss penalty Standard to have 2-3 cache levels (and split I$/D$) Makes CPI/AMAT calculations more complicated Set associativity reduces miss rate Index field now determines set Cache design choices change performance parameters and cost 7/10/2012 Summer 2012 -- Lecture #13

Question: How many total bits are stored in the following cache? 4-way set associative cache Cache size 1 KiB, Block size 16 B Write-back 16-bit address space 26 × (27 + 23 + 21) = 8.625 Kib ☐ 24 × (27 + 23 + 20) = 2.140625 Kib ☐ 24 × (27 + 23 + 21) = 2.15625 Kib ☐ 24 × (27 + 6 + 21) = 2.125 Kib ☐

Question: (previous midterm question) Which of the following cache changes will definitely increase L1 Hit Time? (more than one may be correct) Adding unified L2$, which is larger than L1 but smaller than memory ☐ Increasing block size while keeping cache size constant ☐ Answer: Pink and Yellow Increased associativity means we must check multiple Valid bits and Tags (using MUX) Increased cache size means we must index through more rows/sets Increasing associativity while keeping cache size constant ☐ Increasing cache size while keeping block size constant ☐

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends The Need for Parallelism 7/10/2012 Summer 2012 -- Lecture #13

Performance Programming Adjust memory accesses in code (software) to improve miss rate With understanding of how caches work, can revise programs to improve cache utilization Cache size Block size Multiple levels 7/10/2012 Summer 2012 -- Lecture #13

Performance of Loops and Arrays Array performance often limited by memory speed Goal: Increase performance by minimizing traffic from cache to memory Reduce your miss rate by getting better reuse of data already in the cache It is okay to access memory in different orderings as long as you still end up with the correct result Cache Blocking: “shrink” the problem by performing multiple iterations on smaller chunks that “fit well” in your cache Use Matrix Multiply as an example (Lab 6 and Project 2) 7/10/2012 Summer 2012 -- Lecture #13

Ex: Looping Performance (1/5) We have an array int A[1024] that we want to increment (i.e. A[i]++) What is will our miss rate be for a D$ with 1- word blocks? (array not in $ at start) 100% MR because each array element (word) accessed just once Can code choices change this? No The array was just initialized, so it’s impossible for the data to be in the cache already. This means the D$ will miss on every access of A[] since we never revisit the same entry. 7/10/2012 Summer 2012 -- Lecture #13

Ex: Looping Performance (2/5) We have an array int A[1024] that we want to increment (i.e. A[i]++) Now for a D$ with 2-word blocks, what are the best and worst miss rates we can achieve? Best: 50% MR via standard incrementation (each block will miss then hit) Code: for(int i=0; i<1024; i++) A[i]++; The array was just initialized, so it’s impossible for the data to be in the cache already. This means the D$ will miss on every access of A[] since we never revisit the same entry. 7/10/2012 Summer 2012 -- Lecture #13

Ex: Looping Performance (3/5) We have an array int A[1024] that we want to increment (i.e. A[i]++) Now for a D$ with 2-word blocks, what are the best and worst miss rates we can achieve? Worst: 100% MR by skipping elements (assuming D$ smaller than half of array size) Code: for(int i=0; i<1024; i+=2) A[i]++; for(int i=1; i<1024; i+=2) A[i]++; The array was just initialized, so it’s impossible for the data to be in the cache already. This means the D$ will miss on every access of A[] since we never revisit the same entry. 7/10/2012 Summer 2012 -- Lecture #13

Ex: Looping Performance (4/5) We have an array int A[1024] that we want to increment (i.e. A[i]++) What does the increment operation look like in assembly? # A  $s0 lw $t0,0($s0) addiu $t0,$t0,1 sw $t0,0($s0) addiu $s0,$s0,4 This value may change depending on your loop in code 7/10/2012 Summer 2012 -- Lecture #13

Ex: Looping Performance (5/5) We have an array int A[1024] that we want to increment (i.e. A[i]++) For an I$ with 1-word blocks, what happens if we don’t use labels/loops? 100% MR, as all instructions are explicitly written out sequentially What if we loop by incrementing i by 1? Will miss on first pass over code, but should be found in I$ for all subsequent passes 7/10/2012 Summer 2012 -- Lecture #13

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends The Need for Parallelism 7/10/2012 Summer 2012 -- Lecture #13

Administrivia HW3 due Wednesday Midterm: Friday @ 9am in 245 Li Ka Shing Take old exams for practice (see Piazza post @209) One-sided sheet of handwritten notes MIPS Green Sheet provided; no calculators Will cover up through caches Mid-Semester Survey (part of Lab 7) Justin’s OH this week: Wed 5-7pm, Room TBD 7/10/2012 Summer 2012 -- Lecture #13

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends The Need for Parallelism 7/10/2012 Summer 2012 -- Lecture #13

Matrix Multiplication B ai* b*j cij = * Subscript indices given in (row,col). 7/10/2012 Summer 2012 -- Lecture #13

Naïve Matrix Multiply Advantage: Code simplicity Disadvantage: Blindly marches through memory and caches for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) c[i][j] += a[i][k] * b[k][j]; 7/10/2012 Summer 2012 -- Lecture #13

Matrices in Memory (1/2) Matrices stored as 1-D arrays in memory Column major: A(i,j) at A+i+j*n Row major: A(i,j) at A+i*n+j C default is row major 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Column major: 4 8 12 16 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 Row major: Because memory is a 1-D array 7/10/2012 Summer 2012 -- Lecture #13

Matrices in Memory (2/2) How do cache blocks fit into this scheme? Column major matrix in memory: ROW of matrix (blue) is spread among cache blocks shown in red Cache blocks 7/10/2012 Summer 2012 -- Lecture #13

Naïve Matrix Multiply (cache view) # implements C = C + A*B for i = 1 to n # read row i of A into cache for j = 1 to n # read c(i,j) into cache # read column j of B into cache for k = 1 to n c(i,j) = c(i,j) + a(i,k) * b(k,j) # write c(i,j) back to main memory This is pseudocode, not actually written in any particular language = + * C(i,j) A(i,:) B(:,j) 7/10/2012 Summer 2012 -- Lecture #13

Linear Algebra to the Rescue (1/2) Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (“blocks”) For example, multiply two 4×4 matrices: 7/10/2012 Summer 2012 -- Lecture #13

Linear Algebra to the Rescue (2/2) Matrices of size N×N, split into 4 blocks of size r (N=4r). C22 = A21B12 + A22B22 + A23B32 + A24B42 = k A2k*Bk2 Multiplication operates on small “block” matrices Choose size so that they fit in the cache! 7/10/2012 Summer 2012 -- Lecture #13

Blocked Matrix Multiply Blocked version of the naïve algorithm: r = block size (assume r divides N) X[i][j] = submatrix of X, defined by block row i and block column j for(i=0; i<N/r; i++) for(j=0; j<N/r; j++) for(k=0; k<N/r; k++) C[i][j] += A[i][k]*B[k][j] r × r matrix addition r × r matrix multiplication 7/10/2012 Summer 2012 -- Lecture #13

Blocked Matrix Multiply (cache view) # implements C = C + A*B for i = 1 to N for j = 1 to N # read block C(i,j) into cache for k = 1 to N # read block A(i,k) into cache # read block B(k,j) into cache C(i,j) = C(i,j) + A(i,k) * B(k,j) # write block C(i,j) back to main memory Matrix multiply on blocks = + * B(k,j) C(i,j) A(i,k) 7/10/2012 Summer 2012 -- Lecture #13

Matrix Multiply Comparison Naïve Matrix Multiply N = 100, 1000 cache blocks, 1 word/block Youtube: Slow/Fast-forward ≈ 1,020,0000 cache misses Blocked Matrix Multiply N = 100, 1000 cache blocks, 1 word/block, r = 30 ≈ 90,000 cache misses 7/10/2012 Summer 2012 -- Lecture #13

Maximum Block Size Blocking optimization only works if the blocks fit in cache Must fit 3 blocks of size r × r in memory (for A, B, and C) For cache of size M (in elements/words), we must have 3r2  M, or r  √(M/3) Ratio of cache misses unblocked vs. blocked up to ≈ √M (play with sizes to optimize) From comparison: ratio was ≈ 11, √M = 31.6 “Only” 11X vs 30X Matrix small enough that row of A in simple version fits completely in cache; other things. 7/10/2012 Summer 2012 -- Lecture #13

Get to Know Your Staff Category: Food 7/10/2012 Summer 2012 -- Lecture #13

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends The Need for Parallelism 7/10/2012 Summer 2012 -- Lecture #13

Six Great Ideas in Computer Architecture Layers of Representation/Interpretation Moore’s Law Principle of Locality/Memory Hierarchy Parallelism Performance Measurement & Improvement Dependability via Redundancy 7/10/2012 Summer 2012 -- Lecture #13

Technology Cost over Time What does improving technology look like? Cost $ A D C B Time 7/10/2012 Summer 2012 -- Lecture #13

Tech Cost: Successive Generations $ Tech Gen 1 Tech Gen 2? How Can Tech Gen 2 Replace Tech Gen 1? Tech Gen 2 Tech Gen 3 Time 7/10/2012 Summer 2012 -- Lecture #13

Tech Performance over Time 7/10/2012 Summer 2012 -- Lecture #13

Moore’s Law Gordon Moore, “Cramming more components onto integrated circuits,” Electronics, Volume 38, Number 8, April 19, 1965 “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. …That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.” (from 50 in 1965) “Integrated circuits will lead to such wonders as home computers--or at least terminals connected to a central computer--automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today.” 7/10/2012 Summer 2012 -- Lecture #13

Great Idea #2: Moore’s Law Predicts: Transistor count per chip doubles every 2 years # of transistors on an integrated circuit (IC) Gordon Moore Intel Cofounder B.S. Cal 1950 Year: 7/10/2012 Summer 2012 -- Lecture #13

Memory Chip Size Growth in memory capacity slowing 4x in 3 years 7/10/2012 Summer 2012 -- Lecture #13

End of Moore’s Law? It’s also a law of investment in equipment as well as increasing volume of integrated circuits that need more transistors per chip Exponential growth cannot last forever More transistors/chip will end during your careers 2020? 2025? (When) will something replace it? 7/10/2012 Summer 2012 -- Lecture #13

Uniprocessor Performance Improvements in processor performance have slowed. Why? Performance measured using SPECint benchmarking (http://en.wikipedia.org/wiki/SPECint) 7/10/2012 Summer 2012 -- Lecture #13

Computer Technology: Growing, But More Slowly Processor Speed 2x / 1.5 years (since ’85-’05) [slowing!] Now +2 cores / 2 years When you graduate: 3-4 GHz, 6-8 Cores in client, 10-14 in server Memory (DRAM) Capacity: 2x / 2 years (since ’96) [slowing!] Now 2X/3-4 years When you graduate: 8-16 GigaBytes Disk Capacity: 2x / 1 year (since ’97) 250X size last decade When you graduate: 6-12 TeraBytes Network Core: 2x every 2 years Access: 100-1000 mbps from home, 1-10 mbps cellular 7/10/2012 Summer 2012 -- Lecture #13

Limits to Performance: Faster Means More Power 7/10/2012 Summer 2012 -- Lecture #13

Dynamic Power Power = C × V2 × f Proportional to capacitance, voltage2, and frequency of switching What is the effect on power consumption of: “Simpler” implementation (fewer transistors)? Reduced voltage? Increased clock frequency? ↓ ↓↓ ↑ 7/10/2012 Summer 2012 -- Lecture #13

Multicore Helps Energy Efficiency Power = C × V2 × f From: William Holt, HOT Chips 2005 7/10/2012 Summer 2012 -- Lecture #13

Transition to Multicore Sequential App Performance 7/10/2012 Summer 2012 -- Lecture #13

Parallelism - The Challenge Only path to performance is parallelism Clock rates flat or declining Key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale Scheduling, load balancing, time for synchronization, overhead for communication Project #2: fastest matrix multiply code on 16 processor (cores) computers 7/10/2012 Summer 2012 -- Lecture #13

Summary Performance programming With understanding of your computer’s architecture, can optimize code to take advantage of your system’s cache Especially useful for loops and arrays “Cache blocking” will improve speed of Matrix Multiply with appropriately-sized blocks Processors have hit the power wall, the only option is to go parallel 7/10/2012 Summer 2012 -- Lecture #13