Instructor: Justin Hsia 7/17/2013Summer 2013 -- Lecture #141 CS 61C: Great Ideas in Computer Architecture Performance Programming, Technology Trends.

Slides:

Advertisements

Similar presentations

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Advertisements

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Computer Abstractions and Technology

Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Register Allocation (via graph coloring)

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.

Fast matrix multiplication; Cache usage

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Last Time Performance Analysis It’s all relative

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

Multi-Core Architectures

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Computer Architecture Lecture 26 Fasih ur Rehman.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Computer Architecture Lecture 32 Fasih ur Rehman.

Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.

Morgan Kaufmann Publishers

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 3 Computer Evolution.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Programming for Performance Laxmikant Kale CS 433.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Instructor: Justin Hsia

Lecture 1: CS/ECE 3810 Introduction

Cache Memories CSE 238/2038/2138: Systems Programming

Optimizing Cache Performance in Matrix Multiplication

Section 7: Memory and Caches

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers

Architecture & Organization 1

Cache Memory Presentation I

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Pipelining: Advanced ILP

Architecture & Organization 1

CS/EE 6810: Computer Architecture

November 14 6 classes to go! Read

COMS 361 Computer Organization

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Presentation transcript:

Instructor: Justin Hsia 7/17/2013Summer Lecture #141 CS 61C: Great Ideas in Computer Architecture Performance Programming, Technology Trends

Review of Last Lecture Multilevel caches reduce miss penalty −Standard to have 2-3 cache levels (and split I$/D$) −Makes CPI/AMAT calculations more complicated Cache design choices change performance parameters and cost 7/17/2013Summer Lecture #142

Adding unified L2$, which is larger than L1 but smaller than memory (A) Increasing block size while keeping cache size constant (B) Increasing associativity while keeping cache size constant (C) 3 Question: (based on previous midterm question) Which of the following cache changes will definitely increase L1 Hit Time? (D)

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends – The Need for Parallelism 7/17/2013Summer Lecture #144

Performance Programming Adjust memory accesses in code (software) to improve miss rate With understanding of how caches work, can revise programs to improve cache utilization 7/17/2013Summer Lecture #145

Performance of Loops and Arrays Array performance often limited by memory speed Goal: Increase performance by minimizing traffic from cache to memory – Reduce your miss rate by getting better reuse of data already in the cache – It is okay to access memory in different orderings as long as you still end up with the correct result Cache Blocking: “shrink” the problem by performing multiple iterations on smaller chunks that “fit well” in your cache – Use Matrix Multiply as an example (Lab 6 and Project 2) 7/17/2013Summer Lecture #146

Ex: Looping Performance (1/5) We have an array int A[1024] that we want to increment (i.e. A[i]++ ) What does the increment operation look like in assembly? # A  $s0 lw $t0,0($s0) addiu $t0,$t0,1 sw $t0,0($s0) addiu $s0,$s0,4 7/17/2013Summer Lecture #147 Guaranteed hit!

Ex: Looping Performance (2/5) We have an array int A[1024] that we want to increment (i.e. A[i]++ ) What is will our miss rate be for a D$ with 1- word blocks? (array not in $ at start) – 50% MR because each array element (word) accessed just once Can code choices change this? – No 7/17/2013Summer Lecture #148

Ex: Looping Performance (3/5) We have an array int A[1024] that we want to increment (i.e. A[i]++ ) Now for a D$ with 2-word blocks, what are the best and worst miss rates we can achieve? – Best: 75% MR via standard incrementation (each block will miss then hit, hit, hit) – Code: for(int i=0; i<1024; i++)A[i]++; 7/17/2013Summer Lecture #149

Ex: Looping Performance (4/5) We have an array int A[1024] that we want to increment (i.e. A[i]++ ) Now for a D$ with 2-word blocks, what are the best and worst miss rates we can achieve? – Worst: 50% MR by skipping elements (assuming D$ smaller than half of array size) – Code: for(int i=0; i<1024; i+=2)A[i]++; for(int i=1; i<1024; i+=2)A[i]++; 7/17/2013Summer Lecture #1410

Ex: Looping Performance (5/5) We have an array int A[1024] that we want to increment (i.e. A[i]++ ) For an I$ with 1-word blocks, what happens if we don’t use labels/loops? – 100% MR, as all instructions are explicitly written out sequentially What if we loop by incrementing i by 1? – Will miss on first pass over code, but should be found in I$ for all subsequent passes 7/17/2013Summer Lecture #1411

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends – The Need for Parallelism 7/17/2013Summer Lecture #1412

Administrivia HW4 due Sunday Midterm: 9am in 1 Pimentel – Take old exams for practice (see Piazza – Double-sided sheet of handwritten notes – MIPS Green Sheet provided; no calculators – Will cover up through caches Mid-Semester Survey (part of Lab 6) 7/17/2013Summer Lecture #1413

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends – The Need for Parallelism 7/17/2013Summer Lecture #1414

Matrix Multiplication 7/17/2013Summer Lecture #1415 C =× AB a i* b *j c ij

Advantage: Code simplicity Disadvantage: Blindly marches through memory and caches Naïve Matrix Multiply for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) c[i][j] += a[i][k] * b[k][j]; 7/17/2013Summer Lecture #1416

Matrices in Memory (1/2) Matrices stored as 1-D arrays in memory – Column major: A(i,j) at A+i+j*n – Row major: A(i,j) at A+i*n+j C default is row major 7/17/2013Summer Lecture # Column major: Row major:

Matrices in Memory (2/2) How do cache blocks fit into this scheme? – Column major matrix in memory: 7/17/2013Summer Lecture #1418 Cache blocks ROW of matrix (blue) is spread among cache blocks shown in red

Naïve Matrix Multiply (cache view) 7/17/2013Summer Lecture #1419 # move along rows of A for i = 1 to n # move along columns of B for j = 1 to n # EACH k loop reads row of A, col of B # Also read & write c(i,j) n times for k = 1 to n c(i,j) = c(i,j) + a(i,k) * b(k,j) =+* C(i,j)A(i,:) B(:,j) C(i,j)

Linear Algebra to the Rescue (1/2) Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (matrix “blocks”) For example, multiply two 4×4 matrices: 7/17/2013Summer Lecture #1420

Linear Algebra to the Rescue (2/2) 7/17/2013Summer Lecture #1421 Matrices of size N×N, split into 4 blocks of size r (N=4r). C 22 = A 21 B 12 + A 22 B 22 + A 23 B 32 + A 24 B 42 =  k A 2k *B k2 Multiplication operates on small “block” matrices ‒ Choose size so that they fit in the cache! C 11 C 12 C 13 C 14 C 21 C 22 C 23 C 24 C 31 C 32 C 43 C 34 C 41 C 42 C 43 C 44 A 11 A 12 A 13 A 14 A 21 A 22 A 23 A 24 A 31 A 32 A 33 A 34 A 41 A 42 A 43 A 144 B 11 B 12 B 13 B 14 B 21 B 22 B 23 B 24 B 32 B 33 B 34 B 41 B 42 B 43 B 44

Blocked Matrix Multiply Blocked version of the naïve algorithm: – r = matrix block size (assume r divides N) – X[i][j] = submatrix of X, defined by block row i and block column j 7/17/2013Summer Lecture #1422 r × r matrix multiplicationr × r matrix addition for(i = 0; i < N/r; i++) for(j = 0; j < N/r; j++) for(k = 0; k < N/r; k++) C[i][j] += A[i][k]*B[k][j]

Blocked Matrix Multiply (cache view) # move along block row of A for i = 1 to N # move along block col of B for j = 1 to N # each k loop reads block of A and B # Also read and write block of C for k = 1 to N C(i,j) = C(i,j) + A(i,k) * B(k,j) 7/17/2013Summer Lecture #1423 =+* B(k,j) C(i,j) A(i,k) Matrix multiply on blocks

Matrix Multiply Comparison Naïve Matrix Multiply – N = 100, 1000 cache blocks, 1 word/block – Youtube: Slow/Fast-forwardSlowFast-forward – ≈ 1,020,0000 cache misses Blocked Matrix Multiply – N = 100, 1000 cache blocks, 1 word/block, r = 30 – Youtube: Slow/Fast-forwardSlowFast-forward – ≈ 90,000 cache misses 7/17/2013Summer Lecture #1424

Maximum Block Size Blocking optimization only works if the blocks fit in cache – Must fit 3 blocks of size r × r in memory (for A, B, and C) For cache of size M (in elements/words), we must have 3r 2  M, or r  √(M/3) Ratio of cache misses unblocked vs. blocked up to ≈ √M (play with sizes to optimize) – From comparison: ratio was ≈ 11, √M = /17/2013Summer Lecture #1425

Get To Know Your Staff Category: Food 7/17/2013Summer Lecture #1426

Agenda Performance Programming Administrivia Perf Prog: Matrix Multiply Technology Trends – The Need for Parallelism 7/17/2013Summer Lecture #1427

Six Great Ideas in Computer Architecture 1.Layers of Representation/Interpretation 2.Moore’s Law 3.Principle of Locality/Memory Hierarchy 4.Parallelism 5.Performance Measurement & Improvement 6.Dependability via Redundancy 7/17/2013Summer Lecture #1428

Technology Cost over Time 7/17/2013Summer Lecture #1429 Cost $ Time A B C D What does improving technology look like?

Time Tech Cost: Successive Generations 7/17/2013Summer Lecture #1430 Tech Gen 1 Tech Gen 2 Tech Gen 3 Tech Gen 2? How Can Tech Gen 2 Replace Tech Gen 1? Cost $

Time Tech Performance over Time 7/17/2013Summer Lecture #1431 Performance

Moore’s Law “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. …That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.” (from 50 in 1965) “Integrated circuits will lead to such wonders as home computers--or at least terminals connected to a central computer--automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today.” 7/17/2013Summer Lecture #1432 Gordon Moore, “Cramming more components onto integrated circuits,” Electronics, Volume 38, Number 8, April 19, 1965

Great Idea #2: Moore’s Law 7/17/2013Summer Lecture #1433 Predicts: Transistor count per chip doubles every 2 years Gordon Moore Intel Cofounder B.S. Cal 1950 # of transistors on an integrated circuit (IC) Year:

End of Moore’s Law? Exponential growth cannot last forever More transistors/chip will end during your careers – 2020? 2025? – (When) will something replace it? It’s also a law of investment in equipment as well as increasing volume of integrated circuits that need more transistors per chip 7/17/2013Summer Lecture #1434

Computer Technology: Growing, But More Slowly Processor – Speed 2x / 1.5 years (since ’85-’05) [slowing!] – Now +2 cores / 2 years – When you graduate: 3-4 GHz, 6-8 Cores in client, in server Memory (DRAM) – Capacity: 2x / 2 years (since ’96) [slowing!] – Now 2X/3-4 years – When you graduate: 8-16 GigaBytes Disk – Capacity: 2x / 1 year (since ’97) – 250X size last decade – When you graduate: 6-12 TeraBytes Network – Core: 2x every 2 years – Access: mbps from home, 1-10 mbps cellular 7/17/201335Summer Lecture #14

Memory Chip Size 7/17/2013Summer Lecture #1436 4x in 3 years2x in 3 years Growth in memory capacity slowing

Uniprocessor Performance 7/17/2013Summer Lecture #1437 Improvements in processor performance have slowed.

Limits to Performance: Faster Means More Power 7/17/2013Summer Lecture #1438

Dynamic Power Power = C × V 2 × f – Proportional to capacitance, voltage 2, and frequency of switching What is the effect on power consumption of: – “Simpler” implementation (fewer transistors)? – Reduced voltage? – Increased clock frequency? 7/17/2013Summer Lecture #1439 ↓ ↓↓ ↑

Multicore Helps Energy Efficiency Power = C × V 2 × f From: William Holt, HOT Chips /17/201340Summer Lecture #14

Transition to Multicore 7/17/2013Summer Lecture #1441 Sequential App Performance

Parallelism - The Challenge Only path to performance is parallelism – Clock rates flat or declining Key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase – i.e. that scale – Scheduling, load balancing, time for synchronization, overhead for communication Project #2: fastest matrix multiply code on 16 processor (cores) computers 7/17/2013Summer Lecture #1442

Summary Performance programming – With understanding of your computer’s architecture, can optimize code to take advantage of your system’s cache – Especially useful for loops and arrays – “Cache blocking” will improve speed of Matrix Multiply with appropriately-sized blocks Processors have hit the power wall, the only option is to go parallel 7/17/2013Summer Lecture #1443