CS194 Lecture 31 Memory Hierarchies and Optimizations Kathy Yelick

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
08/29/2002CS267 Lecure 21 High Performance Programming on a Single Processor: Memory Hierarchies Katherine Yelick
Data Locality CS 524 – High-Performance Computing.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Uniprocessor Optimizations and Matrix Multiplication
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1/19/2007CS267 Lecture 21 Single Processor Machines: Memory Hierarchies and Processor Features Kathy Yelick
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
4/6/2005 ECE Motivation for Memory Hierarchy What we want from memory  Fast  Large  Cheap There are different kinds of memory technologies  Register.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Systems I Locality and Caching
CMPE 421 Parallel Computer Architecture
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Lecture 2 Single Processor Machines: Memory Hierarchies and Processor Features UCSB CS240A, Winter 2013 Modified from Demmel/Yelick’s slides.
1 Single Processor Machines: Memory Hierarchies and Processor Features Case Study: Tuning Matrix Multiply Based on slides by James Demmel
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
01/26/2009CS267 - Lecture 2 1 Experimental Study of Memory (Membench)‏ Microbenchmark for memory system performance time the following loop (repeat many.
CPE 779 Parallel Computing - Spring Memory Hierarchies & Optimizations Case Study: Tuning Matrix Multiply Based on slides by: Kathy Yelick
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
COSC3330 Computer Architecture
The Goal: illusion of large, fast, cheap memory
Optimizing Cache Performance in Matrix Multiplication
5.2 Eleven Advanced Optimizations of Cache Performance
Optimizing Cache Performance in Matrix Multiplication
Uniprocessor Optimizations and Matrix Multiplication
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Guest Lecturer TA: Shreyas Chand
Instruction Execution Cycle
Pipelining Appendix A and Chapter 3.
Memory System Performance Chapter 3
Caches & Memory.
Pipelined datapath and control
Presentation transcript:

CS194 Lecture 31 Memory Hierarchies and Optimizations Kathy Yelick

CS194 Lecture 32 Motivation Multiple cores or processors on a single system are there for performance Many applications run well below the “peak” of the systems, often under 10% of arithmetic performance Perhaps optimizing the code on a single core will give as much benefit as writing in parallel Most of the single processor performance loss is in the memory system To understand this, we need to look under the hood of modern processors A parallel programmer is also a performance programmer Understand your hardware

CS194 Lecture 33 Outline Idealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance

CS194 Lecture 34 Outline Idealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance

CS194 Lecture 35 Idealized Uniprocessor Model Processor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Operations include Read and write into very fast memory called registers Arithmetic and other logical operations on registers Order specified by program Read returns the most recently written data Compiler and architecture translate high level expressions into “obvious” lower level instructions Hardware executes instructions in order specified by compiler Idealized Cost Each operation has roughly the same cost (read, write, add, multiply, etc.) A = B + C  Read address(B) to R1 Read address(C) to R2 R3 = R1 + R2 Write R3 to Address(A)

CS194 Lecture 36 Uniprocessors in the Real World Real processors have registers and caches small amounts of fast memory store values of recently used or nearby data different memory ops can have very different costs parallelism multiple “functional units” that can run in parallel different orders, instruction mixes have different costs pipelining a form of parallelism, like an assembly line in a factory Why is this your problem? In theory, compilers understand all of this and can optimize your program; in practice they don’t. Even if they could optimize one algorithm, they won’t know about a different algorithm that might be a much better “match” to the processor

CS194 Lecture 37 Outline Idealized and actual costs in modern processors Parallelism within single processors Hidden from software (sort of) Pipelining SIMD units Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance

CS194 Lecture 38 What is Pipelining? In this example: Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more total loads Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages ABCD 6 PM 789 TaskOrderTaskOrder Time Dave Patterson’s Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency

CS194 Lecture 39 Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134, CA:AQA 2e by Patterson and Hennessy Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Pipelining is also used within arithmetic units – a fp multiply may have latency 10 cycles, but throughput of 1/cycle Next PC Address RS1 RS2 Imm MUX

CS194 Lecture 310 SIMD: Single Instruction, Multiple Data + Scalar processing traditional mode one operation produces one result SIMD processing with SSE / SSE2 one operation produces multiple results X Y X + Y + x3x2x1x0 y3y2y1y0 x3+y3x2+y2x1+y1x0+y0 X Y Slide Source: Alex Klimovitski & Dean Macri, Intel Corporation

CS194 Lecture 311 SSE / SSE2 SIMD on Intel 16x bytes 4x floats 2x doubles SSE2 data types: anything that fits into 16 bytes, e.g., Instructions perform add, multiply etc. on all the data in this 16-byte register in parallel Challenges: Need to be contiguous in memory and aligned Some instructions to move data around from one part of register to another

CS194 Lecture 312 What does this mean to you? In addition to SIMD extensions, the processor may have other special instructions Fused Multiply-Add (FMA) instructions: x = y + c * z is so common some processor execute the multiply/add as a single instruction, at the same rate (bandwidth) as + or * alone In theory, the compiler understands all of this When compiling, it will rearrange instructions to get a good “schedule” that maximizes pipelining, uses FMAs and SIMD It works with the mix of instructions inside an inner loop or other block of code But in practice the compiler may need your help Choose a different compiler, optimization flags, etc. Rearrange your code to make things more obvious Using special functions (“intrinsics”) or write in assembly 

CS194 Lecture 313 Peak Performance Whenever you look at performance, it’s useful to understand what you’re expecting Machine peak is the maximum number of arithmetic operations a processor (or set of them) can perform Often referred to as “speed of light” or “guaranteed not to exceed” number Treat integer and floating point separately and be clear on how wide (# bits) the operands have If you see a number higher than peak, you have a mistake in your timers, formulas, or …

CS194 Lecture 314 Peak Performance Examples Example 1: Power5 procs in Bassi at NERSC 1.9 GHz (G cycles/sec) with 4 double precision floating point ops/cycle  7.6 GFlop/s peak per processor Two fp units, each can do a fused MADD (throughput!) Each SMP has 8 processors  60.8 Gflop/s Example 2: 20 of the PSI nodes (in CITRIS) 2.33GHz with SSE2 (2-wide for 64-bit FPUs with FMA)  9.32 GFlop/s 2 Quad-Cores per node  Gflop/s

CS194 Lecture 315 Outline Idealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Temporal and spatial locality Basics of caches Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance

CS194 Lecture 316 Memory Hierarchy Most programs have a high degree of locality in their accesses spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality on-chip cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape) Speed1ns10ns100ns10ms10sec SizeBKBMBGBTB

CS194 Lecture 317 Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Memory hierarchies are getting deeper Processors get faster more quickly than memory

CS194 Lecture 318 Approaches to Handling Memory Latency Bandwidth has improved more than latency Approach to address the memory latency problem Eliminate memory operations by saving values in small, fast memory (cache) and reusing them need temporal locality in program Take advantage of better bandwidth by getting a chunk of memory and saving it in small fast memory (cache) and using whole chunk need spatial locality in program Take advantage of better bandwidth by allowing processor to issue multiple reads to the memory system at once concurrency in the instruction stream, eg load whole array, as in vector processors; or prefetching

CS194 Lecture 319 Little’s Law Latency vs. Bandwidth Latency is physics (wire length) e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room Bandwidth is cost: add more wires to increase bandwidth (over-simplification) Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency Required concurrency = Bandwidth * Latency bandwidth-delay product For parallel computing, this means: Concurrency = bandwidth x latency

CS194 Lecture 320 Little’s Law Example 1: a single processor : If the latency is to memory is 50ns, and the bandwidth is 5 GB/s (.2ns / Bytes = 12.8 ns / 64-byte cache line) The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed) An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them) Example 2: 1000 processor system 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory. Main memory bandwidth is: ~ 1000 x 100 words x 10 9 /s = words/sec. To achieve full performance, an application needs: ~ x = 10 7 way concurrency (some of that may be hidden in the instruction stream)

CS194 Lecture 321 Cache Basics (Review) Cache is fast (expensive) memory which keeps copy of data in main memory; it is hidden from software Cache hit: in-cache memory access—cheap Cache miss: non-cached memory access—expensive Need to access next, slower level of cache Consider a tiny cache (for illustration only) Cache line length: # of bytes loaded together in one entry 2 in above example Associativity direct-mapped: only 1 address (line) in a given range in cache n-way: n  2 lines with different addresses can be stored Address PatternData (2 Bytes) X through X through X through X through

CS194 Lecture 322 Why Have Multiple Levels of Cache? On-chip vs. off-chip On-chip caches are faster, but limited in size A large cache has delays Hardware to check longer addresses in cache takes more time Associativity, which gives a more general set of data in cache, also takes more time Some examples: Cray T3E eliminated one cache to speed up misses IBM uses a level of cache as a “victim cache” which is cheaper There are other levels of the memory hierarchy Register, pages (TLB, virtual memory), … And it isn’t always a hierarchy

CS194 Lecture 323 Experimental Study of Memory (Membench) Microbenchmark for memory system performance time the following loop (repeat many times and average) for i from 0 to L load A[i] from memory (4 Bytes) 1 experiment for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes) s

CS194 Lecture 324 Membench: What to Expect Consider the average cost per load Plot one line for each array length, time vs. stride Small stride is best: if cache line holds 4 words, at most ¼ miss If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) Picture assumes only one level of cache Values have gotten more difficult to measure on modern procs s = stride average cost per access total size < L1 cache hit time memory time size > L1

CS194 Lecture 325 Memory Hierarchy on a Sun Ultra-2i L1: 16 KB 2 cycles (6ns) Sun Ultra-2i, 333 MHz L2: 64 byte line See for details L2: 2 MB, 12 cycles (36 ns) Mem: 396 ns (132 cycles) 8 K pages, 32 TLB entries L1: 16 B line Array length

CS194 Lecture 326 Memory Hierarchy on a Pentium III L1: 32 byte line ? L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? Katmai processor on Millennium, 550 MHz Array size

CS194 Lecture 327 Memory Hierarchy on a Power3 (Seaborg) Power3, 375 MHz L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line.5-2 cycles Array size Mem: 396 ns (132 cycles)

CS194 Lecture 328 Memory Performance on Itanium 2 (CITRIS) Itanium2, 900 MHz L2: 256 KB 128 B line.5-4 cycles L1: 32 KB 64B line.34-1 cycles Array size Mem: cycles L3: 2 MB 128 B line 3-20 cycles Uses MAPS Benchmark:

CS194 Lecture 329 Stanza Triad Even smaller benchmark for prefetching Derived from STREAM Triad Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements 1) do L triads 3) do L triads 2) skip k elements... stanza Source: Kamil et al, MSP05

CS194 Lecture 330 Stanza Triad Results This graph (x-axis) starts at a cache line size (>=16 Bytes) If cache locality was the only thing that mattered, we would expect Flat lines equal to measured memory peak bandwidth (STREAM) as on Pentium3 Prefetching gets the next cache line (pipelining) while using the current one This does not “kick in” immediately, so performance depends on L

CS194 Lecture 331 Lessons Actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly To write fast programs, need to consider architecture True on sequential or parallel processor We would like simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking or tiling Idea: used divide-and-conquer to define a problem that fits in register/L1-cache/L2-cache

CS194 Lecture 332 Outline Idealized and actual costs in modern processors Parallelism within single processors Memory hierarchies Use of microbenchmarks to characterized performance Case study: Matrix Multiplication Use of performance models to understand performance Simple cache model Warm-up: Matrix-vector multiplication Case study continued next time

CS194 Lecture 333 Why Matrix Multiplication? An important kernel in scientific problems Appears in many linear algebra algorithms Closely related to other algorithms, e.g., transitive closure on a graph using Floyd-Warshall Optimization ideas can be used in other problems The best case for optimization payoffs The most-studied algorithm in parallel computing

CS194 Lecture 334 Administrivia Homework 1 (with the code!) is online Lot to understand, relatively little code to write Understanding performance results and code important in grade! Homework 0 should be done Submit should (finally) be working, but perhaps not on CITRIS/PSI machines (use instructional machines to submit)

CS194 Lecture 335 Note on Matrix Storage A matrix is a 2-D array of elements, but memory addresses are “1-D” Conventions for matrix layout by column, or “column major” (Fortran default); A(i,j) at A+i+j*n by row, or “row major” (C default) A(i,j) at A+i*n+j recursive (later) Column major (for now) Column major Row major cachelines Blue row of matrix is stored in red cachelines Figure source: Larry Carter, UCSD Column major matrix in memory

CS194 Lecture 336 Computational Intensity: Key to algorithm efficiency Machine Balance: Key to machine efficiency Using a Simple Model of Memory to Optimize Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory t m = time per slow memory operation f = number of arithmetic operations t f = time per arithmetic operation << t m q = f / m average number of flops per slow memory access Minimum possible time = f* t f when all data in fast memory Actual time f * t f + m * t m = f * t f * (1 + t m /t f * 1/q) Larger q means time closer to minimum f * t f q  t m /t f needed to get at least half of peak speed

CS194 Lecture 337 Warm up: Matrix-vector multiplication {implements y = y + A*x} for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) = + * y(i) A(i,:) x(:)

CS194 Lecture 338 Warm up: Matrix-vector multiplication {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} m = number of slow memory refs = 3n + n 2 f = number of arithmetic operations = 2n 2 q = f / m ~= 2 Matrix-vector multiplication limited by slow memory speed

CS194 Lecture 339 Modeling Matrix-Vector Multiplication Compute time for nxn = 1000x1000 matrix Time f * t f + m * t m = f * t f * (1 + t m /t f * 1/q) = 2*n 2 * t f * (1 + t m /t f * 1/2) For t f and t m, using data from R. Vuduc’s PhD (pp 351-3) For t m use minimum-memory-latency / words-per-cache-line machine balance (q must be at least this for ½ peak speed)

CS194 Lecture 340 Simplifying Assumptions What simplifying assumptions did we make in this analysis? Ignored parallelism in processor between memory and arithmetic within the processor Sometimes drop arithmetic term in this type of analysis Assumed fast memory was large enough to hold three vectors Reasonable if we are talking about any level of cache Not if we are talking about registers (~32 words) Assumed the cost of a fast memory access is 0 Reasonable if we are talking about registers Not necessarily if we are talking about cache (1-2 cycles for L1) Memory latency is constant Could simplify even further by ignoring memory operations in X and Y vectors Mflop rate/element = 2 / (2* t f + t m )

CS194 Lecture 341 Validating the Model How well does the model predict actual performance? Actual DGEMV: Most highly optimized code for the platform Model sufficient to compare across machines But under-predicting on most recent ones due to latency estimate

CS194 Lecture 342 Summary Details of machine are important for performance Processor and memory system (not just parallelism) Before you parallelize, make sure you’re getting good serial performance What to expect? Use understanding of hardware limits There is parallelism hidden within processors Pipelining, SIMD, etc Locality is at least as important as computation Temporal: re-use of data recently used Spatial: using data nearby that recently used Machines have memory hierarchies 100s of cycles to read from DRAM (main memory) Caches are fast (small) memory that optimize average case Can rearrange code/data to improve locality