Uniprocessor Optimizations and Matrix Multiplication

Slides:

Advertisements

Similar presentations

PipelineCSCE430/830 Pipeline: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Prof. Yifeng Zhu, U of Maine Fall,

Advertisements

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Instruction-Level Parallelism (ILP)

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath

Data Locality CS 524 – High-Performance Computing.

08/29/2002CS267 Lecure 21 CS 267: Optimizing for Uniprocessors—A Case Study in Matrix Multiplication Katherine Yelick

CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.

Uniprocessor Optimizations and Matrix Multiplication

CS430 – Computer Architecture Introduction to Pipelined Execution

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Optimizing for the serial processors Scaled speedup: operate near the memory boundary. Memory systems on modern processors are complicated. The performance.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

1 Lecture 2 Single Processor Machines: Memory Hierarchies and Processor Features UCSB CS240A, Winter 2013 Modified from Demmel/Yelick’s slides.

1 Single Processor Machines: Memory Hierarchies and Processor Features Case Study: Tuning Matrix Multiply Based on slides by James Demmel

CS1104: Computer Organisation School of Computing National University of Singapore.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

CS194 Lecture 31 Memory Hierarchies and Optimizations Kathy Yelick

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CPE 779 Parallel Computing - Spring Memory Hierarchies & Optimizations Case Study: Tuning Matrix Multiply Based on slides by: Kathy Yelick

Lecture 18: Pipelining I.

Computer Organization

Review: Instruction Set Evolution

The Goal: illusion of large, fast, cheap memory

CDA 3101 Spring 2016 Introduction to Computer Organization

Cache Memories CSE 238/2038/2138: Systems Programming

Optimizing Cache Performance in Matrix Multiplication

Performance of Single-cycle Design

CMSC 611: Advanced Computer Architecture

5 Steps of MIPS Datapath Figure A.2, Page A-8

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Appendix C Pipeline implementation

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Chapter 4 The Processor Part 2

High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tuning James Demmel

Memory Hierarchies.

Lecturer: Alan Christopher

Serial versus Pipelined Execution

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Pipeline control unit (highly abstracted)

The Processor Lecture 3.4: Pipelining Datapath and Control

An Introduction to pipelining

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Morgan Kaufmann Publishers Memory Hierarchy: Introduction

Instruction Execution Cycle

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipeline control unit (highly abstracted)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining Appendix A and Chapter 3.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Memory System Performance Chapter 3

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps.

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

CS 140 : Matrix multiplication

Caches & Memory.

Pipelined datapath and control

Presentation transcript:

Uniprocessor Optimizations and Matrix Multiplication BeBOP Summer 2002 http://www.cs.berkeley.edu/~richie/bebop 11/6/2018 CS267 Lecure 2 CS267 Lecture 2

Applications ... Scientific simulation and modeling Signal processing Weather and earthquakes Cars and buildings The universe Signal processing Audio and image compression Machine vision Speech recognition Information retrieval Web searching Human genome Computer graphics and computational geometry Structural models Films: Final Fantasy, Shrek 11/6/2018 CS267 Lecure 2

… and their Building Blocks (Kernels) Scientific simulation and modeling Matrix-vector/matrix-matrix multiply Solving linear systems Signal processing Performing fast transforms: Fourier, trigonometric, wavelet Filtering Linear algebra on structured matrices Information retrieval Sorting Finding eigenvalues and eigenvectors Computer graphics and computational geometry Matrix multiply Computing matrix determinants 11/6/2018 CS267 Lecure 2

Parallelism in Modern Processors Memory Hierarchies Outline Parallelism in Modern Processors Memory Hierarchies Matrix Multiply Cache Optimizations Bag of Tricks 11/6/2018 CS267 Lecure 2

Modern Processors: Theory & Practice Idealized Uniprocessor Model Execution order specified by program Operations (load/store, +/*, branch) have roughly the same cost Processors in the Real World Registers and caches Small amounts of fast memory Memory ops have widely varying costs Exploit Instruction-Level Parallelism (ILP) Superscalar — multiple functional units Pipelined — decompose units of execution into parallel stages Different instruction mixes/orders have different costs Why is this your problem? In theory, compilers understand all this mumbo-jumbo and optimize your programs; in practice, they don’t. 11/6/2018 CS267 Lecure 2

What is Pipelining? 6 PM 7 8 9 30 40 20 A B C D In this example: Dave Patterson’s Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) 6 PM 7 8 9 In this example: Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.3 hours Pipelining helps throughput, but not latency Pipeline rate limited by slowest pipeline stage Potential speedup = Number pipe stages Time to “fill” pipeline and time to “drain” it reduces speedup Time T a s k O r d e 30 40 20 A B C D 11/6/2018 CS267 Lecure 2

Limits of ILP Hazards prevent next instruction from executing in its designated clock cycle T a s k O r d e Structural: single person to fold and put clothes away Data: missing socks Control: dyed clothes need to be rewashed Compiler will try to reduce these, but careful coding helps! A B C D 11/6/2018 CS267 Lecure 2

Parallelism in Modern Processors Memory Hierarchies Outline Parallelism in Modern Processors Memory Hierarchies Matrix Multiply Cache Optimizations Bag of Tricks 11/6/2018 CS267 Lecure 2

Memory Hierarchy Most programs have a high degree of locality in their accesses spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality processor control Second level cache (SRAM) Secondary storage (Disk) Main memory (DRAM) Tertiary storage (Disk/Tape/WWW) datapath on-chip cache registers Speed (ns): 1 10 100 10 ms 10 sec Size (bytes): 100,1Ks Ms Gs Ts Ps 11/6/2018 CS267 Lecure 2

Processor-DRAM Gap (latency) Memory hierarchies are getting deeper Processors get faster more quickly than memory µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 7%/yr. Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 11/6/2018 CS267 Lecure 2 CS267 Lecture 2

Cache Basics Cache hit: in-cache memory access—cheap Cache miss: non-cached memory access—expensive Consider a tiny cache (for illustration only) X000 X001 X010 X011 X100 X101 X110 X111 Cache line length: # of bytes loaded together in one entry Associativity direct-mapped: only one address (line) in a given range in cache n-way: 2 or more lines with different addresses exist 11/6/2018 CS267 Lecure 2

Experimental Study of Memory Microbenchmark for memory system performance (Saavedra ’92) time the following program for each size(A) and stride s (repeat to obtain confidence and mitigate timer resolution) for array A of size from 4KB to 8MB by 2x for stride s from 8 Bytes (1 word) to size(A)/2 by 2x for i from 0 to size by s load A[i] from memory (8 Bytes) 11/6/2018 CS267 Lecure 2

Memory Hierarchy on a Sun Ultra-IIi Sun Ultra-IIi, 333 MHz Array size Mem: 396 ns (132 cycles) L2: 2 MB, 36 ns (12 cycles) L1: 16K, 6 ns (2 cycle) L1: 16 byte line L2: 64 byte line 8 K pages See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details 11/6/2018 CS267 Lecure 2

Memory Hierarchy on a Pentium III Array size Katmai processor on Millennium, 550 MHz L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? L1: 32 byte line ? 11/6/2018 CS267 Lecure 2

Lessons True performance can be a complicated function of the architecture Slight changes in architecture or program change performance significantly To write fast programs, need to consider architecture We would like simple models to help us design efficient algorithms Is this possible? Next: Example of improving cache performance: blocking or tiling Idea: decompose problem workload into cache-sized pieces 11/6/2018 CS267 Lecure 2

Parallelism in Modern Processors Memory Hierarchies Outline Parallelism in Modern Processors Memory Hierarchies Matrix Multiply Cache Optimizations Bag of Tricks 11/6/2018 CS267 Lecure 2

Note on Matrix Storage A matrix is a 2-D array of elements, but memory addresses are “1-D” Conventions for matrix layout by column, or “column major” (Fortran default) by row, or “row major” (C default) Column major Row major 5 10 15 1 2 3 1 6 11 16 4 5 6 7 2 7 12 17 8 9 10 11 3 8 13 18 12 13 14 15 4 9 14 19 16 17 18 19 11/6/2018 CS267 Lecure 2

Note on “Performance” For linear algebra, measure performance as rate of execution: Millions of floating point operations per second (Mflop/s) Higher is better Comparing Mflop/s is not the same as comparing time unless flops are constant! Speedup taken wrt time Speedup of A over B = (Running time of B) / (Running time of A) 11/6/2018 CS267 Lecure 2

Using a Simple Model of Memory to Optimize Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory tm = time per slow memory operation f = number of arithmetic operations tf = time per arithmetic operation << tm q = f / m average number of flops per slow element access Minimum possible time = f* tf when all data in fast memory Actual time Larger q means time closer to minimum f * tf Key to algorithm efficiency Key to machine efficiency 11/6/2018 CS267 Lecure 2

Warm up: Matrix-vector multiplication {implements y = y + A*x} for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) A(i,:) = + * y(i) y(i) x(:) 11/6/2018 CS267 Lecure 2

Warm up: Matrix-vector multiplication {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} m = number of slow memory refs = 3n + n2 f = number of arithmetic operations = 2n2 q = f / m ~= 2 Matrix-vector multiplication limited by slow memory speed 11/6/2018 CS267 Lecure 2

“Naïve” Matrix Multiply {implements C = C + A*B} for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) A(i,:) C(i,j) C(i,j) B(:,j) = + * 11/6/2018 CS267 Lecure 2

“Naïve” Matrix Multiply {implements C = C + A*B} for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} A(i,:) C(i,j) C(i,j) B(:,j) = + * 11/6/2018 CS267 Lecure 2

“Naïve” Matrix Multiply Number of slow memory references on unblocked matrix multiply m = n3 read each column of B n times + n2 read each row of A once + 2n2 read and write each element of C once = n3 + 3n2 So q = f / m = 2n3 / (n3 + 3n2) ~= 2 for large n, no improvement over matrix-vector multiply A(i,:) C(i,j) C(i,j) B(:,j) = + * 11/6/2018 CS267 Lecure 2

Blocked (Tiled) Matrix Multiply Consider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} A(i,k) C(i,j) C(i,j) = + * B(k,j) 11/6/2018 CS267 Lecure 2

Blocked (Tiled) Matrix Multiply Recall: m : # of moves from slow to fast memory Matrix is n x n, and N x N blocks each of size b x b f = 2n3 for this problem q = f / m is algorithmic memory efficiency So: m = N*n2 read each block of B N3 times (N3 * n/N * n/N) + N*n2 read A + 2n2 read and write each block of C once = (2N + 2) * n2 So q = f / m = 2n3 / ((2N + 2) * n2) ~= n / N = b for large n So we can improve performance by increasing the block size b Can be much faster than matrix-vector multiply (q=2) 11/6/2018 CS267 Lecure 2

Limits to Optimizing Matrix Multiply Blocked algorithm has ratio q ~= b Larger block size => faster implementation Limit: All three blocks from A,B,C must fit in fast memory (cache): 3b2 <= M So: q ~= b <= sqrt(M/3) Lower bound: Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (using only algebraic associativity) is limited to: q = O(sqrt(M)) 11/6/2018 CS267 Lecure 2

Basic Linear Algebra Subroutines Industry standard interface (evolving) Hardware vendors, others supply optimized implementations History BLAS1 (1970s): vector operations: dot product, saxpy (y=a*x+y), etc m=2*n, f=2*n, q ~1 or less BLAS2 (mid 1980s) matrix-vector operations: matrix vector multiply, etc m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1 BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK) See www.netlib.org/blas, www.netlib.org/lapack 11/6/2018 CS267 Lecure 2

BLAS speeds on an IBM RS6000/590 Peak speed = 266 Mflops Peak BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) 11/6/2018 CS267 Lecure 2

Locality in Other Algorithms The performance of any algorithm is limited by q In matrix multiply, we increase q by changing computation order increased temporal locality For other algorithms and data structures, even hand- transformations are still an open problem sparse matrices (reordering, blocking) trees (B-Trees are for the disk level of the hierarchy) linked lists (some work done here) 11/6/2018 CS267 Lecure 2

Parallelism in Modern Processors Memory Hierarchies Outline Parallelism in Modern Processors Memory Hierarchies Matrix Multiply Cache Optimizations Bag of Tricks 11/6/2018 CS267 Lecure 2

Tiling Alone Might Not Be Enough Naïve and a “naïvely tiled” code 11/6/2018 CS267 Lecure 2

Optimizing in Practice Tiling for registers loop unrolling, use of named “register” variables Tiling for multiple levels of cache Exploiting fine-grained parallelism in processor superscalar; pipelining Complicated compiler interactions Hard to do by hand (but you’ll try) Automatic optimization an active research area BeBOP: www.cs.berkeley.edu/~richie/bebop PHiPAC: www.icsi.berkeley.edu/~bilmes/phipac in particular tr-98-035.ps.gz ATLAS: www.netlib.org/atlas 11/6/2018 CS267 Lecure 2

PHiPAC: Portable High Performance ANSI C Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops 11/6/2018 CS267 Lecure 2

ATLAS (DGEMM n = 500) Source: Jack Dongarra ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. 11/6/2018 CS267 Lecure 2

Removing False Dependencies Using local variables, reorder operations to remove false dependencies a[i] = b[i] + c; a[i+1] = b[i+1] * d; false read-after-write hazard between a[i] and b[i+1] float f1 = b[i]; float f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; With some compilers, you can say explicitly (via flag or pragma) that a and b are not aliased. 11/6/2018 CS267 Lecure 2

Exploit Multiple Registers Reduce demands on memory bandwidth by pre-loading into local variables while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++; } float f0 = filter[0]; float f1 = filter[1]; float f2 = filter[2]; while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++; } also: register float f0 = …; 11/6/2018 CS267 Lecure 2

Minimize Pointer Updates Replace pointer updates for strided memory addressing with constant array offsets f0 = *r8; r8 += 4; f1 = *r8; r8 += 4; f2 = *r8; r8 += 4; f0 = r8[0]; f1 = r8[4]; f2 = r8[8]; r8 += 12; 11/6/2018 CS267 Lecure 2

Loop Unrolling Expose instruction-level parallelism float f0 = filter[0], f1 = filter[1], f2 = filter[2]; float s0 = signal[0], s1 = signal[1], s2 = signal[2]; *res++ = f0*s0 + f1*s1 + f2*s2; do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3; } while( … ); 11/6/2018 CS267 Lecure 2

Expose Independent Operations Hide instruction latency Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion Balance the instruction mix (what functional units are available?) f1 = f5 * f9; f2 = f6 + f10; f3 = f7 * f11; f4 = f8 + f12; 11/6/2018 CS267 Lecure 2

Copy optimization Copy input operands or blocks Reduce cache conflicts Constant array offsets for fixed size blocks Expose page-level locality Original matrix (numbers are addresses) Reorganized into 2x2 blocks 4 8 12 2 8 10 1 5 9 13 1 3 9 11 2 6 10 14 4 6 12 13 3 7 11 15 5 7 14 15 11/6/2018 CS267 Lecure 2

Summary Performance programming on uniprocessors requires understanding of fine-grained parallelism in processor produce good instruction mix understanding of memory system levels, costs, sizes improve locality Blocking (tiling) is a basic approach Techniques apply generally, but the details (e.g., block size) are architecture dependent Similar techniques are possible on other data structures and algorithms Now it’s your turn: Homework 0 (due 6/25/02) http://www.cs.berkeley.edu/~richie/bebop/notes/matmul2002 11/6/2018 CS267 Lecure 2

End (Extra slides follow) 11/6/2018 CS267 Lecure 2

Example: 5 Steps of MIPS Datapath Figure 3 Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and Hennessy Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Pipelining is also used within arithmetic units a fp multiply may have latency 10 cycles, but throughput of 1/cycle 11/6/2018 CS267 Lecure 2 CS267 Lecture 2

Dependences (Data Hazards) Limit Parallelism A dependence or data hazard is one of the following: true of flow dependence: a writes a location that b later reads (read-after write or RAW hazard) anti-dependence a reads a location that b later writes (write-after-read or WAR hazard) output dependence a writes a location that b later writes (write-after-write or WAW hazard) true anti output a = = a 11/6/2018 CS267 Lecure 2

Observing a Memory Hierarchy Dec Alpha, 21064, 150 MHz clock Mem: 300 ns (45 cycles) L2: 512 K, 52 ns (8 cycles) L1: 8K, 6.7 ns (1 cycle) 32 byte cache line 8 K pages See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details 11/6/2018 CS267 Lecure 2