1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,

1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Summer School Lecture 1 2 Outline (of all lectures) Technology trends -Why all computers must be parallel processors -Arithmetic is cheap, what costs is moving data -Between processors, or levels of memory hierarchy “Direct” Linear Algebra -Lower bounds on how much data must be moved to solve linear algebra problems like Ax=b, Ax = λx, etc -Algorithms that attain these lower bounds -Not in standard libraries like Sca/LAPACK (yet!) -Large speed-ups possible “Iterative” Linear Algebra (Krylov Subspace Methods) -Ditto Extensions, open problems… (time permitting)

Summer School Lecture 1 4 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Moore’s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

Summer School Lecture 1 5

6 Impact of Device Shrinkage What happens when the feature size (transistor size) shrinks by a factor of x ? Clock rate goes up by x because wires are shorter -actually less than x, because of power consumption Transistors per unit area goes up by x 2 Die size also tends to increase -typically another factor of ~x Raw computing power of the chip goes up by ~ x 4 ! -typically x 3 is devoted to either on-chip -parallelism: hidden parallelism such as ILP -locality: caches So most programs x 3 times faster, without changing them

7 But there are limiting forces Moore’s 2 nd law (Rock’s law): costs go up Demo of 0.06 micron CMOS Source: Forbes Magazine Yield -What percentage of the chips are usable? -E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield Manufacturing costs and yield problems limit use of density

Summer School Lecture 1 8 Power Density Limits Serial Performance

Summer School Lecture 1 9 Parallelism Revolution is Happening Now Chip density is continuing increase ~2x every 2 years -Clock speed is not -Number of processor cores may double instead There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Summer School Lecture 1 Motif/Dwarf: Common Computational Methods (Red Hot  Blue Cool)

Summer School Lecture 1 Understanding Bandwidth and Latency (1/2) Bandwidth and Latency are metrics for the cost of moving data For a freeway between Berkeley and Sacramento -Bandwidth: how many cars/hour can get from Berkeley to Sac? -#cars/hour = density (#cars/mile)  velocity (miles/hour)  #lanes -Latency: how long does it take for first car to get from Berkeley to Sac? -#hours = distance (#miles)  velocity (miles/hour) For accessing a disk -Bandwidth: how many bits/second can you read from disk? -#bits/sec = density (#bits/cm)  velocity (cm/sec)  #read_heads -Latency: how long do you wait for first bit? -#seconds = distance_to_move_read_head (cm)  velocity (cm/sec) + distance_half_way_around (cm)  velocity (cm/sec) For sending data from one processor to another – same idea For moving data from DRAM to on-chip cache - same idea … 12

Summer School Lecture 1 Understanding Bandwidth and Latency (2/2) Bandwidth = #bits(or words…)/second that can be communicated Latency = #seconds for first bit(or word) to arrive Notation: use #words as unit -1/Bandwith ≡ , units of seconds/word -Latency ≡ , units of seconds Basic Timing Model for communication -to move one “message” of n bits (or words) costs =  +  n 13 We will model running time of an algorithm as sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency Which term is largest? That’s the one to minimize

Summer School Lecture 1 14 Experimental Study of Memory (Membench) Microbenchmark for memory system performance time the following loop (repeat many times and average) for i from 0 to L load A[i] from memory (4 Bytes) for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes) s 1 experiment

Summer School Lecture 1 15 Memory Hierarchy on a Sun Ultra-2i Sun Ultra-2i, 333 MHz See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details Array length

Summer School Lecture 1 16 Memory Hierarchy Most programs have a high degree of locality in their accesses -spatial locality: accessing things nearby previous accesses -temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality on-chip cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape) Speed1ns10ns100ns10ms10sec SizeBKBMBGBTB

Summer School Lecture 1 17 Membench: What to Expect Consider the average cost per load -Plot one line for each array length, time vs. stride -Small stride is best: if cache line holds 4 words, at most ¼ miss -If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) -Picture assumes only one level of cache -Values have gotten more difficult to measure on modern procs s = stride average cost per access total size < L1 cache hit time memory time size > L1

Summer School Lecture 1 18 Memory Hierarchy on a Sun Ultra-2i L1: 16 KB 2 cycles (6ns) Sun Ultra-2i, 333 MHz L2: 64 byte line See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details L2: 2 MB, 12 cycles (36 ns) Mem: 396 ns (132 cycles) 8 K pages, 32 TLB entries L1: 16 B line Array length

Summer School Lecture 1 19 Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988 1989 19901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Memory hierarchies are getting deeper -Processors get faster more quickly than memory

Summer School Lecture 1 Why avoiding communication is important (2/2) Running time of an algorithm is sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency 20 communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time Goal : reorganize linear algebra to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Not just hiding communication (speedup  2x ) Arbitrary speedups possible Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%7% 60%

Summer School Lecture 1 21 Memory Hierarchy on a Pentium III L1: 32 byte line ? L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? Katmai processor on Millennium, 550 MHz Array size

Summer School Lecture 1 22 Memory Hierarchy on an IBM Power3 (Seaborg) Power3, 375 MHz L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line.5-2 cycles Array size Mem: 396 ns (132 cycles)

Summer School Lecture 1 Implications so far Communication (moving data) is much more expensive than arithmetic, and getting more so over time Communication occurs at many levels in the machine -Between levels of memory hierarchy (registers  L1  L2  DRAM  disk…) -Between levels of parallelism -Between cores on a multicore chip -Between chips in a multisocket board (eg CPU and GPU) -Between boards in a rack -Between racks in a cluster/supercomputer/”cloud” -Between cities in a grid -All of these are expensive compared to arithmetic What are we going to do about it? -Not enough to hope cache policy will deal with it; we need better algorithms -True not just for linear algebra Strategy: deal with two levels, try to apply recursion 23

Summer School Lecture 1 Outline of lectures 1.Introduction, technological trends 2.Case study: Matrix Multiplication 3.Communication Lower Bounds for Direct Linear Algebra 4.Optimizing One-sided Factorizations (LU and QR) 5.Multicore and GPU implementations 6.Communication-optimal Eigenvalue and SVD algorithms 7.Optimizing Sparse-Matrix-Vector-Multiplication (SpMV) 8.Communication-optimal Krylov Subspace Methods 9.Further topics, time permitting… Lots of open problems (i.e. homework…) 24

Summer School Lecture 1 Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz, TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group at Berkeley 25

Summer School Lecture 1 Where to find papers, slides, video, software bebop.cs.berkeley.edu www.cs.berkeley.edu/~demmel/cs267_Spr10 parlab.eecs.berkeley.edu -parlab.cs.berkeley.edu/2009bootcamp -Sign up for 2010 bootcamp available 26

Summer School Lecture 1 EXTRA SLIDES 27

1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,

Similar presentations

Presentation on theme: "1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,

Similar presentations

Presentation on theme: "1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,"— Presentation transcript:

Similar presentations

About project

Feedback