1 Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Introduction, Technological Trends Jim Demmel EECS & Math Departments,

Slides:

Advertisements

Similar presentations

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Computer ArchitectureFall 2008 © October 27th, 2008 Majd F. Sakr CS-447– Computer Architecture.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Uniprocessor Optimizations and Matrix Multiplication

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

4/6/2005 ECE Motivation for Memory Hierarchy What we want from memory  Fast  Large  Cheap There are different kinds of memory technologies  Register.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Systems I Locality and Caching

ECE Dept., University of Toronto

COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

CS1104: Computer Organisation School of Computing National University of Singapore.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.

Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Computer Organization & Assembly Language © by DR. M. Amer.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

University of Washington Today Midterm topics end here. HW 2 is due Wednesday: great midterm review. Lab 3 is posted. Due next Wednesday (July 31)  Time.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.

What is it and why do we need it? Chris Ward CS147 10/16/2008.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

01/26/2009CS267 - Lecture 2 1 Experimental Study of Memory (Membench)‏ Microbenchmark for memory system performance time the following loop (repeat many.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

CPE 779 Parallel Computing - Spring Memory Hierarchies & Optimizations Case Study: Tuning Matrix Multiply Based on slides by: Kathy Yelick

How will execution time grow with SIZE?

Today How’s Lab 3 going? HW 3 will be out today

Architecture & Organization 1

Cache Memory Presentation I

Minimizing Communication in Linear Algebra

Architecture & Organization 1

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Morgan Kaufmann Publishers Memory Hierarchy: Introduction

Avoiding Communication in Linear Algebra

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Cache Memory and Performance

Presentation transcript:

1 Minimizing Communication in Numerical Linear Algebra Introduction, Technological Trends Jim Demmel EECS & Math Departments, UC Berkeley

Summer School Lecture 1 2 Outline (of all lectures) Technology trends -Why all computers must be parallel processors -Arithmetic is cheap, what costs is moving data -Between processors, or levels of memory hierarchy “Direct” Linear Algebra -Lower bounds on how much data must be moved to solve linear algebra problems like Ax=b, Ax = λx, etc -Algorithms that attain these lower bounds -Not in standard libraries like Sca/LAPACK (yet!) -Large speed-ups possible “Iterative” Linear Algebra (Krylov Subspace Methods) -Ditto Extensions, open problems… (time permitting)

Summer School Lecture 1 3 Outline (of all lectures) Technology trends -Why all computers must be parallel processors -Arithmetic is cheap, what costs is moving data -Between processors, or levels of memory hierarchy “Direct” Linear Algebra -Lower bounds on how much data must be moved to solve linear algebra problems like Ax=b, Ax = λx, etc -Algorithms that attain these lower bounds -Not in standard libraries like Sca/LAPACK (yet!) -Large speed-ups possible “Iterative” Linear Algebra (Krylov Subspace Methods) -Ditto Extensions, open problems… (time permitting)

Summer School Lecture 1 4 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Moore’s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

Summer School Lecture 1 5

6 Impact of Device Shrinkage What happens when the feature size (transistor size) shrinks by a factor of x ? Clock rate goes up by x because wires are shorter -actually less than x, because of power consumption Transistors per unit area goes up by x 2 Die size also tends to increase -typically another factor of ~x Raw computing power of the chip goes up by ~ x 4 ! -typically x 3 is devoted to either on-chip -parallelism: hidden parallelism such as ILP -locality: caches So most programs x 3 times faster, without changing them

7 But there are limiting forces Moore’s 2 nd law (Rock’s law): costs go up Demo of 0.06 micron CMOS Source: Forbes Magazine Yield -What percentage of the chips are usable? -E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield Manufacturing costs and yield problems limit use of density

Summer School Lecture 1 8 Power Density Limits Serial Performance

Summer School Lecture 1 9 Parallelism Revolution is Happening Now Chip density is continuing increase ~2x every 2 years -Clock speed is not -Number of processor cores may double instead There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Summer School Lecture 1 Motif/Dwarf: Common Computational Methods (Red Hot  Blue Cool)

Summer School Lecture 1 11 Outline (of all lectures) Technology trends -Why all computers must be parallel processors -Arithmetic is cheap, what costs is moving data -Between processors, or levels of memory hierarchy “Direct” Linear Algebra -Lower bounds on how much data must be moved to solve linear algebra problems like Ax=b, Ax = λx, etc -Algorithms that attain these lower bounds -Not in standard libraries like Sca/LAPACK (yet!) -Large speed-ups possible “Iterative” Linear Algebra (Krylov Subspace Methods) -Ditto Extensions, open problems… (time permitting)

Summer School Lecture 1 Understanding Bandwidth and Latency (1/2) Bandwidth and Latency are metrics for the cost of moving data For a freeway between Berkeley and Sacramento -Bandwidth: how many cars/hour can get from Berkeley to Sac? -#cars/hour = density (#cars/mile)  velocity (miles/hour)  #lanes -Latency: how long does it take for first car to get from Berkeley to Sac? -#hours = distance (#miles)  velocity (miles/hour) For accessing a disk -Bandwidth: how many bits/second can you read from disk? -#bits/sec = density (#bits/cm)  velocity (cm/sec)  #read_heads -Latency: how long do you wait for first bit? -#seconds = distance_to_move_read_head (cm)  velocity (cm/sec) + distance_half_way_around (cm)  velocity (cm/sec) For sending data from one processor to another – same idea For moving data from DRAM to on-chip cache - same idea … 12

Summer School Lecture 1 Understanding Bandwidth and Latency (2/2) Bandwidth = #bits(or words…)/second that can be communicated Latency = #seconds for first bit(or word) to arrive Notation: use #words as unit -1/Bandwith ≡ , units of seconds/word -Latency ≡ , units of seconds Basic Timing Model for communication -to move one “message” of n bits (or words) costs =  +  n 13 We will model running time of an algorithm as sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency Which term is largest? That’s the one to minimize

Summer School Lecture 1 14 Experimental Study of Memory (Membench) Microbenchmark for memory system performance time the following loop (repeat many times and average) for i from 0 to L load A[i] from memory (4 Bytes) for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes) s 1 experiment

Summer School Lecture 1 15 Memory Hierarchy on a Sun Ultra-2i Sun Ultra-2i, 333 MHz See for details Array length

Summer School Lecture 1 16 Memory Hierarchy Most programs have a high degree of locality in their accesses -spatial locality: accessing things nearby previous accesses -temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality on-chip cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape) Speed1ns10ns100ns10ms10sec SizeBKBMBGBTB

Summer School Lecture 1 17 Membench: What to Expect Consider the average cost per load -Plot one line for each array length, time vs. stride -Small stride is best: if cache line holds 4 words, at most ¼ miss -If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) -Picture assumes only one level of cache -Values have gotten more difficult to measure on modern procs s = stride average cost per access total size < L1 cache hit time memory time size > L1

Summer School Lecture 1 18 Memory Hierarchy on a Sun Ultra-2i L1: 16 KB 2 cycles (6ns) Sun Ultra-2i, 333 MHz L2: 64 byte line See for details L2: 2 MB, 12 cycles (36 ns) Mem: 396 ns (132 cycles) 8 K pages, 32 TLB entries L1: 16 B line Array length

Summer School Lecture 1 19 Processor-DRAM Gap (latency) µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Memory hierarchies are getting deeper -Processors get faster more quickly than memory

Summer School Lecture 1 Why avoiding communication is important (2/2) Running time of an algorithm is sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency 20 communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time Goal : reorganize linear algebra to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Not just hiding communication (speedup  2x ) Arbitrary speedups possible Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%7% 60%

Summer School Lecture 1 21 Memory Hierarchy on a Pentium III L1: 32 byte line ? L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? Katmai processor on Millennium, 550 MHz Array size

Summer School Lecture 1 22 Memory Hierarchy on an IBM Power3 (Seaborg) Power3, 375 MHz L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line.5-2 cycles Array size Mem: 396 ns (132 cycles)

Summer School Lecture 1 Implications so far Communication (moving data) is much more expensive than arithmetic, and getting more so over time Communication occurs at many levels in the machine -Between levels of memory hierarchy (registers  L1  L2  DRAM  disk…) -Between levels of parallelism -Between cores on a multicore chip -Between chips in a multisocket board (eg CPU and GPU) -Between boards in a rack -Between racks in a cluster/supercomputer/”cloud” -Between cities in a grid -All of these are expensive compared to arithmetic What are we going to do about it? -Not enough to hope cache policy will deal with it; we need better algorithms -True not just for linear algebra Strategy: deal with two levels, try to apply recursion 23

Summer School Lecture 1 Outline of lectures 1.Introduction, technological trends 2.Case study: Matrix Multiplication 3.Communication Lower Bounds for Direct Linear Algebra 4.Optimizing One-sided Factorizations (LU and QR) 5.Multicore and GPU implementations 6.Communication-optimal Eigenvalue and SVD algorithms 7.Optimizing Sparse-Matrix-Vector-Multiplication (SpMV) 8.Communication-optimal Krylov Subspace Methods 9.Further topics, time permitting… Lots of open problems (i.e. homework…) 24

Summer School Lecture 1 Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz, TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group at Berkeley 25

Summer School Lecture 1 Where to find papers, slides, video, software bebop.cs.berkeley.edu parlab.eecs.berkeley.edu -parlab.cs.berkeley.edu/2009bootcamp -Sign up for 2010 bootcamp available 26

Summer School Lecture 1 EXTRA SLIDES 27