Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.

Multiplying Matrices Two matrices, A with (p x q) matrix and B with (q x r) matrix, can be multiplied to get C with dimensions p x r, using scalar multiplications.

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.

Lecture 3: Parallel Algorithm Design

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Reference: Message Passing Fundamentals.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Lecture 21: Parallel Algorithms

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.

Design of parallel algorithms

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

Parallel System Performance CS 524 – High-Performance Computing.

1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.

GPGPU platforms GP - General Purpose computation using GPU

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

Data Structures and Algorithms in Parallel Computing Lecture 1.

8.2 Operations With Matrices

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

An Overview of Parallel Processing

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Numerical Algorithms Chapter 11.

Jeremy Martin Alex Tiskin

Auburn University

All-pairs Shortest paths Transitive Closure

Lecture 3: Parallel Algorithm Design

Multiply to find the number of pictures in all.

OCR GCSE Computer Science Teaching and Learning Resources

Parallel Programming By J. H. Wang May 2, 2017.

Bisection and Twisted SVD on GPU

Course Description Algorithms are: Recipes for solving problems.

Lecture 22 review PRAM: A model developed for parallel machines

Computer Architecture 2

Lecture 16: Parallel Algorithms I

Dynamic Programming General Idea

Branch and Bound.

Numerical Algorithms Quiz questions

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Dynamic Programming General Idea

Course Description Algorithms are: Recipes for solving problems.

Design and Analysis of Algorithms

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

Presentation transcript:

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui

Outline Problem Goal MV multiplication w/i constant factor of the lower bound Matrix-matrix multiplication w/i constant factor of the lower bound –With Examples! If there’s time: derivation of proofs of lower bounds for N 2 and N 3 problems

Problem N 2 problems like MV multiply –Dominated by the communication to read inputs and write outputs to main memory N 3 problems like matrix-matrix multiply –Dominated by communication between global and local memory (Loomis-Whitney)

Architecture

Goal Let’s create a model based on the communication scheme to minimize the overall time Execution time becomes a function of processing speed, bandwidth, message latency, and memory size.

Optimization Function Inv. Bandwidth Comm. cost of reading inputs and writing outputs Num. bytes & msgs passed between global and shared memory Latency to shared memory Num. FLOPS times computation speed The slowest processor is the weakest link Time won’t be faster than the fastest heterogeneous element (could be a GPU, for example) Overall execution time

Heterogeneous MV multiply

Heterogeneous, recursive MM Solve the optimization equation for each processor, then assign it an appropriate amount of work based in units of work that are multiples of 8 sub MM multiplication blocks Recursively solve each sub-block using a divide and conquer method

Recursive, Parallel MM = x x

x

x

x = ?? ??

Review = x

= So we need EH + FJ ?? ??

CG = 11 * 11 = 121 CH = 11 * 12 = 132 EG = 15 * 11 = 165 EH = 15 * 12 = 180 DI = 10 * 7 = 70 DJ = 10 * 8 = 80 FI = 14 * 7 = 98 FJ = 14 * 8 = 112 DI = 12 * 15 = 180 DJ = 12 * 16 = 192 FI = 16 * 15 = 240 FJ = 16 * 16 = 256 CG = 9 * 3 = 27 CH = 9 * 4 = 36 EG = 13 * 3 = 39 EH = 13 * 4 = 52

CG = 9 * 3 = 27 CH = 9 * 4 = 36 EG = 13 * 3 = 39 EH = 13 * 4 = 52 DI = 10 * 7 = 70 DJ = 10 * 8 = 80 FI = 14 * 7 = 98 FJ = 14 * 8 = 112 CG = 11 * 11 = 121 CH = 11 * 12 = 132 EG = 15 * 11 = 165 EH = 15 * 12 = 180 DI = 12 * 15 = 180 DJ = 12 * 16 = 192 FI = 16 * 15 = 240 EH = 15 * 12 = 180

CG = 9 * 3 = 27 CH = 9 * 4 = 36 EG = 13 * 3 = 39 EH = 13 * 4 = 52 DI = 10 * 7 = 70 DJ = 10 * 8 = 80 FI = 14 * 7 = 98 FJ = 14 * 8 = 112 CG = 11 * 11 = 121 CH = 11 * 12 = 132 EG = 15 * 11 = 165 EH = 15 * 12 = 180 DI = 12 * 15 = 180 DJ = 12 * 16 = 192 FI = 16 * 15 = 240 EH = 15 * 12 = 180 =

CG = 9 * 3 = 27 CH = 9 * 4 = 36 EG = 13 * 3 = 39 EH = 13 * 4 = 52 DI = 10 * 7 = 70 DJ = 10 * 8 = 80 FI = 14 * 7 = 98 FJ = 14 * 8 = 112 CG = 11 * 11 = 121 CH = 11 * 12 = 132 EG = 15 * 11 = 165 EH = 15 * 12 = 180 DI = 12 * 15 = 180 DJ = 12 * 16 = 192 FI = 16 * 15 = 240 EH = 15 * 12 = =

x = ?? ??

Derivation of Proofs For a MM, the a node can only do O(N * ) useful arithmetic operations per phase –If a single processor accesses rows of matrix A and some of these rows have at least elements, then the processor will touch at most rows of matrix A. –Since each row of C is a product of a row in A and all of B, the number of useful multiplications on a single processor that involve rows of matrix A are bounded by O(N B ) or O(N )

Derivation Cont’d Number of total phases = the total amt. of work per processor divided by the amt. of work a processor can do per phase O( ) Total amt. of communication = # phases times memory used per phase (~M)

Derivation Cont’d Total # Words, W = G / sqrt(M) = W = G * M / (M * sqrt(M)) W >= ((G / (M * sqrt(M)) – 1) * M W >= (G / sqrt(M)) - M