Programming for Performance Laxmikant Kale CS 433.

Slides:

Advertisements

Similar presentations

Basic Communication Operations

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Reference: Message Passing Fundamentals.

Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

Lecture 21: Parallel Algorithms

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Strategies for Implementing Dynamic Load Sharing.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

Programming for Performance CS433 Spring 2001 Laxmikant Kale.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

HYPERCUBE ALGORITHMS-1

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

Concurrency and Performance Based on slides by Henri Casanova.

Dynamic Load Balancing Tree and Structured Computations.

Programming for Performance Laxmikant Kale CS 433.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Auburn University

Interconnection topologies

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Lecture 16: Parallel Algorithms I

Unit-2 Divide and Conquer

Parallel Sorting Algorithms

COMP60621 Fundamentals of Parallel and Distributed Systems

CENG 351 Data Management and File Structures

Memory System Performance Chapter 3

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Programming for Performance Laxmikant Kale CS 433

Causes of performance loss If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? –Several causes, –Each must be understood separately –but they interact with each other in complex ways Solution to one problem may create another One problem may mask another, which manifests itself under other conditions (e.g. increased p).

Causes Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks

Algorithmic overhead Parallel algorithms may have a higher operation count Example: parallel prefix (also called “scan”) –How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I];

Parallel Prefix: continued How to this operation in parallel? –Seems inherently sequential –Recursive doubling algorithm –Operation count: log(P). N A better algorithm: –Take blocking of data into account –Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements –N + log(P) +N: doubling of op. Count

Bottleneck Consider the “primes” program (or the “pi”) –What happens when we run it on 1000 pes? How to eliminate bottlenecks: –Two structures are useful in most such cases: Spanning trees: organize processors in a tree Hypercube-based dimensional exchange

Communication overhead Components: –per message and per byte –sending, receiving and network –capacity constraints Grainsize analysis: –How much computation per message –Computation-to-communication ratio

Communication overhead examples Usually, must reorganize data or work to reduce communication Combining communication also helps Examples:

Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = microseconds, b: 2-10 ns

Grainsize control A Simple definition of grainsize: –Amount of computation per message –Problem: short message/ long message More realistic: –Computation to communication ratio –computation time / (a + bN) for one message

Example: matrix multiplication How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J];

A simple algorithm: Distribute A by rows, B by columns –So,any processor can request a row of A and get it (in two messages). Same for a col of B, –Distribute the work of computing each element of C using some load balancing scheme So it works even on machines with varying processor capabilities (e.g. timeshared clusters) –What is the computation-to-communication ratio? For each object: 2.N ops, 2 messages with N bytes

Computation-Communication ratio –2N / (2 a + 2N b) = – 2N * 0.01 / (2*10 + 2*0.002N)

A better algorithm: Store A as a collection row-bunches –each bunch stores g rows –Same of B’s columns Each object now computes a gxg section of C Comp to commn ratio: –2*g*g*N ops –2 messages, gN bytes each –alpha ratio: 2g*g*N/2, beta ratio: g

Alpha vs beta The per message cost is significantly larger than per byte cost –factor of several thousands –So, several optimizations are possible that trade off : get larger beta cost for smaller alpha –I.e. send fewer messages –Applications of this idea: Message combining Complex communication patterns: each-to-all,..

Example: Each to all communication: –each processor wants to send N bytes, distinct message to each other processor –Simple implementation: alpha*P + N * beta *P typical values? – Alpha: 10 microseconds, beta: 2 nanoseconds

Programming for performance: steps Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs –global synchronizations, critical paths

Design Philosophy: Parallel Algorithm design: –Ensure good performance (total op count) –Generate sufficient parallelism –Avoid/minimize “extra work” Decomposition: –Break into many small pieces: Smallest grain that sufficiently amortizes overhead

Design principles: contd. Load balancing –Select static, dynamic, or quasi-dynamic strategy Measurement based vs prediction based load estimation –Principle: let a processor idle but avoid overloading one (think about this) Reduce communication overhead –Algorithmic reorganization (change mapping) –Message combining –Use efficient communication libraries

Design principles: Synchronization Eliminate unnecessary global synchronization –If T(i,j) is the time during i’th phase on j’th PE With synch: sum ( max {T(i,j)}) Without: max { sum(T (i,j) } Critical Paths: –Look for long chains of dependences Draw timeline pictures with dependences

Diagnosing performance problems Tools: –Back of the envelope (I.e. simple) analysis –Post-mortem analysis, with performance logs Visualization of performance data Automatic analysis Phase-by-phase analysis (prog. may have many phases) –What to measure load distribution, (commun.) overhead, idle time Their averages, max/min, and variances Profiling: time spent in individual modules/subroutines

Diagnostic technniques Tell-tale signs: –max load >> average, and # PEs > average is >>1 Load imbalance –max load >> average, and # PEs > average is ~ 1 Possible bottleneck (if there is dependence) –Profile shows increase in total time in routine f with increase in Pes: algorithmic overhead –Communication overhead: obvious

Communication Optimization Example problem from last lecture: Molecular Dynamics –Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors –Assume Send/Receive each: alpha = 10 us, beta: 2ns –Time spent (notice: 26 sends and 26 receives): 26*2(10 ) = 520 us –If there are more than one cells on each PE, multiply this number! –Can this be improved? How?

Message combining If there are multiple cells per processor: –Neighbors of a cell may be on the same neighboring processor. –Neighbors of two different cells on the same processor –Combine messages going to the same processor

Communication Optimization I Take advantage of the structure of communication, and do communication in stages: –If my coordinates are: (x,y,z): Send to (x+1, y,z), anything that goes to (x+1, *, *) Send to (x-1, y,z), anything that goes to (x-1, *, *) Wait for messages from x neighbors, then Send to y neighbors a combined message –A total of 6 messages instead of 26 –Apparently longer critical path

Communication Optimization II Send all migrating atoms to processor 0 –Let processor 0 sort them out and send 1 message to each processor –Works ok if the number of processors is small Otherwise, bottleneck at 0

Communication Optimization 3 Generalized problem: – each to all, individualized messages Dimensional exchange: –Instead of sending all data to node 0, can we do a distributed exchange? –Arrange processors in a virtual hypercube: Use binary representation of a processor’s number: Its neighbors are: all those with a bit different –log P Phases: in each phase: Send data to the other partition that belongs there.

Dimensional exchange: analysis Each PE is sending n bytes to each other PE –Total bytes sent (and received) by each processor: n(P-1) or about nP bytes –The baseline algorithm (direct sends): Each processor incurs overhead of: (P-1)(α +n β) –Dimensional exchange: Each processor sends half of the dat that is has to its neighbor in each phase: (lg P) (α +0.5 nP β) The α factor is significantly reduced, but the β factor has increased. Most data items go multiple hops OK with n is sufficiently small: (how small? )

Another idea: Must reduce number of hops traveled by each data item –(log p may be 10+ for a 1024 processor system) Arrange processors in a 2D grid –2 phases: –I: each processor sends sqrt(P)-1 messages within its column –II: each processors waits for messages within its column, and then sends sqrt(P)-1 messages within its row. –Now the beta factor is proportional to 2 (2 hops) –alpha factor is proportional to 2*sqrt(P)

Generalization: Arrange processors in k-ary hypercube –There are k processors in each row –there are D dimensions to the “hypercube”

Each to all multicast Identical message being sent from each processor Special case: each to all multicast (broadcast) Can we adapt the previous algorithms? –Send to one processor? Nah! –Dimensional exchange, and row-column broadcast are alternatives to direct individual messages.

Optimizing Reductions Operation: –Each processor contributes data, that must be “added” via any commutative-associative operation –Result may be needed on only 1 processor, or on all. –Assume that all Pes are ready with their data simultaneously Naïve algorithm: all send to PE 0. ( O(P) ) Basic Spanning tree algorithm: –Organize processors in a k-ary tree –Leaves: send contributions to parent –Internal nodes: wait for data from all children, add mine, –Then, if I am not the root, send to my parent –What is a good value of k?

Reduction via Spanning tree Time to finish: Minima at k=3 more precisely:

Better spanning trees: Observation: Only one level of the tree is active at a time –Also, A PE can’t deal with data from second child until it has finished “receive” of data from 1st. –So, second child could delay sending its data, with no impact –It can collect data from someone else in the meanwhile

Hypercube based spanning tree Use a variant of dimensional exchange: –In each phase i, send data to neighbor in i’th dimension if its serial number is smaller than mine –Accumulate data from neighbors until it is my turn to send –log P phases, with at most one recv per processor per phase More complex spanning trees: –Exploit the actual values of send overhead, latency, and receive overhead

Reductions with large datasets What if n is large? –Example: simpler formulation of molecular dynamics: Each PE has an array of forces for all atoms Each PE is assigned a subset of pairs of atoms Accumulated forces must be summed up across processors New optimizations become possible with large n: –Essential idea: use multiple concurrent reductions to keep all levels of the tree busy

Concurrent reductions Use a normal spanning tree (for example) Divide data (n items) into segments of k items each Start reduction for each segment. –N/k pipelined phases (I.e. phases overlap in time)

Concurrent reductions: load balancing! Leaves of the spanning tree are doing little work Use a different spanning tree for successive reductions: –E.g. first reduction uses a normal spanning tree rooted at 0, while second reduction uses a mirror-image tree rooted at (P-1) –This load balancing improve performance considerably

Intro to Load Balancing Example: 500 processors, units of work What should the objective of load balancing be?