EECC756 - Shaaban #1 lec # 5 Spring2001 3-27-2001 Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
EECC756 - Shaaban #1 lec # 7 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Hardware/Software Performance Tradeoffs (plus Msg Passing Finish) CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Computer Architecture II 1 Computer architecture II Programming for performance.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Computer Architecture II 1 Computer architecture II Introduction.
Computer Architecture II 1 Computer architecture II Introduction.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Programming for Performance
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Parallel Simulation of Continuous Systems: A Brief Introduction
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
RAM, PRAM, and LogP models
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
#1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance Goal of the.
CMPE655 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Steps in Creating a Parallel Program
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel Programming in C with MPI and OpenMP
Parallel Application Case Studies
Parallelization of An Example Program
CS 584.
Parallel Programming: Performance Todd C
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

EECC756 - Shaaban #1 lec # 5 Spring Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning for Performance: –Load Balancing and Synch Wait Time Reduction –Identifying & Managing Concurrency Static Vs. Dynamic AssignmentStatic Vs. Dynamic Assignment Determining Optimal Task GranularityDetermining Optimal Task Granularity Reducing SerializationReducing Serialization –Reducing Inherent Communication MinimizingMinimizing communication to computation ratio –Efficient Domain Decomposition –Reducing Additional Overheads Extended Memory-Hierarchy View of MultiprocessorsExtended Memory-Hierarchy View of Multiprocessors –Exploiting Spatial Locality –Structuring Communication Reducing ContentionReducing Contention Overlapping CommunicationOverlapping Communication

EECC756 - Shaaban #2 lec # 5 Spring Successive Refinement Partitioning is often independent of architecture, and may be done first: –View machine as a collection of communicating processors Balancing the workload. Reducing the amount of inherent communication Reducing extra work. –Above three issues are conflicting. Then deal with interactions with architecture: –View machine as an extended memory hierarchy Extra communication due to architectural interactions. Cost of communication depends on how it is structured –This may inspire changes in partitioning.

EECC756 - Shaaban #3 lec # 5 Spring Partitioning for Performance Balancing the workload and reducing wait time at synch points Reducing inherent communication. Reducing extra work. These algorithmic issues have extreme trade-offs: –Minimize communication => run on 1 processor. => extreme load imbalance. –Maximize load balance => random assignment of tiny tasks. => no control over communication. –Good partition may imply extra work to compute or manage it The goal is to compromise between the above extremes –Fortunately, often not difficult in practice.

EECC756 - Shaaban #4 lec # 5 Spring Load Balancing and Synch Wait Time Reduction Limit on speedup: –Work includes data access and other costs. –Not just equal work, but must be busy at same time. Four parts to load balancing and reducing synch wait time: 1. Identify enough concurrency. 2. Decide how to manage it. 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization Sequential Work Max Work on any Processor Speedup problem (p) 

EECC756 - Shaaban #5 lec # 5 Spring Identifying Concurrency Techniques seen for equation solver: –Loop structure, fundamental dependences, new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism; e.g. VLSI routing

EECC756 - Shaaban #6 lec # 5 Spring Identifying Concurrency (continued) Function parallelism: – Entire large tasks (procedures) that can be done in parallel on same or different data. e.g. different independent grid computations in Ocean. – Pipelining, as in video encoding/decoding, or polygon rendering. – Degree usually modest and does not grow with input size – Difficult to load balance. – Often used to reduce synch between data parallel phases. Most scalable programs: Data parallel (per this loose definition) –Function parallelism reduces synch between data parallel phases.

EECC756 - Shaaban #7 lec # 5 Spring Managing Concurrency Static versus Dynamic techniques Static: –Algorithmic assignment based on input; won’t change –Low runtime overhead –Computation must be predictable –Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: –Adapt at runtime to balance load –Can increase communication and reduce locality –Can increase task management overheads

EECC756 - Shaaban #8 lec # 5 Spring Dynamic Assignment Profile-based (semi-static): –Profile work distribution at runtime, and repartition dynamically –Applicable in many computations, e.g. Barnes-Hut, some graphics Dynamic Tasking: –Deal with unpredictability in program or environment (e.g. Raytrace) Computation, communication, and memory system interactions Multiprogramming and heterogeneity Used by runtime systems and OS too –Pool of tasks; take and add tasks until done –E.g. “self-scheduling” of loop iterations (shared loop counter)

EECC756 - Shaaban #9 lec # 5 Spring Dynamic Tasking with Task Queues Centralized versus distributed queues. Task stealing with distributed queues. –Can compromise communication and locality, and increase synchronization. –Whom to steal from, how many tasks to steal,... –Termination detection –Maximum imbalance related to size of task

EECC756 - Shaaban #10 lec # 5 Spring Impact of Dynamic Assignment On SGI Origin 2000 (cache-coherent shared memory):

EECC756 - Shaaban #11 lec # 5 Spring Determining Task Granularity Recall that task granularity: Amount of work associated with a task. General rule: –Coarse-grained => often less load balance –Fine-grained => more overhead; often more communication, contention Comm., contention actually affected by assignment, not size –Overhead by size itself too, particularly with task queues

EECC756 - Shaaban #12 lec # 5 Spring Reducing Serialization Careful assignment and orchestration (including scheduling) Event synchronization: –Reduce use of conservative synchronization e.g. point-to-point instead of barriers, or granularity of pt-to-pt –But fine-grained synch more difficult to program, more synch operations. Mutual exclusion: –Separate locks for separate data e.g. locking records in a database: lock per process, record, or field Lock per task in task queue, not per queue Finer grain => less contention/serialization, more space, less reuse –Smaller, less frequent critical sections No reading/testing in critical section, only modification e.g. searching for task to dequeue in task queue, building tree –Stagger critical sections in time

EECC756 - Shaaban #13 lec # 5 Spring Implications of Load Balancing Extends speedup limit expression to: Speedup problem (p)  Generally, responsibility of software Architecture can support task stealing and synch efficiently –Fine-grained communication, low-overhead access to queues Efficient support allows smaller tasks, better load balancing –Naming logically shared data in the presence of task stealing Need to access data of stolen tasks, esp. multiply-stolen tasks => Hardware shared address space advantageous –Efficient support for point-to-point communication Sequential Work Max (Work + Synch Wait Time)

EECC756 - Shaaban #14 lec # 5 Spring Reducing Inherent Communication Measure: communication to computation ratio Focus here is on inherent communication –Determined by assignment of tasks to processes –Actual communication can be greater Assign tasks that access same data to same process Optimal solution to reduce communication and achive an optimal load balance is NP-hard in the general case Simple heuristic solutions work well in practice: –Due to specific structure of applications.

EECC756 - Shaaban #15 lec # 5 Spring Domain Decomposition Works well for scientific, engineering, graphics,... applications Exploits the local-biased nature of physical problems –Information requirements often short-range –Or long-range but fall off with distance Simple example: Nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3-d) Depends on n,p: decreases with n, increases with p

EECC756 - Shaaban #16 lec # 5 Spring Domain Decomposition (continued) Best domain decomposition depends on information requirements Nearest neighbor example: block versus strip decomposition: Comm to comp: for block, for strip Retain block from here on Application dependent: strip may be better in other cases E.g. particle flow in tunnel 4*√p n 2*p n

EECC756 - Shaaban #17 lec # 5 Spring Finding a Domain Decomposition Static, by inspection: –Must be predictable: grid example above, and Ocean Static, but not by inspection: –Input-dependent, require analyzing input structure –E.g sparse matrix computations, data mining Semi-static (periodic repartitioning) –Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing –Initial decomposition, but highly unpredictable; e.g ray tracing

EECC756 - Shaaban #18 lec # 5 Spring Other Partitioning Techniques Scatter Decomposition, e.g. initial partition in Raytrace Preserve locality in task stealing Steal large tasks for locality, steal from same queues,...

EECC756 - Shaaban #19 lec # 5 Spring Implications of Communication-to- Computation Ratio Architects must examine application needs If denominator is execution time, ratio gives average BW needs If operation count, gives extremes in impact of latency and bandwidth –Latency: assume no latency hiding –Bandwidth: assume all latency hidden –Reality is somewhere in between Actual impact of communication depends on structure and cost as well: –Need to keep communication balanced across processors as well. Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <

EECC756 - Shaaban #20 lec # 5 Spring Reducing Extra Work (Overheads) Common sources of extra work: –Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix –Using redundant computation to avoid communication –Task, data and process management overhead Applications, languages, runtime systems, OS – Imposing structure on communication Coalescing messages, allowing effective naming Architectural Implications: –Reduce need by making communication and orchestration efficient Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <

EECC756 - Shaaban #21 lec # 5 Spring Summary of Parallel Algorithms Analysis Requires characterization of multiprocessor system and algorithm Historical focus on algorithmic aspects: partitioning, mapping PRAM model: data access and communication are free –Only load balance (including serialization) and extra work matter –Useful for early development, but unrealistic for real performance –Ignores communication and also the imbalances it causes –Can lead to poor choice of partitions as well as orchestration –More recent models incorporate communicazation costs; BSP, LogP,... Sequential Instructions Max (Instructions + Synch Wait Time + Extra Instructions) Speedup <

EECC756 - Shaaban #22 lec # 5 Spring Limitations of Parallel Algorithm Analysis Inherent communication in parallel algorithm is not the only communication present: – Artifactual communication caused by program implementation and architectural interactions can even dominate. – Thus, actual amount of communication may not be dealt with adequately Cost of communication determined not only by amount: – Also how communication is structured – …. and cost of communication in system Both architecture-dependent, and addressed in orchestration step.

EECC756 - Shaaban #23 lec # 5 Spring Extended Memory-Hierarchy View of Multiprocessors Levels in extended hierarchy: –Registers, caches, local memory, remote memory (topology) –Glued together by communication architecture –Levels communicate at a certain granularity of data transfer Need to exploit spatial and temporal locality in hierarchy –Otherwise extra communication may also be caused –Especially important since communication is expensive

EECC756 - Shaaban #24 lec # 5 Spring Extended Hierarchy Idealized view: local cache hierarchy + single main memory But reality is more complex: –Centralized Memory: caches of other processors –Distributed Memory: some local, some remote; + network topology –Management of levels: Caches managed by hardware Main memory depends on programming model –SAS: data movement between local and remote transparent –Message passing: explicit –Improve performance through architecture or program locality –Tradeoff with parallelism; need good node performance and parallelism

EECC756 - Shaaban #25 lec # 5 Spring Artifactual Communication in Extended Hierarchy Accesses not satisfied in local portion cause communication –Inherent communication, implicit or explicit, causes transfers Determined by program – Artifactual communication: Determined by program implementation and arch. interactions Poor allocation of data across distributed memories Unnecessary data in a transfer Unnecessary transfers due to system granularities Redundant communication of data finite replication capacity (in cache or main memory) –Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed. – More on artifactual communication later; first consider replication-induced further

EECC756 - Shaaban #26 lec # 5 Spring Communication and Replication Comm. induced by finite capacity is most fundamental artifact –Similar to cache size and miss rate or memory traffic in uniprocessors. –Extended memory hierarchy view useful for this relationship View as three level hierarchy for simplicity –Local cache, local memory, remote memory (ignore network topology). Classify “misses” in “cache” at any level as for uniprocessors Compulsory or cold misses (no size effect) Capacity misses (yes) Conflict or collision misses (yes) Communication or coherence misses (no) –Each may be helped/hurt by large transfer granularity (spatial locality).

EECC756 - Shaaban #27 lec # 5 Spring Working Set Perspective At a given level of the hierarchy (to the next further one) Hierarchy of working sets At first level cache (fully assoc, one-word block), inherent to algorithm working set curve for program Traffic from any type of miss can be local or nonlocal (communication )

EECC756 - Shaaban #28 lec # 5 Spring Orchestration for Performance Reducing amount of communication: –Inherent: change logical data sharing patterns in algorithm –Artifactual: exploit spatial, temporal locality in extended hierarchy Techniques often similar to those on uniprocessors Structuring communication to reduce cost We’ll examine techniques for both...

EECC756 - Shaaban #29 lec # 5 Spring Reducing Artifactual Communication Message passing model –Communication and replication are both explicit –Even artifactual communication is in explicit messages Shared address space model –More interesting from an architectural perspective –Occurs transparently due to interactions of program and system sizes and granularities in extended memory hierarchy Use shared address space to illustrate issues

EECC756 - Shaaban #30 lec # 5 Spring Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy Often techniques to reduce inherent communication do well here Schedule tasks for data reuse once assigned Multiple data structures in same phase e.g. database records: local versus remote Solver example: blocking More useful when O(n k+1) computation on O(n k ) data – Many linear algebra computations (factorization, matrix multiply)

EECC756 - Shaaban #31 lec # 5 Spring Exploiting Spatial Locality Besides capacity, granularities are important: –Granularity of allocation –Granularity of communication or data transfer –Granularity of coherence Major spatial-related causes of artifactual communication: –Conflict misses –Data distribution/layout (allocation granularity) –Fragmentation (communication granularity) –False sharing of data (coherence granularity) All depend on how spatial access patterns interact with data structures –Fix problems by modifying data structures, or layout/alignment Examine later in context of architectures –One simple example here: data distribution in SAS solver

EECC756 - Shaaban #32 lec # 5 Spring Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation

EECC756 - Shaaban #33 lec # 5 Spring Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows –Blocks still have a spatial locality problem on remote data –Row-wise can perform better despite worse inherent c-to-c ratio Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on nonlocal accesses at column-oriented boundary Result depends on n and p

EECC756 - Shaaban #34 lec # 5 Spring Example Performance Impact Equation solver on SGI Origin2000

EECC756 - Shaaban #35 lec # 5 Spring Architectural Implications of Locality Communication abstraction that makes exploiting it easy For cache-coherent SAS, e.g.: –Size and organization of levels of memory hierarchy cost-effectiveness: caches are expensive caveats: flexibility for different and time-shared workloads –Replication in main memory useful? If so, how to manage? hardware, OS/runtime, program? –Granularities of allocation, communication, coherence (?) small granularities => high overheads, but easier to program Machine granularity (resource division among processors, memory...)

EECC756 - Shaaban #36 lec # 5 Spring Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce cost Cost of communication as seen by process: C = f * ( o + l + + t c - overlap) f = frequency of messages o = overhead per message (at both ends) l = network delay per message n c = total data sent m = number of messages B = bandwidth along path (determined by network, NI, assist) t c = cost induced by contention per message overlap = amount of latency hidden by overlap with comp. or comm. – Portion in parentheses is cost of a message (as seen by processor) –That portion, ignoring overlap, is latency of a message –Goal: reduce terms in latency and increase overlap n c /m B

EECC756 - Shaaban #37 lec # 5 Spring Reducing Overhead Can reduce no. of messages m or overhead per message o o is usually determined by hardware or system software –Program should try to reduce m by coalescing messages –More control when communication is explicit Coalescing data into larger messages: –Easy for regular, coarse-grained communication –Can be difficult for irregular, naturally fine-grained communication May require changes to algorithm and extra work –coalescing data and determining what and to whom to send Will discuss more in implications for programming models later

EECC756 - Shaaban #38 lec # 5 Spring Reducing Network Delay Network delay component = f*h*t h h = number of hops traversed in network t h = link+switch latency per hop Reducing f: Communicate less, or make messages larger Reducing h: –Map communication patterns to network topology e.g. nearest-neighbor on mesh and ring; all-to-all –How important is this? Used to be a major focus of parallel algorithms Depends on no. of processors, how t h, compares with other components Less important on modern machines –Overheads, processor count, multiprogramming

EECC756 - Shaaban #39 lec # 5 Spring Reducing Contention All resources have nonzero occupancy: –Memory, communication controller, network link, etc. –Can only handle so many transactions per unit time. Effects of contention: –Increased end-to-end cost for messages. –Reduced available bandwidth for individual messages. –Causes imbalances across processors. Particularly insidious performance problem: –Easy to ignore when programming –Slow down messages that don’t even need that resource by causing other dependent resources to also congest –Effect can be devastating: Don’t flood a resource!

EECC756 - Shaaban #40 lec # 5 Spring Types of Contention Network contention and end-point contention (hot-spots) Location and Module Hot-spots –Location: e.g. accumulating into global variable, barrier solution: tree-structured communication Module: all-to-all personalized comm. in matrix transpose – Solution: stagger access by different processors to same node temporally In general, reduce burstiness; may conflict with making messages larger

EECC756 - Shaaban #41 lec # 5 Spring Overlapping Communication Cannot afford to stall for high latencies Overlap with computation or communication to hide latency Requires extra concurrency (slackness), higher bandwidth Techniques: –Prefetching –Block data transfer –Proceeding past communication –Multithreading

EECC756 - Shaaban #42 lec # 5 Spring Summary of Tradeoffs Different goals often have conflicting demands –Load Balance Fine-grain tasks Random or dynamic assignment –Communication Usually coarse grain tasks Decompose to obtain locality: not random/dynamic –Extra Work Coarse grain tasks Simple assignment –Communication Cost: Big transfers: amortize overhead and latency Small transfers: reduce contention

EECC756 - Shaaban #43 lec # 5 Spring Relationship Between Perspectives

EECC756 - Shaaban #44 lec # 5 Spring Summary Speedup prob (p) = –Goal is to reduce denominator components –Both programmer and system have role to play –Architecture cannot do much about load imbalance or too much communication –But it can: reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) reduce artifactual communication provide efficient naming for flexible assignment allow effective overlapping of communication Busy(1) + Data(1) Busy useful (p)+Data local (p)+Synch(p)+Date remote (p)+Busy overhead (p)