Computer Architecture II 1 Computer architecture II Introduction.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Introductory Courses in High Performance Computing at Illinois David Padua.
ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
EECC756 - Shaaban #1 lec # 7 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Hardware/Software Performance Tradeoffs (plus Msg Passing Finish) CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Computer Architecture II 1 Computer architecture II Programming for performance.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Computer Architecture II 1 Computer architecture II Introduction.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Programming for Performance
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
EECC756 - Shaaban #1 lec # 5 Spring Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning.
ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
#1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance Goal of the.
CMPE655 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Static Process Scheduling
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Steps in Creating a Parallel Program
18-447: Computer Architecture Lecture 30B: Multiprocessors
ECE 1747 Parallel Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 9 – Real Memory Organization and Management
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
Parallel Application Case Studies
Lecture 2: Parallel Programs
Parallelization of An Example Program
Distributed Systems CS
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Programming: Performance Todd C
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II 2 Recap Parallelization strategies –What to partition? –Embarrassingly Parallel Computations –Divide-and-Conquer –Pipelined Computations Application examples Parallelization steps 3 programming models –Data parallel –Shared memory –Message passing

Computer Architecture II 3 4 Steps in Creating a Parallel Program Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors

Computer Architecture II 4 Plan for today Programming for performance Amdahl’s law Partitioning for performance –Addressing decomposition and assignment Orchestration for performance

Computer Architecture II 5 Creating a Parallel Program Assumption: Sequential algorithm is given –Sometimes need very different algorithm, but beyond scope Pieces of the job: –Identify work that can be done in parallel –Partition work and perhaps data among processes –Manage data access, communication and synchronization –Note: work includes computation, data access and I/O Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = For a fixed problem: Speedup (p) = Performance(p) Performance(1) Time(1) Time(p)

Computer Architecture II 6 Amdahl´s law Suppose a fraction f of your application is not parallelizable 1-f : parallelizable on p processors Speedup(P) = T 1 /T p <= T 1 /(f T 1 + (1-f) T 1 /p) = 1/(f + (1-f)/p) <= 1/f

Computer Architecture II 7 Amdahl’s Law (for 1024 processors) See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

Computer Architecture II 8 Amdahl´s law But: –There are many problems can be “embarrassingly” parallelized Ex: image processing, differential equation solver –In some cases the serial fraction does not increase with the problem size –Additional speedup can be achieved from additional resources (super-linear speedup due to more memory) “

Computer Architecture II 9 Performance Goal => Speedup Architect Goal –observe how program uses machine and improve the design to enhance performance Programmer Goal –observe how the program uses the machine and improve the implementation to enhance performance

Computer Architecture II 10 4 Steps in Creating a Parallel Program Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors

Computer Architecture II 11 Partitioning for Performance First two phases of parallelization process: decomposition & assignment Goal 1.Balancing the workload and reducing wait time at synch points 2.Reducing inherent communication 3.Reducing extra work for determining and managing a good assignment (static versus dynamic) Tensions between the 3 goals –Maximize load balance => smaller tasks => increase communication –No communication (run on 1 processor) => extreme load imbalance (all others idle) –Load balance => extra work to compute or manage the partitioning (ex. dynamic techniques)

Computer Architecture II Load Balance –Work: data access, computation –Not just equal work, but must be busy at same time –Ex: Speedup ≤ 1000/400 = 2.5 Sequential Work Max Work on any Processor Speedup ≤

Computer Architecture II Load balance a)Identify enough concurrency Data and functional parallelism (last class) b)Managing concurrency c)Task granularity d)Reduce communication and synchronization

Computer Architecture II 14 1 b) Static versus Dynamic assignment Static: before the program starts is clear who does what #pragma omp parallel for schedule(static) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Dynamic –External scheduler –Self-scheduled Each process picks a chunk of loop iterations and executes them #pragma omp parallel for schedule(dynamic,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Dynamic guided self-scheduling: processes take first larger chunks and then reduce this number progressively #pragma omp parallel for schedule(guided,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];}

Computer Architecture II 15 Dynamic Tasking with Task Queues Centralized queue: simple protocol –Problems: Communication, synchronization, contention Distributed queues: complicated protocol –Initial distribution of jobs May cause load imbalance Solution: task stealing: whom to steal from, how many tasks to steal,... –Termination detection

Computer Architecture II 16 1.c Task granularity Task granularity: amount of work associated with a task General rule: –Coarse-grained: often less load balance –Fine-grained: better load balance, but more overhead, often more communication and contention Processor 1 Processor 2 Processor 1 Processor 2

Computer Architecture II 17 1.d Reducing Serialization Synchronization for task assignment may cause serialization (for instance the access to a queue) Sequential Work Max (Work + Synch Wait Time) Speedup < Process 1 Process 2 Process 3 Synchronization pointWorkSynchronization wait time

Computer Architecture II 18 Reducing Serialization Event synchronization –Reduce use of conservative synchronization point-to-point instead of barriers finer granularity of access may reduce the synchronization time –But fine-grained synchronization more difficult to program, more synchronization operations. Mutual exclusion –Separate locks for separate data lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse –Smaller, less frequent critical sections don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue outside critical section

Computer Architecture II Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Inherent communication –Determined by assignment of tasks to processes –Actual communication may be larger (artifactual) One principle: Assign tasks that access same data to same process Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup < Synchronization pointWorkSynchronization wait time Communication Process 1 Process 2 Process 3

Computer Architecture II 20 Domain Decomposition Ocean Example: communicate with the neighbors, compute in the assigned domain Perimeter to Area communication-to-computation ratio (area to volume in 3-d): Depends on n,p: decreases with n, increases with p

Computer Architecture II 21 Domain Decomposition Communication/computation: for block, for strip –Block better –Application dependent: strip may be better in other cases 4*√p n 2*p n Best domain decomposition depends on information requirements Block versus strip decomposition:

Computer Architecture II 22 Finding a Domain Decomposition GOALS: load balance & low communication Static, by inspection –Must be predictable: Ocean Static, but not by inspection –Input-dependent, require analyzing input structure –E.g sparse matrix computations Semi-static (periodic repartitioning) –Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing –Initial decomposition, but highly unpredictable; e.g ray tracing

Computer Architecture II Reducing Extra Work Common sources of extra work: –Computing a good partition e.g. partitioning in Barnes-Hut –Using redundant computation to avoid communication –Task, data and process management overhead applications, languages, runtime systems, OS –Imposing structure on communication coalescing small messages Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <

Computer Architecture II 24 PART II: memory aware optimizations So far we have seen the parallel computer as a collection of communicating processors –Goals: balance load, reduce inherent communication and extra work –We have assumed an unlimited memory In reality the parallel computer uses a multi-cache, multi- memory system

Computer Architecture II 25 Memory-oriented View Multiprocessor as Extended Memory Hierarchy Levels in extended hierarchy: –Registers, caches, local memory, remote memory –Glued together by communication architecture –Levels communicate at a certain granularity of data transfer Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects Granularity increases, access time increases Capacity increases, cost/unit decreases

Computer Architecture II 26 Memory-oriented view Performance depends heavily on memory hierarchy Time spent by a program (usually given in cycles) Time prog =Time compute + Time access Data access time can be reduced by: – Optimizing machine larger caches lower latency Larger bandwidth – Optimizing program temporal and spatial locality

Computer Architecture II 27 Artifactual Communication in Extended Hierarchy poor allocation of data across distributed memories –Data accessed by a node in the memory of another unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity (in cache or main memory)

Computer Architecture II 28 Replication induced artifactual communication Communication induced by finite capacity is most fundamental artifact –Like cache size and miss rate or memory traffic in uniprocessors View as three level hierarchy for simplicity –Local cache, local memory, remote memory (ignore network topology) Classify “misses” in “cache” at any level as for uniprocessors (4 “C”s) compulsory or cold misses (no size effect) capacity misses (yes) conflict or collision misses (yes) communication or coherence misses (no) – Each may be helped/hurt by large transfer granularity (spatial locality)

Computer Architecture II 29 Working Set Perspective –Hierarchy of working sets –At first level cache (fully assoc, one-word block), inherent to algorithm working set curve for program –Traffic from any type of miss can be local or nonlocal (communication)

Computer Architecture II 30 Orchestration for Performance Reducing amount of communication: –Inherent: change the partitioning (seen earlier) –Artifactual: exploit spatial, temporal locality in the memory hierarchy Techniques often similar to those on uniprocessors Structuring communication to reduce cost

Computer Architecture II 31 Reducing Artifactual Communication Message passing model –Communication and replication are both explicit –Even artifactual communication is in explicit messages Shared address space model –Occurs transparently due to interactions of program and system –used for explanation

Computer Architecture II 32 Exploiting Temporal Locality –Def: reusing of data elements already brought into cache –Structure algorithm so working sets fit into the cache often techniques to reduce inherent communication: –assign tasks accessing the same elements to the same processor schedule tasks for data reuse once assigned –Ocean Solver example: blocking Each grid element accessed 5 times First time brought into cache then reused Rewrite the loops

Computer Architecture II 33 Exploiting Spatial Locality Def: when a data element is accessed, its neighbors are accessed Major spatial-related causes of artifactual communication : –Conflict misses –Data distribution/layout (allocation granularity) –Fragmentation (communication granularity) –False sharing of data (coherence granularity) AVOIDING ARTIFACTUAL COMMUNICATION: keep contiguous data accessed by one processor –Fix problems by modifying data structures, or layout/alignment

Computer Architecture II 34 Spatial Locality Example – Repeated sweeps over 2-d grid, each time adding 1 to elements – 4-d grid to achieve spatial locality ( line processor x column processor x line index x column index)

Computer Architecture II 35 Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows –Blocks have a spatial locality problem on remote data: when accessing the elements of neighboring processors whole cache blocks are fetched at column boundary –Row-wise can perform better despite worse inherent communication-to-computation ratio

Computer Architecture II 36 Example Performance Impact on Origin2000 OceanKernel solver