Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh
Introduction Distributed systems are heterogeneous: Distributed systems are heterogeneous: Power Power Architecture Architecture Data Representation Data Representation Data access latency are significantly long and vary with underlaying network traffic Data access latency are significantly long and vary with underlaying network traffic Network bandwidths are limited and can vary dramatically with the underlaying load Network bandwidths are limited and can vary dramatically with the underlaying load
Programming Support Systems: Principles Principle: each component of the system should do what it does best Principle: each component of the system should do what it does best The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction
Programming Support Systems: Goals They should make applications easy to develop They should make applications easy to develop Build applications that portable across different architectures and computing configurations Build applications that portable across different architectures and computing configurations Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations Exploits various forms of parallelism to balance across a heterogeneous configuration Exploits various forms of parallelism to balance across a heterogeneous configuration Minimizing the computation time Minimizing the computation time Matching the communication to the underlaying bandwidths and latencies Matching the communication to the underlaying bandwidths and latencies Ensure the performance variability remains within certain bounds Ensure the performance variability remains within certain bounds
Autoparallelization The user focuses on what is being computed rather than How The user focuses on what is being computed rather than How Performance penalty should not be worse rather than a factor of two Performance penalty should not be worse rather than a factor of two Automatic vectorization Automatic vectorization Dependence analysis Dependence analysis Asynchronous (MIMD) Parallel Processing Asynchronous (MIMD) Parallel Processing Symmetric multiprocessor (SMP) Symmetric multiprocessor (SMP)
Distributed Memory Architecture Caches Caches Higher latency of large memories Higher latency of large memories Determine how to apportion data to the memories of processors in away that Determine how to apportion data to the memories of processors in away that Maximize local memory access Maximize local memory access Minimize communication Minimize communication Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization Interprocedural analysis and optimization Interprocedural analysis and optimization Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required
Explicit Communication Message passing to get data from remote memories Message passing to get data from remote memories Single version of program runs on the all processors Single version of program runs on the all processors The computation is specialized to specific processors through extracting number of processor and indexing its own data The computation is specialized to specific processors through extracting number of processor and indexing its own data
Send-Receive Model A shared-memory environment A shared-memory environment Each processor not only receives its needed data but also sends data other ones require Each processor not only receives its needed data but also sends data other ones require PVM PVM MPI MPI
Get-Put Model The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor
Discussion Program is responsible for: Program is responsible for: Decomposition of computation Decomposition of computation The power of individual processor The power of individual processor Load balancing Load balancing Layout of the memory Layout of the memory Management of latency Management of latency Organization and optimization of communication Organization and optimization of communication Explicit communication can be though of as an assembly language for grids Explicit communication can be though of as an assembly language for grids
Distributed Shared Memory DSM as a vehicle for hiding complexities of memory and communication management DSM as a vehicle for hiding complexities of memory and communication management Address space is as flatten as a single- processor machine for programmer Address space is as flatten as a single- processor machine for programmer The hardware/software is responsible for data retrieval through generating needed communications, from remote memories The hardware/software is responsible for data retrieval through generating needed communications, from remote memories
Hardware Approach Stanford DASH, HP/Convex Exemplar, SGI Origin Stanford DASH, HP/Convex Exemplar, SGI Origin Local cache misses initiate data transfer from remote memory if needed Local cache misses initiate data transfer from remote memory if needed
Software Scheme Shared Virtual Memory, TreadMark Shared Virtual Memory, TreadMark Rely on paging mechanism in the operating system Rely on paging mechanism in the operating system Transfer whole page on demand between operating systems Transfer whole page on demand between operating systems Make granularity and latency significantly large Make granularity and latency significantly large Used in conjunction with relaxed memory consistency models and support for latency hiding Used in conjunction with relaxed memory consistency models and support for latency hiding
Discussion Programmer is free from handling thread packaging and parallel loops Programmer is free from handling thread packaging and parallel loops Has performance penalties and then is useful for coarser-grained parallelism Has performance penalties and then is useful for coarser-grained parallelism Works best with some help from the programmer on the layout of memory Works best with some help from the programmer on the layout of memory Is a promising strategy for simplifying the programming model Is a promising strategy for simplifying the programming model
Data-Parallel Languages High performance on distributed memory: High performance on distributed memory: Allocate data to various processor memory to maximize locality and minimize communication Allocate data to various processor memory to maximize locality and minimize communication For scaling parallelism to hundreds or thousands of processors data parallelism is necessary For scaling parallelism to hundreds or thousands of processors data parallelism is necessary Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) These are the foundations for data-parallel languages These are the foundations for data-parallel languages Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ High Performance Fortran (HPF), and High Performance C++ (HPC++) High Performance Fortran (HPF), and High Performance C++ (HPC++)
HPF Provides directives for data layout on F’90 and F’95 Provides directives for data layout on F’90 and F’95 Directives have no effect on the meaning of the program Directives have no effect on the meaning of the program Advices for compiler on how to assign elements of the program arrays and data structures to different processors Advices for compiler on how to assign elements of the program arrays and data structures to different processors These specification is relatively machine independent These specification is relatively machine independent The principle focus is the layout of arrays The principle focus is the layout of arrays Arrays are typically associated with the data domains of underlying problem Arrays are typically associated with the data domains of underlying problem The principle drawback: limited support for problems on irregular meshes The principle drawback: limited support for problems on irregular meshes Distribution via run-time array Distribution via run-time array Generalized block distribution (blocks to be of different sizes) Generalized block distribution (blocks to be of different sizes) For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution) For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)
HPC++ Unsynchronized for-loops Unsynchronized for-loops Parallel template libraries, with parallel or distributed data structures as basis Parallel template libraries, with parallel or distributed data structures as basis
Task Parallelism Different components of the same computation are executed in parallel Different components of the same computation are executed in parallel Different tasks can be allocated to different nodes of the grid Different tasks can be allocated to different nodes of the grid Object parallelism (Different tasks may be components of objects of different classes) Object parallelism (Different tasks may be components of objects of different classes) Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library
HPF 2.0 Extensions for Task Parallelism Can be implemented on both shared- and distributed-memory systems Can be implemented on both shared- and distributed-memory systems Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end Remaining problems on using HPF on a computational grid: Remaining problems on using HPF on a computational grid: Load matching Load matching Communication optimization Communication optimization
Coarse-Grained Software Integration Complete application is not a simple program Complete application is not a simple program It is a collection of programs that must all be run, passing data to one another It is a collection of programs that must all be run, passing data to one another The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs Each program could be viewed as a task Each program could be viewed as a task Tasks collected and matched to the power of the various nodes in the grid Tasks collected and matched to the power of the various nodes in the grid
Latency Tolerance Dealing with long memory or communication latencies Dealing with long memory or communication latencies Latency hiding: data communication is overlapped with computation (software-perfecting) Latency hiding: data communication is overlapped with computation (software-perfecting) Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) More complex to implement on heterogeneous distributed computers More complex to implement on heterogeneous distributed computers Latencies are large and variable Latencies are large and variable More time to be spent on estimating running times More time to be spent on estimating running times
Load Balancing Spreading the calculation evenly across processors while minimizing communication Spreading the calculation evenly across processors while minimizing communication Simulated annealing, neural nets Simulated annealing, neural nets Recursive bisection: at each stage, the work is divided into two equal parts. Recursive bisection: at each stage, the work is divided into two equal parts. For Grid: power of each node must be taken in the account For Grid: power of each node must be taken in the account Performance prediction of components is essential Performance prediction of components is essential
Runtime Compilation A problem with automatic load-balancing (especially on irregular grids) A problem with automatic load-balancing (especially on irregular grids) Unknown loop upper bounds Unknown loop upper bounds Unknown array sizes Unknown array sizes Inspector/executer model Inspector/executer model Inspector: executed a single time once the runtime, establishes a plan for efficient execution Inspector: executed a single time once the runtime, establishes a plan for efficient execution Executor: executed on each iteration, carries out the plan defined by inspector Executor: executed on each iteration, carries out the plan defined by inspector
Libraries Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) Data structure library: a parallel data structure is maintained within the library whose representation is hidden from the user (DAGH) Data structure library: a parallel data structure is maintained within the library whose representation is hidden from the user (DAGH) Well suited for OO languages Well suited for OO languages Provides max flexibility to the library developer to manage runtime challenges Provides max flexibility to the library developer to manage runtime challenges Heterogeneous networks Heterogeneous networks Adaptive girding Adaptive girding Variable latencies Variable latencies Drawback: their components are currently treated by compilers as black boxes Drawback: their components are currently treated by compilers as black boxes Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation
Programming Tools Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist Performance-tuning tools Performance-tuning tools
Future Directions (Assumptions) The user is responsible for both problem decomposition and assignment The user is responsible for both problem decomposition and assignment Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power Some portion of compilation will be invoked after this service Some portion of compilation will be invoked after this service
Task Compilation Constructing a task graph, along with an estimation of running time for each task Constructing a task graph, along with an estimation of running time for each task TG construction and decomposition TG construction and decomposition Performance Estimation Performance Estimation Restructuring the program to better suit the target grid configuration Restructuring the program to better suit the target grid configuration Assignments of components of the TG to the available nodes Assignments of components of the TG to the available nodes Java Java
Grid Shared Memory (Challenges) Different nodes has different page sizing and paging mechanisms Different nodes has different page sizing and paging mechanisms Good Performance Estimation Good Performance Estimation Managing the system level interaction providing DSM Managing the system level interaction providing DSM
Global Grid Compilation Providing a programming language and compilation strategy targeted to grid Providing a programming language and compilation strategy targeted to grid Mixture of parallelism styles, data parallelism and task parallelism Mixture of parallelism styles, data parallelism and task parallelism Data decomposition Data decomposition Function decomposition Function decomposition