Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Distributed Systems CS
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Introductory Courses in High Performance Computing at Illinois David Padua.
Technical Architectures
Reference: Message Passing Fundamentals.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Parallel Computing Overview CS 524 – High-Performance Computing.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Parallel Programming Models and Paradigms
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Distributed Shared Memory Systems and Programming
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Programmability Hiroshi Nakashima Thomas Sterling.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Distributed Shared Memory
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Multithreaded Programming
Database System Architectures
Presentation transcript:

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

Introduction Distributed systems are heterogeneous: Distributed systems are heterogeneous: Power Power Architecture Architecture Data Representation Data Representation Data access latency are significantly long and vary with underlaying network traffic Data access latency are significantly long and vary with underlaying network traffic Network bandwidths are limited and can vary dramatically with the underlaying load Network bandwidths are limited and can vary dramatically with the underlaying load

Programming Support Systems: Principles Principle: each component of the system should do what it does best Principle: each component of the system should do what it does best The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction

Programming Support Systems: Goals They should make applications easy to develop They should make applications easy to develop Build applications that portable across different architectures and computing configurations Build applications that portable across different architectures and computing configurations Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations Exploits various forms of parallelism to balance across a heterogeneous configuration Exploits various forms of parallelism to balance across a heterogeneous configuration Minimizing the computation time Minimizing the computation time Matching the communication to the underlaying bandwidths and latencies Matching the communication to the underlaying bandwidths and latencies Ensure the performance variability remains within certain bounds Ensure the performance variability remains within certain bounds

Autoparallelization The user focuses on what is being computed rather than How The user focuses on what is being computed rather than How Performance penalty should not be worse rather than a factor of two Performance penalty should not be worse rather than a factor of two Automatic vectorization Automatic vectorization Dependence analysis Dependence analysis Asynchronous (MIMD) Parallel Processing Asynchronous (MIMD) Parallel Processing Symmetric multiprocessor (SMP) Symmetric multiprocessor (SMP)

Distributed Memory Architecture Caches Caches Higher latency of large memories Higher latency of large memories Determine how to apportion data to the memories of processors in away that Determine how to apportion data to the memories of processors in away that Maximize local memory access Maximize local memory access Minimize communication Minimize communication Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization Interprocedural analysis and optimization Interprocedural analysis and optimization Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required

Explicit Communication Message passing to get data from remote memories Message passing to get data from remote memories Single version of program runs on the all processors Single version of program runs on the all processors The computation is specialized to specific processors through extracting number of processor and indexing its own data The computation is specialized to specific processors through extracting number of processor and indexing its own data

Send-Receive Model A shared-memory environment A shared-memory environment Each processor not only receives its needed data but also sends data other ones require Each processor not only receives its needed data but also sends data other ones require PVM PVM MPI MPI

Get-Put Model The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor

Discussion Program is responsible for: Program is responsible for: Decomposition of computation Decomposition of computation The power of individual processor The power of individual processor Load balancing Load balancing Layout of the memory Layout of the memory Management of latency Management of latency Organization and optimization of communication Organization and optimization of communication Explicit communication can be though of as an assembly language for grids Explicit communication can be though of as an assembly language for grids

Distributed Shared Memory DSM as a vehicle for hiding complexities of memory and communication management DSM as a vehicle for hiding complexities of memory and communication management Address space is as flatten as a single- processor machine for programmer Address space is as flatten as a single- processor machine for programmer The hardware/software is responsible for data retrieval through generating needed communications, from remote memories The hardware/software is responsible for data retrieval through generating needed communications, from remote memories

Hardware Approach Stanford DASH, HP/Convex Exemplar, SGI Origin Stanford DASH, HP/Convex Exemplar, SGI Origin Local cache misses initiate data transfer from remote memory if needed Local cache misses initiate data transfer from remote memory if needed

Software Scheme Shared Virtual Memory, TreadMark Shared Virtual Memory, TreadMark Rely on paging mechanism in the operating system Rely on paging mechanism in the operating system Transfer whole page on demand between operating systems Transfer whole page on demand between operating systems Make granularity and latency significantly large Make granularity and latency significantly large Used in conjunction with relaxed memory consistency models and support for latency hiding Used in conjunction with relaxed memory consistency models and support for latency hiding

Discussion Programmer is free from handling thread packaging and parallel loops Programmer is free from handling thread packaging and parallel loops Has performance penalties and then is useful for coarser-grained parallelism Has performance penalties and then is useful for coarser-grained parallelism Works best with some help from the programmer on the layout of memory Works best with some help from the programmer on the layout of memory Is a promising strategy for simplifying the programming model Is a promising strategy for simplifying the programming model

Data-Parallel Languages High performance on distributed memory: High performance on distributed memory: Allocate data to various processor memory to maximize locality and minimize communication Allocate data to various processor memory to maximize locality and minimize communication For scaling parallelism to hundreds or thousands of processors data parallelism is necessary For scaling parallelism to hundreds or thousands of processors data parallelism is necessary Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) These are the foundations for data-parallel languages These are the foundations for data-parallel languages Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ High Performance Fortran (HPF), and High Performance C++ (HPC++) High Performance Fortran (HPF), and High Performance C++ (HPC++)

HPF Provides directives for data layout on F’90 and F’95 Provides directives for data layout on F’90 and F’95 Directives have no effect on the meaning of the program Directives have no effect on the meaning of the program Advices for compiler on how to assign elements of the program arrays and data structures to different processors Advices for compiler on how to assign elements of the program arrays and data structures to different processors These specification is relatively machine independent These specification is relatively machine independent The principle focus is the layout of arrays The principle focus is the layout of arrays Arrays are typically associated with the data domains of underlying problem Arrays are typically associated with the data domains of underlying problem The principle drawback: limited support for problems on irregular meshes The principle drawback: limited support for problems on irregular meshes Distribution via run-time array Distribution via run-time array Generalized block distribution (blocks to be of different sizes) Generalized block distribution (blocks to be of different sizes) For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution) For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)

HPC++ Unsynchronized for-loops Unsynchronized for-loops Parallel template libraries, with parallel or distributed data structures as basis Parallel template libraries, with parallel or distributed data structures as basis

Task Parallelism Different components of the same computation are executed in parallel Different components of the same computation are executed in parallel Different tasks can be allocated to different nodes of the grid Different tasks can be allocated to different nodes of the grid Object parallelism (Different tasks may be components of objects of different classes) Object parallelism (Different tasks may be components of objects of different classes) Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library

HPF 2.0 Extensions for Task Parallelism Can be implemented on both shared- and distributed-memory systems Can be implemented on both shared- and distributed-memory systems Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end Remaining problems on using HPF on a computational grid: Remaining problems on using HPF on a computational grid: Load matching Load matching Communication optimization Communication optimization

Coarse-Grained Software Integration Complete application is not a simple program Complete application is not a simple program It is a collection of programs that must all be run, passing data to one another It is a collection of programs that must all be run, passing data to one another The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs Each program could be viewed as a task Each program could be viewed as a task Tasks collected and matched to the power of the various nodes in the grid Tasks collected and matched to the power of the various nodes in the grid

Latency Tolerance Dealing with long memory or communication latencies Dealing with long memory or communication latencies Latency hiding: data communication is overlapped with computation (software-perfecting) Latency hiding: data communication is overlapped with computation (software-perfecting) Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) More complex to implement on heterogeneous distributed computers More complex to implement on heterogeneous distributed computers Latencies are large and variable Latencies are large and variable More time to be spent on estimating running times More time to be spent on estimating running times

Load Balancing Spreading the calculation evenly across processors while minimizing communication Spreading the calculation evenly across processors while minimizing communication Simulated annealing, neural nets Simulated annealing, neural nets Recursive bisection: at each stage, the work is divided into two equal parts. Recursive bisection: at each stage, the work is divided into two equal parts. For Grid: power of each node must be taken in the account For Grid: power of each node must be taken in the account Performance prediction of components is essential Performance prediction of components is essential

Runtime Compilation A problem with automatic load-balancing (especially on irregular grids) A problem with automatic load-balancing (especially on irregular grids) Unknown loop upper bounds Unknown loop upper bounds Unknown array sizes Unknown array sizes Inspector/executer model Inspector/executer model Inspector: executed a single time once the runtime, establishes a plan for efficient execution Inspector: executed a single time once the runtime, establishes a plan for efficient execution Executor: executed on each iteration, carries out the plan defined by inspector Executor: executed on each iteration, carries out the plan defined by inspector

Libraries Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) Data structure library: a parallel data structure is maintained within the library whose representation is hidden from the user (DAGH) Data structure library: a parallel data structure is maintained within the library whose representation is hidden from the user (DAGH) Well suited for OO languages Well suited for OO languages Provides max flexibility to the library developer to manage runtime challenges Provides max flexibility to the library developer to manage runtime challenges Heterogeneous networks Heterogeneous networks Adaptive girding Adaptive girding Variable latencies Variable latencies Drawback: their components are currently treated by compilers as black boxes Drawback: their components are currently treated by compilers as black boxes Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation

Programming Tools Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist Performance-tuning tools Performance-tuning tools

Future Directions (Assumptions) The user is responsible for both problem decomposition and assignment The user is responsible for both problem decomposition and assignment Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power Some portion of compilation will be invoked after this service Some portion of compilation will be invoked after this service

Task Compilation Constructing a task graph, along with an estimation of running time for each task Constructing a task graph, along with an estimation of running time for each task TG construction and decomposition TG construction and decomposition Performance Estimation Performance Estimation Restructuring the program to better suit the target grid configuration Restructuring the program to better suit the target grid configuration Assignments of components of the TG to the available nodes Assignments of components of the TG to the available nodes Java Java

Grid Shared Memory (Challenges) Different nodes has different page sizing and paging mechanisms Different nodes has different page sizing and paging mechanisms Good Performance Estimation Good Performance Estimation Managing the system level interaction providing DSM Managing the system level interaction providing DSM

Global Grid Compilation Providing a programming language and compilation strategy targeted to grid Providing a programming language and compilation strategy targeted to grid Mixture of parallelism styles, data parallelism and task parallelism Mixture of parallelism styles, data parallelism and task parallelism Data decomposition Data decomposition Function decomposition Function decomposition