High Performance Computing

High Performance Computing
Overview of the Course Approach Master Degree Program in Computer Science and Networking Academic Year Dr. Gabriele Mencagli, PhD Department of Computer Science University of Pisa High Performance Computing, G. Mencagli 23/11/2018

A Complex Application Apply to each image a noise reduction algorithm On each denoised image update each pixel value as a function of the values of its neighbors and using the features extracted as input parameters Feature extraction from an image (e.g., blob detection) Blob detection methods are aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions. Initial functional specification: computation graph (workflow) Cooperating modules, working in parallel on data streams Non-functional requirements: bandwidth (throughput), response time, …, memory size, …, power consumption, … High Performance Computing, G. Mencagli 23/11/2018

A Complex Application Bottleneck! Functionally equivalent…
How much parallelism? Modules may be computation-intensive and/or data-intensive Internal parallelization of one or more modules (‘bottlenecks’): transformed into subgraphs of cooperating modules Performance requirements: computational bandwidth (service time), latency and response time, completion time  cost models are needed (also to compare alternative parallel versions) High Performance Computing, G. Mencagli 23/11/2018

Which Architectures Shared memory multiprocessor
External memory Shared memory multiprocessor Currently: on-chip multicore (8, 16, 64, …, 1024, … cores) A parallel application is compiled once, it is parametric in the degree of parallelism… Distributed memory multicomputer Cluster High-performance interconnect (e.g., InfiniBand) Clusters of shared-memory multicore-based machines It is possible to compile a parallel program once, and to use it with different paralellism degrees and on different machines by specifying, through proper configuration files, which resources and how many of them you want to use. The run-time support will be different, if two processes comminicate between different nodes of a cluster or between two cores in a multicore. Data center / Private Cloud High Performance Computing, G. Mencagli 23/11/2018

Programming Models/Tools?
Low-level libraries Message passing (e.g., MPI) Shared variables (e.g., OpenMP, Intel TBB) Higher-level frameworks and programming environments Skeletons, parallel design patterns (e.g., FastFlow, SkePu) MapReduce / Hadoop / Apache Storm Methodology for parallel computation structuring and design Low complexity of parallel program design Good trade-off between programmability vs performance Systematic methods for run-time support implementation on several architectures Cost models (= performance models) associated with the parallel program, once it is mapped and executed onto the target architecture High Performance Computing, G. Mencagli 23/11/2018

Structured Parallel Programming
In the past, parallel programs were arbitrary collections of processes/threads cooperating in different ways Exchanging messages (e.g., POSIX sockets, MPI, or IPC mechanisms) Shared variables (for threads and using special mechanisms to protect critical sections) Over the last 15 years, parallel patterns and algorithmic skeletons have been introduced to model recurrent recipes in parallel computing Parametric in the parallelism degree Easy to use and understand With cost models unstructured parallelism High Performance Computing, G. Mencagli 23/11/2018

A Complex Example process processing node We will be able to
tranformed into a structured parallel program mapped onto the processing nodes of a parallel architecture We will be able to Analyse the initial computation, evaluate its performance metrics, recognize bottlenecks Apply the parallel programming methodology to parallelize the bottleneck modules according to high-level parallel paradigms Design the run-time support of the parallel modules on the target parallel architecture (multiprocessor, multicomputer, mixed combinations) Evaluate the performance according to general cost models High Performance Computing, G. Mencagli 23/11/2018

Independent from the process concept
Level Structure Independent from the process concept Application developed through user-friendly tools Applications Parallel program as a collection of cooperating processes (message passing and/or shared variables) compiled / interpreted into Processes Architecture independent Assembler Optimizations can be applied! Using cost models! Firmware  Uniprocessor: Instruction Level Parallelism   Shared Memory Multiprocessor: SMP, NUMA, …  Distributed Memory: Cluster, MPP, …  Firmware compiled / interpreted into a program executable by Run-time support of processes and process cooperation: distinct and different for each architecture At the application level, applications are designed using formalisms independent of the machine and of the processes concept, especially related to how they are implemented and how processes cooperate with each other. At the processes level, we know how processes are implemented and the cooperation model adopted. However, the program at this level is still independent of the specific machine (e.g., send/receive primitives are implemented in different way in shared-memory systems w.r.t distributed-memory architectures). Architecture 1 Architecture 2 Architecture m Hardware High Performance Computing, G. Mencagli 23/11/2018

Abstract Architecture
Cost Models and Abstract Architecture At the application level, applications are designed independently of the process concept, how processes cooperate, and independently of the target machine The process-level version of a parallel program reperesents a sort of intermediate version (it is still independent of the machine) At this level, several optimizations are possible before executing the application, like choosing the correct parallelism degree, evaluate/predict the performance General approach parallel programs developed independently of the architecture on which they will be executed on parallel program performance can be predicted/evaluated using cost models that are designed having in mind an Abstract Architecture the impact of the underlying architecture is captured by some input parameters of the cost models (i.e. calculation times and communication latency) The compiler performs optimizations by applying cost models whose input parameters (their value) depend on the physical machine High Performance Computing, G. Mencagli 23/11/2018

Example of Cost Models Transformation of a stream-based sequential module into an equivalent parallel module parallelism degree = n 𝒏= 𝒊𝒅𝒆𝒂𝒍 𝒔𝒆𝒓𝒗𝒊𝒄𝒆 𝒕𝒊𝒎𝒆 𝒊𝒏𝒕𝒆𝒓𝒂𝒓𝒓𝒊𝒗𝒂𝒍 𝒕𝒊𝒎𝒆 = 𝒇 𝒏 ( 𝑻 𝒄𝒂𝒍𝒄 , 𝑳 𝒄𝒐𝒎 ) 𝑻 𝑨 𝑻 𝒄𝒂𝒍𝒄 : calculation time 𝑳 𝒄𝒐𝒎 : inter-process communication latency 𝑳 𝒄𝒐𝒎 = … < evaluation of interprocess communication run−time support latency > = 𝒇 𝑳𝒄𝒐𝒎 ( …, 𝜌 𝑎𝑟𝑐ℎ , …) 𝑻 𝒄𝒂𝒍𝒄 = … < evaluation of the sequential program processing time >= 𝒇 𝒄𝒂𝒍𝒄 (…, 𝜌 𝑎𝑟𝑐ℎ , …) 𝝆 𝒂𝒓𝒄𝒉 = utilization factor of some server processing units in the underlying architecture (shared memory modules or network interface modules) 𝝆 𝒂𝒓𝒄𝒉 = 𝝓(interconnection network latency, memory access time, cache management strategies, inter-process comm. strategies, average number of conflicting nodes, average distance between two server requests, …, …) Functions of both computational issues and architectural issues All ‘hardware-software’ levels must be studied in an integrated manner High Performance Computing, G. Mencagli 23/11/2018

Abstract Architecture: Example
Cost Models and Abstract Architecture: Example P1 E W C P3 . M1 = Seq(F1); M2 = Farm(F2); M3 = Seq(F3); M1.out = M2; M2.out = M3; M1 M2 M3 Application Level parallel P1:: {... code with send/receive ... } E:: {... code with send/receive ... } Wi:: {... code with send/receive ... } C:: {... code with send/receive ... } P3:: {... code with send/receive ... } Processes Level Abstract architecture e.g., as many processing nodes as the number of processes of the applications. Nodes interact with a fully interconnected network, i.e. each arc between processes is a link between nodes in the abstract architecture . . . In the Abstract Architecture, we have a high number of processors and of network links. Actually, in the physical architecture this is not true. Indeed, we can have fewers processors than processes and we rely on the operating system scheduling. Analogously, the interconnection network likely has a smaller number of links that are shared among processors/memories. In both cases, we encapsulate the sharing/contention for physical resource in the values of the two parameters Tcalc and Lcom that depends on the machine, the parallelism degree of the computation, the layout and allocation of data structures used by the parallel program and so forth. Cost models of the architectures will be used to derive good approximations for Tcalc and Lcom which, as the next step, allow us to instantiate the abstract architecture and the cost models used by parallel paradigms (e.g., to compute the optimal parallelism degree of the application). Cost models are parametric: 𝑻 𝒄𝒂𝒍𝒄 and 𝑳 𝒄𝒐𝒎 These two paramemters are derived by studying the specific target machine If you change the machine you have to change the values of 𝑻 𝒄𝒂𝒍𝒄 and 𝑳 𝒄𝒐𝒎 High Performance Computing, G. Mencagli 23/11/2018

Finding the Input Parameters
of the Cost Model P1 E W C P3 . Parallel program using a feasible parallelism paradigm Cost model of the Parallelism Paradigm Rq=F1(Tcalc, Lcom, …) Bw=F2(Tcalc, Lcom, ...) Abstract architecture . . . Contention in the Memory Hierarchy P1 E C Contention in the Interconnection Network (links and switching units) Cost model of the architecture will be used to evaluate the impact of contention and to derive Average Tcalc Average Lcom W High Performance Computing, G. Mencagli 23/11/2018

Independent from the process concept
Course Big Picture Applications Independent from the process concept Application developed through user-friendly tools Parallel program as a collection of cooperating processes (message passing and/or shared variables) compiled / interpreted into Processes Architecture independent Assembler Firmware  Uniprocessor : Instruction Level Parallelism   Shared Memory Multiprocessor: SMP, NUMA,…  Distributed Memory : Cluster, MPP, …  Firmware Abstract architecture and associated cost models for the different concrete architectures: … Ti = fi (a, b, c, d, …) Tj = fj (a, b, c, d, …) … . . . compiled / interpreted into a program executable by On the left hand part of the figure we have the traditional level-based view of a computing system. On the right hand we have the conceptual and technological scheme for the development of high performance applications. Run-time support of processes and process cooperation: distinct and different for each architecture Architecture 1 Architecture 2 Architecture m Hardware High Performance Computing, G. Mencagli 23/11/2018

Using the Abstract Architecture
Abstract Processing Elements (PE), having all the main features of real PEs (processor, assembler, memory hierarchy, external memory, I/O, etc.); one PE for each process . . . Abstract interconnection network: all the needed direct links corresponding to the interprocess communication channels Evaluation of calculation times Tcalc Abstraction of physical interconnect, Memory hierarchy, I/O, Process run-time support, Process mapping onto PEs, etc. All the physical details are condensed into a small number of parameters used to evaluate Lcom. Cost model of the specific parallel program executed on the specific parallel architecture High Performance Computing, G. Mencagli 23/11/2018

The Steps of the Methodology
The methodology consists in several conceptual steps Compute the optimal parallelism degree and choose a parallel paradigm. Evaluate the cost model of the parallel paradigm to check scalability issues Derive basic parameters without considering congestion in the architecture: 𝑳 𝒄𝒐𝒎 and 𝑻 𝒄𝒂𝒍𝒄 P starting sequential process Extract some parameters 𝑻 𝒑 ,𝒑, 𝑹 𝑸−𝟎 ,… that depend both on the application and on the hardware characteristics and apply the cost model of the architecture to evaluate congestion Re-evaluate the cost model of the parallel paradigm with the updated 𝑳 𝒄𝒐𝒎 and 𝑻 𝒄𝒂𝒍𝒄 values that take into account the congestion effect on the hardware resources (e.g., memories, caches, networks) Final parallel version High Performance Computing, G. Mencagli 23/11/2018

Prerequisities and Techniques
Process level modules: parallel programming, communication and synchronization, run-time support Firmware level modules (processing units): realization, clock cycle, communications CPUs: assembler level, compilation, optimizations Shared memory: addressing spaces, SMP, NUMA models, input-output processing Memory hierarchies: virtual memory, cache memories All these concepts and techniques must be CONCRETELY APPLIED and USED during the HPC course High Performance Computing, G. Mencagli 23/11/2018

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback