HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology Adjunct Professor Department of Computer Science North Dakota State University
What is High Performance Computing? Definition: The solution of problems involving high degrees of computational complexity or data analysis which require specialized hardware and software systems.
What is Parallel Computing? Definition: A strategy of decreasing the time to solution of a computational problem by carrying out multiple elements of the computation at the same time.
Does HPC imply Parallel Computing? Typically but not always. HPC solutions may require specialized systems due to memory and/or I/O performance issues. Conversely parallel computing does not necessarily imply high performance computing.
Flynn's Taxonomy: Classification Strategy for Concurrent Execution SISD Single Instruction, Single Data MISD Multiple Instruction, Single Data SIMD * Single Instruction, Multiple Data MIMD * Multiple Instruction, Multiple Data * = Relevant to HPC
SIMD The Origin of HPC Architectural model at the heart of 'vector processors'. Performance enhancement in machines at origin of HPC: CDC STAR-100 and Cray-1 Utility predicated on fact that mathematical operations on vectors or vector spaces are at the heart of linear algebra.
Vector Processing Diagram Vector Length = 8 'words' Vector elements Parallel mathematical operations +,-,*,/
Current SIMD Examples Embedded in modern x86 and x86_64 architectures. primarily focus on graphics/signal processing MMX, PNI, SSE2-4, AVX Foundation for current trend in 'GPGPU computing' NVIDIA Tesla architecture Component of Larrabee architecture.
SSE Implementation Vector elements Parallel operations 100+ (SSE4) 128 bit XMM register Stride Length
MIMD Multiple Instruction Multiple Data Characterized by multiple execution threads operating on separate data elements. Threads may operate in shared or disjoint (distributed) memory configurations. Implementation example SMP (Symmetric Multi-Processing)
SPMD The Basis for Modern HPC Defined as a single process executing a common program at different points. Different from SIMD in that execution is not in lockstep format. Common implementations: shared memory: OpenMP Pthreads distributed memory MPI
Characteristics of MD Models MIMD/SPMD requires active participation by programmer to implement 'orthogonalization'. SIMD requires active participation by the compiler with consideration by the programmer to support orthogonalization. Orthogonalization defn: The isolation of a problem into discrete elements capable of being independently resolved.
The Real World - A Continuum Practical programs do not exhibit strict model partitioning. More pragmatic model is to consider 'dimensions' of parallelism available to a program. Currently a total of four dimensions of parallelism are exploitable.
Dimensions of Parallelism First dimension. Standard sequential programming with processor supplied ILP (Instruction Level Parallelism) Referred to as 'free' or 'invisible' parallelism. Second dimension. SIMD or OpenMP loop parallelism characterized by isolation of the problem into a single system image primarily supported by programming language or compiler
Dimensions of Parallelism - cont. Third dimension – Two subtypes. use of MPI to partition problem into orthogonal elements partitioning is frequently implemented on multiple system images MIMD threading on a single system image separate threads dispatched to handle separate tasks which can execute asynchronously Common HPC example is to 'thread' computation and Input/Output (I/O)
Dimensions of Parallelism - cont. Fourth dimension partitioning of the problem into orthogonal elements which can be dispatched to a heterogeneous instruction architecture. examples: GPGPU/CUDA PowerXcell SPU FPGA
Depth of Parallelism Measure of the complexity of parallelism implemented. Simplest metric is the count of the number of programmer implemented dimensions of parallelism on a single system image. Example MPI implementation with SIMD loop vectorization on each node Parallelism depth is two
Parallelism Analysis Example Process based MIMD application. Depth = 1 MPI simulation with OpenMP loop vectorization. Depth = 2 MPI partitioning with CUDA PTree offload and SIMD loop vectorization. Depth = 3
Escalation of Complexity Dimension Architectural decisions must be based on cost benefit analysis of performance returns. Depth 1 N Least Most 14
Exercise Verify you have changeset which adds experimental code for SSE/SIMD based boolean PTree operators. Study the class methods implementing the AND and OR operators. Review and understood how vector and stride length effect the number of times a loop needs to be executed.
goto skills_lecture1;