CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Parallel computer architecture classification
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Parallel Computers Chapter 1
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Reference: Message Passing Fundamentals.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Parallel Computing Overview CS 524 – High-Performance Computing.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 1: Introduction to High Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Parallel Computing.
CS591x -Cluster Computing and Parallel Programming
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Parallel Processing I’ve gotta spend at least 10 hours studying for the IT 344 final! I’m going to study with 9 friends… we’ll be done in an hour.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Parallel Computing Presented by Justin Reschke
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Processing
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
Parallel computer architecture classification
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
CS 147 – Parallel Processing
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Chapter 17 Parallel Processing
Parallel Processing Architectures
Distributed Systems CS
Chapter 4 Multiprocessors
Presentation transcript:

CS Design of Algorithms Parallel Computer Architecture and Software Models

Parallel Computing – It’s about performance Greater performance is the reason for parallel computing Many types of scientific and engineering programs are too large and too complex for traditional uniprocessors Such large problems are common is – Ocean modeling, weather modeling, astrophysics, solid state physics, power systems, CFD….

FLOPS – a measure of performance FLOPS – Floating Point Operations per Second … a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS FLOPS GigaFLOPS – GFLOPS – 10 9 FLOPS TeraFLOPS – TFLOPS – FLOPS PetaFLOPS – PFLOPS – FLOPS

How fast … Cray 1 - ~150 MFLOPS Pentium 4 – 3-6 GFLOPS IBM’s BlueGene TFLOPS PSC’s Big Ben – 10 TFLOPS Humans --- it depends as calculators – MFLOPS as information processors – 10PFLOPS

FLOPS vs. MIPS FLOPS only concerned with floating pointing calculations other performance issues memory latency cache performance I/O capacity Interconnect

See… biannual performance reports and … rankings of the fastest computers in the world

Performance Speedup(n processors) = time(1 processor)/time(n processors)  ** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach

Consider… from:

… a model of the Indian Ocean - 73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface points Need to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data points Every 10 minutes for 4 hours – 24 time steps

So – 73 x 10 6 (points on the surface) x 10 2 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the model grid Suppose calculations of 100 instructions per grid point 350,400,000,000,000 instructions in model

Then - Imagine that you have a computer that can run 1 billion (10 9 )instructions per second x / 10 9 = seconds or 9.7 hours

But – On a 10 teraflops computer – x / = 35.0 seconds

Gaining performance Pipelining More instructions –faster More instructions in execution at the same time in a single processor Not usually an attractive strategy these days – why?

Instruction Level Parallelism (ILP) based on the fact that many instructions do not depend on instructions that are before them… Processor has extra hardware to execute several instructions at the same time …multiple adders…

Pipelining and ILP not the solution to our problem – why? near incremental improvements in performance been done already we need orders of magnitude improvements in performance

Gaining Performance Vector Processors Scientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift object x to the right Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)

Gaining Performance Vector Processors Declining popularity for a while – Hardware expensive Popularity returning – Applications – science, engineering, cryptography, media/graphics Earth Simulator your computer?

Parallel Computer Architecture Shared Memory Architectures Distributed Memory

Shared Memory Systems Multiple processors connected to/share the same pool of memory SMP Every processor has, potentially, access to and control of every memory location

Shared Memory Computers Memory Processor

Shared Memory Computers Memory Processor

Shared Memory Computer Memory Processor Switch

Share Memory Computers SGI Origin2000 – at NCSA Balder mhz R10000 processors 128 Gbyte Memory

Shared Memory Computers Rachel at PSC Ghz EV7 processors 256 Gbytes of shared memory

Distributed Memory Systems Multiple processors each with their own memory Interconnected to share/exchange data, processing Modern architectural approach to supercomputers Supercomputers and Clusters similar **Hybrid distributed/shared memory

Clusters – distributed memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Interconnect

Cluster Distributed Memory with SMP Proc1 Memory Interconnect Proc2Proc1 Memory Proc2Proc1 Memory Proc2 Proc1Proc2Proc1Proc2Proc1Proc2

Distributed Memory Supercomputer BlueGene/L DOE/IBM 0.7 Ghz PowerPC Processors previous Processors 367 Teraflops was 70 TFlops

Distributed Memory Supercomputer Thunder at LLNL Number 19 was Number 5 20 Teraflops 1.4 Ghz Itanium processors 4096 processors

Earth Simulator Japan Built by NEC Number 14 was Number 1 40 TFlops 640 Nodes each node = 8 vector processors 640x640 full crossbar

Grid Computing Systems What is a Grid Means different things to different people Distributed Processors Around campus Around the state Around the world

Grid Computing Systems Widely distributed Loosely connected (i.e. Internet) No central management

Grid Computing Systems Connected Clusters/other dedicated scientific computers I2/Abilene

Grid Computer Systems Internet Control/Scheduler Harvested Idle Cycles

Grid Computing Systems Dedicated Grids TeraGrid Sabre NASA Information Power Grid Cycle Harvesting Grids Condor *GlobalGridExchange (Parabon)

Flynn’s Taxonomy Single Instruction/Single Data - SISD Multiple Instruction/Single Data - MISD Single Instruction/Multiple Data - SIMD Multiple Instruction/Multiple Data - MIMD *Single Program/Multiple Data - SPMD

SISD – Single Instruction Single Data Single instruction stream “single instruction execution per clock cycle” Single data stream – one pieced of data per clock cycle Deterministic Tradition CPU, most single CPU PCs Load x to a Load y to b Add B to A Store A Load x to a …

Single Instruction Multiple Data One Instruction stream Multiple data streams (partitions) Given instruction operates on multiple data elements Lockstep Deterministic Processor Arrays, Vector Processors CM-2, Cray-C90 Load A(1) Load B(1) Store C(1) … … Load A(2) Load B(2) Store C(2) … … Load A(3) Load B(3) Store C(3) … … C(1)=A(1)*B(1)C(2)=A(2)*B(2)C(3)=A(3)*B(3) PE-1PE-2PE-n

Multiple Instruction Single Data Multiple instruction streams Operate on single data stream Several instructions operate on the same data element – concurrently A bit strange – CMU Multi-pass filters Encryption – code cracking Load A(1) Load B(1) Store C(1) … … Load A(1) Load B(2) Store C(2) … … Load A(1) Load B(3) Store C(3) … C(1)=A(1)*4C(2)=A(1)*4 PE-1PE-2PE-n C(3)=A(1)*4

Multiple Instruction Multiple Data Multiple Instruction Streams Multiple Data Streams Each processor has own instructions/own data Most Supercomputers, Clusters, Grids Load A(1) Load B(1) Store C(1) … Load G A=SQRT(G) Store C … Call func2(C,G) Load B Call func1(B,C) Store G C(1)=A(1)*4C = A*Pi PE-1PE-2PE-n

Single Program Multiple Data Single Code Image/Executable Each Processor has own data Instruction execution under program control DMC, SMP if PE=1 then… Load A Load B Store C … Load A Load B Store C … Load A Load B Store C … C=A*B PE-1PE-2PE-n C=A*B if PE=2 then…if PE=n then…

Multiple Program Multiple Data MPMD like SPMD … …except each processor run separate, independent executable How to implement interprocess communications Socket MPI-2 – more later ProgA ProgBProgCProgD SPMD MPMD

UMA and NUMA UMA – Uniform Memory Access all processors have equal access to memory Usually found in SMPs Identical processors Difficult to implement as n of processors increases Good processor to memory bandwidth Cache Coherency CC –  important  can be implemented in hardware

UMA and NUMA NUMA – Non Uniform Memory Access Access to memory differs by processor local processor = good access, nonlocal processors = not so good access Usually multiple computers or multiple SMPs Memory access across interconnect is slow Cache Coherency CC –  can be done  usually not a problem

Let’s revisit speedup… we can achieve speedup (theoretically) by using more processors,… but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance Parallelizability of algorithms

Amdahl’s Law According to Amdahl’s Law… Speedup = 1/(S + (1-S)/N) where S is the purely sequential part of the program N is the number of processors

Amdahl’s Law What does it mean – Part of a program can is parallelizable Part of the program must be sequential (S) Amdahl’s law says – Speedup is constrained by the portion of the program that must remain sequential relative to the part that is parallelized. Note: If S is very small – “embarrassingly parallel problem” sometimes anyway!

Software models for parallel computing Sockets and other P2P models Threads Shared Memory Message Passing Data Parallel

Sockets and others TCP Sockets establish TCP links among processes send messages through sockets RPC, CORBA, DCOM Webservices, SOAP…

Threads A single executable runs… … at specific points in execution launches new executables – threads… … threads can be launched on other PEs … threads close – control returns to main program …fork and join Posix, Microsoft OpenMP is implemented with threads Thread Threads t1t2t3t0 t1t2t3t0

Shared Memory Processes share common memory space Data sharing via common memory space Protocol needed to “play nice” with memory OpenMP Memory Processor

Distributed Memory - Message Passing Data messages are passed from PE to PE Message Passing is explicit … under program control Parallelization is designed by the programmer… …implemented by the programmer Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Processor Memory Interconnect

Message Passing Message Passing usually implement as a library – functions and subroutine calls Most common – MPI – Message Passing Interface Standards – MPI-1 MPI-2 Implementations MPICH OpenMPI MPICH-GM (Myrinet MPICH-G2 – MPICH-G

Message Passing Hybrid DM/SMP How does it look from a message passing perspective? How is MPI implemented? Proc1 Memory Interconnect Proc2Proc1 Memory Proc2Proc1 Memory Proc2 Proc1Proc2Proc1Proc2Proc1Proc2

Data Parallel Processes work concurrently on pieces of single data structure SMP – each process works on portion of structure in common memory DMS – data structure is partitioned, distributed, computed (and collected) from -

Data Parallel Can be done with calls to libraries, compiler directives… can be automatic (sort of) High Performance Fortran (HPF) Fortran 95

Comments on Automatic Parallelization Some compilers can automatically parallelize portions of code (HPF) Usually loops are the target Essentially a serial algorithm with portions pushed out to other processors Problems Not parallel algorithm, not under programmer control (at least partly) might be wrong might result in slowdown

See…