An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Today’s topics Single processors and the Memory Hierarchy
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
2. Multiprocessors Main Structures 2.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Interconnect Networks
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
A.Broumandnia, 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures.
Parallel Computer Architecture and Interconnect 1b.1.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Parallel Computing.
Lecture 2 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Data Structures and Algorithms in Parallel Computing Lecture 1.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Outline Why this subject? What is High Performance Computing?
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.
1 Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming.
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Array computers. Single Instruction Stream Multiple Data Streams computer There two types of general structures of array processors SIMD Distributerd.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Parallel Architecture
Multiprocessor Systems
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
buses, crossing switch, multistage network.
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Parallel and Multiprocessor Architectures
Data Structures and Algorithms in Parallel Computing
Multiprocessor Introduction and Characteristics of Multiprocessor
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
buses, crossing switch, multistage network.
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Part 2: Parallel Models (I)
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Advanced Computer and Parallel Processing
Presentation transcript:

An Overview of Parallel Computing

Hardware There are many varieties of parallel computing hardware and many different architectures The original classification of parallel computers is popularly known as Flynn’s taxonomy. Flynn classified systems according to the number of instruction streams and the number of data streams 1. SISD (Von Neumann machine) 2. MIMD (most general, a collection of autonomous processors operate on their own data streams) 3. SIMD 4. MISD

The Classical von Neumann Machine Divided into a CPU and main memory CPU is divided into a control unit and an ALU The memory stores both instructions and data The control unit directs the execution of programs The ALU carries out the calculations. When being used by the program, instructions and data are stored in very fast memory location, called registers. Both data and instructions are moved between memory and registers in CPU via bus. Bus is a collection of parallel wires, faster buses have more wires

The Classical von Neumann Machine To be useful, some additional devices are needed including input devices, output devices, and extended storage devices (disk) The bottleneck is the transfer of data and instructions between memory and CPU Few computers use classical Neumann machine Most machines now have a hierarchical memory. Cache is used to achieve faster access.

Pipeline and Vector Architecture The first widely used extension to Neumann model was pipelining. Suppose we have the following program float x[100], y[100], z[100]; for(i =0; i<100; i++) z[i]=x[i]+y[i]; Further a single addition consists of following operations: 1. Fetch the operands from memory; 2. Compare exponents; 3. Shift one operand; 4. Add; 5. Normalize the results; 6. Store results in memory.

Pipeline and Vector Architecture A further improvement: add vector instructions With vector instruction, each of the basic instruction only needs to be issued once. One short instruction encodes N operations. Another improvement is the use of multiple memory banks. Different authors regard vector processors as different categories (MISD, variant of SIMD, even not really parallel machines) Examples: CRAY C90 and NEC SX4

Pipeline and Vector Architecture Advantages: relatively easy to write programs to obtain very high performance, therefore very popular for high performance scientific computing Disadvantages: Don’t work well for programs that use irregular structures or use many branches

SIMD Systems A pure SIMD system is opposed to a vector processor since it has single CPU During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors. Each of them either executes the instruction or idle. Example: for (i=0; i< 1000; i++) if (y[i]!=0.0) z[i]=x[i]/y[i]; else z[i]=x[i];

SIMD Systems Each subordinate processor would execute Time Step 1: Test local_y=0.0. Time Step 2: a. If local_y was nonzero, z[i]=x[i]/y[i]; b. If local_y was zero, do nothing. Time Step 3: a. If local_y was nonzero, do nothing. b. if local_y was zero, z[i]=x[i]. It is completely synchronous execution. A given subordinate processor either active or idle at given instant of time

SIMD Systems The disadvantage is clear: in a program with many conditional branches or long segments of code whose execution depends on conditionals, possibly many processes will be idle for long period of time Easy program if underlying problem has a regular structure. The most famous examples of SIMD machines are the CM-1 and CM-2 Connection Machines produced by Thinking Machines.

General MIMD Systems The key difference between SIMD and MIMD: the processors are autonomous. MIMD systems are asynchronous. Often no global clock; maybe no correspondence between different processors even if they execute the same program MIMD systems consist of shared-memory (and distributed memory systems, also sometimes called multiprocessors and multicomputers.

Shared-Memory MIMD The generic shared-memory architecture

Bus-based Architecture Simplest interconnection network If multiple processors access memory, bus will become saturated, thus long delays A fairly large cache Due to limited bandwidth of a bus, do not scale to large number of processors.

Switched-based Architecture Most others rely on some type of switch-based network A crossbar as a rectangular mesh of wires with switches at the point of intersection, and terminals on its left and top edges.

Switched-based Architecture Processors or memory modules can be connected to the terminals The switches can either allow a signal to pass through in both directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. Any other processor can simultaneously access any other memory module, therefore, don’t suffer from the problems of saturation However, they are very expensive: an m*n crossbar needs mn hardware switches

Cache Coherence Cache coherence is a problem for any shared-memory architecture A processor accesses a shared variable in its cache, how will it know whether the value stored in the variable is current? Example: assume x=2; //initially P1 P2 Time 0 y0=x; y1=3*x; Time 1 x=7; z=6; Time 2 y=5; z1=4*x; y0 ends up 2 and y1 ends up 6. How about z1?

Cache Coherence The simplest solution is probably the snoopy protocol Each CPU has a cache controller The ache controller monitors the bus traffic. When a processor updates a shared variable, it also updated the corresponding main memory location. The cache controllers on the other processors detect the write to main memory and mark their copies of the variable as invalid This approach is only suitable for bus-based shared- memory because any traffic on the bus can be monitored by all the controllers

Distributed-Memory MIMD Each processor has its own private memory Generic distributed-memory MIMD If we view it as a graph, the edges are communication wires. Each vertex corresponds to a processor/memory pair (or node), or some vertices correspond to nodes and others correspond to switches They are static networks and dynamic networks

Distributed-Memory MIMD Different types of distributed systems (a) a static network (mesh) (b) a dynamic network (crossbar)

Dynamic Interconnection Networks Dynamic interconnection networks Example: An omega network

Dynamic Interconnection Networks A less expensive solution is to use the multistage switching network, such as omega network If p nodes, plogp/2 switches are needed, less than the crossbar using p 2 switches The delay in transmitting a message is increased since logp switches must be set

Static Interconnection Networks Fully connected interconnection network Ideal case from the performance and programming Communication has no delay Costs are huge

Static Interconnection Networks A linear array or a ring Relatively inexpensive (p or p-1 wires) Easy to increase the size of the network Number of available wires is extremely limited The longest path is p-1 or p/2

Static Interconnection Networks Hypercube: practically closest to the fully connected network A d-dimension hypercube has 2 d nodes Any two nodes traverse at most d wires Drawback: relative lack of scalability

Static Interconnection Networks Mesh or torus

Static Interconnection Networks Mesh or torus is between hypercube and linear array Scale better than hypercube Quite popular

Communication and Routing If two nodes are not directly connected or if a processor is not directly connected to a memory module, how is data transmitted between the two? If there are multiple routes, how to decide? Is the route always the shortest path? Most systems use a deterministic shortest-path algorithm How do intermediate nodes forward communications? Two basic approaches are store-and-forward routing and cut-through routing Store-and-forward routing uses considerably more memory Most systems use some variant of cut-through routing

Store-and-Forward Routing Read in the entire message,and then send to C

Cut-Through Routing Immediately forward each identifiable pieces of the message