Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Multiple Processor Systems

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

History of Distributed Systems Joseph Cordina

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Parallel Computer Architectures

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

Computer System Architectures Computer System Software

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Yulia Newton CS 147, Fall 2009 SJSU. What is it? “Parallel processing is the ability of an entity to carry out multiple operations or tasks simultaneously.

Parallel Computer Architecture and Interconnect 1b.1.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

Lecture 13: Multiprocessors Kai Bu

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Pipelining and Parallelism Mark Staveley

Data Structures and Algorithms in Parallel Computing Lecture 1.

Outline Why this subject? What is High Performance Computing?

Super computers Parallel Processing

Lecture 3: Computer Architectures

The Central Processing Unit (CPU)

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Overview Parallel Processing Pipelining

Parallel Architecture

Multiprocessor Systems

CS5102 High Performance Computer Systems Thread-Level Parallelism

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Parallel Processing - introduction

Parallel and Multiprocessor Architectures

Parallel and Multiprocessor Architectures – Shared Memory

Outline Interconnection networks Processor arrays Multiprocessors

Advanced Computer and Parallel Processing

Chapter 4 Multiprocessors

Advanced Computer and Parallel Processing

Parallel and Multiprocessor Architectures

Presentation transcript:

Parallel Computing Erik Robbins

Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints to further improvement  Physical barriers  Heat and electromagnetic interference limit chip transistor density  Processor speeds constrained by speed of light  Economic barriers  Cost will eventually increase beyond price anybody will be willing to pay

Parallelism  Improvement of processor performance by distributing the computational load among several processors.  The processing elements can be diverse  Single computer with multiple processors  Several networked computers

Drawbacks to Parallelism  Adds cost  Imperfect speed-up.  Given n processors, perfect speed-up would imply a n-fold increase in power.  A small portion of a program which cannot be parallelized will limit overall speed-up.  “The bearing of a child takes nine months, no matter how many women are assigned.”

Amdahl’s Law  This relationship is given by the equation:  S = 1 / (1 – P)  S is the speed-up of the program (as a factor of its original sequential runtime)  P is the fraction that is parallelizable  Web Applet – 

Amdahl’s Law

History of Parallel Computing – Examples  1954 – IBM 704  Gene Amdahl was a principle architect  uses fully automatic floating point arithmetic commands.  1962 – Burroughs Corporation D825  Four-processor computer  1967 – Amdahl and Daniel Slotnick publish debate about parallel computing feasibility  Amdahl’s Law coined  1969 – Honeywell Multics system  Capable of running up to eight processors in parallel  1970s – Cray supercomputers (SIMD architecture)  1984 – Synapse N+1  First bus-connected multi-processor with snooping caches

History of Parallel Computing – Overview of Evolution  1950’s - Interest in parallel computing began.  1960’s & 70’s - Advancements surfaced in the form of supercomputers.  Mid-1980’s – Massively parallel processors (MPPs) came to dominate top end of computing.  Late-1980’s – Clusters (type of parallel computer built from large numbers of computers connected by network) competed with & eventually displaced MPPs.  Today – Parallel computing has become mainstream based on multi-core processors in home computers. Scaling of Moore’s Law predicts a transition from a few cores to many.

Multiprocessor Architectures  Instruction Level Parallelism (ILP)  Superscalar and VLIW  SIMD Architectures (single instruction streams, multiple data streams)  Vector Processors  MIMD Architectures (multiple instruction, multiple data)  Interconnection Networks  Shared Memory Multiprocessors  Distributed Computing  Alternative Parallel Processing Approaches  Dataflow Computing  Neural Networks (SIMD)  Systolic Arrays (SIMD)  Quantum Computing

Superscalar  A design methodology that allows multiple instructions to be executed simultaneously in each clock cycle.  Analogous to adding another lane to a highway. The “additional lanes” are called execution units.  Instruction Fetch Unit  Critical component.  Retrieves multiple instructions simultaneously from memory. Passes instructions to…  Decoding Unit  Determines whether the instructions have any type of dependency

VLIW  Superscalar processors rely on both hardware and the compiler.  VLIW processors rely entirely on the compiler.  They pack independent instructions into one long instruction which tells the execution units what to do.  Compiler cannot have an overall picture of the runtime code.  Is compelled to be conservative in its scheduling.  VLIW compiler also arbitrates all dependencies.

Vector Processors  Referred to as supercomputers. (Cray series most famous)  Based on vector arithmetic.  A vector is a fixed-length, one-dimensional array of values, or an ordered series of scalar quantities.  Operations include addition, subtraction, and multiplication.  Each instruction specifies a set of operations to be carried over an entire vector.  Vector registers – specialized registers that can hold several vector elements at one time.  Vector instructions are efficient for two reasons.  Machine fetches fewer instructions.  Processor knows it will have continuous source of data – can pre-fetch pairs of values.

MIMD Architectures  Communication is essential for synchronized processing and data sharing.  Manner of passing messages determines overall design.  Two aspects:  Shared Memory – one large memory accessed identically by all processors.  Interconnected Network – Each processor has own memory, but processors are allowed to access each other’s memories via the network.

Interconnection Networks  Categorized according to topology, routing strategy, and switching technique.  Networks can be either static or dynamic, and either blocking or non-blocking.  Dynamic – Allow the path between two entities (two processors or a processor & memory) to change between communications. Static is opposite.  Blocking – Does not allow new connections in the presence of other simultaneous connections.

Network Topologies  The way in which the components are interconnected.  A major determining factor in the overhead of message passing.  Efficiency is limited by:  Bandwidth – information carrying capacity of the network  Message latency – time required for first bit of a message to reach its destination  Transport latency – time a message spends in the network  Overhead – message processing activities in the sender and receiver

Static Topologies  Completely Connected – All components are connected to all other components.  Expensive to build & difficult to manage.  Star – Has a central hub through which all messages must pass.  Excellent connectivity, but hub can be a bottleneck.  Linear Array or Ring – Each entity can communicate directly with its two neighbors.  Other communications have to go through multiple entities.  Mesh – Links each entity to four or six neighbors.  Tree – Arrange entities in tree structures.  Potential for bottlenecks in the roots.  Hypercube – Multidimensional extensions of mesh networks in which each dimension has two processors.

Static Topologies

Dynamic Topology  Dynamic networks use either a bus or a switch to alter routes through a network.  Bus-based networks are simplest and most efficient when number of entities are moderate.  Bottleneck can result as number of entities grow large.  Parallel buses can alleviate bottlenecks, but at considerable cost.

Switches  Crossbar Switches  Are either open or closed.  A crossbar network is a non-blocking network.  If only one switch at each crosspoint, n entities require n^2 switches. In reality, many switches may be required at each crosspoint.  Practical only in high-speed multiprocessor vector computers.

Switches  2x2 Switches  Capable of routing its inputs to different destinations.  Two inputs and two outputs.  Four states  Through (inputs feed directly to outputs)  Cross (upper in directed to lower out & vice versa)  Upper broadcast (upper input broadcast to both outputs)  Lower broadcast (lower input directed to both outputs)  Through and Cross states are the ones relevant to interconnection networks.

2x2 Switches

Shared Memory Multiprocessors  Tightly coupled systems that use the same memory.  Global Shared Memory – single memory shared by multiple processors.  Distributed Shared Memory – each processor has local memory, but is shared with other processors.  Global Shared Memory with separate cache at processors.

UMA Shared Memory  Uniform Memory Access  All memory accesses take the same amount of time.  One pool of shared memory and all processors have equal access.  Scalability of UMA machines is limited. As the number of processors increases…  Switched networks quickly become very expensive.  Bus-based systems saturate when the bandwidth becomes insufficient.  Multistage networks run into wiring constraints and significant latency.

NUMA Shared Memory  Nonuniform Memory Access  Provides each processor its own piece of memory.  Processors see this memory as a contiguous addressable entity.  Nearby memory takes less time to read than memory that is further away. Memory access time is thus inconsistent.  Prone to cache coherence problems.  Each processor maintains a private cache.  Modified data needs to be updated in all caches.  Special hardware units known as snoopy cache controllers.  Write-through with update – updates stale values in other caches.  Write-through with invalidation – removes stale values from other caches.

Distributed Computing  Means different things to different people.  In a sense, all multiprocessor systems are distributed systems.  Usually used referring to a very loosely based multicomputer system.  Depend on a network for communication among processors.

Grid Computing  An example of distributed computing.  Uses resources of many computers connected by a network (i.e. Internet) to solve computational problems that are too large for any single super-computer.  Global Computing  Specialized form of grid computing. Uses computing power of volunteers whose computers work on a problem while the system is idle.  Screen Saver  Six year run accumulated two million years of CPU time and 50 TB of data.

Questions?