Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Shantanu Dutt Univ. of Illinois at Chicago
Distributed Systems CS
SE-292 High Performance Computing
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
UMA Bus-Based SMP Architectures
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.

Parallel Computing Platforms
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 7 Multiprocessors and Multicomputers 7.1 Multiprocessor System Interconnects.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Mapping Techniques for Load Balancing
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
1 Lecture 2 Parallel Programming Platforms Parallel Computing Spring 2015.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Computer Science Department
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computer Architecture and Interconnect 1b.1.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
RAM, PRAM, and LogP models
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Data Structures and Algorithms in Parallel Computing Lecture 1.
Parallel Computing Platforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'',
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Overview Parallel Processing Pipelining
Parallel Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Dynamic connection system
Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.
CS 147 – Parallel Processing
Parallel and Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Multiprocessors Interconnection Networks
Parallel and Multiprocessor Architectures – Shared Memory
Multiple Processor Systems
Advanced Computer and Parallel Processing
Advanced Computer and Parallel Processing
Lecture 18: Coherence and Synchronization
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Super computers Parallel Processing By: Lecturer \ Aisha Dawood

Communication Model of Parallel Platforms There are two primary forms of data exchange between parallel tasks - accessing a shared data space and exchanging messages. Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. Platforms that support messaging are also called message passing platforms or multicomputers. 2

Shared-Address-Space Platforms Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine (caches are not considered). 3

NUMA and UMA Shared-Address-Space Platforms Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory- access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only. 4

NUMA and UMA Shared-Address-Space Platforms The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines if accessing local memory is cheaper than accessing global memory algorithms and data structure must built locally. Programming these platforms is easier since reads and writes are implicitly visible to other processors (global memory space). However, read-write data to shared data must be coordinated. Caches in such machines require coordinated access to multiple copies. This leads to the cache coherence problem. Cache coherence problem: the presence of multiple copies of a single memory word being changed by multiple processors. A weaker model of these machines provides an address map, but not coordinated access. These models are called non cache coherent shared address space machines. 5

Shared-Address-Space vs. Shared Memory Machines It is important to note the difference between the terms shared address space and shared memory. We refer to the former as a programming abstraction and to the latter as a physical machine attribute. It is possible to provide a shared address space using a physically distributed memory. 6

Message-Passing Platforms The logical machine view of a message passing platform of p processing nodes each with its own exclusive address space ( exclusive memory). Interactions in such platforms between processes running on different nodes must be accomplished using messages, hence the name message passing. These platforms are programmed using (variants of) send and receive primitives. Libraries such as MPI and PVM provide such primitives. 7

Message Passing vs. Shared Address Space Platforms Message passing requires little hardware support, other than a network, and more programming support. Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner). 8

Physical Organization of Parallel Platforms We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or PRAM. 9

Architecture of an Ideal Parallel Computer A natural extension of the Random Access Machine (RAM) serial architecture is the Parallel Random Access Machine, or PRAM. PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to all processors. Processors share a common clock but may execute different instructions in each cycle. 10

Architecture of an Ideal Parallel Computer Depending on how simultaneous memory accesses are handled, PRAMs can be divided into four subclasses. ▫Exclusive-read, exclusive-write (EREW) PRAM. Access to a memory location is exclusive, no concurrent read write operations are allowed.(weakest model, minimum concurrency). ▫Concurrent-read, exclusive-write (CREW) PRAM. Multiple read accesses to a memory location is allowed, multiple write accesses are serialized. ▫Exclusive-read, concurrent-write (ERCW) PRAM. Multiple write accesses are allowed to a memory location, but read accesses are serialized. ▫Concurrent-read, concurrent-write (CRCW) PRAM. Allows multiple read and write accesses to a common memory location, this is the most powerful PRAM model. 11

Architecture of an Ideal Parallel Computer What does concurrent write mean, anyway? Concurrent read doesn’t allow any problems but concurrent write accesses require resolving. The most used protocols to resolve concurrent writes are: ▫Common: write only if all values are identical. ▫Arbitrary: write the data from a randomly selected processor others are fail. ▫Priority: follow a predetermined priority list, the processor with the highest priority succeed but the rest are failed. ▫Sum: Write the sum of all data items, this can be extended to any operator defined on the quantities being written. 12

Interconnection Networks for Parallel Computers Interconnection networks carry data between processors and to memory. Interconnects are made of switches and links (wires, fiber). Interconnection networks are classified as static or dynamic. Static networks consist of point-to-point communication links among processing nodes and are also referred to as direct networks. Dynamic networks consists of nodes connected dynamically built using switches and communication links. Dynamic networks are also referred to as indirect networks. 13

Static and Dynamic Interconnection Networks Classification of interconnection networks: (a) a static network; and (b) a dynamic network. 14

Interconnection Networks Switches functionality: ▫Switches map a fixed number of inputs to outputs. ▫Switches may also support internal buffering, when the requested output port is busy. ▫Support routing to prevent collision on the network. ▫Support multicast (same output on multiple ports). The total number of ports on a switch is the degree of the switch. The cost of switching affected by mapping cost and packaging cost. 15

Interconnection Networks: Network Interfaces Processors connect to the network via a network interface which has input and output ports that pipe data into and out of the network. The network interface may be placed on the I/O bus or the memory bus, the latter support higher bandwidth, since I/O buses are slower. The relative speeds of the I/O and memory buses impact the performance of the network. 16

Network Topologies A variety of network topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement hybrids of multiple topologies for reasons of packaging, cost, and available components. 17

Network Topologies: Bus based network Some of the simplest and earliest parallel machines used buses. All processors access a common bus for exchanging data. The distance between any two nodes is 1 in a bus. The bus also provides a convenient broadcast media. However, the bandwidth of the shared bus is a major bottleneck as the number of nodes increases. The demand of a bus bandwidth can be reduced by making the majority of the data accessed is local to the node and remote data is accessed by the bus. 18

Network Topologies: Bus based network Bus-based interconnects (a) with no local caches; (b) with local memory/caches. Since much of the data accessed by processors is local to the processor, a local memory can improve the performance of bus-based machines. 19

Network Topologies: Crossbars A completely non-blocking crossbar network connecting p processors to b memory banks. A crossbar network uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner. 20

Network Topologies: Crossbars The cost of crossbar network at which p processors connected to b databank increases as the number of processing nodes increases, this leads to memory access blocking. 21

Network Topologies: Multistage Networks Crossbars have excellent performance scalability but poor cost scalability. Buses have excellent cost scalability, but poor performance scalability. Multistage interconnects strike a compromise between these extremes. 22

Network Topologies: Multistage Networks The schematic of a typical multistage interconnection network. 23

Network Topologies: Multistage Omega Network One of the most commonly used multistage interconnects is the Omega network. This network consists of log p stages, where p is the number of inputs/outputs. At each stage, input i is connected to output j if: 24

Network Topologies: Multistage Omega Network Each stage of the Omega network implements a perfect shuffle as follows: A perfect shuffle interconnection for eight inputs and outputs. 25

Network Topologies: Multistage Omega Network The perfect shuffle patterns are connected using 2×2 switches. The switches operate in two modes crossover or passthrough. Two switching configurations of the 2 × 2 switch: (a) Pass-through; (b) Cross-over. 26

Network Topologies: Multistage Omega Network A complete omega network connecting eight inputs and eight outputs. A complete Omega network with the perfect shuffle interconnects and switches can now be illustrated : 27

Network Topologies: Multistage Omega Network – Routing An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB. Blocking route is the most important property of Omega network, accessing a memory location may block other accesses to another memory location by another processor network with this property called blocking network 28