Parallel Computer Architecture and Interconnect 1b.1.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Distributed Systems CS
SE-292 High Performance Computing
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Reference: Message Passing Fundamentals.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Parallel Computing Platforms
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Lecture 3 Innerconnection Networks for Parallel Computers
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Multiprocessor System Distributed System
Overview Parallel Processing Pipelining
Introduction to parallel programming
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Dynamic connection system
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
Overview Parallel Processing Pipelining
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Advanced Computer and Parallel Processing
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
Advanced Computer and Parallel Processing
Types of Parallel Computers
Presentation transcript:

Parallel Computer Architecture and Interconnect 1b.1

Types of Parallel Computer Architecture 1b.2 Two principal types: Shared memory multiprocessor From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical address space. Distributed memory multicomputer In hardware, refers to network based memory access that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines. Ref slides from B. Wilkinson at UNC-Charlotte, and Kumar Introduction to parallel computing

Shared Memory Multiprocessor 1b.3

Conventional Computer 1b.4 Virtually all computers have followed a common machine model known as the von Neumann computer. Named after the Hungarian mathematician John von Neumann. A von Neumann computer uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory. Each main memory location located by its address. Addresses start at 0 and extend to 2 b - 1 when there are b bits (binary digits) in address.

Shared Memory Multiprocessor System 1b.5 Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.

UMA and NUMA. 1b.6 Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP Not all processors have equal access time to all memories Memory access across link is slower If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Shared Memory Computers 1b.7 Advantages: Global address space provides a user-friendly programming interface to memory Data sharing between tasks is both fast and uniform Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can increases traffic on the shared memory- CPU path Programmer responsibility for synchronization constructs that insure "correct" access of global memory and consistent data result. Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

Distributed Memory Computer 1b.8 Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.

Distributed Memory Computer 1b.9 Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking like Ethenet. Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. Non-uniform memory access (NUMA) times

Hybrid Computer 1b.10 The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.

Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache. Example Quad Shared Memory Multiprocessor 1b.11 Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Memory controller Memory I/O interface I/O bus Processor/ memory bus Shared memory

Programming Shared Memory Computers Several possible ways 1b Use Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access shared and global variables declared. Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Any thread can execute any subroutine at the same time as other threads. Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to insure that more than one thread is not updating the same global address at any time. Example Pthreads

1b Use library functions and preprocessor compiler directives with a sequential programming language to declare shared variables and specify parallelism. Portable / multi-platform, including Unix and Windows NT platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use Example OpenMP - industry standard. Consists of library functions, compiler directives, and environment variables - needs OpenMP compiler

Programming Distributed Memory Computers 1b.14 Message passing model Tasks exchange data through communications by sending and receiving messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

Interconnection Networks 1b.15 Provide mechanisms for data transfer between processors or between processors and memory Typical network built on links (physical media such as wires and fibers) and switches ( provide mapping from input to output). Static network: point to point links Dynamic network: switches and links. Communications are established dynamically among processors and memory.

Interconnection Networks 1b and 3-dimensional meshes Hypercube (not now common) Using Switches: Crossbar Trees Multistage interconnection networks

1b.17 Bus-Based Networks Idea for broadcasting. Distance between any two nodes is constant. However, the bounded bandwidth of a bus place limitations on performance as number of nodes creases. Cache is used to improve access time. Scalable in cost but not in performance

Crossbar Networks pxb switches are employed. b>=p, non-blocking Lower bound on the total switches is (p^2). Not scalable in terms of cost Scalable in terms of performance

1b.19 Multistage Networks Intermediate class of networks lies between these above two extremes. Omega network consists of log p stages, where p is the number of inputs (nodes) and output (memory).

1b.20 Input i and output j, a link exists if: j = 2i 0<=i <=p/2 -1 or j = 2i +1-p, p/2<=i<=p-1 Left shift by one bit for input binary sequence

1b.21 p inputs are fed into a set of p/2 switches. Each switch is in one of the two connection modes. 1). Pass-through: input are sent straight through to the outputs 2). Cross-over: Inputs are crossed over and then sent out.

1b.22 Total number of switches?

1b.23 AB link may be used by another pair of node to memory. Such communication will be blocked.

1b.24 Completely-connected network is good in the sense that any two nodes can exchange message in a single step. Similar to crossbar network due to non-blocking property Star connected is similar to bus-based network. Communication between any pair of nodes is routed through the central processor. The central node is the bottleneck just like the bus.

1b.25

1b.26 Total nodes are 2^d In general, a d-dimensional hypercube is constructed by connecting corresponding nodes of two (d-1) dimensional hypercubes.

1b.27 Tree-based network a.Static tree network has a processing nodes at each node. b.Dynamic tree has switching nodes at intermediate levels, processing nodes at leaf level. To route a message, source node sends the message up the tree until reach the node that is the root of the subtree containing both sender and receiver.

1b.28 Cache Coherence 1b.28 In the case of shared-address-space computers, additional hardware is required to keep multiple copies of data consistent with each other. Especially, for multiple processors how to ensure they all use the same updated values? If a processor changes the value of its copy, one the two things must happen: The other copies must be invalidated The other copies must be updated

1b.29

1b.30 Solid line represents processor actions and the dashed line presents coherence actions. Read on invalid data transition to shared by accessing the remote value A write on shared transition to dirty and c_write to label other copies to be invalid.

1b.31

1b.32

1b.33