Jie Liu, Ph.D. Professor Department of Computer Science

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Today’s topics Single processors and the Memory Hierarchy
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Computer System Architectures Computer System Software
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computer Architecture and Interconnect 1b.1.
Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Dynamic Interconnection Networks Miodrag Bolic.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Flynn’s Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.
Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Shared versus Switched Media.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
Lecture 3: Computer Architectures
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Parallel Processing & Distributed Systems Thoai Nam Chapter 3.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Overview Parallel Processing Pipelining
Parallel Architecture
Distributed and Parallel Processing
Multiprocessor Systems
Distributed Processors
Interconnection topologies
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Parallel Architectures Based on Parallel Computing, M. J. Quinn
Outline Interconnection networks Processor arrays Multiprocessors
AN INTRODUCTION ON PARALLEL PROCESSING
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Presentation transcript:

Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA liuj@wou.edu

Part I outline Three models of parallel computation Processor organizations Processor arrays Multiprocessors Multi-computers Flynn’s taxonomy Affordable parallel computers Algorithms with processor organizations

Processor Organization In a parallel computer, processors need to “cooperate.” To do so, a processor must be able to “reach” other processors. The method of connecting processors is a parallel computer is called processor Organization. In a processor organization chart, vertices represent processors and edges represent communication paths.

Processor Organization Criteria Diameter: the largest distance between two nodes. The lower, the better because it affects the communicate costs. Bisection width: the minimum number of edges that must be removed to divide the network into to halves (within one). The higher, the better because it affect the number of concurrent communication channels. Number of edges per node: we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability. Maximum edge length: again, we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.

Mesh Networks 1. A mesh always has a dimension q, which could be 1, 2, 3, or even higher. Each interior nodes can communicate with 2q other processors

Mesh Networks (2) For a mesh with nodes ( as shown) Diameter: q(k-1) (too large to have NC algorithm) Bisection width: (Reasonable) Maximum number of edge per note: 2q (constant -- good) Maximum edge length: Constant (good) Many parallel computers used this architecture because it is simple and scalable Intel Paragon XP/S used this architecture

Hypertree Networks Degree k = 4 And depth d = 2

Hypertree Networks For a hypertree of degree k and depth d, generally, we only consider the cases where k = 4 Number of nodes: Diameter: 2d (good for design NC class algorithms) Bisection width: (Reasonable) Maximum number of edge per note: 6 (kind of constant) Maximum edge length: changes depend on d Only one parallel computers Thinking Machines’ CM-5 (Connection Machine) used this architecture The designed maximum number of processors was 64K The processors were vector processors that were capable of performing 32 pairs of arithmetic operations per clock cycle.

Butterfly Network A butterfly network has nodes. The one on the right has k= 3. In practice, rank 0 and rank k are combined, so each node has four connections. Each rank contains n= nodes. If n(i, j) is the jth node on the ith rank, then it connects to two nodes on rank i-1: n(i-1, j) and n(i-1, m), where m is the integer formed by inverting the ith most significant bit in the k-bit binary number of j. For example, n(2,3) is connected to n(1,3) and n(1,1) because 3 is 011, inverting the second most significant bit makes it 001, which is 1.

Butterfly Network (2) For a butterfly network with nodes Diameter: 2k -1 (good for design NC class algorithms) Bisection width: (very good) Maximum number of edge per note: 4 (constant) Maximum edge length: changes depend on k The network is also called an  network. A few computers used this connection network including BBN’s TC2000

Routing on Butterfly Network To route a message from rank 0 to a node on rank k, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right. The chart shows routing from n(0, 2) to n(3, 5) To route a message from rank k to a node on rank 0, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right.

Hypercube Is a butterfly in which each column of switch nodes is collapsed into a single node. A binary n-cube has processors and equal number of switch nodes. The chart on the right shows a hypercube of degree 4. Two switch nodes are connected if their binary labels differ in exactly one bit position.

Hypercube Measures For a hypercube with nodes Diameter: k (good for design NC class algorithms) Bisection width: n/2 (very good) Maximum number of edge per note: k Maximum edge length: depend on nodes Routing, just find the difference, one bit at a time, either from left to right, or from right to left. For example, for 0100 to 1011 we can go 0100  1100  1000 1010  1011, or 0100  0101  0111 0011  1011 A company named nCUBE Corporation makes machine of this structure up to k = 13 (theoretically). The company was later bought by ORACLE

Shuffle-Exchange Network It has nodes numbered 0, 1, … … n-1. It has two kinds of connections: shuffle and exchange. Exchange connections link two nodes whose number differ in their least significant bit. Shuffle connections link node i with Below is a Shuffle-Exchange network with

Shuffle-Exchange Network For a Shuffle-Exchange Network with nodes Diameter: 2k -1 (good for design NC class algorithms) Bisection width:  n/k (very good) Maximum number of edge per note: 2 Maximum edge length: depend on nodes Routing is not easy. It is hard to build a real Shuffle-Exchange Network because there are lines crossing each other. This architecture is studied for its theoretical significance

Summary

Processor Array

Processor Array (2) Parallel computers employ processor array technology can perform many arithmetic operation per clock cycle achieved by with pipelined vector processors, such as Cray-1, or processor array, such as Thinking Machines’ CM-200. This type of parallel did not really survive because $$$$$ because of the special CPUs Hard to utilize all processors Cannot handle if-then-else types of statements well because all the processors must carry out the same instruction Partitioning is very difficult It really need to deal with very large amount of data, which make I/O impossible

Multiprocessors Parallel computers with multiple CPUs and shared memory space. + can use commodity CPU  reasonable costs + support multiple user + different CPUs can execute different instructions UMA– uniform memory access, also called symmetric multiprocessor (SMP) – all the processors access any memory address with the same cost Sequent Symmetry can have up to 32 Intel X386processors All the CPUs share the same bus The problem with SMP/UMA is that the number of processors is limited NUMA – nonuniform memory access – processors can access it own memory, though accessible by others, much cheaper. Processors are connected through some connection network, such as butterfly Kendall Square Research support over 1000 processors The connection network costs too much, around 40% of the overall costs

UMA VS. NUMA

Cache Coherence Problem

Multicomputers Parallel computers with multiple CPUs and NO shared memory. Processors interact through message passing. + all of multiprocessor and possible to have a large number of CPUs - message passing is hard to implement and take a lot of time to carry out The first generation of message passing is store-and-forward where a processor receives the complete message then forward to the next processor iPSC and nCUBE/10 The second generation of message passing is circuit-switched where a path is first established with a cost, then subsequent messages use the path without the start up cost iPSC/2 and nCUBE 2 The cost of message passing (www.cs.bgu.ac.il/~discsp/papers/ConstrDisAC.pdf) startup time – must occur even if you send an empty message per byte cost cost of one floating point operation, for comparison reason

Multicomputers--nCUBE An nCUBE parallel computer has three parts, the frontend, the backend, and I/Os. The frontend is a fully functioning computer The backend nodes, each is a computer of it own, has an simple OS that support message passing Note, the capability of the frontend stays the same regardless the number of processors at the backend The largest nCUBE can have 8K processors

Multicomputers—CM-5 Each node consist of a SPARC CPU, up to 32 MB of RAM, and four pipeline lined vector processing, each with 32 MFlop It can have up to 16K nodes With a theoretical peak speed of 2 teraflops (in 1991)

Flynn’s Taxonomy SISD – single core PC SIMD processor array or CM-200 MISD – systolic array MIMD – multiple core PC, nCUBE, Symmetry, CM-5, Paragon XP/S Single Data Multiple Data SISD SIMD Single-Instruction MISD MIND Multiple-Instruction

Inexpensive “Parallel Computers” Beowulf PCs connected by a switch NOW Work stations on an intranet Multi-core PCs with few multicore CPUS # of node Cost Perfor- mance Easy to program Dedicated Beowulf Few to 100 OK Yes NOW 100s none No Multi-core Two to few low Low

Summation on Hypercube

Summation on Hypercube for j (log p) -1 down to 0 do { for all where 0 <= i < p - 1 if ( i < ) // variable tmp on receives //value of sum from tmp <= [i + ] sum sum = sum + tmp }

Summation on Hypercube (2) What could the code looks like?

Summation on 2-D mesh –SIMD Code The mesh has lxl processors, 1 based for i  l -1 down to 1 do // push from right to left { for all where 1 <= j < l do // only l processor is working { // variable tmp on receives value of sum from tmp <= [j, i+1] sum sum = sum + tmp } for i  l -1 down to 1 do // push from bottom up { for all // really only two processors are working { // variable tmp on receives value of sum from tmp <= [i+1,1] sum

Summation on 2-D mesh

Summation on Shuffle-exchange Shuffle-exchange SIMD code for j 0 to (log p) -1 { for all where 0 <= i < p - 1 Shuffle(sum) <= sum Exchange(tmp) <= sum sum = sum + tmp }