Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

SE-292 High Performance Computing

Fundamental of Computer Architecture By Panyayot Chaikan November 01, 2003.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

History of Distributed Systems Joseph Cordina

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.

1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

Parallel Computer Architectures

Introduction to Parallel Processing Ch. 12, Pg

10-1 Chapter 10 - Trends in Computer Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.

10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.

“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.

Invitation to Computer Science 5th Edition

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Multiprocessor systems Objective n the multiprocessors’ organization and implementation n the shared-memory in multiprocessor n static and dynamic connection.

CPU Computer Hardware Organization (How does the computer look from inside?) Register file ALU PC System bus Memory bus Main memory Bus interface I/O bridge.

CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 January Session 4.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Birds Eye View of Interconnection Networks

Chapter 4 MARIE: An Introduction to a Simple Computer.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Super computers Parallel Processing

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2: Data Manipulation

2/16/2016 Chapter Four Array Computers Index Objective understand the meaning and structure of array computer realize the associated instruction sets,

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Array computers. Single Instruction Stream Multiple Data Streams computer There two types of general structures of array processors SIMD Distributerd.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

INTERCONNECTION NETWORK

These slides are based on the book:

Overview Parallel Processing Pipelining

Parallel Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Course Outline Introduction in algorithms and applications

Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.

How does an SIMD computer work?

Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan

Parallel and Multiprocessor Architectures

Outline Interconnection networks Processor arrays Multiprocessors

Chapter 2: Data Manipulation

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Advanced Computer and Parallel Processing

Chapter 2: Data Manipulation

Advanced Computer and Parallel Processing

Chapter 2: Data Manipulation

Multiprocessor System Interconnects

Presentation transcript:

Chapter 5 Array Processors

Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel architectures –Hardware intensive architectures –Interconnection network

Associative Processor  An SIMD whose main component is an associative memory.(Figure 2.19)  AM(Associative Memory): Figure 2.18 –Used in fast search operations –Data register –Mask register –Word selector –Result register

Introduction(continued)  Associative processor architectures also belong to the SIMD classification. –STRAN –Goodyear Aerospace ’ s MPP(massively parallel processor)  The systolic architectures are a special type of synchronous array processor architecture.

5.1 SIMD Organization  Figure 5.1 shows a SIMD processing model. (Compare to Figure 4.1)  Example 5.1 –SIMDs offer an N-fold throughput enhancement over SISD provided the application exhibits a data-parallelism of degree N.

5.1 SIMD Organization (continued)  Memory –Data are distributed among the memory blocks –A data alignment network allows any data memory to be accessed by any PE.

5.1 SIMD Organization (continued)  Control processor –To fetch instructions and decode them –To transfer instructions to PEs for executions –To perform all address computations –To retrieve some data elements from the memory –To broadcast them to all PEs as required.

5.1 SIMD Organization (continued)  Arithmetic/Logic processors –To perform the arithmetic and logical operations on the data –Each PE corresponding to data paths and arithmetic/logic units of an SISD processor capable of responding to control control signals from the control unit.

5.1 SIMD Organization (continued)  Interconnection network (Refer to Figure 2.9) –In type 1 and type 2 SIMD architectures, the PE to memory interconnection through n x n switch –In type 3, there is no PE-to-PE interconnection network. There is a n x n alignment switch between PEs and the memory block.

5.1 SIMD Organization (continued)  Registers, instruction set, performance considerations –The instruction set contains two types of index manipulation instructions, one set for global registers and the other for local registers

5.2 Data Storage Techniques and Memory Organization  Straight storage / skewed storage  GCD

5.3 Interconnection Networks  Terminology and performance measures –Nodes –Links –Messages –Paths: dedicated / shared –Switches –Directed(or indirect) message transfer –Centralized (or decentralized) indirect message transfer

5.3 Interconnection Networks (continued)  Terminology and performance measures –Performance measures Connectivity Bandwidth Latency Average distance Hardware complexity Cost Place modularity Regularity Reliability and fault tolerance Additional functionality

5.3 Interconnection Networks (continued)  Terminology and performance measures –Design choices(by Feng): refer to Figure 5.9 Switching mode Control strategy Topology Mode of operation

5.3 Interconnection Networks (continued)  Routing protocols –Circuit switching –Packet switching –Worm hole switching  Routing mechanism –Static / dynamic  Switching setting functions –Centralized / distributed

5.3 Interconnection Networks (continued)  Static topologies –Linear array and ring –Two dimensional mesh –Star –Binary tree –Complete interconnection –hypercube

5.3 Interconnection Networks (continued)  Dynamic topologies –Bus networks –Crossbar network –Switching networks Perfect shuffle –Single stage –Multistage

5.4 Performance Evaluation and Scalability  The speedup S of a parallel computer system:  Theoretically, the maximum speed possible with a p processor system is p. ( A superlinear speedup is an exception) –Maximum speedup is not possible in practice, because all the processors in the system cannot be kept busy performing useful computations all the time.

5.4 Performance Evaluation and Scalability (continued)  The timing diagram of Figure 5.20 illustrates the operation of a typical SIMD system.  Efficiency, E is a measure of the fraction of the time that the processors are busy. In Figure 5.20, s is the fraction of the time spent in serial code. 0  E  1

5.4 Performance Evaluation and Scalability (continued)  The serial execution time in Figure 5.20 is one unit and if the code that can be run in parallel takes N time units on a single processor system,  The efficiency is also defines as

5.4 Performance Evaluation and Scalability (continued)  The cost is the product of the parallel run time and the number of processors. –Cost optimal: if the cost of a parallel system is proportional to the execution time of the fastest algorithm.  Scalability is a measure of its ability to increase speedup as the number of processors increases.

5.5 Programming SIMDs  The SIMD instruction set contains additional instruction for IN operations, manipulating local and global registers, setting activity bits based on data conditions.  Popular high-level languages such as FORTRAN, C, and LISP have been extended to allow data-parallel programming on SIMDs.

5.6 Example Systems  ILLIAC-IV –The ILLIAC-IV project was started in 1966 at the University of Illinois. – A system with 256 processors controlled by a CP was envisioned. –The set of processors was divided into four quadrants of 64 processors. –Figure 5.21 shows the system structure. –Figure 5.22 shows the configuration of a quadrant. –The PE array is arranged as an 8x8 torus.

5.6 Example Systems (continued)  CM-2 –The CM-2, introduced in 1987, is a massively parallel SIMD machine. –Table 5.1 summarizes its characteristics. –Figure 5.23 shows the architecture of CM- 2.

5.6 Example Systems (continued)  CM-2 –Processors The 16 processors are connected by a 4x4 mesh. (Figure 5.24) Figure 5.25 shows a processing cell. –Hypercube The processors are linked by a 12-dimensional hypercube router network. The following parallel communication operations permit elements of parallel variables: reduce & broadcast, grid(NEWS), general(send, get), scan, spread, sort.

5.6 Example Systems (continued)  CM-2 –Nexus A 4x4 crosspoint switch, –Router It is used to transmit data from a processor to the other. –NEWS Grid A two-dimensional mesh that allows nearest- neighbor communication.

5.6 Example Systems (continued)  CM-2 –Input/Output system Each 8-K processor section is connected to one of the eight I/O channels (Figure 5,26). Data is passed along the channels to I/O controller (Figure 5.27). –Software Assembly language, Paris *LISP, CM-LISP, and *C –Applications: refer to page 211.

5.6 Example Systems (continued)  MasPar MP –The MasPar MP-1 is a data parallel SIMD with basic configuration consisting of the data parallel unit(DDP) and a host workstation. –The DDP consists of from 1,024 to 16,384 processing elements. –The programming environment is UNIX-based. Programming languages are MDF(MasPar FORTRAN), MPL(MasPar Programming Language)

5.6 Example Systems (continued)  MasPar MP –Hardware architecture The DPU consists of a PE array and an array control unit(ACU). The PE array(Figure 5.28) is configurable from 1 to 16 identical processor boards. Each processor board has 64 PE clusters(PECs) of 16 PEs per cluster. Each processor board thus contains 1024 PEs.

5.7 Systolic Arrays  A systolic array is a special purpose planar array of simple processors that feature a regular, near-neighbor interconnection network.

Figure 5-31(iWarP System)  iWarp (Intel 1991) –Developed jointly by CMU and Intel Corp. –A programmable systolic array –Memory communication & systolic communication –The advantages of systolic communication Fine grain communication Reduced access to local memory Increased instruction level parallelism Reduced size of local memory

Figure 5-31(iWarP System)

 An iWarp system is made of an array of iWarp cells  Each iWarp cell consists of an iWarp component and the local memory.  The iWarp component contains independent communication and computation agents

Figure 5-31(iWarP System)