Computer Architecture Guidance Keio University AMANO, Hideharu ． ics ． keio ． ac ． jp.

Slides:

Advertisements

Similar presentations

© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Introduction to MIMD architectures

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Parallel Processing Architectures Laxmi Narayan Bhuyan

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

Computer System Architectures Computer System Software

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Department of Computer Science University of the West Indies.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Course Outline Introduction in algorithms and applications

Constructing a system with multiple computers or processors

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

AN INTRODUCTION ON PARALLEL PROCESSING

Constructing a system with multiple computers or processors

Chapter 4 Multiprocessors

Types of Parallel Computers

Presentation transcript:

Computer Architecture Guidance Keio University AMANO, Hideharu ． ics ． keio ． ac ． jp

Contents Techniques of two key architectures for future system LSIs  Parallel Architectures  Reconfigurable Architectures Advanced uni-processor architecture → Special Course of Microprocessors (by Prof. Yamasaki, Fall term)

Class Lecture using Powerpoint: (70mins. )  The ppt file is uploaded on the web site and you can down load/print before the lecture.  When the file is uploaded, the message is sent to you by .  Textbook: “Parallel Computers” by H.Amano (Sho-ko-do) Exercise (20mins.)  Simple design or calculation on design issues

Evaluation Exercise on Parallel Programming using SCore on RHiNET-2 (50%) Exercise after every lecture (50%)

Computer Architecture 1 Introduction to Parallel Architectures Keio University AMANO, Hideharu ． ics ． keio ． ac ． jp

Parallel Architecture Purposes Classifications Terms Trends A parallel architecture consists of multiple processing units which work simultaneously.

Boundary between Parallel machines and Uniprocessors ILP(Instruction Level Parallelism)  A single Program Counter  Parallelism Inside/Between instructions TLP(Tread Level Parallelism)  Multiple Program Counters  Parallelism between processes and jobs Uniprocessors Parallel Machines Definition Hennessy & Petterson’s Computer Architecture: A quantitative approach

Increasing of simultaneous issued instructions vs. Tightly coupling Single pipeline Multiple instructions issue Multiple Threads execution Connecting Multiple Processors Shared memory, Shared register On chip implementation Tightly Coupling Performance improvement

Purposes of providing multiple processors Performance  A job can be executed quickly with multiple processors Fault tolerance  If a processing unit is damaged, total system can be available ： Redundant systems Resource sharing  Multiple jobs share memory and/or I/O modules for cost effective processing ： Distributed systems Low power  High performance with Low frequency operation Parallel Architecture: Performance Centric!

Flynn ’ s Classification The number of Instruction Stream ： M(Multiple)/S(Single) The number of Data Stream ： M/S  SISD Uniprocessors （ including Super scalar 、 VLIW ）  MISD ： Not existing （ Analog Computer ）  SIMD  MIMD

SIMD (Single Instruction Streams Multiple Data Streams Instruction Memory Processing Unit Data memory All Processing Units executes the same instruction Low degree of flexibility Illiac-IV/MMX （ coarse grain ） CM-2 type （ fine grain ）

Two types of ＳＩＭＤ Coarse grain ： Each node performs floating point numerical operations  ILLIAC-IV ， BSP,GF-11  Multimedia instructions in recent high-end CPUs  Dedicated on-chip approach: NEC’s IMEP Fine grain ： Each node only performs a few bits operations  ICL DAP, CM-2 ， MP-2  Image/Signal Processing  Connection Machines extends the application to Artificial Intelligence (CmLisp)

A processing unit of CM-2 ABF C 256bit memory sc OP Flags Context 1bit serial ALU

Element of CM2 PPPPPPPPPPPPPPPP Router 4x4 Processor Array 12links 4096 Hypercube connection 256bit x 16 PE RAM Instruction LSI chip 4096 chips = 64K PE

Thinking Machines ’ CM2 (1996)

The future of SIMD Coarse grain SIMD  A large scale supercomputer like Illiac-IV/GF-11 will not revive.  Multi-media instructions will be used in the future.  Special purpose on-chip system will become popular. Fine grain SIMD  Advantageous to specific applications like image processing  General purpose machines are difficult to be built ex.CM2 → CM5

MIMD Processors Memory modules (Instructions ・ Data ） Each processor executes individual instructions Synchronization is required High degree of flexibility Various structures are possible Interconnection networks

Classification of MIMD machines Structure of shared memory UMA(Uniform Memory Access Model ） provides shared memory which can be accessed from all processors with the same manner. NUMA(Non-Uniform Memory Access Model ） provides shared memory but not uniformly accessed. NORA/NORMA （ No Remote Memory Access Model ） provides no shared memory. Communication is done with message passing.

UMA The simplest structure of shared memory machine The extension of uniprocessors OS which is an extension for single processor can be used. Programming is easy. System size is limited.  Bus connected  Switch connected A total system can be implemented on a single chip On-chip multiprocessor Chip multiprocessor Single chip multiprocessor

An example of UMA ： Bus connected PU Snoop Cache PU Snoop Cache PU Snoop Cache Main Memory shared bus Snoop Cache SMP(Symmetric MultiProcessor) On chip multiprocessor

Switch connected UMA Switch Local Memory CPU Interface Main Memory ．．．．．．．． …．…． The gap between switch and bus becomes small

NUMA Each processor provides a local memory, and accesses other processors’ memory through the network. Address translation and cache control often make the hardware structure complicated. Scalable ：  Programs for UMA can run without modification.  The performance is improved as the system size. Competitive to WS/PC clusters with Software DSM

Typical structure of NUMA Node １ Node 2 Node ３ Node ００１２３ＩｎｔｅｒｃｏｎｎｅｃｔｏｎＮｅｔｗｏｒｋ Logical address space

Classification of NUMA Simple NUMA ：  Remote memory is not cached.  Simple structure but access cost of remote memory is large. CC-NUMA ： Cache Coherent  Cache consistency is maintained with hardware.  The structure tends to be complicated. COMA:Cache Only Memory Architecture  No home memory  Complicated control mechanism

Cray ’ s T3D: A simple NUMA supercomputer (1993) Using Alpha 21064

The Earth simulator (2002)

SGI Origin Hub Chip Main Memory Ｎｅｔｗｏｒｋ Bristled Hypercube Main Memory is connected directly with Hub Chip 1 cluster consists of ２ PE.

SGI ’ s CC-NUMA Origin3000(2000) Using R12000

DDM(Data Diffusion Machine ）．．．Ｄ

NORA/NORMA No shared memory Communication is done with message passing Simple structure but high peak performance The fastest processor is always NORA (except The Earth Simulator) Hard for programming Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2

Fujitsu ’ s NORA AP1000(1990) Mesh connection SPARC

Intel ’ s Paragon XP/S(1991) Mesh connection i860

PC Cluster Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling)  Commodity components  TCP/IP  Free software Others  Commodity components  High performance networks like Myrinet / Infiniband  Dedicated software

RHiNET-2 cluster

Terms(1) Multiprocessors ：  MIMD machines with shared memory  （ Strict definition ：ｂｙＥｎｓｌｏｗＪｒ．） Shared memory Shared I/O Distributed OS Homogeneous  Extended definition: All parallel machines （ Wrong usage ）

Terms （ 2) Multicomputer  ＭＩＭＤ machines without shared memory, that is ＮＯＲＡ／ＮＯＲＭＡ Arraycomputer  A machine consisting of array of processing elements : ＳＩＭＤ  A supercomputer for array calculation (Wrong usage) Loosely coupled ・ Tightly coupled  Loosely coupled: ＮＯＲＡ， Tightly coupled: ＵＭＡ  But, are NORAs really loosely coupled?? Don’t use if possible

Stored programming based Others ＳＩＭＤＦｉｎｅｇｒａｉｎＣｏａｒｓｅｇｒａｉｎＭＩＭＤ Bus connected UMA Switch connected UMA ＮＵＭＡＳｉｍｐｌｅＮＵＭＡＣＣ－ＮＵＭＡＣＯＭＡＮＯＲＡ Systolic architecture Data flow architecture Mixed control Demand driven architecture Multiprocessors Multicomputers Classification

Systolic Architecture Data ｘ Data ｙ Computational Array Data streams are inserted into an array of special purpose computing node in a certain rhythm. Introduced in Reconfigurable Architectures

Data flow machines ｘ＋ｘ＋ａｂｃｄｅ（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ）） A process is driven by the data Also introduced in Reconfigurable Systems

Exercise 1 In this class, a PC cluster RHiNET is used for exercising parallel programming. Is RHiNET classified into Beowulf cluster ? Why do you think so? If you take this class, send the answer with your name and student number to