Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Lecture 6: Multicore Systems
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Multiple Processor Systems
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
Parallel Computer Architectures
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Architecture Basics ECE 454 Computer Systems Programming
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006.
MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.
Introduction 9th January, 2006 CSL718 : Architecture of High Performance Systems.
Outline Classification ILP Architectures Data Parallel Architectures
Multi-core architectures. Single-core computer Single-core CPU chip.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Processor Level Parallelism 1
COMP 740: Computer Architecture and Implementation
Parallel Architecture
Distributed Processors
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
CMSC 611: Advanced Computer Architecture
Computer Architecture: Multithreading (I)
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
ECE/CS 757: Advanced Computer Architecture II
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Latency Tolerance: what to do when it just won’t go away
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Hardware Multithreading
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
Presentation transcript:

Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006

Anshul Kumar, CSE IITD Context switching Delays and poor resource utilization due to - –Data/control hazards –cache misses –waiting for some event Solution – –context switch to another thread Context switch mechanism – –operating system - slow –hardware - fast

Anshul Kumar, CSE IITD Multithreaded architecture Hardware context switching Models –control flow or hybrid (control flow, data flow) Granularity –fine grain or coarse grain Memory organization –shared?, distributed?, cache coherent? No. of threads –small, medium, large

ILP and Multithreading ILP Coarse MT Fine MT SMT Hennessy and Patterson

Anshul Kumar, CSE IITD Chip level multithreading Executing instructions from multiple threads within one processor chip at the same time. Multithreading: Interleaved issue of multiple instructions from different threads Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. Chip-level multiprocessing (CMP or Multicore): integrate two or more superscalar processors into one chip, each execute one thread independently Any combination of multithreading/SMT/CMP Wikipedia

Anshul Kumar, CSE IITD Historical Examples MachineGranu-ProcsThreads/MemoryYear larity proc HEP fromfinemax 168 activeshared1978 Denelcor64 maxcentralized Terafinemax distributed1990 shared Alewifecoarsemax 5121 activeCC1990 (MIT)sparcle3 loaded

Anshul Kumar, CSE IITD Modern examples Pentium 4Hyperthreading MIPS MT 8 cores with 4 threads each IBM Power 5 dual core, 2 threads each Ultrasparc T1 fine grained multithreading

Anshul Kumar, CSE IITD HEPHEP FU1FU2FUn Operand fetch Matching unit Registers Program memory Increment control PSW queue To/from data memory SFU Control loop8 stage pipeline scheduler function unit

Anshul Kumar, CSE IITD Control Flow & Data Flow models Control Flow (von Neumann) –control flows through a sequence of instructions, branches can alter the flow –instructions get data from or put data in memory –explicit parallelism through control operators – fork/join Data Flow –instructions are triggered by availability of data –data flows from instruction to instruction –explicit parallelism

Anshul Kumar, CSE IITD Dataflow Model -+ * AB1 A-BB+1 R=(A-B)*(B+1)

Anshul Kumar, CSE IITD Dataflow Program A B A-B B+1 R=(A-B)*(B+1) - L4/1 + 1 L4/2 * L6/1 - L2/2 L3/1 B L1: L2: L3: L4: Compute B

Anshul Kumar, CSE IITD Static Dataflow Architecture FU1FU2FUn Fetch unit Update unit Activity Store Instruction queue to/from other PEs

Anshul Kumar, CSE IITD Tagged-token dataflow architecture FU1FU2FUn Fetch unit Form token unit Instruction/ data memory Token queue to/from other PEs Matching unit Matching store

Anshul Kumar, CSE IITD UMA Examples Earlier approach : Large number of processors (e.g. Denelcor HEP, NYU Ultracomputer) Now realized : Good only for small number of processors (e.g. Encore Multimax ’s, SGI Power Challenge ’s)

Anshul Kumar, CSE IITD SGI Power Challenge 18 MIPS R GB RAM, 8-way interleaved 4 power channel-2, each 320 MB/s (I/O bus) Power path-2 : split transaction shared bus (256 bit data, 40 bit address) Snoopy cache coherence protocol

Anshul Kumar, CSE IITD NUMA Examples BBN TC2000 IBM RP3 Hector Cray T3D

Anshul Kumar, CSE IITD HectorHector Hierarchical Structure global ring local rings stations Proc module (P+C+M) I/O module

Anshul Kumar, CSE IITD HectorHector local ring global ring local ring station Proc module Proc module Proc module I/O module Station controller Station bus Station

Anshul Kumar, CSE IITD Cray T3D Alpha ProcCray Y-MP host upto 128 GB memory 4x4x4 3D torus - config upto 8x8x8 2 PEs in each node

Anshul Kumar, CSE IITD CC-NUMA examples MachineNodesMemCacheNet Wisconsinsingle procper col bussnoopybus grid Multicube Aquariussingle procper nodesnoopy+bus grid Multimultidirectory Stanfordclusterper clustersnoopy+pair of Dash4 R3000+directorymeshes FPU on bus Stanfordsingle procper nodedirectory2D FlashT5+magic chipmesh Convexhyper nodeperSCIX bar Exemplar8 PA-RISChyper node (hyper node) multi rings Magic chip : memory + I/O + network controller

Anshul Kumar, CSE IITD COMA examples DDM (Data Diffusion Machine) –single bus (split transaction) –can be made hierarchical KSR 1 –hierarchical rings –distributed directory is a matrix : rows for pages, columns for caches

Anshul Kumar, CSE IITD Distr Mem Arch Examples MachineComp.Comm.Vec.SwitchTopology procprocproc nCUBE2customcustomhyper cube iPSC2i386yesyeshyper cube Inteli860i860custom2D mesh Paragon Genesisi870i870custom2 level X bar Mannai860i86016x16 X bar hierarch. ParsytecP.PC601T805C0043D mesh Transtechi860T805C004variable Paramid IBM SP2Power2i860customfat tree MeikoSPARCcustomFujitsucustomfat tree C32 ParsysT900T900C104hierarch sw SN9800

Anshul Kumar, CSE IITD ReferencesReferences D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997.