Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every 18-24 years. – Has been valid for over 40 years – Can’t.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Architecture and Data Manipulation Chapter 3.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Invitation to Computer Science 5th Edition
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
RISC Architecture RISC vs CISC Sherwin Chan.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Pipelining and Parallelism Mark Staveley
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Outline Why this subject? What is High Performance Computing?
1 Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming.
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
The Central Processing Unit (CPU)
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Page 1 Computer Architecture and Organization 55:035 Final Exam Review Spring 2011.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CMSC 611: Advanced Computer Architecture
Parallel Processing - introduction
CS 147 – Parallel Processing
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Parallel and Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Microprocessors Chapter 4.
Chapter 4: Threads.
Chapter 17 Parallel Processing
Multi-Core Architectures
Chapter 2: Data Manipulation
Today’s agenda Hardware architecture and runtime system
AN INTRODUCTION ON PARALLEL PROCESSING
Chapter 11: Alternative Architectures
Chapter 2: Data Manipulation
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
The University of Adelaide, School of Computer Science
Chapter 2: Data Manipulation
Presentation transcript:

Chapter 2 Parallel Architecture

Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t improve performance with increased frequency due to heat issues What to do with the additional transistors? – Faster ways to compute operations – Pipelining – Cache – MULTICORE!!

Levels of Parallelism Bit Level – Word size related – how many bits do we work on at one time (4, 8, 16, 32, 64). Parallel vs. serial addition Pipelining – Break an instruction into components and process them like an assembly line: Fetch Decode Execute Write-back

Parallelism Levels Pipeline (cont.) – Each component can work on its stage on a different instruction in parallel. – Superpipeline – many stages. Multiple function units – Have a separate unit for various functions (integer arithmetic, float arithmetic, load/store, etc.) that can be executed in parallel as long as there are no dependencies.

Parallelism Levels (cont.) Process/Thread Level – Used for multicore/multiple processors – Issues with shared memory/cache vs. distributed memory – Done at the programming level

Flynn’s Taxonomy Data/InstructionSingleMultiple SingleSISDMISD MultipleSIMDMIMD SISD – Standard single processor working on a single data item (pair) at a time MISD – NOT FEASIBLE SIMD – One instruction is performed on multiple data items – also called a vector processor since it looks like the operands are vectors of data MIMD – Multiple processing elements working on data independently.

Memory Organization Distributed Memory – Memory is local and private on each processor – Sharing information is done via message passing between nodes – Faster system if the nodes have DMA Direct Memory Access – controller can get values from or put values into memory without using the processor The processor can keep processing while information is transferred. – Faster with routers to handle data transfer without bothering the processor

Memory Organization (cont.) Shared Memory – Memory is global and “public” – Processes share variables for communications – concerned about “race” conditions – Different results with different execution orders. Ex.: Processor 1Processor 2 LW $t1,X (A)LW $t1,X (D) ADDI $t1,$t1,1 (B)ADDI $t1, $t1, 1 (E) SW $t1, X (C)SW $t1, X (F) ABCDEF gives different result from ADBECF

Shared Memory Generally easier programming Does not scale well (hardware issues with many processors hitting memory at the same time) Cache coherence – if each processor/core has its own cache, then the same global memory location may be mapped to 2 caches which get updated independently (non-shared cache) NUMA – non-uniform memory access. There is a hierarchy of memory that have different access times.

Memory Organization Virtually shared memory – There is a difference between the programmer’s view and the hardware – The programmer writes the code as if the memory is shared, but the memory, in reality, is distributed – The system automatically generates the messages to get values to the proper processor – Definitely NUMA

Cache Memory Small, fast memory between processor and main memory – Is feasible because of temporal and spatial locality – Holds a subset of main memory – Cache hit vs. miss – Cache mapping issues – COHERENCY

Thread Level Parallelism Multithreading – multiple threads executing “simultaneously” on a single processor. – Can’t be simultaneous since single processor – CPU swaps between threads based on time (timeslicing) and/or switch-on-event (one thread waiting for I/O). Multiple cores allow true simultaneousness.

Simultaneous Multithreading Requires multiple functional units and replication of the PC register and all the general user registers (state of the machine) – Creates “Logical Processors” Allow multiple instructions from different threads to execute simultaneously as long as they do not overlap functional units

Multicore Processors Needs an OS that recognizes and schedules tasks on the different cores. Can run different programs on different cores Programs need to be written in such a way that parts of program can be run simultaneously on separate cores to have any improvement in time with multiple cores

Multicore Architecture One main issue is the location of the Cache(s) – Each core has a local cache – Every core shares a cache – Each core has a local L1 cache and shares a L2 cache. Caches need to communicate for Coherency – Network communication – Pipeline (go to the next)

Interconnection Networks Ideally, every node would be connected to every other node so that 2 mutually exclusive pairs of nodes could communicate simultaneously. – Requires O(n 2 ) connections – does not scale well Most use a significantly reduced set of connections (cost vs. speed) – Routing technique needed if not fully connected.