Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Parallel computer architecture classification
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Multicore Systems CET306 Harry R. Erwin University of Sunderland.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Parallel Computing.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Outline Why this subject? What is High Performance Computing?
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Lecture 3: Computer Architectures
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
CS 147 – Parallel Processing
Multi-Processing in High Performance Computer Architecture:
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Symmetric Multiprocessing (SMP)
Parallel Processing Architectures
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Multicore and GPU Programming
Multicore and GPU Programming
Husky Energy Chair in Oil and Gas Research
Presentation transcript:

Multiprocessing

Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Processor Parallelism Process Parallelism : Ability run multiple instruction streams simultaneously

Flynn's Taxonomy Categorization of architectures based on – Number of simultaneous instructions – Number of simultaneous data items

Flynn's Taxonomy Categorization of architectures

SISD SISD : Single Instruction – Single Data – One instruction sent to one processing unit to work on one piece of data – May be pipelined or superscalar

Flynn's Taxonomy Categorization of architectures

SIMD Roots ILLIAC IV – One instruction issued to 64 processing units

SIMD Roots Cray I – Vector processor – One instruction applied to all elements of vector register

Modern SIMD x86 Processors – SSE Units : Streaming SIMD Execution – Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …

Modern SIMD Graphics Cards fermi-architecture.html fermi-architecture.html Becoming less and less "S"

Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops

CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with CUDA cores

CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed

Flynn's Taxonomy Categorization of architectures

MISD MISD : Multiple Instruction – Single Data – Different instruction, same data calculated – Rare – Space shuttle : Five processors handle fly by wire input, vote

Flynn's Taxonomy Categorization of architectures

MIMD MIMD : Multiple Instruction – Multiple Data – Different instructions, working on different data in different processing units – Most common parallel

Coprocessors Coprocessor : Assists main CPU with some part of work

Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops

Other Coprocessors CPU's used to have floating point coprocessors – Intel & Audio cards PhysX Crytpo – SLL encryption for servers

Multiprocessing Multiprocessing : Many processors, shared memory – May have local cache/special memory

Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3

Heterogeneous Multicore Different cores for different jobs – Specialized media processing in mobile devices Examples – Tegra  – PS3 Cell

Multiprocessing & Memory Memory conflict demo…

UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word

NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower

Connections Bus : One communication channel – Scales poorly

Connections Crossbar switched – Segmented memory – Any processor can directly link to any memory – N 2 switches

Connections Other topologies – Balance complexity, flexibility and latency

BlueGene Major super computer player

BG/P Compute Cards 4 processors per card Fully coherent caches Connected in double torus to neighbors

BG/P Full system : 72 x 32 x 32 torus of nodes

Titan The king : Descendant of Redstorm –

Flynn's Taxonomy Categorization of architectures

Distributed Systems No common memory space Pass message between processors

COW Cluster of Workstations

Grid Computing – Multi Computing at internet scale – Resources owned by multiple parties

Parallel Algorithms Some problems highly parallel, others not:

Applications can almost never be completely parallelized; some serial code remains Speedup always limited by serial part of program Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5

Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion Amdahl’s law: – s is serial fraction of program, P is # of processors

Ouch More processors only help with high % of parallelized code

Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…