Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Structure of Computer Systems
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency Prof Jonathan Koomey looked at 6 decades of data and found that energy efficiency.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Computer System Architectures Computer System Software
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Lecture 13: Multiprocessors Kai Bu
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Martin Kruliš by Martin Kruliš (v1.1)1.
The Beauty and Joy of Computing Lecture #10 Concurrency UC Berkeley EECS Lecturer Gerald Friedland Prof Jonathan Koomey looked at 6 decades of data and.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Understanding Parallel Computers Parallel Processing EE 613.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
The University of Adelaide, School of Computer Science
Lecture 13: Multiprocessors Kai Bu
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Thread & Processor Scheduling
Distributed Processors
Parallel Processing - introduction
Parallel Computing Lecture
CS 147 – Parallel Processing
Introduction to Parallelism.
Lecture 2: Intro to the simd lifestyle and GPU internals
Multi-Processing in High Performance Computer Architecture:
Presented by: Isaac Martin
Merry Christmas Good afternoon, class,
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Parallel Processing Sharing the load.
Multicore / Multiprocessor Architectures
Multiprocessors - Flynn’s taxonomy (1966)
High Performance Computing
CSC3050 – Computer Architecture
General Purpose Graphics Processing Units (GPGPUs)
Chapter 4 Multiprocessors
CS 286 Computer Organization and Architecture
6- General Purpose GPU Programming
Presentation transcript:

Processor Level Parallelism 2

How We Got Here Developments in PC CPUs

Development Single Core

Development Single Core with Multithreading – 2002 Pentium 4 / Xeon

Development Multi Processor – Multiple processors coexisting in system – PC space in ~1995

Development Multi Core – Multiple CPU's on one chip – PC space in ~2005

Power Density Prediction circa 2000 Core 2 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005

Moore's Law Related Curves Adapted from UC Berkeley "The Beauty and Joy of Computing"

Moore's Law Related Curves Adapted from UC Berkeley "The Beauty and Joy of Computing"

Development Modern Complexity – Many cores – Private / Shared cache levels

Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3

Heterogeneous Multicore Different cores for different jobs – Standard CPU – Low Power CPU – Graphics – Video

Coprocessors Coprocessor : Assists main CPU with some part of work

Co Processors Graphics Card : floating point specialized – 100s-1000s of SIMD cores –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops

CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with CUDA cores

CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed

Other Coprocessors CPU's used to have floating point coprocessors – Intel & Audio cards Crytpo – SLL encryption for servers

Parallelism & Memory

Multiprocessing & Memory Memory demo…

Memory Access Multiple Processes accessing same memory = interactions – May add 10, 1 or 11 to x

UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word

NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower

NUMA & Cache Memory problems compounded by cache X = 10

NUMA & Cache Memory problems compounded by cache X = 10 X = 15

Cache Coherence Cores need to "snoop" other reads Cores need to broadcast writes

MESI MESI : Cache Coherence Protocol – Modified I have this cached and I have changed it – Exclusive I have this uncached and unmodified and am only one with it – Shared I and another both have this cached – Invalid I do not have this cached

State Change Changes based on OWN actions Read Fulfilled By Other Cache Read Fulfilled By Other Cache

State Change Changes based on OTHERS actions I have only modified copy of this… Write it out to memory and have other core wait

State Change Sample CPU 2 broadcasts write message… CPU 1 invalidates

State Change Sample CPU 2 snoops read… has to write modified value to memory CPU 2 snoops write… has to write modified value to memory

Parallelism Bad News

Parallel Speedup In Theory: N cores = N times speedup

Issues Not every part of a problem scales well – Parallel : can run at same time – Serial : must run one at a time in order

Amdahl’s Law In Practice: Amadahl's law applied to N processors on a task where P is parallel portion:

Amdahl’s Law 60% of a job can be made parallel. We use 2 processors: 1.43x faster with 2 than 1

Applications can almost never be completely parallelized; some serial code remains Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5

Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion Serial portion becomes limiting factor

Ouch More processors only help with high % of parallelized code

Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…

Parallel Algorithms Some problems highly parallel, others not: