Old-Fashioned Mud-Slinging with Flash

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 xkcd/619. Multicore & Parallel Processing Guest Lecture: Kevin Walsh CS 3410, Spring 2011 Computer Science Cornell University.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
CS61C L28 Intra-machine Parallelism (1) Magrino, Summer 2010 © UCB TA Tom Magrino inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
Instruction Level Parallelism (ILP) Colin Stevens.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CS61C L18 Introduction to CPU Design (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency Prof Jonathan Koomey looked at 6 decades of data and found that energy efficiency.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 25 CPU design (of a single-cycle CPU) Intel is prototyping circuits that.
CS10 The Beauty and Joy of Computing Lecture #8 : Concurrency IBM’s Watson computer (really 2,800 cores) is leading former champions $35,734.
CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!
CS61C L26 CPU Design : Designing a Single-Cycle CPU II (1) Garcia, Spring 2010 © UCB inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Basics and Architectures
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Lecture 16: Basic Pipelining
EKT303/4 Superscalar vs Super-pipelined.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CS203 – Advanced Computer Architecture Performance Evaluation.
CS203 – Advanced Computer Architecture
Chapter Six.
CS 352H: Computer Systems Architecture
Lynn Choi School of Electrical Engineering
Advanced Architectures
Simultaneous Multithreading
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Coe818 Advanced Computer Architecture
Chapter Six.
Chapter Six.
November 5 No exam results today. 9 Classes to go!
/ Computer Architecture and Design
Overview Prof. Eric Rotenberg
CSC3050 – Computer Architecture
Guest Lecturer: Justin Hsia
Presentation transcript:

Old-Fashioned Mud-Slinging with Flash inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 39 – Intra-machine Parallelism 2010-04-30 Head TA Scott Beamer www.cs.berkeley.edu/~sbeamer Old-Fashioned Mud-Slinging with Flash Greet class Apple, Adobe, and Microsoft all pointing fingers at each other for Flash’s crashes. Apple and Microsoft will back HTML5.

Parallelism is necessary for performance Review Parallelism is necessary for performance It looks like it’s the future of computing It is unlikely that serial computing will ever catch up with parallel computing Software Parallelism Grids and clusters, networked computers MPI & MapReduce – two common ways to program Parallelism is often difficult Speedup is limited by serial portion of the code and communication overhead

Today Superscalars Power, Energy & Parallelism Multicores Vector Processors Challenges for Parallelism

Disclaimers Please don’t let today’s material confuse what you have already learned about CPU’s and pipelining When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) Many of the concepts described today are difficult to implement, so if it sounds easy, think of possible hazards

Add more functional units or pipelines to the CPU Superscalar Add more functional units or pipelines to the CPU Directly reduces CPI by doing more per cycle Consider what if we: Added another ALU Added 2 more read ports to the RegFile Added 1 more write port to the RegFile

Simple Superscalar MIPS CPU Instruction Memory Can now do 2 instructions in 1 cycle! Inst1 Inst0 Rd Rs Rt Rd Rs Rt 5 5 5 5 5 5 Instruction Address A Data Addr W0 Ra Rb W1 Rc Rd ALU 32 32 Data Memory PC Register File Next Address B Data In clk clk clk 32 32 C ALU 32 32 D

Simple Superscalar MIPS CPU (cont.) Considerations ISA now has to be changed Forwarding for pipelining now harder Limitations of our example Programmer must explicitly generate instruction parallel code Improvement only if other instructions can fill slots Doesn’t scale well

Superscalars in Practice Modern superscalar processors often hide behind a scalar ISA Gives the illusion of a scalar processor Dynamically schedules instructions Tries to fill all slots with useful instructions Detects hazards and avoids them Instruction Level Parallelism (ILP) Multiple instructions from same instruction stream in flight at the same time Examples: pipelining and superscalars

Scaling

Scaling

Energy & Power Considerations Courtesy: Chris Batten

Energy & Power Considerations Courtesy: Chris Batten

Parallel Helps Energy Efficiency Power ~ CV2f Circuit delay is roughly linear with V William Holt, HOT Chips 2005

Multicore Put multiple CPU’s on the same die Cost of multicore: complexity and slower single- thread execution Task/Thread Level Parallelism (TLP) Multiple instructions streams in flight at the same time

Cache Coherence Problem CPU0: LW R2, 16(R0) 5 16 CPU0 CPU1 CPU1: LW R2, 16(R0) 16 5 Cache Addr Value Cache Addr Value CPU1: SW R0,16(R0) View of memory no longer “coherent”. Loads of location 16 from CPU0 and CPU1 see different values! Shared Main Memory Addr Value 16 5 From: John Lazzaro

Real World Example: IBM POWER7

Performance Contest posted tonight Office Hours Next Week Administrivia Proj3 due tonight at 11:59p Performance Contest posted tonight Office Hours Next Week Watch website for announcements Final Exam Review – 5/9, 3-6p, 10 Evans Final Exam – 5/14, 8-11a, 230 Hearst Gym You get to bring: two 8.5”x11” sheets of handwritten notes + MIPS Green sheet

Classification of Parallel Architectures Flynn’s Taxonomy Classification of Parallel Architectures Single Instruction Multiple Instruction Single Data Multiple Data wwww.wikipedia.org

Vector Processors implement SIMD One operation applied to whole vector Not all program can easily fit into this Can get high performance and energy efficiency by amortizing instruction fetch Spends more silicon and power on ALUs Data Level Parallelism (DLP) Compute on multiple pieces of data with the same instruction stream

Graphics Processing Units (GPU) Vectors (continued) Vector Extensions Extra instructions added to a scalar ISA to provide short vectors Examples: MMX, SSE, AltiVec Graphics Processing Units (GPU) Also leverage SIMD Initially very fixed function for graphics, but now becoming more flexible to support general purpose computation

Parallelism at Different Levels Intra-Processor (within the core) Pipelining & superscalar (ILP) Vector extensions & GPU (DLP) Intra-System (within the box) Multicore – multiple cores per socket (TLP) Multisocket – multiple sockets per system Intra-Cluster (within the room) Multiple systems per rack and multiple racks per cluster

Conventional Wisdom (CW) in Computer Architecture Old CW: Power is free, but transistors expensive New CW: Power wall Power expensive, transistors “free” Can put more transistors on a chip than have power to turn on Old CW: Multiplies slow, but loads fast New CW: Memory wall Loads slow, multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks Old CW: More ILP via compiler / architecture innovation Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP wall Diminishing returns on more ILP Old CW: 2X CPU Performance every 18 months New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall Implication: We must go parallel!

Parallelism again? What’s different this time? “This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism; instead, this plunge into parallelism is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures.” – Berkeley View, December 2006 HW/SW Industry bet its future that breakthroughs will appear before it’s too late view.eecs.berkeley.edu

All energy efficient systems are low power Peer Instruction All energy efficient systems are low power My GPU at peak can do more FLOP/s than my CPU 12 a) FF b) FT c) TF d) TT

Peer Instruction Answer An energy efficient system could peak with high power to get work done and then go to sleep … FALSE Current GPUs tend to have more FPUs than current CPUs, they just aren’t as programmable … TRUE All energy efficient systems are low power My GPU at peak can do more FLOP/s than my CPU 12 a) FF b) FT c) TF d) TT False (explained above) True

“And In conclusion…” Parallelism is all about energy efficiency Types of Parallelism: ILP, TLP, DLP Parallelism already at many levels Great opportunity for change