Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter 4.10-11, 7.1-6.

Slides:

Advertisements

Similar presentations

Chapter 4 The Processor. Chapter 4 — The Processor — 2 The Main Control Unit Control signals derived from instruction 0rsrtrdshamtfunct 31:265:025:2120:1615:1110:6.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Pipeline Optimization

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

Instruction-Level Parallelism (ILP)

Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.

1 xkcd/619. Multicore & Parallel Processing Guest Lecture: Kevin Walsh CS 3410, Spring 2011 Computer Science Cornell University.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: 4.10, 1.7, 1.8, 5.10, 6.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!

Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University P & H Chapter ,

Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter ,

Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter ,

Instructor: Justin Hsia CS 61C: Great Ideas in Computer Architecture Multiple Instruction Issue, Virtual Memory Introduction 7/26/2012Summer Lecture.

Last Time Performance Analysis It’s all relative

Multiple Issue Processors: Superscalar and VLIW

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

RISC Architecture RISC vs CISC Sherwin Chan.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

CMPE 421 Parallel Computer Architecture

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: 4.10, 1.7, 1.8, 5.10, 6.4, 2.11.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3 Instructor: Dan Garcia inst.eecs.Berkeley.edu/~cs61c.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

12/26/20151 Pipeline Architecture since 1985  Last time, we completed the 5-stage pipeline MIPS. —Processors like this were first shipped around 1985.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

Instructor: Justin Hsia CS 61C: Great Ideas in Computer Architecture Multiple Instruction Issue, Virtual Memory Introduction 8/01/2013Summer Lecture.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Pipelining Example Laundry Example: Three Stages

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruction Level Parallelism: Multiple Instruction Issue Guest Lecturer: Justin Hsia.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: 4.10, 1.7, 1.8, 5.10, 6.4, 2.11.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

XU, Qiang 徐強 [Adapted from UC Berkeley’s D. Patterson’s and

Pipeline Architecture since 1985

Single Clock Datapath With Control

Chapter 4 The Processor Part 6

Morgan Kaufmann Publishers The Processor

Computer Architecture

Chapter 1 Introduction.

November 5 No exam results today. 9 Classes to go!

CSC3050 – Computer Architecture

Parallelism, Multicore, and Synchronization

Presentation transcript:

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter , 7.1-6

2 Problem Statement Q: How to improve system performance?  Increase CPU clock rate?  But I/O speeds are limited Disk, Memory, Networks, etc. Recall: Amdahl’s Law Solution: Parallelism

3 Instruction-Level Parallelism (ILP) Pipelining: execute multiple instructions in parallel Q: How to get more instruction level parallelism? A: Deeper pipeline –Less work per stage  shorter clock cycle A: Multiple issue pipeline –Start multiple instructions per clock cycle in duplicate stages –Example: 1GHz 4-way multiple-issue Peak CPI = 0.25  16 billion instructions per second

4 Static Multiple Issue a.k.a. Very Long Instruction Word (VLIW) Compiler groups instructions to be issued together Packages them into “issue slots” Simple HW: Compiler detects and avoids hazards Example: Static Dual-Issue MIPS Instructions come in pairs (64-bit aligned) –One ALU/branch instruction (or nop) –One load/store instruction (or nop)

5 Scheduling Example Compiler scheduling for dual-issue MIPS… TOP: lw $t0, 0($s1) # $t0 = A[i] lw $t1, 4($s1)# $t1 = A[i+1] addu $t0, $t0, $s2 # add $s2 addu $t1, $t1, $s2 # add $s2 sw $t0, 0($s1) # store A[i] sw $t1, 4($s1) # store A[i] addi $s1, $s1, +8 # increment pointer bne $s1, $s3, TOP# continue if $s1!=end ALU/branch slotLoad/store slotcycle TOP:noplw $t0, 0($s1)1 addu $s1, $s1, +8lw $t1, 4($s1)2 addu $t0, $t0, $s2nop3 addu $t1, $t1, $s2sw $t0, -8($s1)4 bne $s1, $s3, TOPsw $t1, -4($s1)5

6 Dynamic Multiple Issue a.k.a. SuperScalar Processor CPU examines instruction stream and chooses multiple instructions to issue each cycle Compiler can help by reordering instructions…. … but CPU is responsible for resolving hazards Even better: Speculation/Out-of-order Execution Guess results of branches, loads, etc. Execute instructions as early as possible Roll back if guesses were wrong Don’t commit results until all previous insts. are retired

7 Does Multiple Issue Work? Q: Does multiple issue / ILP work? A: Kind of… but not as much as we’d like Limiting factors? Programs dependencies Hard to detect dependencies  be conservative –e.g. Pointer Aliasing: A[0] += 1; B[0] *= 2; Hard to expose parallelism –Can only issue a few instructions ahead of PC Structural limits –Memory delays and limited bandwidth Hard to keep pipelines full

8 Power Efficiency Q: Does multiple issue / ILP cost much? A: Yes.  Dynamic issue and speculation requires power CPUYearClock Rate Pipeline Stages Issue width Out-of-order/ Speculation CoresPower i MHz51No15W Pentium199366MHz52No110W Pentium Pro MHz103Yes129W P4 Willamette MHz223Yes175W UltraSparc III MHz144No190W P4 Prescott MHz313Yes1103W  Multiple simpler cores may be better? Core MHz144Yes275W UltraSparc T MHz61No870W

9 Moore’s Law Pentium Atom P4 Itanium 2 K8 K10 Dual-core Itanium 2

10 Why Multicore? Moore’s law A law about transistors Smaller means more transistors per die And smaller means faster too But: Power consumption growing too…

11 Power Limits

12 Power Wall Power = capacitance * voltage 2 * frequency In practice: Power ~ voltage 3 Reducing voltage helps (a lot)... so does reducing clock speed Better cooling helps The power wall We can’t reduce voltage further We can’t remove more heat

13 Why Multicore? Power 1.0x Performance Single-Core Power 1.2x 1.7x Performance Single-Core Overclocked +20% Power 0.8x 0.51x Performance Single-Core Underclocked -20% 1.6x 1.02x Dual-Core Underclocked -20%

14 Inside the Processor AMD Barcelona: 4 processor cores

15 Parallel Programming Q: So lets just all use multicore from now on! A: Software must be written as parallel program Multicore difficulties Partitioning work Coordination & synchronization Communications overhead Balancing load over cores How do you write parallel programs? –... without knowing exact underlying architecture?

16 Work Partitioning Partition work so all cores have something to do

17 Load Balancing Need to partition so all cores are actually working

18 Amdahl’s Law If tasks have a serial part and a parallel part… Example: step 1: divide input data into n pieces step 2: do work on each piece step 3: combine all results Recall: Amdahl’s Law As number of cores increases … time to execute parallel part? time to execute serial part?

19 Amdahl’s Law