Exploiting Parallelism

Slides:

Advertisements

Similar presentations

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

Instruction Level Parallelism (ILP) Colin Stevens.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Chapter 12 Pipelining Strategies Performance Hazards.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

Computer Processing of Data

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Computers Are Your Future Eleventh Edition Chapter 2: Inside the System Unit Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.

Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.

Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Introduction to Microprocessors

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.

IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Memory Buffering Techniques Greg Stitt ECE Department University of Florida.

EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

Buffering Techniques Greg Stitt ECE Department University of Florida.

Buffering Techniques Greg Stitt ECE Department University of Florida.

STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.

Pipelining – Loop unrolling and Multiple Issue

Memory Buffering Techniques

Morgan Kaufmann Publishers

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

Greg Stitt ECE Department University of Florida

COMP4211 : Advance Computer Architecture

Pipelining Example Cycle 1 b[0] b[1] b[2] + +

Array Processor.

Overview Parallel Processing Pipelining

Single Cycle vs. Multiple Cycle

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

September 17 Test 1 pre(re)view Fang-Yi will demonstrate Spim

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Presentation transcript:

Exploiting Parallelism Greg Stitt ECE Department University of Florida

Why are Custom Circuits Fast? Circuits can exploit massive amounts of parallelism Microprocessor/VLIW/Superscalar parallelism ~5 ins/cycle in best case (rarely occurs) Digital Circuits Potentially thousands of operations per cycle As many operations as will fit in device Also, supports different types of parallelism

Circuit for Bit Reversal Types of Parallelism Bit-level parallelism C Code for Bit Reversal Circuit for Bit Reversal Bit Reversed X Value . . . . . . . . . . . Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x) for same clock x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Binary Compilation Processor Requires between 32 and 128 cycles

Types of Parallelism Arithmetic-level Parallelism (i.e. “wide” parallelism) C Code Circuit . . . for (i=0; i < 128; i++) y += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. 128 multipliers * + . . . . . . 128 adders . . . Processor Processor Processor FPGA 1000’s of instructions Several thousand cycles ~ 7 cycles (assuming 1 op per cycle) Speedup > 100x for same clock

Types of Parallelism Pipeline Parallelism (i.e., “deep” parallelism) for (i=0; i < 100000; i++) { y[i] += c[i] * x[i] + c[i+1] * x[i+1] + ….. + c[i+11] * x[i+11] } Problem: 12*100000 multipliers would require huge area Solution: Use resources required by one iteration, and start new iteration every cycle * + Registers After filling up pipeline, performs 12 mults + 12 adds every cycle Performance can be further increased by “unrolling” loop or replicating datapath to perform multiple iterations every cycle.

Types of Parallelism Task-level Parallelism e.g. MPEG-2 Each box is a task All tasks executes in parallel Each task may have bit-level, wide, and deep parallelism

How to exploit parallelism? General Idea: 1)Identify tasks 2)Create circuit for each task that exploits deep and wide parallelism 3)Add buffers between tasks to enable communication How to create circuit for each task?

Pipelining Example Create DFG (data flow graph) for body of loop b[i] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i] b[i+1] b[i+2] + + a[i]

Pipelining Example Add pipeline stages to each level of DFG b[i] + + a[i]

Pipelining Example Cycle 1 b[0] b[1] b[2] + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; +

Pipelining Example Cycle 2 b[1] b[2] b[3] + + b[0]+b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]+b[1] b[2] +

Pipelining Example Cycle 3 b[2] b[3] b[4] + + b[1]+b[2] b[3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1]+b[2] b[3] + b[0]+b[1]+b[2]

Pipelining Example Cycle 4 b[3] b[4] b[5] + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2]+b[3] b[4] + b[1]+b[2]+b[3] a[0] First output appears, takes 4 cycles to fill pipeline

Pipelining Example Cycle 5 Total Cycles => 4 init + 99 = 103 b[4] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[3]+b[4] b[5] + b[2]+b[3]+b[4] Total Cycles => 4 init + 99 = 103 One output per cycle at this point, 99 more until completion a[1]

uP Performance Comparison Assumptions: 10 instructions for loop body CPI (cycles per instructon)= 1.5 Clk 2x faster than circuit Total SW cycles: 100*10*1.5 = 1,500 cycles Circuit Speedup (1500/103)*(1/2) = 7.3x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

uP Performance Comparison Assumptions: 10 instructions for loop body CPI = 1.5 Clk 10x faster than circuit e.g. FPGA running at 200 MHz Total SW cycles: 100*10*1.5 = 1,500 cycles Circuit Speedup (1500/103)*(1/10) = 1.46x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Entire Circuit Input Address Generator RAM Buffer Controller RAM delivers “streams” of data to the datapath Input Address Generator RAM Buffer Controller Pipelined Datapath Buffer Separate RAM writes “streams” of data from the datapath Output Address Generator RAM

Pipelining, Cont. Possible improvement Why not execute multiple iterations at same time? e.g. Loop unrolling Only possible when no dependencies between iterations for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Unrolled DFG b[i] b[i+1] b[i+2] b[i+1] b[i+2] b[i+3] + + . . . . . + + a[i] a[i+1]

Pipelining, Cont. How much to unroll? Limited by memory bandwidth and area Must get all inputs once per cycle b[i] b[i+1] b[i+2] b[i+1] b[i+2] b[i+3] + + . . . . . + + a[i] a[i+1] Must write all outputs once per cycle Must be sufficient area for all ops in DFG

Unrolling Example Original circuit 1st iteration requires 3 inputs for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 1st iteration requires 3 inputs b[0] b[1] b[2] + + a[0]

Unrolling Example Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Each unrolled iteration requires one additional input b[0] b[1] b[2] b[3] + + + + a[0] a[1]

Unrolling Example Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Each cycle brings in 4 inputs (instead of 6) b[1] b[2] b[3] b[4] + + b[0]+b[1] b[2] b[3] b[1]+b[2] + +

Performance after unrolling How much unrolling? Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle Can perform 6 iterations in parallel (24 + 8 + 8 +8 +8 +8) = 64 bits New performance (assume 3GHz uP, 200 MHz FPGA) Unrolled pipeline requires 4 cycles to fill pipeline, (100-6)/6 iterations ~ 20 cycles With unrolling, FPGA is (1500/20)*(1/15) = 5x faster than 3 GHz microprocessor!!! for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Importance of Memory Bandwidth Performance with wider memories 128-bit bus 14 iterations in parallel 64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations Total cycles = 4 to fill pipeline + (100-14)/14 = ~10 Speedup (1500/10)*(1/15) = 10x Doubling memory width increased speedup from 5x to 10x!!! Important Point Performance of pipelined hardware often limited by memory bandwidth More bandwidth => more unrolling => more parallelism => BIG SPEEDUP

Delay Registers Common mistake Forgetting to add registers for values not used during a cycle Values “delayed” or passed on until needed + + Instead of + + Incorrect Correct

Delay Registers Illustration of incorrect delays Cycle 1 b[0] b[1] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] Cycle 1 + +

Delay Registers Illustration of incorrect delays Cycle 2 b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] b[3] Cycle 2 + b[0]+b[1] + b[2] + ?????

Delay Registers Illustration of incorrect delays Cycle 3 b[2] b[3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[3] b[4] Cycle 3 + b[1]+b[2] + b[0] + b[1] + b[3] b[2] + ?????

Another Example Your turn Steps Build DFG for body of loop Add pipeline stages Map operations to hardware resources Assume divide takes one cycle Determine maximum amount of unrolling Memory bandwidth = 128 bits/cycle Determine performance compared to uP Assume 15 instructions per iteration, CPI = 1.5, CLK = 15x faster than RC short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = (b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4]) / 5;