Exploiting Parallelism

Exploiting Parallelism
Greg Stitt ECE Department University of Florida

Why are Custom Circuits Fast?
Circuits can exploit massive amounts of parallelism Microprocessor/VLIW/Superscalar parallelism ~5 ins/cycle in best case (rarely occurs) Digital Circuits Potentially thousands of operations per cycle As many operations as will fit in device Also, supports different types of parallelism

Circuit for Bit Reversal
Types of Parallelism Bit-level parallelism C Code for Bit Reversal Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x) for same clock x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Binary Compilation Processor Requires between 32 and 128 cycles

Types of Parallelism Arithmetic-level Parallelism (i.e. “wide” parallelism) C Code Circuit . . . for (i=0; i < 128; i++) y += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. 128 multipliers * + . . . . . . 128 adders . . . Processor Processor Processor FPGA 1000’s of instructions Several thousand cycles ~ 7 cycles (assuming 1 op per cycle) Speedup > 100x for same clock

Types of Parallelism Pipeline Parallelism (i.e., “deep” parallelism)
for (i=0; i < ; i++) { y[i] += c[i] * x[i] + c[i+1] * x[i+1] + ….. + c[i+11] * x[i+11] } Problem: 12* multipliers would require huge area Solution: Use resources required by one iteration, and start new iteration every cycle * + Registers After filling up pipeline, performs 12 mults + 12 adds every cycle Performance can be further increased by “unrolling” loop or replicating datapath to perform multiple iterations every cycle.

Types of Parallelism Task-level Parallelism
e.g. MPEG-2 Each box is a task All tasks executes in parallel Each task may have bit-level, wide, and deep parallelism

How to exploit parallelism?
General Idea: 1)Identify tasks 2)Create circuit for each task that exploits deep and wide parallelism 3)Add buffers between tasks to enable communication How to create circuit for each task?

Pipelining Example Create DFG (data flow graph) for body of loop b[i]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i] b[i+1] b[i+2] + + a[i]

Pipelining Example Add pipeline stages to each level of DFG b[i]
+ + a[i]

Pipelining Example Cycle 1 b[0] b[1] b[2] + +
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; +

Pipelining Example Cycle 2 b[1] b[2] b[3] + + b[0]+b[1] b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]+b[1] b[2] +

Pipelining Example Cycle 3 b[2] b[3] b[4] + + b[1]+b[2] b[3]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1]+b[2] b[3] + b[0]+b[1]+b[2]

Pipelining Example Cycle 4 b[3] b[4] b[5] + +
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2]+b[3] b[4] + b[1]+b[2]+b[3] a[0] First output appears, takes 4 cycles to fill pipeline

Pipelining Example Cycle 5 Total Cycles => 4 init + 99 = 103 b[4]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[3]+b[4] b[5] + b[2]+b[3]+b[4] Total Cycles => 4 init + 99 = 103 One output per cycle at this point, 99 more until completion a[1]

uP Performance Comparison
Assumptions: 10 instructions for loop body CPI (cycles per instructon)= 1.5 Clk 2x faster than circuit Total SW cycles: 100*10*1.5 = 1,500 cycles Circuit Speedup (1500/103)*(1/2) = 7.3x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

uP Performance Comparison
Assumptions: 10 instructions for loop body CPI = 1.5 Clk 10x faster than circuit e.g. FPGA running at 200 MHz Total SW cycles: 100*10*1.5 = 1,500 cycles Circuit Speedup (1500/103)*(1/10) = 1.46x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Entire Circuit Input Address Generator RAM Buffer Controller
RAM delivers “streams” of data to the datapath Input Address Generator RAM Buffer Controller Pipelined Datapath Buffer Separate RAM writes “streams” of data from the datapath Output Address Generator RAM

Pipelining, Cont. Possible improvement
Why not execute multiple iterations at same time? e.g. Loop unrolling Only possible when no dependencies between iterations for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Unrolled DFG b[i] b[i+1] b[i+2] b[i+1] b[i+2] b[i+3] + + + + a[i] a[i+1]

Pipelining, Cont. How much to unroll?
Limited by memory bandwidth and area Must get all inputs once per cycle b[i] b[i+1] b[i+2] b[i+1] b[i+2] b[i+3] + + + + a[i] a[i+1] Must write all outputs once per cycle Must be sufficient area for all ops in DFG

Unrolling Example Original circuit 1st iteration requires 3 inputs
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 1st iteration requires 3 inputs b[0] b[1] b[2] + + a[0]

Unrolling Example Original circuit
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Each unrolled iteration requires one additional input b[0] b[1] b[2] b[3] + + + + a[0] a[1]

Unrolling Example Original circuit
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Each cycle brings in 4 inputs (instead of 6) b[1] b[2] b[3] b[4] + + b[0]+b[1] b[2] b[3] b[1]+b[2] + +

Performance after unrolling
How much unrolling? Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle Can perform 6 iterations in parallel ( ) = 64 bits New performance (assume 3GHz uP, 200 MHz FPGA) Unrolled pipeline requires 4 cycles to fill pipeline, (100-6)/6 iterations ~ 20 cycles With unrolling, FPGA is (1500/20)*(1/15) = 5x faster than 3 GHz microprocessor!!! for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Importance of Memory Bandwidth
Performance with wider memories 128-bit bus 14 iterations in parallel 64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations Total cycles = 4 to fill pipeline + (100-14)/14 = ~10 Speedup (1500/10)*(1/15) = 10x Doubling memory width increased speedup from 5x to 10x!!! Important Point Performance of pipelined hardware often limited by memory bandwidth More bandwidth => more unrolling => more parallelism => BIG SPEEDUP

Delay Registers Common mistake
Forgetting to add registers for values not used during a cycle Values “delayed” or passed on until needed + + Instead of + + Incorrect Correct

Delay Registers Illustration of incorrect delays Cycle 1 b[0] b[1]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] Cycle 1 + +

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] b[3] Cycle 2 + b[0]+b[1] + b[2] + ?????

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[3] b[4] Cycle 3 + b[1]+b[2] + b[0] + b[1] + b[3] b[2] + ?????

Another Example Your turn Steps Build DFG for body of loop
Add pipeline stages Map operations to hardware resources Assume divide takes one cycle Determine maximum amount of unrolling Memory bandwidth = 128 bits/cycle Determine performance compared to uP Assume 15 instructions per iteration, CPI = 1.5, CLK = 15x faster than RC short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = (b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4]) / 5;

Exploiting Parallelism

Similar presentations

Presentation on theme: "Exploiting Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Parallelism

Similar presentations

Presentation on theme: "Exploiting Parallelism"— Presentation transcript:

Similar presentations

About project

Feedback