Download presentation
Presentation is loading. Please wait.
Published byJob Garrison Modified over 9 years ago
2
A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal
3
2 Motivation Many applications have sequences tasks E.g., in image and video processing algorithms Contemporary FPGAs Plenty of room to accommodate highly specialized complex architectures Time to creatively “use available resources” than to simply “save resources”
4
3 Motivation Computing Stages Sequentially Task ATask BTask C TIME
5
4 Motivation Computing Stages Concurrently TIME Task A Task B Task C
6
5 Outline Objective Loop Pipelining Producer/Consumer Computing Stages Pipelining Sequences of Loops Inter-Stage Communication Experimental Setup and Results Related Work Conclusions Future Work
7
6 Objectives To speed-up applications with multiple and data-dependent stages each stage seen as a set of nested loops How? Pipelining those sequences of data- dependent stages using fine-grain synchronization schemes Taking advantage of field-custom computing structures (FPGAs)
8
7 Loop Pipelining Attempt to overlap loop iterations Significant speedups are achieved But how to pipeline sequences of loops? I1I2 I3I4 I1 I2 I3 I4 time...
9
8 Computing Stages Sequentially Producer:...A[2]A[1]A[0 ] Consumer: A[0]A[1]A[2]...
10
9 Computing Stages Concurrently Ordered producer/consumer pairs Send/receive Producer:...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... A[3]... A[2] A[1] A[0] FIFO with N stages
11
10 Computing Stages Concurrently Unordered producer/consumer pairs Empty/Full table 0 1A[1] 0 0 0 1A[5] 0 0 Producer:...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]... Empty/full data
12
11 Main Idea FDCT Execution of Loops 1, 2Execution of Loop 3 time Loop 1 Loop 2 Loop 3 Global FSM Data Input Intermediate data Data output Intermediate data array 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56
13
12 Main Idea FDCT Out-of-order producer/consumer pairs How to overlap computing stages? 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56
14
13 Main Idea Pipelined FDCT Intermediate data ( dual-port RAM ) Loop 1 Loop 2 Loop 3 FSM 1 FSM 2 Dual-port 1-bit table ( empty/full ) Data input Data output Execution of Loops 1, 2 Execution of Loop 3 time Intermediate data array 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56
15
14 Main Idea Task A Task B Memory
16
15 Possible Scenarios Single write, single read Accepted without code changes Single write, multiple reads Accepted without code changes (by using an N-bit table) Multiple writes, single read Need code transformations Multiple writes, multiple reads Need code transformations
17
16 Inter-Stage Communication Responsible to: Communicate data between pipelined stages Flag data availability Solutions Perfect associative memory Cost too high Memory for data plus 1-bit table (each cell represents full/empty information) Size of the data set to communicate Decrease size using hash- based solution 0 1A[1] 0 0 0 1A[5] 0 0 Empty/full data
18
17 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } Inter-Stage Communication Memory plus 1-bit table
19
18 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } Inter-Stage Communication Hash-based solution:
20
19 Inter-Stage Communication Hash-based solution We did not want to include additional delays in the load/store operations Use H(k) = k MOD m When m is a multiple of 2*N, H(k) can be implemented by just using the least log 2 (m) significant bits of K to address the cache (translates to simple interconnections) A[5]1 0 0 0 0 0 A[1]1 0 H H A[5]1 0 0 0 0 0 A[1]1 0
21
20 Inter-Stage Communication Hash-based solution: H(k) = k MOD m Single read (L=1) R = 1 = 0 a) write b) read c) empty/full update
22
21 Inter-Stage Communication Hash-based solution: H(k) = k MOD m Multiple reads (L>1) R = 11...1 (L) >>= R a) write b) read c) empty/full update
23
22 Buffer size calculation By monitoring behavior of communication component For each read and write determine the size of the buffer needed to avoid collisions Done during RTL simulation
24
23 Experimental Setup Compilation flow Uses our previous work on compiling algorithms in a Java subset to FPGAs
25
24 Experimental Setup Simulation back-end fsm.xmldatapath.xml fsm.xml rtg.xml to dotty to hds to java to vhdl datapath.hdsfsm.javartg.java fsm.classrtg.class HADES Library of Operators (JAVA) I/O data ( RAMs and Stimulus ) XSLTs ANT build file
26
25 Experimental Results Benchmarks Algorithm# Stages#loopsDescription fdct 2 {s1,s2}3 Fast DCT (Discrete Cosine Transform) fwt2D 4 {s1,s2,s3,s4}8 Forward Haar Wavelet RGB2gray + histogram 2 {s1,s2}2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image Smooth + sobel, 3 versions: (a) (b) (c) 2 {s1,s2}6 Smooth image operation based on 3 3 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.
27
26 Experimental Results FDCT (speed-up achieved by Pipelining Sequences of Loop)
28
27 Experimental Results Algorithm Input data size Stages#cc w/o PSL Speed-up Upper – Bound #cc w/ PSL Speed- up fdct 800 600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920,003 2.021,830,215 2.02 Fwt2D 512 512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362,373 2.003,664,917 1.29 RGB2gray + histogram 800 600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840,015 1.753,840,007 1.75 Smooth + sobel (a) 800 600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606,951 1.5132,929,489 1.51 Smooth + sobel (b) 800 600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606,951 1.8116,640,509 1.81 Smooth + sobel (c) 800 600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862,791 1.9213,364,117 1.92
29
28 Experimental Results What does happen with buffer sizes?
30
29 Experimental Results Adjust latency of tasks in order to balance pipeline stages: Slowdown tasks with higher latency Optimization of slower tasks in order to reduce their latency Slowdown of producer tasks usually reduces the size of the inter-stage buffers
31
30 Experimental Results Buffer sizes +1 cycle per iteration of the producer +2 cycles per iteration of the producer original Optimizations in the producer +Optimizations in the consumer original
32
31 Experimental Results Buffer sizes
33
32 Experimental Results Resources and Frequency (Spartan-3 400)
34
33 Related Work Previous approach (Ziegler et al.) Coarse-grained communication and synchronization scheme FIFOs are used to communicate data between pipelining stages Width of FIFO stages dependent on producer/consumer ordering Less applicable A[0] A[1] A[2] A[3]... Producer: Consumer: A[0] A[1] A[2] A[3]... A[0]A[1]... A[0] A[1] A[2] A[3]... A[1] A[0] A[3] A[2]... A[0] A[1] A[2] A[3]... A[0] A[1] A[2] A[3] A[4] A[5]... A[0] A[3] A[1] A[4] A[2] A[5]... A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]... time
35
34 Conclusions We presented a scheme to accelerate applications, pipelining sequences of loops I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced Data-driven scheme is used based on empty/full tables A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function) Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved as if stages are concurrently and independently executed
36
35 Future Work Research other hash functions Study slowdown effects Apply the technique in the context of Multi-Core Systems Processor Core A LN M data_inaddress_in H address_out data_out H hit/miss T (a) (b) (c) (a) (b) R (a) Processor Core B Memory
37
36 Acknowledgments Work partially funded by CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002 Based on the work done by Rui Rodrigues In collaboration with Pedro C. Diniz
38
37 technology from seed A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops
39
38 Buffer Monitor
40
39 Buffer Monitor
41
40 Buffer Monitor
42
41 Buffer Monitor
43
42 Buffer Monitor
44
43 Buffer Monitor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.