A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal.

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal

2 Motivation  Many applications have sequences tasks E.g., in image and video processing algorithms  Contemporary FPGAs Plenty of room to accommodate highly specialized complex architectures Time to creatively “use available resources” than to simply “save resources”

3 Motivation  Computing Stages Sequentially Task ATask BTask C TIME

4 Motivation  Computing Stages Concurrently TIME Task A Task B Task C

5 Outline  Objective  Loop Pipelining  Producer/Consumer Computing Stages  Pipelining Sequences of Loops  Inter-Stage Communication  Experimental Setup and Results  Related Work  Conclusions  Future Work

6 Objectives  To speed-up applications with multiple and data-dependent stages each stage seen as a set of nested loops  How? Pipelining those sequences of data- dependent stages using fine-grain synchronization schemes Taking advantage of field-custom computing structures (FPGAs)

7 Loop Pipelining  Attempt to overlap loop iterations  Significant speedups are achieved  But how to pipeline sequences of loops? I1I2 I3I4 I1 I2 I3 I4 time...

8 Computing Stages  Sequentially Producer:...A[2]A[1]A[0 ] Consumer: A[0]A[1]A[2]...

9 Computing Stages  Concurrently Ordered producer/consumer pairs Send/receive Producer:...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... A[3]... A[2] A[1] A[0] FIFO with N stages

10 Computing Stages  Concurrently Unordered producer/consumer pairs Empty/Full table 0 1A[1] 0 0 0 1A[5] 0 0 Producer:...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]... Empty/full data

11 Main Idea  FDCT Execution of Loops 1, 2Execution of Loop 3 time Loop 1 Loop 2 Loop 3 Global FSM Data Input Intermediate data Data output Intermediate data array 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56

12 Main Idea  FDCT Out-of-order producer/consumer pairs How to overlap computing stages? 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56

13 Main Idea  Pipelined FDCT Intermediate data ( dual-port RAM ) Loop 1 Loop 2 Loop 3 FSM 1 FSM 2 Dual-port 1-bit table ( empty/full ) Data input Data output Execution of Loops 1, 2 Execution of Loop 3 time Intermediate data array 0 1 2 3 4 5 6 7 8 16 24 32 40 48 56

14 Main Idea Task A Task B Memory

15 Possible Scenarios  Single write, single read Accepted without code changes  Single write, multiple reads Accepted without code changes (by using an N-bit table)  Multiple writes, single read Need code transformations  Multiple writes, multiple reads Need code transformations

16 Inter-Stage Communication  Responsible to: Communicate data between pipelined stages Flag data availability  Solutions Perfect associative memory Cost too high Memory for data plus 1-bit table (each cell represents full/empty information) Size of the data set to communicate Decrease size using hash- based solution 0 1A[1] 0 0 0 1A[5] 0 0 Empty/full data

17 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } Inter-Stage Communication  Memory plus 1-bit table

18 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } Inter-Stage Communication  Hash-based solution:

19 Inter-Stage Communication  Hash-based solution We did not want to include additional delays in the load/store operations Use H(k) = k MOD m When m is a multiple of 2*N, H(k) can be implemented by just using the least  log 2 (m)  significant bits of K to address the cache (translates to simple interconnections) A[5]1 0 0 0 0 0 A[1]1 0 H H A[5]1 0 0 0 0 0 A[1]1 0

20 Inter-Stage Communication  Hash-based solution: H(k) = k MOD m  Single read (L=1)  R = 1   = 0  a) write  b) read  c) empty/full update

21 Inter-Stage Communication  Hash-based solution: H(k) = k MOD m  Multiple reads (L>1)  R = 11...1 (L)   >>= R  a) write  b) read  c) empty/full update

22 Buffer size calculation  By monitoring behavior of communication component  For each read and write determine the size of the buffer needed to avoid collisions  Done during RTL simulation

23 Experimental Setup  Compilation flow Uses our previous work on compiling algorithms in a Java subset to FPGAs

24 Experimental Setup  Simulation back-end fsm.xmldatapath.xml fsm.xml rtg.xml to dotty to hds to java to vhdl datapath.hdsfsm.javartg.java fsm.classrtg.class HADES Library of Operators (JAVA) I/O data ( RAMs and Stimulus ) XSLTs ANT build file

25 Experimental Results  Benchmarks Algorithm# Stages#loopsDescription fdct 2 {s1,s2}3 Fast DCT (Discrete Cosine Transform) fwt2D 4 {s1,s2,s3,s4}8 Forward Haar Wavelet RGB2gray + histogram 2 {s1,s2}2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image Smooth + sobel, 3 versions: (a) (b) (c) 2 {s1,s2}6 Smooth image operation based on 3  3 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.

26 Experimental Results  FDCT (speed-up achieved by Pipelining Sequences of Loop)

27 Experimental Results Algorithm Input data size Stages#cc w/o PSL Speed-up Upper – Bound #cc w/ PSL Speed- up fdct 800  600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920,003 2.021,830,215 2.02 Fwt2D 512  512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362,373 2.003,664,917 1.29 RGB2gray + histogram 800  600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840,015 1.753,840,007 1.75 Smooth + sobel (a) 800  600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606,951 1.5132,929,489 1.51 Smooth + sobel (b) 800  600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606,951 1.8116,640,509 1.81 Smooth + sobel (c) 800  600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862,791 1.9213,364,117 1.92

28 Experimental Results  What does happen with buffer sizes?

29 Experimental Results  Adjust latency of tasks in order to balance pipeline stages: Slowdown tasks with higher latency Optimization of slower tasks in order to reduce their latency  Slowdown of producer tasks usually reduces the size of the inter-stage buffers

30 Experimental Results  Buffer sizes +1 cycle per iteration of the producer +2 cycles per iteration of the producer original Optimizations in the producer +Optimizations in the consumer original

31 Experimental Results  Buffer sizes

32 Experimental Results  Resources and Frequency (Spartan-3 400)

33 Related Work  Previous approach (Ziegler et al.) Coarse-grained communication and synchronization scheme FIFOs are used to communicate data between pipelining stages Width of FIFO stages dependent on producer/consumer ordering Less applicable A[0] A[1] A[2] A[3]... Producer: Consumer: A[0] A[1] A[2] A[3]... A[0]A[1]... A[0] A[1] A[2] A[3]... A[1] A[0] A[3] A[2]... A[0] A[1] A[2] A[3]... A[0] A[1] A[2] A[3] A[4] A[5]... A[0] A[3] A[1] A[4] A[2] A[5]... A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]... time

34 Conclusions  We presented a scheme to accelerate applications, pipelining sequences of loops I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced  Data-driven scheme is used based on empty/full tables A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function)  Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved as if stages are concurrently and independently executed

35 Future Work  Research other hash functions  Study slowdown effects  Apply the technique in the context of Multi-Core Systems Processor Core A LN M data_inaddress_in H address_out data_out H hit/miss T (a) (b) (c) (a) (b) R (a) Processor Core B Memory

36 Acknowledgments  Work partially funded by CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002  Based on the work done by Rui Rodrigues  In collaboration with Pedro C. Diniz

37 technology from seed A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops

38 Buffer Monitor

39 Buffer Monitor

40 Buffer Monitor

41 Buffer Monitor

42 Buffer Monitor

43 Buffer Monitor

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal.

Similar presentations

Presentation on theme: "A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal.

Similar presentations

Presentation on theme: "A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal."— Presentation transcript:

Similar presentations

About project

Feedback