A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

CSCI 4717/5717 Computer Architecture

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.

Multiscalar processors

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Pipes & Filters Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Buffering Techniques Greg Stitt ECE Department University of Florida.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Hiba Tariq School of Engineering

Multiscalar Processors

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instruction Level Parallelism and Superscalar Processors

Hardware Acceleration of the Lifting Based DWT

STUDY AND IMPLEMENTATION

Register Pressure Guided Unroll-and-Jam

Chapter 12 Pipelining and RISC

Dynamic Hardware Prediction

Implementation of a De-blocking Filter and Optimization in PLX

Presentation transcript:

A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso ITIV, University of Karlsruhe, July 2, 2007 Portugal

2 Motivation  Many applications have sequences tasks E.g., in image and video processing algorithms  Contemporary FPGAs Plenty of room to accommodate highly specialized complex architectures Time to creatively “use available resources” than to simply “save resources”

3 Motivation  Computing Stages Sequentially Task ATask BTask C TIME

4 Motivation  Computing Stages Concurrently TIME Task A Task B Task C

5 Outline  Objective  Loop Pipelining  Producer/Consumer Computing Stages  Pipelining Sequences of Loops  Inter-Stage Communication  Experimental Setup and Results  Related Work  Conclusions  Future Work

6 Objectives  To speed-up applications with multiple and data-dependent stages each stage seen as a set of nested loops  How? Pipelining those sequences of data- dependent stages using fine-grain synchronization schemes Taking advantage of field-custom computing structures (FPGAs)

7 Loop Pipelining  Attempt to overlap loop iterations  Significant speedups are achieved  But how to pipeline sequences of loops? I1I2 I3I4 I1 I2 I3 I4 time...

8 Computing Stages  Sequentially Producer:...A[2]A[1]A[0 ] Consumer: A[0]A[1]A[2]...

9 Computing Stages  Concurrently Ordered producer/consumer pairs Send/receive Producer:...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... A[3]... A[2] A[1] A[0] FIFO with N stages

10 Computing Stages  Concurrently Unordered producer/consumer pairs Empty/Full table 0 1A[1] A[5] 0 0 Producer:...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]... Empty/full data

11 Main Idea  FDCT Execution of Loops 1, 2Execution of Loop 3 time Loop 1 Loop 2 Loop 3 Global FSM Data Input Intermediate data Data output Intermediate data array

12 Main Idea  FDCT Out-of-order producer/consumer pairs How to overlap computing stages?

13 Main Idea  Pipelined FDCT Intermediate data ( dual-port RAM ) Loop 1 Loop 2 Loop 3 FSM 1 FSM 2 Dual-port 1-bit table ( empty/full ) Data input Data output Execution of Loops 1, 2 Execution of Loop 3 time Intermediate data array

14 Main Idea Task A Task B Memory

15 Possible Scenarios  Single write, single read Accepted without code changes  Single write, multiple reads Accepted without code changes (by using an N-bit table)  Multiple writes, single read Need code transformations  Multiple writes, multiple reads Need code transformations

16 Inter-Stage Communication  Responsible to: Communicate data between pipelined stages Flag data availability  Solutions Perfect associative memory Cost too high Memory for data plus 1-bit table (each cell represents full/empty information) Size of the data set to communicate Decrease size using hash- based solution 0 1A[1] A[5] 0 0 Empty/full data

17 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } Inter-Stage Communication  Memory plus 1-bit table

18 i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; } … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } Inter-Stage Communication  Hash-based solution:

19 Inter-Stage Communication  Hash-based solution We did not want to include additional delays in the load/store operations Use H(k) = k MOD m When m is a multiple of 2*N, H(k) can be implemented by just using the least  log 2 (m)  significant bits of K to address the cache (translates to simple interconnections) A[5] A[1]1 0 H H A[5] A[1]1 0

20 Inter-Stage Communication  Hash-based solution: H(k) = k MOD m  Single read (L=1)  R = 1   = 0  a) write  b) read  c) empty/full update

21 Inter-Stage Communication  Hash-based solution: H(k) = k MOD m  Multiple reads (L>1)  R = (L)   >>= R  a) write  b) read  c) empty/full update

22 Buffer size calculation  By monitoring behavior of communication component  For each read and write determine the size of the buffer needed to avoid collisions  Done during RTL simulation

23 Experimental Setup  Compilation flow Uses our previous work on compiling algorithms in a Java subset to FPGAs

24 Experimental Setup  Simulation back-end fsm.xmldatapath.xml fsm.xml rtg.xml to dotty to hds to java to vhdl datapath.hdsfsm.javartg.java fsm.classrtg.class HADES Library of Operators (JAVA) I/O data ( RAMs and Stimulus ) XSLTs ANT build file

25 Experimental Results  Benchmarks Algorithm# Stages#loopsDescription fdct 2 {s1,s2}3 Fast DCT (Discrete Cosine Transform) fwt2D 4 {s1,s2,s3,s4}8 Forward Haar Wavelet RGB2gray + histogram 2 {s1,s2}2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image Smooth + sobel, 3 versions: (a) (b) (c) 2 {s1,s2}6 Smooth image operation based on 3  3 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.

26 Experimental Results  FDCT (speed-up achieved by Pipelining Sequences of Loop)

27 Experimental Results Algorithm Input data size Stages#cc w/o PSL Speed-up Upper – Bound #cc w/ PSL Speed- up fdct 800  600 (s1,s2) (s1) (s2) 3,930,005 1,950,003 1,920, ,830, Fwt2D 512  512 (s1,s2,s3,s4) (s1,s2) (s3,s4) 4,724,745 2,362, ,664, RGB2gray + histogram 800  600 (s1,s2) (s1) (s2) 6,720,025 2,880,015 3,840, ,840, Smooth + sobel (a) 800  600 (s1,s2) (s1) (s2) 49,634,009 32,929,473 16,606, ,929, Smooth + sobel (b) 800  600 (s1,s2) (s1) (s2) 30,068,645 13,364,109 16,606, ,640, Smooth + sobel (c) 800  600 (s1,s2) (s1) (s2) 25,773,809 13,364,109 11,862, ,364,

28 Experimental Results  What does happen with buffer sizes?

29 Experimental Results  Adjust latency of tasks in order to balance pipeline stages: Slowdown tasks with higher latency Optimization of slower tasks in order to reduce their latency  Slowdown of producer tasks usually reduces the size of the inter-stage buffers

30 Experimental Results  Buffer sizes +1 cycle per iteration of the producer +2 cycles per iteration of the producer original Optimizations in the producer +Optimizations in the consumer original

31 Experimental Results  Buffer sizes

32 Experimental Results  Resources and Frequency (Spartan-3 400)

33 Related Work  Previous approach (Ziegler et al.) Coarse-grained communication and synchronization scheme FIFOs are used to communicate data between pipelining stages Width of FIFO stages dependent on producer/consumer ordering Less applicable A[0] A[1] A[2] A[3]... Producer: Consumer: A[0] A[1] A[2] A[3]... A[0]A[1]... A[0] A[1] A[2] A[3]... A[1] A[0] A[3] A[2]... A[0] A[1] A[2] A[3]... A[0] A[1] A[2] A[3] A[4] A[5]... A[0] A[3] A[1] A[4] A[2] A[5]... A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9]... time

34 Conclusions  We presented a scheme to accelerate applications, pipelining sequences of loops I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced  Data-driven scheme is used based on empty/full tables A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function)  Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved as if stages are concurrently and independently executed

35 Future Work  Research other hash functions  Study slowdown effects  Apply the technique in the context of Multi-Core Systems Processor Core A LN M data_inaddress_in H address_out data_out H hit/miss T (a) (b) (c) (a) (b) R (a) Processor Core B Memory

36 Acknowledgments  Work partially funded by CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002  Based on the work done by Rui Rodrigues  In collaboration with Pedro C. Diniz

37 technology from seed A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops

38 Buffer Monitor

39 Buffer Monitor

40 Buffer Monitor

41 Buffer Monitor

42 Buffer Monitor

43 Buffer Monitor