A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian Pilato, Donatella Sciuto and Marco Domenico Santambrogio Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria Milano, IT [durelli, nacci, rcattaneo, pilato, 1 20th Reconfigurable Architectures Workshop May 20-21, 2013, Boston, USA
Rationale Strive for performance in computing intensive applications Reconfigurable HW well suited for certain classes of applications –Multimedia, computational biology, physical simulation FPGA used in HPC systems High maintenance costs –need to share resources among users Need to dynamically share and reuse components on FPGA among different users 2
Outline Goals State of Art Proposed Solution Design and Evaluation Case Study Conclusions and Future work 3
Goals Design an interconnection able to: –Create different pipelines reusing available components on the FPGA –Share the resources between different applications –Not insert any stall in the pipeline Target FPGA for HPC scenario 4
State of Art BUS interconnection –Congestion problem –Does not scale Network on Chip –Possible congestion problem –Good scalability 5 Introduce unexpected delays in computation –Can’t assure performance when sharing the device between different users
Proposed Solution Switch based interconnection –Cores inputs connected to interconnection outputs –Cores outputs connected to interconnection inputs –Fully pipelined point-to-point communication Data read/write only when all the inputs are available Can be configured by setting for each input and output channels: –Switching configuration: Multiplexer configuration to route information –From which clock cycle the channel is active –How much data have to be read/write through that channel 6
Proposed Solution Suited for Dataflow/Pipelined applications Parameters can be extracted from an high level description of the application and pipeline structure: –Possibility to automate the parameter extraction and interconnection design
Implementation 8 Solution Implemented with HLS: –HLS well suited for dataflow/stencil loop synthesis –Simplify HW development –Generation of compatible interfaces Maxeler Technologies: –HPC Dataflow computing exploiting FPGA –Proprietary HLS starting from Java-like description: Proposed interconnection solution easily described in Java MaxWorkstation 3A: –Intel i7 quad-core –Xilinx Virtex6 XC6VSX547T –PCIe communication: Maximum 8 channels/streams
Evaluation: Area Occupation 9 Area increment (10-30%) due to increase in switching logic The interconnection consumes up to 6% of the FPGA: –Lot of space remains for user cores
Evaluation: Frequency 10 Tested with pass-through cores to evaluate maximum working frequency of the interconnection (300MHz) In case of real life applications (Brain network with cores working at 200MHz) the interconnection does not affect the critical path
Case Study Application: –Image processing pipeline (up to 4 stages): Gray scale (GS), Gaussian blur (GB), Edge detection (ED) filters Their combinations Tested architectures: Experiments: –Single execution of a N stages pipeline –Batch execution of a workload of 100 random applications 11 (A) (B) (C) (D)
Case Study: Single execution 12 (A) (B) (C) (D)
Case Study: Single execution 13 (A) (B) (C) (D)
Case Study: Batch execution 14 Proposed solution (D) does not introduce overhead in the overall execution time w.r.t. the other two architectures Low system load: –Up to 30% reduction in the overall workload execution time
Case Study: Batch execution 15 Low system load (1-2 applications): –Proposed solution (D) does not introduce delays in the execution of a single application of the workload Higher system loads (more than 2 applications): –10%-30% reduction in single application execution time
Conclusions and Future work Conclusion: –Design of a interconnection to support HW resource sharing in multi-application scenario –Solution suited for dataflow/pipelined systems –Possibility to realize different pipeline configurations at run-time Future works: –Design of a mapping/reconfiguration strategy to allocate user cores and configure new core instances at run-time 16
17