Download presentation
Presentation is loading. Please wait.
Published byJudith Dowson Modified over 10 years ago
1
D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†, Christian Pilato ∗, Tobias Becker†, Wayne Luk†, Marco D. Santambrogio ∗ * Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano † Department of Computing Imperial College of London R E C O S O C13 International Workshop on Reconfigurable Communication-centric Systems-on-Chip R E C O S O C13
2
D ARMSTADT, G ERMANY - 11/07/2013 2 Motivations The design of heterogeneous, reconfigurable systems is a complex task Adequate computer-assisted design (CAD) tools required One of the foreseen predominant platforms of the future is the MPSoC Lots of heterogeneous cores onto single chips Typically, we want to accelerate an application or o class of applications onto the MPSoC Starting point should be the application, not the architecture alone Decisions in the frontend phase may highly affect the backend implementation iterative exploration is a practical requirement This is an ongoing project at Politecnico di Milano to assist in the design of such complex systems
3
D ARMSTADT, G ERMANY - 11/07/2013 3 Contents Framework Overview Preliminary Results – Test Case Conclusions and Future Work
4
D ARMSTADT, G ERMANY - 11/07/2013 4 Framework Overview Inputs (single XML file): Information about the target device Application source files (.C) plus custom pragmas for additional information (e.g., task level parallelism/kernels) Architectural template to use Application Analysis Task graph generation Dataflow Graph generation (per function) High Level Analysis: Estimates of resource consumption for each node (DFG based) Mapping and Scheduling Mapping, Scheduling Refinement of the architectural template Output: Project files ready for the synthesis with back-end tools
5
D ARMSTADT, G ERMANY - 11/07/2013 5 XML Exchange Format The entire project is contained inside an XML file Architecture: components’ characteristics (e.g., reconfigurable regions), … Applications: source code files and profiling information Library: task implementations with the characterization (time, resources,...) Partitions: task graph, mapping and scheduling, … It allows a modular organization of the framework, but also the sharing of information among the different phases Specific details of the target platform are taken into account only in the final phase (interaction with backend tools)
6
D ARMSTADT, G ERMANY - 11/07/2013 6 Task Graph Generation Application source code files can be analyzed to extract the task graphs Profiling information can drive the generation of such solutions Task graph will be then specified in the XML file as processing nodes connected by data transfers #pragma omp task void threshold(unsigned char *o1,unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int minH1 = p[1]; int maxH1 = p[2]; int minV1 = p[3]; int maxV1 = p[4]; for(v=minV1;v<maxV1;v++) for(h=minH1;h<maxH1;h++){ If(original1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; }
7
D ARMSTADT, G ERMANY - 11/07/2013 7 Library Generation: a collection of different implementations LLVM-based compiler to extract the dataflow graph of each task Estimation of required resources (including bit-width analysis) Possibility to interact with HLS synthesis tools to obtain more accurate results (trading off design time with estimation accuracy) Generated implementations are then stored into the XML file to offer opportunities to the mapper and floorplacer Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain
8
D ARMSTADT, G ERMANY - 11/07/2013 8 Mapping, Scheduling and Floorplacing We generate one or more configurations where each task of the application is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to An available and admissible implementation A component of the architecture (GPP, IP or reconfigurable region) This allows to “share” implementations across different tasks (hardware sharing) move a task implementation to another processing element at run-time (task relocation)
9
D ARMSTADT, G ERMANY - 11/07/2013 9 Architecture Exploration During exploration, the target architecture can be refined Adding/removing processing elements (reconfigurable regions) Modifying their parameters Determining the proper interconnection topology It can iteratively affect: mapping and scheduling: modification to the computational resources (especially the number of reconfigurable regions) floorplacing: resources might become more scarce or more available due to the presence of more or less components to floorplace It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application E.g.: mapping and floorplacing can suggest which resources should be added
10
D ARMSTADT, G ERMANY - 11/07/2013 10 Supported Platforms Virtex-5 XC5VLX110T (embedded) Two XCF32P Platform Flash PROMs (32Mbyte each) SystemACE™ Compact Flash configuration controller 64-bit wide 256Mbyte DDR2 small outline DIMM (SODIMM) Maxeler MaxWorkstation (HPC system) Intel i7 2600s@2.8GHz, 16GB RAM, 500GB HDD Max3 dataflow engine (DFE) Virtex 6 SX475T FPGA, 24GB memory DFE connected to CPU via PCI Express XUPV5 Reconf. Area DDR2 (256MB) CPU0 CPU1 CPU MAX3 DFE DRAM (16GB) Interface FPGA Compute FPGA DRAM (24GB)
11
D ARMSTADT, G ERMANY - 11/07/2013 11 Backend Toolchains CPU Compiler.c.xml Bitstream Generation HLS (MaxJ-VHDL) -Source code for CPU -DFGs for HW tasks -Mapping configurations Bitstream Generation exec bin bit Manual VHDL Implementations DFG-C HLS (C-VHDL) Manual MaxJ Implementations FPGA-based embedded system MaxWorkstation The code can be always further optimized by hand; e.g., glue code for data transfers MaxIDE DFG-MaxJ
12
D ARMSTADT, G ERMANY - 11/07/2013 12 Helper Graphical User Interface Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies
13
D ARMSTADT, G ERMANY - 11/07/2013 13 Preliminary results: edge detection Edge detection application: 4 stages of computation C + custom #pragmas based description Extracted taskgraph and corresponding DFG of first stage (Scale, 1x parallelism) We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application “parallelism X”: X pixels processed at once Maxeler Backend
14
D ARMSTADT, G ERMANY - 11/07/2013 14 Experimental Results / 1 Static vs reconfigurable design (both extracted using the framework) R0: S,T R1: B,E Task Name Area Occupation S664 B64 E7680 T7376 Region NameFinal Area Occupation R0max(664,64)=664 R1max(7680,7376)=7680 Total area consumption 7376+64=8344 Reconfigurable (parallelism 8) Task Name Area Occupation S332 B32 E3840 T3688 Region NameFinal Area Occupation Total area consumption 332+32+3840+3688= 7876 Static (parallelism 4) IP0: S IP1: B IP2: E IP3: T We limit the available area to 10klut and implement the most performing design
15
D ARMSTADT, G ERMANY - 11/07/2013 15 Experiment Results / 2 Reconfiguration time is automatically masked (when possible) Partial Reconfiguration improves performance of application via automatic resource multiplexing Automatic due to exploration of different schedulings
16
D ARMSTADT, G ERMANY - 11/07/2013 16 Experiment Results / 3 HLA estimates are fairly accurate, given that they are extracted in a matter of seconds on a commodity desktop machine. Average values over the set of tasks Average accuracy is > 85%
17
D ARMSTADT, G ERMANY - 11/07/2013 17 Conclusions and Future Work We presented a modular framework to design heterogeneous, reconfigurable systems Easy to plug alternative methods for each of the phase Possibility to perform progressive refinement of both application and architecture Critical part: multi-objective optimization strategy. Different experiments with different heuristics or possibly different algorithms Easy to plug in different components This is becoming part of a larger project (ASAP – Advanced Synthesis of Applications and Platforms) SystemC TLM backend for (co-)simulation and early validation More architectural templates Closer interaction with actual synthesis (e.g., high-level synthesis) Automated methodologies to accelerate the design
18
D ARMSTADT, G ERMANY - 11/07/2013 Thank you! Riccardo Cattaneo rcattaneo@elet.polimi.it Research partially funded by the European Community’s Seventh Framework Programme, FASTER project.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.