Politecnico di Milano, Italy

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Torino, Italy – June 27th, 2013 A2B: AN I NTEGRATED F RAMEWORK FOR D ESIGNING H ETEROGENEOUS AND R ECONFIGURABLE S YSTEMS C. Pilato, R. Cattaneo, G. Durelli,
D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,
ECE-777 System Level Design and Automation Hardware/Software Co-design
Undoing the Task: Moving Timing Analysis back to Functional Models Marco Di Natale, Haibo Zeng Scuola Superiore S. Anna – Pisa, Italy McGill University.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Berlin, Germany – January 21st, 2013 A2B: A F RAMEWORK FOR F AST P ROTOTYPING OF R ECONFIGURABLE S YSTEMS Christian Pilato, R. Cattaneo, G. Durelli, A.A.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
SCOTT MILLER, AMBROSE CHU, MIHAI SIMA, MICHAEL MCGUIRE ReCoEng Lab DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Chess Review May 10, 2004 Berkeley, CA Platform-based Design for Mixed Analog-Digital Designs Fernando De Bernardinis, Yanmei Li, Alberto Sangiovanni-Vincentelli.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Platform-based Design for Mixed Analog-Digital Designs Fernando De Bernardinis, Yanmei Li, Alberto Sangiovanni-Vincentelli May 10, 2004 Analog Platform.
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
Router modeling using Ptolemy Xuanming Dong and Amit Mahajan May 15, 2002 EE290N.
Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
EECE **** Embedded System Design
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Ph.D. in Computer Science
A Methodology for System-on-a-Programmable-Chip Resources Utilization
ENG3050 Embedded Reconfigurable Computing Systems
FPGA: Real needs and limits
Introduction to cosynthesis Rabi Mahapatra CSCE617
Peter Poplavko, Saddek Bensalem, Marius Bozga
SAT-Based Area Recovery in Technology Mapping
Presentation transcript:

Politecnico di Milano, Italy SMASH: A Heuristic Methodology for Designing Partially Reconfigurable MPSoCs Riccardo Cattaneo, Christian Pilato, Gianluca C. Durelli, Marco D. Santambrogio and Donatella Sciuto Politecnico di Milano, Italy IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013

What is an FPGA? Hardware device that can be customized after the fabrication to execute a specific functionality Distinct hardware blocks are “intrinsically” running in parallel on the device Heterogeneous grid of interconnected components look-up tables (LUTs), block rams (BRAMs), digital signal processors (DSPs), switch matrices, input/output blocks (IOBs) etc… Possibility to reuse resources by reconfiguring part of the logic at run time (partial reconfiguration)

Heterogeneous SoCs with FPGAs Highly coupled heterogeneous systems Zynq Platform: ARM Dual-Cortex A9 cores tightly coupled with a Xilinx Artix-7 FPGA High speed, low latency reconfigurable interconnect AVNet ZedBoard (Zynq7000-based dev board) Coarse Grain overview of Zynq7000 All-Programmable SoC

Design Challenges and Motivation Hardware engineer needs to: partition the application in blocks (partitioning) determine which parts are better to be executed in hardware (mapping and scheduling) generate the systems (architecture refinement) Partial reconfiguration allows reusing the same logic across different tasks More tasks can be ported in hardware Significant overhead to be taken into account INPUT SMASH The steps are strictly interdependent!

SMASH: Proposed Methodology Design Space Exploration determines the proper mapping and scheduling Architecture Refinement customizes the architectural template to derive the corresponding platform

Mapping and Scheduling Input: Task graph (DAG) Architectural Template Identifies resources constraints Implementations List of different trade-offs in terms of performance and resources Output: Implementation and component for each task Order of execution

Implementation vs. Component Each task can have multiple alternative implementations on the same component Faster tasks usually require more resources Some tasks can share implementations to execute the same functionality multiple times Hardware reuse: no reconfiguration is required Implementation is more related to functionality and resources Component is more related to where the task is actually executed Processor or hardware module

SMASH: Execution Overview Simultaneous MApping and Scheduling Heuristic SMASH iteration Generate trace Schedule trace Evaluate metrics Store solution Termination? No Yes Return best solution

Exploring Mapping and Scheduling Exploration based on the Serial Generation Scheme (SGS) Constructive approach to better handle design constraints Decision is not taken if it would lead to a constraint violation Different combinations of mapping and scheduling Each decision represents a mapping of a task with respect to an implementation and a processing element The order of selection represents the priority values for resolving scheduling conflicts on the resources

Ant Colony Optimization Our proposed approach is based on Ant Colony Optimization (ACO) to limit unfeasible solutions Cooperative behavior of the ants while searching The ant has different possibilities at each step and takes stochastic decisions, composing a trace Stochastic principles guarantee exploration (a probability is generated for each admissible decision at each step) Feed-backs guarantee the exploitation of good parts of the solutions

Algorithm Overview Pseudo-code of the proposed ACO-based exploration: Exploration: generating trace Mapping decision Exploitation: updating global information

Stochastic Selection Process At each decision point d, the probability to assign a candidate j (task/communication) to a proper implementation point i (implementation+processing element) is: Global information G: feedback information Probability that the decision leads to a good solution Local heuristic L: problem-specific hint “Adjusted” by the global heuristic if wrong Roulette wheel and extraction of a combination i, j Probability is generated iff the resources required by the resulting PEs can be satisfied by the architecture global heuristic local heuristic There is always the possibility of adding a new PE or reusing an existing one (platform customization)

More about SMASH Simultaneous MApping and Scheduling Heuristic No Yes SMASH iteration Generate trace Schedule trace Evaluate metrics Store solution Termination? No Yes Return best solution 13

Trace Generation and Evaluation Evaluation is performed only on the complete trace Updated version of the original TG augmented with communications and reconfigurations Reconfiguration is taken into account from the early stages of the design process Possibility to include different evaluation methods Analytical estimations vs. TLM simulations Decisions composing the best solution are reinforced As the time goes, the best trace is identified [generazione della traccia] Analisi statica, simulazione TLM (vedi SoC), possibilità di modificare il TG per integrare informazioni più dettagliate (diverse comunicazioni, riconfigurabilità) Possibility to evaluate and compare alternative approaches

Scheduling Definition Input Task graph (DAG) Trace: ordered list of mapping decisions (task-component-implementation) Output Start/end time estimations for each task Goal Reduce total execution time Task Component Implementation A p1 impl_0 B p2 impl_1 C impl_2 D p3 impl_3

Scheduling: Methodology Overview SMASH scheduler Task graph and trace Extended task graph Metrics Create extended task graph Actual scheduling (assign times) Evaluate Metrics

Extended TG: Communications Adding explicit tasks based on the communication topology

Extended TG: Reconfigurations A reconfiguration task is introduced iff: Two processing tasks are mapped on the same component and Their implementations are different, i.e., module cannot be reused Insertion of a reconfiguration task: New edges are introduced from all WRITEs exiting the source processing task to the reconfiguration New edges are introduced from the reconfiguration to all the READs entering the target processing task

Extended TG: Reconfigurations Task Component Implementation A p1 impl_0 B p2 impl_1 C impl_2 D p3 impl_3

Trace Evaluation Possibility to integrate different policies to generate the corresponding scheduling

Architecture Refinement Actual platform instance is derived based on the resulting decisions Hardware modules with only one task assigned are converted into static IP blocks Hardware modules with more tasks assigned are represented as reconfigurable regions Integration with the generation of the run time manager to manage reconfigurations Still work in progress and manually performed

Experimental Evaluation Synthetic benchmarks (TGFF) Focus on scalability of the approach Possibility to evaluate different task graph patterns Resulting systems (platform instance and extended task graph with mapping/scheduling decisions) converted into virtual platforms Validation of the different solutions assuming correctness of the execution Simulations performed with Synopsys Platform Architect VPU performance annotations extracted from tasks’ implementations

Experimental Setup Three different class of experiments: Static: FPGA area is divided into a set of up to KS static IP cores (no partial reconfiguration) Mixed: both IP cores and reconfigurable regions can be used, with an upper bound of KM IPs and RM reconfigurable regions. Reconfigurable: architectures with no more than KR regions Reconfigurable regions can be also deployed as static cores in the final architecture if only one task is assigned to them

Experimental Results Small task graphs cannot benefit of reconfiguration Large task graphs are affected by communication overhead   static mixed reconfigurable #Task IPs RRs HW tasks #Reconf 12 7 6 20 18 1 17 19 31 30 4 16 41 8 40 14 52 9 51 25 26 60 15 10 53 28 27 70 55 58 33 83 11 80 54 81 56 90 23 3 5 39 100 46

Conclusions and Future Work SMASH is an automated methodology to design reconfigurable systems It determines the mapping and scheduling of the different tasks It allows customizing the architectural template Future work Integration of floorplanning procedures to compuate and validate physical constraints of the blocks Automatic generation of the platform specification

End… http://www.fp7-faster.eu/