Mapping of Applications to Multi-Processor Systems

Slides:

Advertisements

Similar presentations

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Advertisements

Embedded Systems & Parallel Programming P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2007 Universität Dortmund A view on embedded systems.

Embedded System, A Brief Introduction

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

November 23, 2005 Egor Bondarev, Michel Chaudron, Peter de With Scenario-based PA Method for Dynamic Component-Based Systems Egor Bondarev, Michel Chaudron,

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

ECE-777 System Level Design and Automation Hardware/Software Co-design

1 Swiss Federal Institute of Technology Computer Engineering and Networks Laboratory Design Space Exploration of Embedded Systems © Lothar Thiele ETH Zurich.

System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.

Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These slides.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

ARTIST2 Network of Excellence on Embedded Systems Design cluster meeting –Bologna, May 22 nd, 2006 System Modelling Infrastructure Activity leader : Jan.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,

1 Embedded Computer System Laboratory RTOS Modeling in Electronic System Level Design.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Computer Science 12 Embedded Systems Group © H. Falk | Dortmund, 08-Jul-08 Overview about Computer Science 12 at Dortmund University of Technology Heiko.

ECE-777 System Level Design and Automation Introduction 1 Cristinel Ababei Electrical and Computer Department, North Dakota State University Spring 2012.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Mapping of Applications to Multi-Processor Systems Peter Marwedel TU Dortmund, Informatik 12 Germany 2014 年 01 月 17 日 These slides use Microsoft clip arts.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2013 年 12 月 02 日 These slides use Microsoft clip arts. Microsoft copyright.

- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Mapping of Applications to Multi-Processor Systems Peter Marwedel TU Dortmund, Informatik 12 Germany 2012 年 07 月 09 日 Graphics: © Alexandra Nolte, Gesine.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

System-on-Chip Design

Combinatorial Optimization for Embedded System Design

Dynamo: A Runtime Codesign Environment

ECE354 Embedded Systems Introduction C Andras Moritz.

Ph.D. in Computer Science

Evaluating Register File Size

Enabling machine learning in embedded systems

Texas Instruments TDA2x and Vision SDK

Standard Optimization Techniques

Digital Processing Platform

Digital Processing Platform

Evaluation and Validation

A High Performance SoC: PkunityTM

Introduction to Embedded Systems

Final Project presentation

Department of Electrical Engineering Joint work with Jiong Luo

Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Martin Croome VP Business Development GreenWaves Technologies.

Presentation transcript:

Mapping of Applications to Multi-Processor Systems Iwan Sonjaya © Springer, 2010 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Structure of this course 2: Specification & Modeling Design repository Design 3: ES-hardware 8: Test Application Knowledge 6: Application mapping 5: Evaluation & validation (energy, cost, performance, …) 7: Optimization 4: system software (RTOS, middleware, …) Numbers denote sequence of chapters

The need to support heterogeneous architectures Energy efficiency a key constraint, e.g. for mobile systems Unconventional architectures close to IPE Efficiency required for smart assistants “inherent power efficiency (IPE)“  How to map to these architectures? © Renesas, MPSoC‘07 © Hugo De Man/Philips, 2007

Practical problem in automotive design Which processor should run the software?

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given task graph Map to CELL, Hopes, Qiang XU (HK) Simunic (UCSD) COOL codesign tool; EXPO/SPEA2 SystemCodesigner Auto-parallelizing Mnemee (Dortmund) Franke (Edinburgh) Daedalus MAPS

Example: System Synthesis © L. Thiele, ETHZ

Basic Model – Problem Graph © L. Thiele, ETHZ

Basic Model: Specification Graph © L. Thiele, ETHZ

Design Space Which architecture is better suited for our application? Communication Templates Computation Templates DSP mE Cipher SDRAM RISC FPGA LookUp Scheduling/Arbitration proportional share WFQ static dynamic fixed priority EDF TDMA FCFS Which architecture is better suited for our application? Architecture # 1 Architecture # 2 LookUp RISC EDF mE mE mE TDMA static Priority mE mE mE WFQ Cipher DSP © L. Thiele, ETHZ

Evolutionary Algorithms for Design Space Exploration (DSE) © L. Thiele, ETHZ

Challenges © L. Thiele, ETHZ

EXPO – Tool architecture (1) MOSES system architecture performance values EXPO SPEA 2 task graph, scenario graph, flows & resources Exploration Cycle selection of “good” architectures © L. Thiele, ETHZ

EXPO – Tool architecture (2) Tool available online: http://www.tik.ee.ethz.ch/expo/expo.html © L. Thiele, ETHZ

EXPO – Tool (3) © L. Thiele, ETHZ

Application Model Example of a simple stream processing task structure: © L. Thiele, ETHZ

Exploration – Case Study (1) © L. Thiele, ETHZ

Exploration – Case Study (2) © L. Thiele, ETHZ

Exploration – Case Study (3) © L. Thiele, ETHZ

More Results Performance for encryption/decryption Performance for RT voice processing © L. Thiele, ETHZ

Extension: considering memory characteristics with MAMOT Example

Gains resulting from the use of memory information Olivera Jovanovic, Peter Marwedel, Iuliana Bacivarov, Lothar Thiele: MAMOT: Memory-Aware Mapping Tool for MPSoC, 15th Euromicro Conference on Digital System Design (DSD 2012) 2012

Design Space Exploration with SystemCoDesigner (Teich et al Design Space Exploration with SystemCoDesigner (Teich et al., Erlangen) System Synthesis comprises: Resource allocation Actor binding Channel mapping Transaction modeling Idea: Formulate synthesis problem as 0-1 ILP Use Pseudo-Boolean (PB) solver to find feasible solution Use multi-objective Evolutionary algorithm (MOEA) to optimize Decision Strategy of the PB solver © J. Teich, U. Erlangen-Nürnberg

A 3rd approach based on evolutionary algorithms: SYMTA/S: [R. Ernst et al.: A framework for modular analysis and exploration of heteterogenous embedded systems, Real-time Systems, 2006, p. 124]

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given task graph Map to CELL, Hopes Qiang XU (HK) Simunic (UCSD) COOL codesign tool; EXPO/SPEA2 SystemCodesigner Auto-parallelizing Mnemee (Dortmund) Franke (Edinburgh) Daedalus MAPS

A fixed architecture approach: Map  CELL Martino Ruggiero, Luca Benini: Mapping task graphs to the CELL BE processor, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008

Partitioning into Allocation and Scheduling © Ruggiero, Benini, 2008

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given task graph Map to CELL, Hopes, Qiang XU (HK) Simunic (UCSD) COOL codesign tool; EXPO/SPEA2 SystemCodesigner Auto-parallelizing Mnemee (Dortmund) Franke (Edinburgh) Daedalus MAPS

Xilinx Platform Studio (XPS) Daedalus Design-flow Sequential application System-level design space exploration Explore, modify, select instances Library of IP cores Library of IP cores High-level Models Sesame KPNgen Automatic Parallelization System-Level Specification Common XML Interface Platform specification Mapping specification Parallel application specification Kahn Process Network RTL-level Models System-level synthesis ESPAM RTL-Level Specification Synthesizable VHDL C/C++ code for processors MP-SoC Ed Deprettere et al.: Toward Composable Multimedia MP-SoC Design,1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008 Xilinx Platform Studio (XPS) Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) © E. Deprettere, U. Leiden

JPEG/JPEG2000 case study Example architecture instances for a single-tile JPEG encoder: 16KB 32KB 32KB 4KB 2KB Vin,DCT Q,VLE,Vout Vin,Q,VLE,Vout DCT 2 MicroBlaze processors (50KB) 1 MicroBlaze, 1HW DCT (36KB) DCT, Q 2KB 4x2KB DCT Q 2KB 2KB 8KB 32KB 8KB DCT, Q Vin VLE, Vout Vin 8KB 32KB VLE, Vout DCT, Q 8KB 4x2KB 2KB 2KB 2KB DCT Q DCT, Q 4x16KB 6 MicroBlaze processors (120KB) 4 MicroBlaze, 2HW DCT (68KB) © E. Deprettere, U. Leiden

Sesame DSE results: Single JPEG encoder DSE © E. Deprettere, U. Leiden

A Simple Classification Architecture fixed/ Auto-parallelizing Fixed Architecture Architecture to be designed Starting from given task graph Map to CELL, Hopes, Qiang XU (HK) Simunic (UCSD) COOL codesign tool; EXPO/SPEA2 SystemCodesigner Auto-parallelizing Mnemee (Dortmund) Franke (Edinburgh) Daedalus MAPS

Auto-Parallelizing Compilers Discipline “High Performance Computing”: Research on vectorizing compilers for more than 25 years. Traditionally: Fortran compilers. Such vectorizing compilers usually inappropriate for Multi-DSPs, since assumptions on memory model unrealistic: Communication between processors via shared memory Memory has only one single common address space De Facto no auto-parallelizing compiler for Multi-DSPs! Work of Franke, O’Boyle (Edinburgh) Work of Daniel Cordes, Olaf Neugebauer (TU Dortmund)

Example: Edge detection benchmark D. Cordes: Automatic Parallelization for Embedded Multi-Core Systems using High-Level Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf

Task Parallelism D. Cordes: Automatic Parallelization for Embedded Multi-Core Systems using High-Level Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf

MaCC Application: Multi-Objective Task-Level Parallelization Restrictions of embedded architectures Low computational power, small memories, battery-driven, … Good trade-off between different objectives must be found As fast as necessary instead of as fast as possible Speedup vs Energy Consumption for Matrix Multiplication on MPARM with MEMSIM © D. Cordes, 2013

Multi-Objective Task-Level Approach (2) Communication overhead Energy Speedup of Execution Time Edge detect benchmark from UTDSP benchmark suite Target architecture: MPARM with MEMSIM energy model (1-4 cores) © D. Cordes, 2013

Pipeline Parallelization 1. Extract pipeline stages (horizontal splits) ILP formulation:

Heterogeneous Pipeline Parallelization Heterogeneous MPSoCs can be more efficient than homogeneous Cores behave differently on same parts of application Efficient balancing of tasks very difficult Use ILP as clear mathematical model integrating cost models Combine mapping with task extraction big.LITTLE © D. Cordes, 2013

Heterogeneous pipeline parallelization results Cycle accurate simulator: CoMET (Vast) 100, 250, 500, 500 MHz ARM1176 cores Baseline: Sequential on 100 MHz core Average speedup: 8.9x Max.Speedup: 11.9x © D. Cordes, 2013

MAPS-TCT Framework Rainer Leupers, Weihua Sheng: MAPS: An Integrated Framework for MPSoC Application Parallelization, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008 © Leupers, Sheng, 2008

Summary Evolutionary algorithms currently the best choice Clear trend toward multi-processor systems for embedded systems, there exists a large design space Using architecture crucially depends on mapping tools Mapping applications onto heterogeneous MP systems needs allocation (if hardware is not fixed), binding of tasks to resources, scheduling Two criteria for classification Fixed / flexible architecture Auto parallelizing / non-parallelizing Introduction to proposed Mnemee tool chain Evolutionary algorithms currently the best choice