Application-to-Architecture Mapping

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECE-777 System Level Design and Automation Hardware/Software Co-design
Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
System Level Design: Orthogonalization of Concerns and Platform- Based Design K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Configurable System-on-Chip: Xilinx EDK
Chapter 13 Embedded Systems
1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
THE PHILIPS NEXPERIA DIGITAL VIDEO PLATFORM. The Digital Video Revolution  Transition from Analog to Digital Video  Navigate, store, retrieve and share.
Automated Design of Custom Architecture Tulika Mitra
J. Christiansen, CERN - EP/MIC
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Introduction to Operating Systems Concepts
Computer Organization and Architecture Lecture 1 : Introduction
System-on-Chip Design
Programmable Hardware: Hardware or Software?
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
Lab 1: Using NIOS II processor for code execution on FPGA
CA Final Project – Multithreaded Processor with IPC Interface
ECE354 Embedded Systems Introduction C Andras Moritz.
Microarchitecture.
Ph.D. in Computer Science
System On Chip.
Introduction ( A SoC Design Automation)
Texas Instruments TDA2x and Vision SDK
Architecture & Organization 1
FPGAs in AWS and First Use Cases, Kees Vissers
Chapter 1: Introduction
IP – Based Design Methodology
Improving cache performance of MPEG video codec
Introduction to cosynthesis Rabi Mahapatra CSCE617
Architecture & Organization 1
Dynamically Reconfigurable Architectures: An Overview
CoCentirc System Studio (CCSS) by
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003.
A High Performance SoC: PkunityTM
Introduction to Operating Systems
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Architecture Synthesis
HIGH LEVEL SYNTHESIS.
Introduction to Operating Systems
Department of Electrical Engineering Joint work with Jiong Luo
ADSP 21065L.
Presentation transcript:

Application-to-Architecture Mapping 4541.633A SoC Design Automation School of EECS Seoul National University

System design methodology Introduction Introduction System design methodology Traditional method Mostly bottom-up Given application and constraints First assemble HW components Then develop SW What if it fails to meet the specification?  reassemble HW-SW codesign Mostly top-down Given application, constraints, and simple architectural assumption Partition the application into HW and SW Synthesize from the partitions

System Implementation HW-SW Codesign HW-SW Codesign Typical HW-SW codesign flow System Specification Analysis Internal Rep. System Simulation HW-SW Partitioning SW part Interface HW part SW Generation Interface Synthesis HW Synthesis SW compilation System Integration System Implementation

HW-SW Codesign Polis F. Balarin, et al., Hardware-Software Co-Design of Embedded Systems: The Polis Approach, Kluwer Academic Publishers, 1997. A design environment for control-dominated embedded systems MoC: CFSM (Co-design Finite State Machine) Globally asynchronous/locally synchronous Formal verification or simulation for the analysis of a system at the behavioral level It can generate C-code and HDL code Weak points Only CFSM: control-dominated application Does not support estimation technique for complex processor models Does not support multiple hardware and software partitioning

Overall flow HW-SW Codesign formal languges (Esterel) translators CFSMs partitioning partitioned CFSMs HW synthesis SW synthesis interface synthesis BLIF optimized hardware C code OS synthesis HW interface logic synthesis integration S-graph scheduler template + timing constraints simulation formal verification intermediate format translator

Partitioning system functionality into HW-SW Partitioning HW-SW Partitioning Partitioning system functionality into Application specific hardware and Software executing on one (or more) processor(s) Partitioning problem Find minimum cost HW-SW combination satisfying constraints Cost = f (HW area, HW delay, SW size, SW time, interface size, interface delay, power, ... ) Need efficient and accurate performance, cost, power estimation models Need efficient partitioning algorithms Greedy method Simulated annealing Kernighan-Lin Integer linear programming Global criticality/local phase Manual ...

HW-SW Partitioning ILP-based approach R. Niemann and P. Marwedel, “Hardware/software partitioning using integer programming,” Proc. ED&TC, Mar. 1996. Concurrent partitioning, scheduling, and sharing Integer linear programming Minimize design cost with performance & resource constraints VHDL C code VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster SW nodes retargetable compilation SW costs

Global criticality/local phase HW-SW Partitioning Global criticality/local phase A. Kalavade and E. A. Lee, “A global criticality/local phase driven algorithm for the constrained hardware/software partitioning problem," Proc. Codes/CASHE, Sept. 1994, pp. 42-48. Global Criticality/Local Phase (GCLP) GC Global time-criticality (feasibility) Node-invariant LP Classify each node into three phases: extremity, repeller, normal Determine mapping and start time for each node Quadratic complexity Task/process level of granularity

Objective function Not hardwired HW-SW Partitioning Objective function Not hardwired Selected at each step according to GC & LP

HW-SW Partitioning Global criticality Probability that an unscheduled node (in U) should be implemented in HW to meet latency constraint Algorithm Estimate H nodes to move to HW according to priority (more performance, less area --> gets higher priority) so that the remaining SW nodes can be executed within Tremaining Compute actual finish time If not feasible, go to 1. Compute GC=(size of H)/(size of U), size: number of elementary operations

Local phase 1: extremity Local phase 2: repellers HW-SW Partitioning Local phase 1: extremity Determine extremity sets EXs and EXh Local phase 2: repellers Software repeller property Bit-level instruction mix, precision level Hardware repeller property Memory-intensive instruction mix, table-lookup instruction mix

HW-SW Partitioning Compute D If i  (EXs  EXh), -0.5<D<0.5 depending on the level of extremity (more negative if HW is preferred) Else if repeller, -0.5<D<0.5 depending on the repeller value (more negative if HW is preferred) For a normal node, D=0

Experimental results ILP: several hours GCLP: order of seconds HW-SW Partitioning Experimental results ILP: several hours GCLP: order of seconds Good solution: low HW area and high DSP utilization HA: hardware area, SA: software area, Util: DSP utilization HA is the total hardware area as a fraction of the capacity constraint. SA is the total software size as a fraction of the memory capacity constraint.

Implementation-bin selection HW-SW Partitioning Implementation-bin selection A. Kalavade and E. A. Lee, "The extended partitioning problem: hardware/software mapping and implementation-bin selection," Proc. of the 6th International Workshop on Rapid. Systems Prototyping, 1995. Mapping and implementation-bin selection (MIBS)

Algorithm Perform GCLP-based HW-SW partitioning Use median values for the HW cost/time Implementation-bin selection is applied to HW only but it is also applicable to SW Bin Fraction Curve (BFC) Fraction of free nodes that need to be mapped to their L bins Bin Sensitivity Curve (BSC) Slopes of the BFC

Algorithm Computation of BFC HW-SW Partitioning next( ) selects a node from U by using different ranking functions such as thiH or ahiL

HW-SW Partitioning Algorithm Weighted bin sensitivity curve

Results HW-SW Partitioning mapped to L bins mapped to median implementation bins

Platform-Based Design Trend in System-on-Chip (SoC) design Larger design space Exponentially growing transistor counts (Moore's law) Ever increasing complexity of applications Multi-functional and multi-standard More flexibility, higher performance, lower energy, ... Shorter Time-to-Market Need more efficient design methodology Complexity 58%/yr growth rate Productivity 21%/yr growth rate

Platform-Based Design Reuse of Cell (standard cell) IP Architecture (platform) --> platform-based design IC (reconfigurability) Memory Video RAM I/O Host interface DSP core 1 (D950) Modem DSP core 2 Sound ASIP 1 Master Control ASIP 2 Controller ASIP 3 Bit Manipulation ASIP 4 (VLIW DSP) Programmable video operations, standard extensions S interface Glue logic A/D & D/A High-speed HW Video operations for DCT, IDCT, motion estimation Single chip videophone (H.263)

Platform-Based Design Platform and derivative design Soft IP EDA Tools Hard IP EDA Integrator Others EDA Tools Application specific integration platform Derivative

Platform-Based Design Design-space exploration Platform Design-Space Exploration Specification Architectural Space Application Space Application Instance Platform Instance System Application Space Application Instance Large Design-Space Exploration Platform Instance Architectural Space Conventional Design Platform-Based Design

Platform-Based Design Taxonomy of SoC platforms Full-Application Platforms Philips Nexperia TI OMAP (Open Multimedia Application Platform) ARM PrimeXsys Intel Xscale Architecture Processor-centric platform Improv Jazz Tensilica Xtensa Communication-Centric platform ARM AMBA bus architecture Sonics mNetwork IBM CoreConnect Fully Programmable Platform Altera Excalibur Xilinx Virtex-II Pro

Platform-Based Design Full-application platform Concentrates on full application Delivers comprehensive set of libraries hardware and software Delivers several mapping and application examples Texas Instruments OMAP Application domain: 2.5G/3G Wireless mobile devices Philips Nexperia Application domain: Digital Video, Digital Audio, Mobile Communications

Platform-Based Design Texas instrument OMAP1610 Dual processor core ARM926, TI DSP Up to 200MHz Multimedia cores 2D Graphics accelerator LCD controller MMC interface USB interface Wireless supports Bluetooth 3G

Platform-Based Design Nexperia platform MIPS™ TriMedia™ SDRAM General-purpose Scalable RISC Processor 50 to 300+ MHz 32-bit or 64-bit Library of Device IP Blocks Image coprocessors DSPs UART 1394 USB …and more Scalable VLIW Media Processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia™ System Buses 32-128 bit MIPS CPU MMI TriMedia CPU D$ PRxxxx TM-xxxx D$ I$ I$ DEVICE IP BLOCK DEVICE IP BLOCK DEVICE IP BLOCK DEVICE IP BLOCK . . . DVP MEMORY BUS . . . PI BUS PI BUS DEVICE IP BLOCK DEVICE IP BLOCK DVP SYSTEM SILICON

Platform-Based Design Nexperia software architecture Scalable from low-end to high-end Consistent API (on MIPS or TriMedia) Single Streaming Architecture for MIPS and TriMedia Aligned to Nexperia™ DVP (Digital Video Platform) HW architecture and IP blocks Operating system independent software layers OS abstraction libray Supports Linux, pSOS, Windows CE Re-use of software components on any instance of the platform

Platform-Based Design Processor-centric platform Application Specific Instruction Set processor Configure processor pipeline Generate complete software development environment Tensilica Xtensa Option: manually refine configuration Original C/C++ Code Evaluates millions of possible extensions: SIMD operations operator fusion parallel execution Designer selects “best” configuration Run XPRES Compiler int main( ) { int i; short c[100]; for (i=0;i<N/2;i++) Xtensa Processor Generator Tuned Software Tools Processor Hardware ALU DSP OCD Timer FPU Register File Cache

Platform-Based Design Configuration of Xtensa Processor Controls Instruction Fetch / PC Instruction RAM Trace TRACE Port Instruction ROM JTAG JTAG Tap Control Extended Instruction Align, Decode, Dispatch Instruction Decode/Dispatch Instruction MMU Instruction Cache On Chip Debug Exception Support User Defined Register Files User Defined Register Files Base Register File External Interface Exception Handling Registers Base ALU Xtensa Processor Interface Control Data Address Watch Registers MAC 16 DSP PIF Instruction Address Watch Registers User Defined Execution Units and Interfaces User Defined Execution Units MUL 16/32 Floating Point Interrupt Control Write Buffer Interrupts User Defined Execution Unit Timers User Defined Queues and Wires Vectra DSP Vectra DSP Vectra DSP Data MMU Data Cache Used Defined Data Load/Store Units Data Load/Store Unit Data ROMs Base ISA Feature Data RAMs Configurable Function Xtensa Local Memory Interface Optional Function Optional & Configurable User Defined Features (TIE)

Platform-Based Design Communication-centric platform Concentrates on communication back-bone (or On-chip Interconnection) - Delivers communication framework (plus generic peripherals) Sonics SiliconBackplane , PALMCHIP CoreFrame

Platform-Based Design Fully programmable platform Concentrates on reconfigurability Delivers processor plus programmable logic Xilinx Virtex-II Pro (Platform FPGA) Altera Excalibur (Platform FPGA)

Platform-Based Design Xilinx Virtex-II Pro PowerPC uP (400MHz) FPGA logics Internal RAM Serial transceiver XtremeDSP functions Digitally controlled impedance

Platform-Based Design Altera Excalibur ARM922T Cache MMU Flash Rom SRAM Interrupt Controller Watchdog Timer SDRAM Controller AHB1 1/2 PLL1 AHB1- AHB2 Bridge Dual Port SRAM0 Dual Port SRAM1 Single Port SRAM0 Single Port SRAM1 EBI UART AHB2 1/4 PLL1 Stripe-to-PLD Bridge Master Slave Configuration Logic Master (Configuration) Register Timer Slave PLD-to-Stripe Bridge Master PLD

Platform-Based Design System design flow Application Architecture Constraints Mapping Estimation of performance, area, and power in HW and SW Mapping results SW synthesis IF synthesis HW synthesis SW HW

Application-to-Architecture Mapping for(i = 0; i < 18; i++) { s = (mpfloat)0.0f; k = 0; do { s += X[k] * v[k]; s += X[k+1] * v[k+1]; s += X[k+2] * v[k+2]; s += X[k+3] * v[k+3]; s += X[k+4] * v[k+4]; s += X[k+5] * v[k+5]; k += 6; } while(k < 18); v += 18; ISCALE(s); t[i] = s; } /* correct the transform into the 18x36 IMDCT we need */ /* 36 muls */ for(i = 0; i < 9; i++) { x[i] = t[i+9] * Granule_imdct_win[gr->block_type][i]; ISCALE(x[i]); x[i+9] = t[17-i] * Granule_imdct_win[gr->block_type][i+9]; ISCALE(x[i+9]); x[i+18] = t[8-i] * Granule_imdct_win[gr->block_type][i+18]; ISCALE(x[i+18]); x[i+27] = t[i] * Granule_imdct_win[gr->block_type][i+27]; ISCALE(x[i+27]); } Application in C Platform architecture

Application-to-Architecture Mapping Y-chart approach B. Kienhuis, E. Deprettere, K. Vissers, P. van der Wolf, "An approach for quantitative analysis of application-specific dataflow architectures," Proc. ASAP'97, 1997. Application Architecture Mapping Performance analysis Performance numbers

Application-to-Architecture Mapping Abstraction pyramid A. Kienhuis, Design Space Exploration of Stream-based Datatow Architectures, Ph.D. Thesis, Delft University of Technology, 1999.

Application-to-Architecture Mapping Design trajectory Golden point design (low-level ad hoc design) Design approach using Y-chart environment

Application-to-Architecture Mapping Stack of Y-chart Use different models at different levels of abstraction

Application-to-Architecture Mapping A crucial step in DSE to evaluate the performance of different application-architecture combinations For smooth mapping Need a good match in data and operation types between the corresponding model of architecture and model of computation Mapping Architecture Application match in data/operation type Model of architecture Model of computation

Application-to-Architecture Mapping Model of computation (MoC) A formal representation of the operational semantics of networks of functional blocks describing computations Well-known MoCs Discrete Events (DE) Finite State Machines (FSM) Process Networks (PN) Synchronous Data Flow (SDF) Synchronous/Reactive (SR) Many different MoCs for various application domains May need multiple MoCs for modeling an application

Application-to-Architecture Mapping Model of architecture (MoA) A formal representation of the operational semantics of networks of functional blocks describing architectures It is for modeling an architecture instance of the architecture template Architecture template A specification of a class of architectures in a parameterized form Parameters are number of functional units, buffer size, bus type, latency, etc. Architecture instance The result of assigning values to parameters of the architecture template

Application-to-Architecture Mapping YAPI E. de Kock, G. Essink, P. van der Wolf, J.-Y. Brunel, W. Kruijtzer, P. Lieverse, and K. Vissers, "YAPI: Application Modeling for Signal Processing Systems," Proc. DAC, 2000. YAPI: Y-chart API Application modeling for signal processing systems For the reuse of signal processing applications For the mapping of signal processing applications onto heterogeneous systems Kahn process network (KPN) Often used for modeling signal processing applications Concurrent processes communicate through unidirectional first-in-first-out channels Blocking read Non-blocking write Deterministic

Application-to-Architecture Mapping A limitation of KPN Cannot model reactiveness such as user interaction, that is, non-deterministic events Control flow models such as finite state machines are a solution, but less suited for the implementation of computationally intensive applications. To extend KPN with non-deterministic events Introduce a communication primitive (channel selection primitive) YAPI separates the concerns of the application programmer and the system designer. Implementation of YAPI In the form of a C++ run-time library Read(), write(), execute(), and select() The implementation of these functions is a concern of the system designer (may be implemented in different ways).

Application-to-Architecture Mapping Architecture evaluation in YAPI VIDEOTOP application The top-level process network model MPEG2 stream Channel selection to be decoded ts: transport stream pid: packet id pes: packetized elementary stream es: elementary stream

Application-to-Architecture Mapping Simulation to measure the workload Communication requirement The amount of data that is transferred between processes Computation requirement The amount of computation of processes From the result We know that the required communication bandwidth is 150MB/s We select initial architecture as input for a more detailed mapping and performance analysis

Application-to-Architecture Mapping Trace-driven approach P. Lieverse, P. van der Wolf, E. Deprettere, K. Vissers, "A methodology for architecture exploration of heterogeneous signal processing systems," Proc. SIPS, 1999. SPADE (System level Performance Analysis and Design space Exploration) For architecture exploration of heterogeneous signal processing systems Support an explicit mapping step Cosimulation of application models and architecture models using trace-driven simulation technique Architecture model do not need to model the functional behavior, still handling data dependent behavior correctly

Application-to-Architecture Mapping In SPADE, applications and architectures are modeled separately. An application imposes a workload on the resources provided by an architecture Workload Computation and communication workload Resources Processing resources Programmable cores or dedicated hardware Communication resources Bus structures and memory resources such as RAMs or FIFO buffers

Application-to-Architecture Mapping Trace-driven simulation Application model A network of concurrent communicating processes Each process of application model Produce a so-called trace which contains information on the communication and computation operations The traces get interfaced to an architecture model Drive computation and communication activities in the architecture

Application-to-Architecture Mapping Application modeling Kahn Process Network model Modeled with YAPI based API read(), write(), and execute() They generate trace entries execute() function takes a symbolic instruction as an argument Architecture modeling Architecture model does not model the functional behavior It is constructed from generic building blocks Trace driven execution unit (TDEU) Interprets trace entries and has a configurable number of I/O ports Interfaces Translates the generic protocol (FIFO) into a communication resource specific protocol (e.g. bus) void Tidct(void) { ... while(1) { In->read(mb_in); mb_out = Idct(mb_in); execute(IDCT_MB); Out->write(mb_out); }

Application-to-Architecture Mapping Architecture modeling (Cont’d) All blocks are parameterized TDEU: a list of symbolic instructions and latencies Interface block: buffer size, bus width, setup delay and transfer delay

Application-to-Architecture Mapping Each process is mapped onto a TDEU Can be many-to-one Need to be scheduled by the TDEU (round robin) Each process port is mapped one-to-one onto an I/O port Simulation Concurrent simulation of the application model and the architecture model Architecture simulation TSS (Tool for System Simulation): Philips in-house architecture modeling and simulation framework

Application-to-Architecture Mapping Heterogeneous multiprocessor scheduling H. Oh and S. Ha, "A hardware-software cosynthesis technique based on heterogeneous multiprocessor scheduling," Proc. CODES, May 1999. Perform list scheduling with the allocated PEs task-PE time table heterogeneous multiprocessor scheduler task-PE allocation controller performance evaluation Fail Good cosynthesis result

Application-to-Architecture Mapping Task-PE allocation controller Allocate additional PEs until the given time constraint is satisfied Lock: initially lock all PE's except the lowest cost ones Unlock: select PE giving largest perf_gain/cost_increase Re-lock: in reverse order if time constraint is met B C P0 A D P1 A B D C solution processor cost task-PE profile table exec time(cost) P0(HW) P1(1) P2(5) B0 B1 B2 A 3(4) 2(6) 1(10) 7 2 B 4(5) 2(8) 10 3 C 2(3) 1(5) 5 D 5(10) 3(15) 15 P0 P1(1) P2(5) B0 7 10 2(3) 15 EMD(i,j,k) = execution_time(i,j*,k*) – execution_time(i,j,k) ECI(i,j,k) = cost(i,j,k) – cost(i,j*,k*) Slack = the schedule length – the time constraint ( i : node, j : processor, k : implementation) ( j*,k*: the minimum cost processing element or implementation) ECI(i,j,k) and EMD(i,j,k) mean the amounts of expected cost increment and expected time decrement respectively when (i,j,k) task-PE pair is unlocked or allocated. After computing EMDs and ECIs of all PE’s, we choose an entry which has the largest min(EMD,Slack) / ECI value among locked pairs memory cost, bus contention not considered

Application-to-Architecture Mapping Scheduler List scheduling is used Priority for the list scheduling is given by BIM E(i,j): execution time of node i on processor j C(i,d): IPC overhead between i and d (child node of i) T(i,j): PE j is available after T(i,j) BIL(i,j)=E(i,j)+maxd[min(BIL(d,j), mink(BIL(d,k)+C(i,d)))] BIL(i,j) is the critical path length from node i to the sink. BIM(i,j)=T(i,j)+BIL(i,j) e e i E(i,j) processor j T(i,j) C(i,d1) C(i,d2) E(i,j) i processor k1 BIM: best imaginary makespan BIL: best imaginary level d2 d1 d2 d1 d1 d1 BIL(i,j) BIL(dx,?) processor k2 sink sink

Application-to-Architecture Mapping Results

Application-to-Architecture Mapping Pipelined heterogeneous multiprocessor system Seng Lin Shee and Sri Parameswaran, "Design methodology for pipelined heterogeneous multiprocessor system," Proc. DAC, June 2007. Pipelining with ASIPs as processing entities

Application-to-Architecture Mapping Tensilica Xtensa LX processors are used for the ASIPs Queue interface Xtensa PRocessor Extension Synthesis (XPRES)

Application-to-Architecture Mapping Design flow

Application-to-Architecture Mapping Exhaustive search for optimal configuration Complexity = O(np) where n: number of possible processor configurations p: number of processors

Application-to-Architecture Mapping Heuristic Find critical node (processor with worst minimum core iteration runtime) Find minimum cost configuration for the critical node For every other node vj, Filter out configurations that are faster than the critical node Find minimum cost configuration for vj v1 v2 v3 v4 r1 c1 r2 c2 r3 c3 r4 c4

Application-to-Architecture Mapping Heuristic Complexity = O(nxp) where

Application-to-Architecture Mapping Results