Application-to-Architecture Mapping

Application-to-Architecture Mapping
A SoC Design Automation School of EECS Seoul National University

System design methodology
Introduction Introduction System design methodology Traditional method Mostly bottom-up Given application and constraints First assemble HW components Then develop SW What if it fails to meet the specification?  reassemble HW-SW codesign Mostly top-down Given application, constraints, and simple architectural assumption Partition the application into HW and SW Synthesize from the partitions

System Implementation
HW-SW Codesign HW-SW Codesign Typical HW-SW codesign flow System Specification Analysis Internal Rep. System Simulation HW-SW Partitioning SW part Interface HW part SW Generation Interface Synthesis HW Synthesis SW compilation System Integration System Implementation

HW-SW Codesign Polis F. Balarin, et al., Hardware-Software Co-Design of Embedded Systems: The Polis Approach, Kluwer Academic Publishers, 1997. A design environment for control-dominated embedded systems MoC: CFSM (Co-design Finite State Machine) Globally asynchronous/locally synchronous Formal verification or simulation for the analysis of a system at the behavioral level It can generate C-code and HDL code Weak points Only CFSM: control-dominated application Does not support estimation technique for complex processor models Does not support multiple hardware and software partitioning

Overall flow HW-SW Codesign formal languges (Esterel) translators
CFSMs partitioning partitioned CFSMs HW synthesis SW synthesis interface synthesis BLIF optimized hardware C code OS synthesis HW interface logic synthesis integration S-graph scheduler template + timing constraints simulation formal verification intermediate format translator

Partitioning system functionality into
HW-SW Partitioning HW-SW Partitioning Partitioning system functionality into Application specific hardware and Software executing on one (or more) processor(s) Partitioning problem Find minimum cost HW-SW combination satisfying constraints Cost = f (HW area, HW delay, SW size, SW time, interface size, interface delay, power, ... ) Need efficient and accurate performance, cost, power estimation models Need efficient partitioning algorithms Greedy method Simulated annealing Kernighan-Lin Integer linear programming Global criticality/local phase Manual ...

HW-SW Partitioning ILP-based approach R. Niemann and P. Marwedel, “Hardware/software partitioning using integer programming,” Proc. ED&TC, Mar Concurrent partitioning, scheduling, and sharing Integer linear programming Minimize design cost with performance & resource constraints VHDL C code VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster SW nodes retargetable compilation SW costs

Global criticality/local phase
HW-SW Partitioning Global criticality/local phase A. Kalavade and E. A. Lee, “A global criticality/local phase driven algorithm for the constrained hardware/software partitioning problem," Proc. Codes/CASHE, Sept. 1994, pp Global Criticality/Local Phase (GCLP) GC Global time-criticality (feasibility) Node-invariant LP Classify each node into three phases: extremity, repeller, normal Determine mapping and start time for each node Quadratic complexity Task/process level of granularity

Objective function Not hardwired
HW-SW Partitioning Objective function Not hardwired Selected at each step according to GC & LP

HW-SW Partitioning Global criticality Probability that an unscheduled node (in U) should be implemented in HW to meet latency constraint Algorithm Estimate H nodes to move to HW according to priority (more performance, less area --> gets higher priority) so that the remaining SW nodes can be executed within Tremaining Compute actual finish time If not feasible, go to 1. Compute GC=(size of H)/(size of U), size: number of elementary operations

Local phase 1: extremity Local phase 2: repellers
HW-SW Partitioning Local phase 1: extremity Determine extremity sets EXs and EXh Local phase 2: repellers Software repeller property Bit-level instruction mix, precision level Hardware repeller property Memory-intensive instruction mix, table-lookup instruction mix

HW-SW Partitioning Compute D If i  (EXs  EXh), -0.5<D<0.5 depending on the level of extremity (more negative if HW is preferred) Else if repeller, -0.5<D<0.5 depending on the repeller value (more negative if HW is preferred) For a normal node, D=0

Experimental results ILP: several hours GCLP: order of seconds
HW-SW Partitioning Experimental results ILP: several hours GCLP: order of seconds Good solution: low HW area and high DSP utilization HA: hardware area, SA: software area, Util: DSP utilization HA is the total hardware area as a fraction of the capacity constraint. SA is the total software size as a fraction of the memory capacity constraint.

Implementation-bin selection
HW-SW Partitioning Implementation-bin selection A. Kalavade and E. A. Lee, "The extended partitioning problem: hardware/software mapping and implementation-bin selection," Proc. of the 6th International Workshop on Rapid. Systems Prototyping, 1995. Mapping and implementation-bin selection (MIBS)

Algorithm Perform GCLP-based HW-SW partitioning
Use median values for the HW cost/time Implementation-bin selection is applied to HW only but it is also applicable to SW Bin Fraction Curve (BFC) Fraction of free nodes that need to be mapped to their L bins Bin Sensitivity Curve (BSC) Slopes of the BFC

Algorithm Computation of BFC HW-SW Partitioning
next( ) selects a node from U by using different ranking functions such as thiH or ahiL

HW-SW Partitioning Algorithm Weighted bin sensitivity curve

Results HW-SW Partitioning mapped to L bins
mapped to median implementation bins

Platform-Based Design
Trend in System-on-Chip (SoC) design Larger design space Exponentially growing transistor counts (Moore's law) Ever increasing complexity of applications Multi-functional and multi-standard More flexibility, higher performance, lower energy, ... Shorter Time-to-Market Need more efficient design methodology Complexity 58%/yr growth rate Productivity 21%/yr growth rate

Reuse of Cell (standard cell) IP Architecture (platform) --> platform-based design IC (reconfigurability) Memory Video RAM I/O Host interface DSP core 1 (D950) Modem DSP core 2 Sound ASIP 1 Master Control ASIP 2 Controller ASIP 3 Bit Manipulation ASIP 4 (VLIW DSP) Programmable video operations, standard extensions S interface Glue logic A/D & D/A High-speed HW Video operations for DCT, IDCT, motion estimation Single chip videophone (H.263)

Platform and derivative design Soft IP EDA Tools Hard IP EDA Integrator Others EDA Tools Application specific integration platform Derivative

Design-space exploration Platform Design-Space Exploration Specification Architectural Space Application Space Application Instance Platform Instance System Application Space Application Instance Large Design-Space Exploration Platform Instance Architectural Space Conventional Design Platform-Based Design

Taxonomy of SoC platforms Full-Application Platforms Philips Nexperia TI OMAP (Open Multimedia Application Platform) ARM PrimeXsys Intel Xscale Architecture Processor-centric platform Improv Jazz Tensilica Xtensa Communication-Centric platform ARM AMBA bus architecture Sonics mNetwork IBM CoreConnect Fully Programmable Platform Altera Excalibur Xilinx Virtex-II Pro

Full-application platform Concentrates on full application Delivers comprehensive set of libraries hardware and software Delivers several mapping and application examples Texas Instruments OMAP Application domain: 2.5G/3G Wireless mobile devices Philips Nexperia Application domain: Digital Video, Digital Audio, Mobile Communications

Texas instrument OMAP1610 Dual processor core ARM926, TI DSP Up to 200MHz Multimedia cores 2D Graphics accelerator LCD controller MMC interface USB interface Wireless supports Bluetooth 3G

Nexperia platform MIPS™ TriMedia™ SDRAM General-purpose Scalable RISC Processor 50 to 300+ MHz 32-bit or 64-bit Library of Device IP Blocks Image coprocessors DSPs UART 1394 USB …and more Scalable VLIW Media Processor: 100 to 300+ MHz 32-bit or 64-bit Nexperia™ System Buses bit MIPS CPU MMI TriMedia CPU D$ PRxxxx TM-xxxx D$ I$ I$ DEVICE IP BLOCK DEVICE IP BLOCK DEVICE IP BLOCK DEVICE IP BLOCK . . . DVP MEMORY BUS . . . PI BUS PI BUS DEVICE IP BLOCK DEVICE IP BLOCK DVP SYSTEM SILICON

Nexperia software architecture Scalable from low-end to high-end Consistent API (on MIPS or TriMedia) Single Streaming Architecture for MIPS and TriMedia Aligned to Nexperia™ DVP (Digital Video Platform) HW architecture and IP blocks Operating system independent software layers OS abstraction libray Supports Linux, pSOS, Windows CE Re-use of software components on any instance of the platform

Processor-centric platform Application Specific Instruction Set processor Configure processor pipeline Generate complete software development environment Tensilica Xtensa Option: manually refine configuration Original C/C++ Code Evaluates millions of possible extensions: SIMD operations operator fusion parallel execution Designer selects “best” configuration Run XPRES Compiler int main( ) { int i; short c[100]; for (i=0;i<N/2;i++) Xtensa Processor Generator Tuned Software Tools Processor Hardware ALU DSP OCD Timer FPU Register File Cache

Configuration of Xtensa Processor Controls Instruction Fetch / PC Instruction RAM Trace TRACE Port Instruction ROM JTAG JTAG Tap Control Extended Instruction Align, Decode, Dispatch Instruction Decode/Dispatch Instruction MMU Instruction Cache On Chip Debug Exception Support User Defined Register Files User Defined Register Files Base Register File External Interface Exception Handling Registers Base ALU Xtensa Processor Interface Control Data Address Watch Registers MAC 16 DSP PIF Instruction Address Watch Registers User Defined Execution Units and Interfaces User Defined Execution Units MUL 16/32 Floating Point Interrupt Control Write Buffer Interrupts User Defined Execution Unit Timers User Defined Queues and Wires Vectra DSP Vectra DSP Vectra DSP Data MMU Data Cache Used Defined Data Load/Store Units Data Load/Store Unit Data ROMs Base ISA Feature Data RAMs Configurable Function Xtensa Local Memory Interface Optional Function Optional & Configurable User Defined Features (TIE)

Communication-centric platform Concentrates on communication back-bone (or On-chip Interconnection) - Delivers communication framework (plus generic peripherals) Sonics SiliconBackplane , PALMCHIP CoreFrame

Fully programmable platform Concentrates on reconfigurability Delivers processor plus programmable logic Xilinx Virtex-II Pro (Platform FPGA) Altera Excalibur (Platform FPGA)

Xilinx Virtex-II Pro PowerPC uP (400MHz) FPGA logics Internal RAM Serial transceiver XtremeDSP functions Digitally controlled impedance

Altera Excalibur ARM922T Cache MMU Flash Rom SRAM Interrupt Controller Watchdog Timer SDRAM Controller AHB1 1/2 PLL1 AHB1- AHB2 Bridge Dual Port SRAM0 Dual Port SRAM1 Single Port SRAM0 Single Port SRAM1 EBI UART AHB2 1/4 PLL1 Stripe-to-PLD Bridge Master Slave Configuration Logic Master (Configuration) Register Timer Slave PLD-to-Stripe Bridge Master PLD

System design flow Application Architecture Constraints Mapping Estimation of performance, area, and power in HW and SW Mapping results SW synthesis IF synthesis HW synthesis SW HW

for(i = 0; i < 18; i++) { s = (mpfloat)0.0f; k = 0; do { s += X[k] * v[k]; s += X[k+1] * v[k+1]; s += X[k+2] * v[k+2]; s += X[k+3] * v[k+3]; s += X[k+4] * v[k+4]; s += X[k+5] * v[k+5]; k += 6; } while(k < 18); v += 18; ISCALE(s); t[i] = s; } /* correct the transform into the 18x36 IMDCT we need */ /* 36 muls */ for(i = 0; i < 9; i++) { x[i] = t[i+9] * Granule_imdct_win[gr->block_type][i]; ISCALE(x[i]); x[i+9] = t[17-i] * Granule_imdct_win[gr->block_type][i+9]; ISCALE(x[i+9]); x[i+18] = t[8-i] * Granule_imdct_win[gr->block_type][i+18]; ISCALE(x[i+18]); x[i+27] = t[i] * Granule_imdct_win[gr->block_type][i+27]; ISCALE(x[i+27]); } Application in C Platform architecture

Y-chart approach B. Kienhuis, E. Deprettere, K. Vissers, P. van der Wolf, "An approach for quantitative analysis of application-specific dataflow architectures," Proc. ASAP'97, 1997. Application Architecture Mapping Performance analysis Performance numbers

Abstraction pyramid A. Kienhuis, Design Space Exploration of Stream-based Datatow Architectures, Ph.D. Thesis, Delft University of Technology, 1999.

Design trajectory Golden point design (low-level ad hoc design) Design approach using Y-chart environment

Stack of Y-chart Use different models at different levels of abstraction

A crucial step in DSE to evaluate the performance of different application-architecture combinations For smooth mapping Need a good match in data and operation types between the corresponding model of architecture and model of computation Mapping Architecture Application match in data/operation type Model of architecture Model of computation

Model of computation (MoC) A formal representation of the operational semantics of networks of functional blocks describing computations Well-known MoCs Discrete Events (DE) Finite State Machines (FSM) Process Networks (PN) Synchronous Data Flow (SDF) Synchronous/Reactive (SR) Many different MoCs for various application domains May need multiple MoCs for modeling an application

Model of architecture (MoA) A formal representation of the operational semantics of networks of functional blocks describing architectures It is for modeling an architecture instance of the architecture template Architecture template A specification of a class of architectures in a parameterized form Parameters are number of functional units, buffer size, bus type, latency, etc. Architecture instance The result of assigning values to parameters of the architecture template

YAPI E. de Kock, G. Essink, P. van der Wolf, J.-Y. Brunel, W. Kruijtzer, P. Lieverse, and K. Vissers, "YAPI: Application Modeling for Signal Processing Systems," Proc. DAC, 2000. YAPI: Y-chart API Application modeling for signal processing systems For the reuse of signal processing applications For the mapping of signal processing applications onto heterogeneous systems Kahn process network (KPN) Often used for modeling signal processing applications Concurrent processes communicate through unidirectional first-in-first-out channels Blocking read Non-blocking write Deterministic

A limitation of KPN Cannot model reactiveness such as user interaction, that is, non-deterministic events Control flow models such as finite state machines are a solution, but less suited for the implementation of computationally intensive applications. To extend KPN with non-deterministic events Introduce a communication primitive (channel selection primitive) YAPI separates the concerns of the application programmer and the system designer. Implementation of YAPI In the form of a C++ run-time library Read(), write(), execute(), and select() The implementation of these functions is a concern of the system designer (may be implemented in different ways).

Architecture evaluation in YAPI VIDEOTOP application The top-level process network model MPEG2 stream Channel selection to be decoded ts: transport stream pid: packet id pes: packetized elementary stream es: elementary stream

Simulation to measure the workload Communication requirement The amount of data that is transferred between processes Computation requirement The amount of computation of processes From the result We know that the required communication bandwidth is 150MB/s We select initial architecture as input for a more detailed mapping and performance analysis

Trace-driven approach P. Lieverse, P. van der Wolf, E. Deprettere, K. Vissers, "A methodology for architecture exploration of heterogeneous signal processing systems," Proc. SIPS, 1999. SPADE (System level Performance Analysis and Design space Exploration) For architecture exploration of heterogeneous signal processing systems Support an explicit mapping step Cosimulation of application models and architecture models using trace-driven simulation technique Architecture model do not need to model the functional behavior, still handling data dependent behavior correctly

In SPADE, applications and architectures are modeled separately. An application imposes a workload on the resources provided by an architecture Workload Computation and communication workload Resources Processing resources Programmable cores or dedicated hardware Communication resources Bus structures and memory resources such as RAMs or FIFO buffers

Trace-driven simulation Application model A network of concurrent communicating processes Each process of application model Produce a so-called trace which contains information on the communication and computation operations The traces get interfaced to an architecture model Drive computation and communication activities in the architecture

Application modeling Kahn Process Network model Modeled with YAPI based API read(), write(), and execute() They generate trace entries execute() function takes a symbolic instruction as an argument Architecture modeling Architecture model does not model the functional behavior It is constructed from generic building blocks Trace driven execution unit (TDEU) Interprets trace entries and has a configurable number of I/O ports Interfaces Translates the generic protocol (FIFO) into a communication resource specific protocol (e.g. bus) void Tidct(void) { ... while(1) { In->read(mb_in); mb_out = Idct(mb_in); execute(IDCT_MB); Out->write(mb_out); }

Architecture modeling (Cont’d) All blocks are parameterized TDEU: a list of symbolic instructions and latencies Interface block: buffer size, bus width, setup delay and transfer delay

Each process is mapped onto a TDEU Can be many-to-one Need to be scheduled by the TDEU (round robin) Each process port is mapped one-to-one onto an I/O port Simulation Concurrent simulation of the application model and the architecture model Architecture simulation TSS (Tool for System Simulation): Philips in-house architecture modeling and simulation framework

Heterogeneous multiprocessor scheduling H. Oh and S. Ha, "A hardware-software cosynthesis technique based on heterogeneous multiprocessor scheduling," Proc. CODES, May 1999. Perform list scheduling with the allocated PEs task-PE time table heterogeneous multiprocessor scheduler task-PE allocation controller performance evaluation Fail Good cosynthesis result

Task-PE allocation controller Allocate additional PEs until the given time constraint is satisfied Lock: initially lock all PE's except the lowest cost ones Unlock: select PE giving largest perf_gain/cost_increase Re-lock: in reverse order if time constraint is met B C P0 A D P1 A B D C solution processor cost task-PE profile table exec time(cost) P0(HW) P1(1) P2(5) B0 B1 B2 A 3(4) 2(6) 1(10) 7 2 B 4(5) 2(8) 10 3 C 2(3) 1(5) 5 D 5(10) 3(15) 15 P0 P1(1) P2(5) B0 7 10 2(3) 15 EMD(i,j,k) = execution_time(i,j*,k*) – execution_time(i,j,k) ECI(i,j,k) = cost(i,j,k) – cost(i,j*,k*) Slack = the schedule length – the time constraint ( i : node, j : processor, k : implementation) ( j*,k*: the minimum cost processing element or implementation) ECI(i,j,k) and EMD(i,j,k) mean the amounts of expected cost increment and expected time decrement respectively when (i,j,k) task-PE pair is unlocked or allocated. After computing EMDs and ECIs of all PE’s, we choose an entry which has the largest min(EMD,Slack) / ECI value among locked pairs memory cost, bus contention not considered

Scheduler List scheduling is used Priority for the list scheduling is given by BIM E(i,j): execution time of node i on processor j C(i,d): IPC overhead between i and d (child node of i) T(i,j): PE j is available after T(i,j) BIL(i,j)=E(i,j)+maxd[min(BIL(d,j), mink(BIL(d,k)+C(i,d)))] BIL(i,j) is the critical path length from node i to the sink. BIM(i,j)=T(i,j)+BIL(i,j) e e i E(i,j) processor j T(i,j) C(i,d1) C(i,d2) E(i,j) i processor k1 BIM: best imaginary makespan BIL: best imaginary level d2 d1 d2 d1 d1 d1 BIL(i,j) BIL(dx,?) processor k2 sink sink

Results

Pipelined heterogeneous multiprocessor system Seng Lin Shee and Sri Parameswaran, "Design methodology for pipelined heterogeneous multiprocessor system," Proc. DAC, June 2007. Pipelining with ASIPs as processing entities

Tensilica Xtensa LX processors are used for the ASIPs Queue interface Xtensa PRocessor Extension Synthesis (XPRES)

Design flow

Exhaustive search for optimal configuration Complexity = O(np) where n: number of possible processor configurations p: number of processors

Heuristic Find critical node (processor with worst minimum core iteration runtime) Find minimum cost configuration for the critical node For every other node vj, Filter out configurations that are faster than the critical node Find minimum cost configuration for vj v1 v2 v3 v4 r1 c1 r2 c2 r3 c3 r4 c4

Heuristic Complexity = O(nxp) where

Results

Application-to-Architecture Mapping

Similar presentations

Presentation on theme: "Application-to-Architecture Mapping"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application-to-Architecture Mapping

Similar presentations

Presentation on theme: "Application-to-Architecture Mapping"— Presentation transcript:

Similar presentations

About project

Feedback