6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
DSPs Vs General Purpose Microprocessors
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Instruction Level Parallelism (ILP) Colin Stevens.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.
Processor Architectures and Program Mapping 5kk10 TU/e 2006 Henk Corporaal Jef van Meerbergen Bart Mesman.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
ECE 232 L2 Basics.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 2 Computer.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.
Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman.
Chapter 6 Memory and Programmable Logic Devices
Generic Software Pipelining at the Assembly Level Markus Pister
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Basics and Architectures
Chapter 5 Basic Processing Unit
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Computer Architecture
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Extreme Makeover for EDA Industry
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Embedded Computer Architecture ASIP Application Specific Instruction-set Processor 5KK73 Bart Mesman and Henk Corporaal.
Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.
CDA 3101 Fall 2013 Introduction to Computer Organization
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
EKT303/4 Superscalar vs Super-pipelined.
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
NISC set computer no-instruction
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Using Custom Accelerators in Wireless Systems Alex Papakonstantinou, Deming Chen Illinois Center for Wireless Systems Wireless SoC Design Trends and Challenges.
Embedded Computer Architecture
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
1 of 14 Lab 2: Design-Space Exploration with MPARM.
15-740/ Computer Architecture Lecture 3: Performance
Low-power Digital Signal Processing for Mobile Phone chipsets
Evaluating Register File Size
Henk Corporaal TUEindhoven 2009
IP – Based Design Methodology
Computer Structure S.Abinash 11/29/ _02.
Morgan Kaufmann Publishers The Processor
Henk Corporaal TUEindhoven 2011
COMS 361 Computer Organization
Guest Lecturer: Justin Hsia
Presentation transcript:

6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor

6/25/2015Platform Design H.Corporaal and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application domain specific Application specific processor Application domain specific processors (ADSP or ASIP)

6/25/2015Platform Design H.Corporaal and B. Mesman3 Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc... performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume) Appl. domain implementation ADSP implementation Appl. domain GP problems - specification manual design, - design time and effort large effort => synthesized cores

6/25/2015Platform Design H.Corporaal and B. Mesman4

6/25/2015Platform Design H.Corporaal and B. Mesman5 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

6/25/2015Platform Design H.Corporaal and B. Mesman6 application(s) processor - model OK? more appl.? yes no yes Estimations cycles/alg occupation HW design SW (code generation) Estimations nsec/cycle, area, power/instr go to phase 2 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback Design process parameters instance e.g. VLIW with shared RFs

6/25/2015Platform Design H.Corporaal and B. Mesman7 A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Problem statement

6/25/2015Platform Design H.Corporaal and B. Mesman8 Algorithm spec FE CDFG Code Generation Machinecode Processor spec (instance) ISE GRTP Problem statement in ch 4 this is part of the code generator

6/25/2015Platform Design H.Corporaal and B. Mesman9 PC IM +1 I.(20:0) RAM I.(12:5) I.(4) Inp I.(20:13) I.(3:2) I.(1:0) REG outp Example: Simple processor [Leupers]

6/25/2015Platform Design H.Corporaal and B. Mesman10 Example: Simple processor [Leupers]

6/25/2015Platform Design H.Corporaal and B. Mesman11 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch // FUs ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … larger grainsize, more heterogeneous, more pipelines 2. Rfiles many Rfiles (>5 vs 1 or 2) limited # ports (3 vs 15) limited size (<16 vs. 128) 3. Issue slots all in parallel vs. 5

6/25/2015Platform Design H.Corporaal and B. Mesman12 RF1 FU1 RF2 RF3 FU2 RF4 RF5 FU3 RF6 RF7 FU4 RF8 IR1IR2 IR3IR4 Instruction memory Con- trol flags

6/25/2015Platform Design H.Corporaal and B. Mesman13 read address RF 1 write address RF 1 read address RF 2 write address RF 2 mux 1 mux 2 control FU output drivers Additional characteristics of the A|RT designer template interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output Each FU can generate one or more flags instruction format (per issue slot) ASIP/VLIW architectures

6/25/2015Platform Design H.Corporaal and B. Mesman14 ALUMAC bus1bus2 RF1RF2RF3RF4 mux 2 read RF1 write RF1 read RF2 write RF2 ALU instr. mux 3 read RF4 write RF4 read RF3 write RF3 MAC instr ASIP/VLIW architectures: example

6/25/2015Platform Design H.Corporaal and B. Mesman15 ASIP/VLIW architectures : example

6/25/2015Platform Design H.Corporaal and B. Mesman16 Datapath synthesis Controller synthesis OK? Change pragmas Algorithm spec no yes RTs Estimations area, power, timing RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) VLIW makes relatively simple code selection possible ASIP/VLIW architectures: design flow

6/25/2015Platform Design H.Corporaal and B. Mesman17 * * 3 * 4 * * 8 * IPB OPB ALU MULT IPB OPB + 2 * 3 * 1 * 1 * * 1 * 3 * 4 * 3 * 4 * 4 * * * 8 * 5 * 5 * 8 * * 5 * 9 * 5 * 9 * 5 * * 9 + Candidate LIST Conflict & Priority Comp. Scheduled Operation ASIP/VLIW architectures: list scheduling

6/25/2015Platform Design H.Corporaal and B. Mesman18 ASIP/VLIW architectures: feedback

6/25/2015Platform Design H.Corporaal and B. Mesman19 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

6/25/2015Platform Design H.Corporaal and B. Mesman20 Implementation Independent Design Database Implementation Independent Design Database Low power aspects Estimation area speed power Estimation Database + Architecture Mistral2

6/25/2015Platform Design H.Corporaal and B. Mesman21 GSM viterbi decoder : default solution EXUACTIVAREAPOWER alu_196% romctrl_148%39259 acu_126% ipb_15% opb_123% ctrl total EXUACTIVAREAPOWER alu_196% romctrl_148%39259 acu_126% ipb_15% opb_123% ctrl total controller responsible for 70% of power consumption –maximum resource-sharing –heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage

6/25/2015Platform Design H.Corporaal and B. Mesman22 GSM viterbi decoder : no loop-folding area down by 33% power down by 35% next step: reduce # of program-steps with second ALU EXUACTIVAREAPOWER alu_192% romctrl_145%39255 acu_125% ipb_15%10786 opb_122% ctrl total EXUACTIVAREAPOWER alu_192% romctrl_145%39255 acu_125% ipb_15%10786 opb_122% ctrl total

6/25/2015Platform Design H.Corporaal and B. Mesman23 GSM viterbi decoder : 2 ALU’s 9739 EXUACTIVAREAPOWER alu_169% alu_265% romctrl_167%39255 acu_137% ipb_18% opb_133% ctrl total EXUACTIVAREAPOWER alu_169% alu_265% romctrl_167%39255 acu_137% ipb_18% opb_133% ctrl total © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load

6/25/2015Platform Design H.Corporaal and B. Mesman24 GSM viterbi decoder : 1 x ACS-ASU EXUACTIVAREAPOWER alu_120% acs_asu_183% or_asu_110% romctrl_116%6521 acu_136% ipb_120%10743 opb_111%16335 ctrl total EXUACTIVAREAPOWER alu_120% acs_asu_183% or_asu_110% romctrl_116%6521 acu_136% ipb_120%10743 opb_111%16335 ctrl total func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = 1930 © cycle count down 5X © power down 20X !

6/25/2015Platform Design H.Corporaal and B. Mesman25 GSM viterbi decoder : 4 x ACS-ASU EXUACTIVAREAPOWER alu_194%24397 acs_asu_195% acs_asu_295% acs_asu_395% acs_asu_495% split_asu_147%9018 or_asu_147% romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl total EXUACTIVAREAPOWER alu_194%24397 acs_asu_195% acs_asu_295% acs_asu_395% acs_asu_495% split_asu_147%9018 or_asu_147% romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl total © cycle count down another 5X © area up 23% © power down another 3X ! 425

6/25/2015Platform Design H.Corporaal and B. Mesman26 GSM viterbi example : summary Implementation Independent Design Database Implementation Independent Design Database 72x ! Mistral2

6/25/2015Platform Design H.Corporaal and B. Mesman27 Exploration phase Application software development: constraint driven compilation application(s) processor - model OK? more appl.? yes no yes HW design SW (code generation) application(s) OK? no yes SW (code generation) Freeze processor model no Discussion: phase 3

6/25/2015Platform Design H.Corporaal and B. Mesman28 Discussion: problems with VLIWs code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = -  p i log 2 p i = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos code size and instruction bandwidth

6/25/2015Platform Design H.Corporaal and B. Mesman29 RF1 FU1FU2 FU3FU4 IR1IR2 IR3IR4 Instruction memory Con- trol flags RF2 RF3RF4

6/25/2015Platform Design H.Corporaal and B. Mesman30 Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.