Embedded Computer Architecture

Slides:

Advertisements

Similar presentations

Is There a Real Difference between DSPs and GPUs?

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Lecture 6: Multicore Systems

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Computer Architecture and Data Manipulation Chapter 3.

Latency considerations of depth-first GPU ray tracing

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Platform-based Design 5KK70 TU/e 2009 Henk Corporaal Bart Mesman.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Instruction Level Parallelism (ILP) Colin Stevens.

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.

Processor Architectures and Program Mapping 5kk10 TU/e 2006 Henk Corporaal Jef van Meerbergen Bart Mesman.

6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.

Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Technische universiteit eindhoven Department of Electrical Engineering Electronic Systems Platform-based Design 5KK70 MPSoC Controlling the Parallel Resources.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor.

Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© 2007 Elsevier Lecture 6: Embedded Processors Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

Technische universiteit eindhoven Department of Electrical Engineering Electronic Systems Embedded Computer Architecture 5KK73 MPSoC Controlling the Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Basics and Architectures

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Embedded Computer Architecture ASIP Application Specific Instruction-set Processor 5KK73 Bart Mesman and Henk Corporaal.

Dual-Pipeline Heterogeneous ASIP Design Swarnalatha Radhakrishnan, Hui Guo, Sri Parameswaran School of Computer Science & Engineering University of New.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

EKT303/4 Superscalar vs Super-pipelined.

NISC set computer no-instruction

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Advanced Architectures

Low-power Digital Signal Processing for Mobile Phone chipsets

Evaluating Register File Size

Graphics Processing Unit

Henk Corporaal TUEindhoven 2009

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Dynamically Reconfigurable Architectures: An Overview

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Henk Corporaal TUEindhoven 2011

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Embedded Computer Architecture ASIP Application Specific Instruction-set Processor 5KK73 Bart Mesman and Henk Corporaal

Embedded Computer Archtiecture H.Corporaal and B. Mesman Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency 4/27/2017 Embedded Computer Archtiecture H.Corporaal and B. Mesman

Embedded Computer Architecture H.Corporaal and B. Mesman Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ... implementation Appl. domain GP Appl. domain implementation ADSP performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume) problems - specification manual design, - design time and effort large effort => synthesized cores 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Embedded Computer Architecture H.Corporaal and B. Mesman www.adelantetech.com 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Embedded Computer Architecture H.Corporaal and B. Mesman Design process processor- model application(s) e.g. VLIW with shared RFs instance parameters SW (code generation) HW design 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Estimations nsec/cycle, area, power/instr Estimations cycles/alg occupation Fast, accurate and early feedback OK? no yes yes more appl.? no go to phase 2 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

ASIP/VLIW architectures: list scheduling IPB Candidate Conflict & Scheduled LIST Priority Comp. Operation * 1 + 2 * 3 * 1 + 2 * 3 * 1 * 3 * 1 + 2 * * 5 1 * 3 * 4 * 3 * 4 * 4 4 OPB + 6 2 * 3 + 6 * 3 + 6 MULT + 7 * 8 * 5 + 7 * 8 * 5 * 8 * 8 + 7 3 ALU * 9 + 10 * 5 * 9 * 5 * 9 * 5 4 IPB * 9 + 10 * 9 + 10 OPB 5 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Application examples (1) * Z-1 + c3 c4 c2 c1 x4 x3 x2 x1 y c0 x0

19 instructions per tap!! Application examples (1) Embedded Computer Architecture H. Corporaal, and B. Mesman

Very simple in hardware Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware

Application examples (2) Bit level operations : DES example srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 20 22 23 25 26 27 source register ($2) destination register ($24) 2 3 4 5 6 7 Embedded Computer Architecture H. Corporaal and B. Mesman

Application examples (2) Bit level operations : A5 example (GSM encryption) srl $24, $5, 18 $25, $5, 17 xor $8, $24, $25 $9, $5, 16 $10, $8, $9 $11, $5, 13 $12, $10, $11 andi $13, $12, 1 18 17 16 13 $5 1 $13 … 0 ...

ASIP/VLIW architectures: feedback 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Embedded Computer Architecture H.Corporaal and B. Mesman Low power aspects Implementation Independent Design Database Estimation area + speed power Mistral2 Architecture Estimation Database 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : default solution EXU ACTIV AREA POWER alu_1 96% 3469 46196 romctrl_1 48% 39 259 acu_1 26% 327 1209 ipb_1 5% 131 105 opb_1 23% 1804 5801 ctrl 9821 135035 total 15591 188605 13750 controller responsible for 70% of power consumption maximum resource-sharing heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : no loop-folding EXU ACTIV AREA POWER alu_1 92% 3411 45073 romctrl_1 45% 39 255 acu_1 25% 294 1087 ipb_1 5% 107 86 opb_1 22% 1661 5340 ctrl 4919 70087 total 10431 121928 14247 area down by 33% power down by 35% next step: reduce # of program-steps with second ALU 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 2 ALU’s EXU ACTIV AREA POWER alu_1 69% 1797 12248 alu_2 65% 1393 8916 romctrl_1 67% 39 255 acu_1 37% 294 1087 ipb_1 8% 149 119 opb_1 33% 2136 6871 ctrl 8957 87235 total 14766 116731 9739 cycle count down 30% area up 42% power down by 5% next step: introduce ASU to reduce ALU-load 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 1 x ACS-ASU func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = EXU ACTIV AREA POWER alu_1 20% 261 105 acs_asu_1 83% 2382 3816 or_asu_1 10% 611 122 romctrl_1 16% 65 21 acu_1 36% 294 205 ipb_1 20% 107 43 opb_1 11% 163 35 ctrl 1864 3597 total 5747 7944 1930 cycle count down 5X power down 20X ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi decoder : 4 x ACS-ASU EXU ACTIV AREA POWER alu_1 94% 243 97 acs_asu_1 95% 1041 420 acs_asu_2 95% 1041 420 acs_asu_3 95% 1041 420 acs_asu_4 95% 1041 420 split_asu_1 47% 90 18 or_asu_1 47% 592 118 romctrl_1 28% 48 6 acu_1 98% 212 85 ipb_1 23% 60 6 opb_1 50% 369 80 ctrl 1306 555 total 7084 2645 425 cycle count down another 5X area up 23% power down another 3X ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

GSM viterbi example : summary Implementation Independent Design Database Mistral2 72x ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Discussion: phase 3 Application software development: processor- model application(s) application(s) SW (code generation) HW design SW (code generation) Freeze processor model no OK? no no OK? yes yes yes no more appl.? Application software development: constraint driven compilation Exploration phase 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Embedded Computer Architecture H.Corporaal and B. Mesman RF1 RF2 RF3 RF4 FU1 FU2 FU3 FU4 flags IR1 IR2 IR3 IR4 Instruction memory Con- trol 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

Discussion: problems with VLIWs code size and instruction bandwidth code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

23 GPU basics Synthetic objects are represented with a bunch of triangles (3d) in a language/library like OpenGL or DirectX plus texture Triangles are represented with 3 vertices A vertex is represented with 4 coordinates with floating-point precision Objects are transformed between coordinate representations Transformations are matrix-vector multiplications 23

24 GPU DirectX 10 pipeline 24

NVIDIA GeForce 6800 3D Pipeline 25 NVIDIA GeForce 6800 3D Pipeline 25

GeForce 8800 GPU 26 330 Gflops, 128 processors with 4-way SIMD 26

GPU: Why more general-purpose programmable? 27 GPU: Why more general-purpose programmable? All transformations are shading Shading is all matrix-vector multiplications Computational load varies heavily between different sorts of shading Programmable shaders allow dynamic resource allocation between shaders Result: Modern GPUs are serious competitor for general-purpose processors! 27

Mixed serial/parallel n n n n n D n n n n n E n n n n n B A n n C n n Fully serial Classical encoding: fetching many nops n n A n n n n n n B n n n n n n n n n n n C n n Mixed serial/parallel n n n n n D n n n n n E n n n n n B A n n C n n F n n n n n n n n n n E n D n n Fully parallel n n n n n n G n F n n n n n n n n n n n n n n H n n n n n n G H A B C D E F G H A B C D E F G H A B C D E F G H A B C D E F G H 1 1 1 1 1 1 1 1 1 1 1 Velocity encoding 4/27/2017 Embedded Computer Architecture

Embedded Computer Architecture H.Corporaal and B. Mesman Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. GPUs are ASIPs 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman