1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.

Slides:



Advertisements
Similar presentations
Processing Efficiency Jonah Probell Multimedia Systems Engineer Tensilica Truly Understanding Low-Power Multimedia Chip Design.
Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Computing Systems Roadmap and its Impact on Software Development Michael Ernst, BNL HSF Workshop at SLAC January, 2015.
Power Reduction Techniques For Microprocessor Systems
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Chapter 13 Embedded Systems
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
Characteristics of Realtime and Embedded Systems Chapter 1 6/10/20151.
Some Thoughts on Technology and Strategies for Petaflops.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
High-level System Modeling and Power Management Techniques Jinfeng Liu Dept. of ECE, UC Irvine Sep
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Configurable System-on-Chip: Xilinx EDK
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Computer performance.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
ECE-777 System Level Design and Automation Introduction 1 Cristinel Ababei Electrical and Computer Department, North Dakota State University Spring 2012.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Brussels, 1 June 2005 WP Strategic Objective Embedded Systems Tom Bo Clausen.
SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.
3G Single Core Modem A New Telecommunications Device Group 4: Warren Irwin, Austin Beam, Amanda Medlin, Rob Westerman, Brittany Deardian.
Multi-core architectures. Single-core computer Single-core CPU chip.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
Multi-Core Architectures
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Background Computer System Architectures Computer System Software.
CS203 – Advanced Computer Architecture
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Lynn Choi School of Electrical Engineering
Dynamo: A Runtime Codesign Environment
Microarchitecture.
Ph.D. in Computer Science
Green cloud computing 2 Cs 595 Lecture 15.
Grid Computing.
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
FPGAs in AWS and First Use Cases, Kees Vissers
GGF15 – Grids and Network Virtualization
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
QNX Technology Overview
A High Performance SoC: PkunityTM
Introduction to Embedded Systems
Chapter 1 Introduction.
Introduction to Heterogeneous Parallel Computing
Department of Electrical Engineering Joint work with Jiong Luo
The University of Adelaide, School of Computer Science
Presentation transcript:

1

DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs

NOMADIK “ The challenge of low power, high performance and scalable multimedia acceleration” Alain Artieri - Patrick Blouet STMicroelectronics July 26, 2006

4 Multimedia Computing Landscape

5 The convergence paradigm New Mobile Multimedia Computing Architecture Personal Computer Mobile Phone Consumer Electronics

6 Consumer versus Computer  Consumer Products  High quality of service  Designed for worst case  Highly parallel architecture  Hardware accelerators  Personal Computer  Monolithic processor architecture  High MHz for performance  High power consumption  Open OS  Flexibility  Rich set of standard interfaces for storage and connectivity New computing architecture must combine the best of both worlds  Open platform, multi OS  Flexibility  Rich set of standard interfaces for storage and connectivity

7 Cell Phones : a Key Driver M Units< FeaturesVoiceVoice & DataMultimediaGlobal Convergence

8 Competing Technical Constraints Scalability Low Power Multimedia Performance

9 Multimedia Performance Requirements :  Multiple video standard, encode and decode (MPEG4, H264, WMV, …), up to HDTV format  High resolution : VGA screen and above in small form factor, Output to HDTV with large screen  Multi megapixel camera, DSC class image reconstruction chain and picture improvement  Sophisticated Audio use cases : combination of multiple Codecs, sound effects, speech codecs, …  Advanced 3D graphics acceleration for gaming  Consume & produce high bandwidth multimedia content

10 Low Power  A key system technology driver  Of course a product feature :  Battery life time  But helps product manufacturability  Stacking in a power budget  And product cost  Low cost packaging  No heat sink

11 Nomadik Architecture Overview

12 Host processor & peripherals, No differentiation Application Processor Content Host Processor Peripherals Multimedia Accelerator Multimedia Acceleration, differentiating factor  The architecture & design challenge is in Multimedia Acceleration (Audio, Video, Imaging, Graphics)  This is were innovation is required and competitive advantage is built Embedded Memory

13 DMA engine Tightly Coupled HW Nomadik Multimedia Acceleration Model DSP DMA engine Tightly Coupled HW DSP DMA engine Tightly Coupled HW DSP Interconnect Multiple DSP Attached to HW acceleration Data mover  Multiple DSP based sub-system  Symmetrical DSPs (generic S/W component can run anywhere)  Attached HW resources (dependence resolved at component manager level) …

14 Multiple DSP approach benefits  High computing performance :  Multiple non interfering domains of intense activity, each having its own processor, DMA services and hardware accelerators for data intensive functions  Hardware acceleration embedding standard functions (e.g. video codec, image reconstruction & improvement)  Highest & predictable performance through a careful bus and memory hierarchy design  Low Power (target: 100’s of mW) :  Intrinsic low power sub systems  Fine grain power management at sub system level  Leakage management by switching on & off sub systems

15 Power management  Combination of multiple techniques :  Dynamic power reduction : Clock gating Voltage scaling (DVFS) Pulse-Width Modulation (PWM)  Static power reduction : Biasing Power On/Off switching (Power gating)  A global system issue from power management inside the OS down to silicon process (e.g. gate leakage)

16 DVFS Principle Operating System Load Monitor (SW) Voltage/ Frequency Tables CPU performance requirements Process Requirements : -Large voltage excursion -Low leakage CPU Voltage 1.3V 1.2V 1.1V 28% energy saving 55% energy saving 100% 85% 62%

17 PWM Principle Operating System Load Monitor (SW) Active clock ratio table CPU performance requirements Process Requirements : -Clock as fast as possible -Source bias or switch off when clock is stopped CPU Voltage 1.0V 15% energy saving 38% energy saving 100% 85% 62%

18 Multi-step PWM  Power management state machine under SW control  Source Bias for short clock stop period  Power off with context save/restore for long period Short stop (Source Bias – reduced leakage) Long stop (Power Off – zero leakage) saverestore

19 Power management  Power mode changes are managed by software:  Constraints and impact must be known by software developer.  Information initially needed only at design level is now flowing into the software space.  Power awareness in the software world is coming form the design world through better link between design tools and software development tools.  Need for a power view of the application accessible to software developers.

20 Software Architecture for Multimedia Acceleration

21 Hardware Codecs, Sensors, Presentation Execution Infrastructure Media Network Server Multimedia API Multimedia Framework Operating System Complex Multimedia Software Stack User Interface SoC design perimeter Upward pervasion of design constraints

22 Objectives  A unified programming model for distributed computing  One S/W component can run anywhere possible  Dynamically configurable  Run complex algorithms that requires more than one DSP  Enforce software architecture  Modularity  Component programming model  Multimedia framework  Comprehensive debug  System level monitoring  Component observable by construction (auto code instrumentation)

23 Complex use case illustration 16 QCIF decode 1 Grab & Viewfinder Graphics & control on Host CPU SVGA display 100mW

24 Architecture evolution

25 SoC evolution across technology nodes  Constant SoC Die Size  Slow evolution of peripherals (area decrease)  General purpose CPU sub-system complexity double at each node (constant area),  Embedded memory capacity double at each node (constant area)  Loosely coupled DSP sub-system complexity increase by 30% at each node (30% area decrease) Technology Node (nm) Loosely coupled Sub-Systems General Purpose CPU Single Multiple Hardware Accelerator Hardwired Reconfigurable

26 Main trends  Host CPU evolving toward multi-core architecture to meet the performance increase requirements  HW acceleration mapped on reconfigurable arrays  Performances close to dedicated HW in many areas  Good fit with regular design constraints imposed by 45nm process and beyond  Excellent structure for best optimized power management  And … FLEXIBILITY …

27 Reconfigurable Hardware (DSP fabric)  Target signal processing and arithmetic intensive applications  Reconfigurable array of simple DSP core (CNode)  Low power architecture  Hierarchical clock gating  Distributed leakage control (fine grain power gating)  Programmable DMA engine  Reconfigurable at run time, multi task

28 Mapping Flow Alus execute a cyclic micro-sequence Data exchanges through hierarchical clustered interconnect Configuration step is sequence loading and interconnect programming Data inData out ILP + software pipelining Procedure(In,Out,inout) Constant A,b,c,…; Begin X=a-in[0]; …….. End; Behavioral code Data inData out Data inData out Data in Data out Partitioning/ static scheduling DFG Coarse grained configuration M U X Clusters Level0 Mux level 2 N0_i N0_o N2_o N2_i N1_i N1_o Level 1

29 Mapping Flow  3D optimization problem (place/route/schedule)  Traditional scheduling techniques for VLIW or clustered VLIW don’t apply  The solution don’t take into account the spatial dimension of the problem  Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension

30 Interconnect 4MB Multi-port Embedded Memory Host Core 2 L1 L2 Peripherals & analog What can fit in 45mm² in 45nm L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA Programmable Multimedia Accelerator Imaging H/W 192 CNode (40 GOPS) Host Core 1 L1 Video H/W

31 CAD Challenges

32 Main area of CAD challenges  Low Power design  Static & Dynamic power global optimization  Power control is becoming very fine grain. Must be tightly linked with software environment.  Power control is beyond the pure SoC. System level power view is needed.  Software design  Efficient software design on hierarchical multiprocessor engine  Capability to architect & design software architecture as efficiently as HW  Capture tools, simulation, verification, automated code generation

33 Main area of CAD challenges  Synthesis on Reconfigurable hardware  Configuring the hardware network 3D place & route of massively parallel code on arrays of DSP’s Design constraints going up in the software –Reconfiguration latency –Expected performance.  Reconfigurable hardware managed at software level. Software development environment has to be aware of reconfigurable hardware. –Profiling to extract hot spot and benefit if doing in hardware. –Code generation as well reconfiguration sequence for hardware.

34 Conclusion  For multimedia processors, the complexity is moving to software design  Hardware complexity resolved through regular design (multicore host, multi-DSP, coarse-grained DSP fabric)  CAD challenge lies essentially in S/W design tools  Multimedia software execution infrastructure, simulation, debug  Programmable hardware acceleration