1
DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs
NOMADIK “ The challenge of low power, high performance and scalable multimedia acceleration” Alain Artieri - Patrick Blouet STMicroelectronics July 26, 2006
4 Multimedia Computing Landscape
5 The convergence paradigm New Mobile Multimedia Computing Architecture Personal Computer Mobile Phone Consumer Electronics
6 Consumer versus Computer Consumer Products High quality of service Designed for worst case Highly parallel architecture Hardware accelerators Personal Computer Monolithic processor architecture High MHz for performance High power consumption Open OS Flexibility Rich set of standard interfaces for storage and connectivity New computing architecture must combine the best of both worlds Open platform, multi OS Flexibility Rich set of standard interfaces for storage and connectivity
7 Cell Phones : a Key Driver M Units< FeaturesVoiceVoice & DataMultimediaGlobal Convergence
8 Competing Technical Constraints Scalability Low Power Multimedia Performance
9 Multimedia Performance Requirements : Multiple video standard, encode and decode (MPEG4, H264, WMV, …), up to HDTV format High resolution : VGA screen and above in small form factor, Output to HDTV with large screen Multi megapixel camera, DSC class image reconstruction chain and picture improvement Sophisticated Audio use cases : combination of multiple Codecs, sound effects, speech codecs, … Advanced 3D graphics acceleration for gaming Consume & produce high bandwidth multimedia content
10 Low Power A key system technology driver Of course a product feature : Battery life time But helps product manufacturability Stacking in a power budget And product cost Low cost packaging No heat sink
11 Nomadik Architecture Overview
12 Host processor & peripherals, No differentiation Application Processor Content Host Processor Peripherals Multimedia Accelerator Multimedia Acceleration, differentiating factor The architecture & design challenge is in Multimedia Acceleration (Audio, Video, Imaging, Graphics) This is were innovation is required and competitive advantage is built Embedded Memory
13 DMA engine Tightly Coupled HW Nomadik Multimedia Acceleration Model DSP DMA engine Tightly Coupled HW DSP DMA engine Tightly Coupled HW DSP Interconnect Multiple DSP Attached to HW acceleration Data mover Multiple DSP based sub-system Symmetrical DSPs (generic S/W component can run anywhere) Attached HW resources (dependence resolved at component manager level) …
14 Multiple DSP approach benefits High computing performance : Multiple non interfering domains of intense activity, each having its own processor, DMA services and hardware accelerators for data intensive functions Hardware acceleration embedding standard functions (e.g. video codec, image reconstruction & improvement) Highest & predictable performance through a careful bus and memory hierarchy design Low Power (target: 100’s of mW) : Intrinsic low power sub systems Fine grain power management at sub system level Leakage management by switching on & off sub systems
15 Power management Combination of multiple techniques : Dynamic power reduction : Clock gating Voltage scaling (DVFS) Pulse-Width Modulation (PWM) Static power reduction : Biasing Power On/Off switching (Power gating) A global system issue from power management inside the OS down to silicon process (e.g. gate leakage)
16 DVFS Principle Operating System Load Monitor (SW) Voltage/ Frequency Tables CPU performance requirements Process Requirements : -Large voltage excursion -Low leakage CPU Voltage 1.3V 1.2V 1.1V 28% energy saving 55% energy saving 100% 85% 62%
17 PWM Principle Operating System Load Monitor (SW) Active clock ratio table CPU performance requirements Process Requirements : -Clock as fast as possible -Source bias or switch off when clock is stopped CPU Voltage 1.0V 15% energy saving 38% energy saving 100% 85% 62%
18 Multi-step PWM Power management state machine under SW control Source Bias for short clock stop period Power off with context save/restore for long period Short stop (Source Bias – reduced leakage) Long stop (Power Off – zero leakage) saverestore
19 Power management Power mode changes are managed by software: Constraints and impact must be known by software developer. Information initially needed only at design level is now flowing into the software space. Power awareness in the software world is coming form the design world through better link between design tools and software development tools. Need for a power view of the application accessible to software developers.
20 Software Architecture for Multimedia Acceleration
21 Hardware Codecs, Sensors, Presentation Execution Infrastructure Media Network Server Multimedia API Multimedia Framework Operating System Complex Multimedia Software Stack User Interface SoC design perimeter Upward pervasion of design constraints
22 Objectives A unified programming model for distributed computing One S/W component can run anywhere possible Dynamically configurable Run complex algorithms that requires more than one DSP Enforce software architecture Modularity Component programming model Multimedia framework Comprehensive debug System level monitoring Component observable by construction (auto code instrumentation)
23 Complex use case illustration 16 QCIF decode 1 Grab & Viewfinder Graphics & control on Host CPU SVGA display 100mW
24 Architecture evolution
25 SoC evolution across technology nodes Constant SoC Die Size Slow evolution of peripherals (area decrease) General purpose CPU sub-system complexity double at each node (constant area), Embedded memory capacity double at each node (constant area) Loosely coupled DSP sub-system complexity increase by 30% at each node (30% area decrease) Technology Node (nm) Loosely coupled Sub-Systems General Purpose CPU Single Multiple Hardware Accelerator Hardwired Reconfigurable
26 Main trends Host CPU evolving toward multi-core architecture to meet the performance increase requirements HW acceleration mapped on reconfigurable arrays Performances close to dedicated HW in many areas Good fit with regular design constraints imposed by 45nm process and beyond Excellent structure for best optimized power management And … FLEXIBILITY …
27 Reconfigurable Hardware (DSP fabric) Target signal processing and arithmetic intensive applications Reconfigurable array of simple DSP core (CNode) Low power architecture Hierarchical clock gating Distributed leakage control (fine grain power gating) Programmable DMA engine Reconfigurable at run time, multi task
28 Mapping Flow Alus execute a cyclic micro-sequence Data exchanges through hierarchical clustered interconnect Configuration step is sequence loading and interconnect programming Data inData out ILP + software pipelining Procedure(In,Out,inout) Constant A,b,c,…; Begin X=a-in[0]; …….. End; Behavioral code Data inData out Data inData out Data in Data out Partitioning/ static scheduling DFG Coarse grained configuration M U X Clusters Level0 Mux level 2 N0_i N0_o N2_o N2_i N1_i N1_o Level 1
29 Mapping Flow 3D optimization problem (place/route/schedule) Traditional scheduling techniques for VLIW or clustered VLIW don’t apply The solution don’t take into account the spatial dimension of the problem Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension
30 Interconnect 4MB Multi-port Embedded Memory Host Core 2 L1 L2 Peripherals & analog What can fit in 45mm² in 45nm L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA L1 DSP HW DMA Programmable Multimedia Accelerator Imaging H/W 192 CNode (40 GOPS) Host Core 1 L1 Video H/W
31 CAD Challenges
32 Main area of CAD challenges Low Power design Static & Dynamic power global optimization Power control is becoming very fine grain. Must be tightly linked with software environment. Power control is beyond the pure SoC. System level power view is needed. Software design Efficient software design on hierarchical multiprocessor engine Capability to architect & design software architecture as efficiently as HW Capture tools, simulation, verification, automated code generation
33 Main area of CAD challenges Synthesis on Reconfigurable hardware Configuring the hardware network 3D place & route of massively parallel code on arrays of DSP’s Design constraints going up in the software –Reconfiguration latency –Expected performance. Reconfigurable hardware managed at software level. Software development environment has to be aware of reconfigurable hardware. –Profiling to extract hot spot and benefit if doing in hardware. –Code generation as well reconfiguration sequence for hardware.
34 Conclusion For multimedia processors, the complexity is moving to software design Hardware complexity resolved through regular design (multicore host, multi-DSP, coarse-grained DSP fabric) CAD challenge lies essentially in S/W design tools Multimedia software execution infrastructure, simulation, debug Programmable hardware acceleration