Download presentation
Presentation is loading. Please wait.
Published byHugh Andrews Modified over 9 years ago
1
11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan – ACAL 1 Arizona State University 2 ARM, Ltd. 3
2
22 2 University of Michigan - ACAL The Old Mobile PhoneThe Modern Mobile Phone 2 Video Recording Video Editing Higher Data Rates Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone 3D Rendering Advanced Image Processing Future phones are becoming more complex Richer applications require much more requirements How do phones handle this now?
3
33 3 University of Michigan - ACAL Inside Today’s Smart Phones 3 Inside the OMAP3430 Application Processor Inside the X-Gold 608 (Representation of QCOM) Modern phones are looking like Frankenchips! Some cores unused and functionality duplicated
4
44 4 University of Michigan - ACAL Cost for Multi-System Support Supporting multiple systems is reserved for the most expensive phones Cost is in supporting all the systems that may or may not be used at once 4 Programmable Unified Architectures Provide Lower Cost Faster Time to Market Support for Multiple Applications (Current and Future) Bug Fixes After Manufacturing So where do we start? Data gathered from - Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder," Solid-State Circuits Conference, 2008. A-SSCC '08.
5
55 5 University of Michigan - ACAL Power/Performance Requirements for Multiple Systems 5 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) 1 10 100 1000 10000 0.1110100 B e t t e r P o w e r E f f i c i e n c y 1 M o p s / m W 1 0 M o p s / m W 1 0 0 M o p s / m W 1 0 0 0 M o p s / m W SODA (65nm) SODA (90nm) TI C6X Imagine VIRAMPentium M IBM Cell P e r f o r m a n c e ( G o p s ) Power(Watts) 3G Wireless 4 Mobile HD Video
6
66 6 The Applications Is there anything we can learn from the applications themselves? 6
7
77 7 University of Michigan - ACAL H.264 Basics 7 T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.
8
88 8 University of Michigan - ACAL 4G Wireless Basics Three kernels make up the majority of the work FFT – Extract Data from Signals STBC – Combine Data into More Reliable Stream LDPC – Error Correction on Data Stream 8
9
99 9 University of Michigan - ACAL Mobile Signal Processing Algorithm Characteristics 9 Algorithms have different SIMD widths From very large to very small Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP Algorithm SIMDScalarOverheadSIMD WidthAmount Workload (%) (Elements)of TLP 4G FFT755201024Low STBC815144High LDPC49183396Low H.264 Deblocking Filter7213158Medium Intra-Prediction8551016Medium Inverse Transform805158High Motion Compensation755108High SIMD comes at a cost! Register File Power Data Movement/Alignment Cost SIMD architectures have to deal with this!
10
10 University of Michigan - ACAL Traditional SIMD Power Breakdown Register File Power consumes a lot of power in traditional 32-wide SIMD architecture SODA WCDMA MEM 3% REG 37% ALU+MULT 11% CONTROL 38% INTERCONNECT 2% SCALAR PIPE 9% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE
11
11 University of Michigan - ACAL Register File Access 11 Many of the register file access do not have to go back to the main register file Lots of power wasted on unneeded register file access!
12
12 University of Michigan - ACAL Instruction Pair Frequency 12 Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! a)Intra-prediction and Deblocking Filter Combined Instruction PairFrequency 1multiply-add26.71% 2add-add13.74% 3shuffle-add8.54% 4shift right-add6.90% 5subtract-add6.94% 6add-shift right5.76% 7multiply-subtract4.00% 8shift right-subtract3.75% 9add-subtract3.07% 10Others20.45% b)LDPC Instruction PairFrequency 1shuffle-move32.07% 2abs-subtract8.54% 3move-subtract8.54% 4shuffle-subtract3.54% 5add-shuffle3.54% 6Others43.77% c)FFT Instruction PairFrequency 1shuffle-shuffle16.67% 2add-multiply16.67% 3multiply-subtract16.67% 4multiply-add16.67% 5subtract-mult16.67% 6shuffle-add16.67%
13
13 University of Michigan - ACAL Data Alignment Problem! H.264 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation 13 Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel Intra-Prediction
14
14 University of Michigan - ACAL Summary Conclusion about 4G and H.264 Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel
15
15 AnySP Design 15
16
16 University of Michigan - ACAL Traditional SIMD Architectures 16 32-Wide SIMD with Simple Shuffle Network
17
17 University of Michigan - ACAL AnySP Architecture – High Level 17 8 Groups of 8-Wide Flexible Function Units 128x128 16bit Swizzle Network 16 Banked Memory with SRAM-based Crossbar Multiple Output Adder Tree Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline
18
18 University of Michigan - ACAL Multi-Width Support 18 Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets
19
19 University of Michigan - ACAL AnySP FFU Datapath 19 Flexible Functional Unit allows us to 1.Exploit Pipeline-parallelism by joining two lanes together 2.Handle register bypass and the temporary buffer 3.Join multiple pipelines to process deeper subgraphs 4.Fuse Instruction Pairs
20
20 AnySP Results 20
21
21 University of Michigan - ACAL Simulation Environment Traditional SIMD architecture comparison SODA at 90nm technology AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted Performance and Power for each kernel was generated using synthesized data on in-house simulator 4G – based on a NTT DoCoMo 4G test setup H.264 – 4CIF@30fps 21
22
22 University of Michigan - ACAL AnySP Speedup vs SIMD-based Architecture For all benchmarks we perform more than 2x better than a SIMD-based architecture 22
23
23 University of Michigan - ACAL AnySP Energy-Delay vs SIMD-based Architecture More importantly energy efficiency is much better! 23
24
24 University of Michigan - ACAL PE System Total ComponentsUnits SIMD Data Mem(32KB) SIMD Register File(16x1024bit) SIMD ALUs,Multipliers,and SSN SIMD Pipeline+Clock+Routing Intra-processor Interconnect Scalar/AGU Pipeline&Misc. ARM(Cortex-M3) Global Scratchpad Memory(128KB) Inter-processor Bus with DMA 90nm(1V@300MHz) 4 4 4 4 4 4 1 1 1 Area mm 2 Area % 9.7638.78% 3.1712.59% 4.5017.88% 1.184.69% 0.943.73% 1.224.85% 0.62.38% 1.87.15% 1.03.97% 25.17100% Est. 65nm(0.9V@300MHz)13.14 45nm(0.8V@300MHz)6.86 4G+H.264Decoder Power mW Power % 102.887.24% 299.0021.05% 448.5131.58% 233.6016.45% 93.446.58% 134.329.46% 2.5<1% 10<1% 1.5<1% 1347.03100% 1091.09 862.09 SIMD Buffer(128B) SIMD Adder Tree 4 4 0.823.25% 0.18<1% 84.095.92% 10.43<1% AnySP Power Breakdown We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm 24
25
25 University of Michigan - ACAL Conclusion & Future Work Conclusion We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform Under the power budget and meeting the performance at 45nm Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels Speedup sequential parts 25
26
26 The End 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.