Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner.

Similar presentations


Presentation on theme: "11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner."— Presentation transcript:

1 11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan – ACAL 1 Arizona State University 2 ARM, Ltd. 3

2 22 2 University of Michigan - ACAL The Old Mobile PhoneThe Modern Mobile Phone 2 Video Recording Video Editing Higher Data Rates Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone 3D Rendering Advanced Image Processing Future phones are becoming more complex Richer applications require much more requirements How do phones handle this now?

3 33 3 University of Michigan - ACAL Inside Today’s Smart Phones 3 Inside the OMAP3430 Application Processor Inside the X-Gold 608 (Representation of QCOM) Modern phones are looking like Frankenchips! Some cores unused and functionality duplicated

4 44 4 University of Michigan - ACAL Cost for Multi-System Support  Supporting multiple systems is reserved for the most expensive phones  Cost is in supporting all the systems that may or may not be used at once 4 Programmable Unified Architectures Provide  Lower Cost  Faster Time to Market  Support for Multiple Applications (Current and Future)  Bug Fixes After Manufacturing So where do we start? Data gathered from - Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder," Solid-State Circuits Conference, 2008. A-SSCC '08.

5 55 5 University of Michigan - ACAL Power/Performance Requirements for Multiple Systems 5 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) 1 10 100 1000 10000 0.1110100 B e t t e r P o w e r E f f i c i e n c y 1 M o p s / m W 1 0 M o p s / m W 1 0 0 M o p s / m W 1 0 0 0 M o p s / m W SODA (65nm) SODA (90nm) TI C6X Imagine VIRAMPentium M IBM Cell P e r f o r m a n c e ( G o p s ) Power(Watts) 3G Wireless 4 Mobile HD Video

6 66 6 The Applications Is there anything we can learn from the applications themselves? 6

7 77 7 University of Michigan - ACAL H.264 Basics 7 T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.

8 88 8 University of Michigan - ACAL 4G Wireless Basics  Three kernels make up the majority of the work  FFT – Extract Data from Signals  STBC – Combine Data into More Reliable Stream  LDPC – Error Correction on Data Stream 8

9 99 9 University of Michigan - ACAL Mobile Signal Processing Algorithm Characteristics 9  Algorithms have different SIMD widths  From very large to very small  Though SIMD width varies all algorithms can exploit it  Large percentage of work can be SIMDized  Larger SIMD width tend to have less TLP Algorithm SIMDScalarOverheadSIMD WidthAmount Workload (%) (Elements)of TLP 4G FFT755201024Low STBC815144High LDPC49183396Low H.264 Deblocking Filter7213158Medium Intra-Prediction8551016Medium Inverse Transform805158High Motion Compensation755108High SIMD comes at a cost! Register File Power Data Movement/Alignment Cost SIMD architectures have to deal with this!

10 10 University of Michigan - ACAL Traditional SIMD Power Breakdown  Register File Power consumes a lot of power in traditional 32-wide SIMD architecture SODA WCDMA MEM 3% REG 37% ALU+MULT 11% CONTROL 38% INTERCONNECT 2% SCALAR PIPE 9% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE

11 11 University of Michigan - ACAL Register File Access 11  Many of the register file access do not have to go back to the main register file Lots of power wasted on unneeded register file access!

12 12 University of Michigan - ACAL Instruction Pair Frequency 12 Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! a)Intra-prediction and Deblocking Filter Combined Instruction PairFrequency 1multiply-add26.71% 2add-add13.74% 3shuffle-add8.54% 4shift right-add6.90% 5subtract-add6.94% 6add-shift right5.76% 7multiply-subtract4.00% 8shift right-subtract3.75% 9add-subtract3.07% 10Others20.45% b)LDPC Instruction PairFrequency 1shuffle-move32.07% 2abs-subtract8.54% 3move-subtract8.54% 4shuffle-subtract3.54% 5add-shuffle3.54% 6Others43.77% c)FFT Instruction PairFrequency 1shuffle-shuffle16.67% 2add-multiply16.67% 3multiply-subtract16.67% 4multiply-add16.67% 5subtract-mult16.67% 6shuffle-add16.67%

13 13 University of Michigan - ACAL Data Alignment Problem!  H.264 Intra-prediction has 9 different prediction modes  Each prediction mode requires a specific permutation 13 Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel Intra-Prediction

14 14 University of Michigan - ACAL Summary  Conclusion about 4G and H.264  Lots of different sized parallelism  From 4 wide to 96 wide to 1024 wide SIMD  Which means many different SIMD widths need to be supported  Very short lived values  Lots of potential for instruction fusings  Limited set of shuffle patterns required for each kernel

15 15 AnySP Design 15

16 16 University of Michigan - ACAL Traditional SIMD Architectures 16 32-Wide SIMD with Simple Shuffle Network

17 17 University of Michigan - ACAL AnySP Architecture – High Level 17 8 Groups of 8-Wide Flexible Function Units 128x128 16bit Swizzle Network 16 Banked Memory with SRAM-based Crossbar Multiple Output Adder Tree Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline

18 18 University of Michigan - ACAL Multi-Width Support 18 Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

19 19 University of Michigan - ACAL AnySP FFU Datapath 19 Flexible Functional Unit allows us to 1.Exploit Pipeline-parallelism by joining two lanes together 2.Handle register bypass and the temporary buffer 3.Join multiple pipelines to process deeper subgraphs 4.Fuse Instruction Pairs

20 20 AnySP Results 20

21 21 University of Michigan - ACAL Simulation Environment  Traditional SIMD architecture comparison  SODA at 90nm technology  AnySP  Synthesized at 90nm TSMC  Power, timing, area numbers were extracted  Performance and Power for each kernel was generated using synthesized data on in-house simulator  4G – based on a NTT DoCoMo 4G test setup  H.264 – 4CIF@30fps 21

22 22 University of Michigan - ACAL AnySP Speedup vs SIMD-based Architecture  For all benchmarks we perform more than 2x better than a SIMD-based architecture 22

23 23 University of Michigan - ACAL AnySP Energy-Delay vs SIMD-based Architecture  More importantly energy efficiency is much better! 23

24 24 University of Michigan - ACAL PE System Total ComponentsUnits SIMD Data Mem(32KB) SIMD Register File(16x1024bit) SIMD ALUs,Multipliers,and SSN SIMD Pipeline+Clock+Routing Intra-processor Interconnect Scalar/AGU Pipeline&Misc. ARM(Cortex-M3) Global Scratchpad Memory(128KB) Inter-processor Bus with DMA 90nm(1V@300MHz) 4 4 4 4 4 4 1 1 1 Area mm 2 Area % 9.7638.78% 3.1712.59% 4.5017.88% 1.184.69% 0.943.73% 1.224.85% 0.62.38% 1.87.15% 1.03.97% 25.17100% Est. 65nm(0.9V@300MHz)13.14 45nm(0.8V@300MHz)6.86 4G+H.264Decoder Power mW Power % 102.887.24% 299.0021.05% 448.5131.58% 233.6016.45% 93.446.58% 134.329.46% 2.5<1% 10<1% 1.5<1% 1347.03100% 1091.09 862.09 SIMD Buffer(128B) SIMD Adder Tree 4 4 0.823.25% 0.18<1% 84.095.92% 10.43<1% AnySP Power Breakdown  We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm 24

25 25 University of Michigan - ACAL Conclusion & Future Work  Conclusion  We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform  Under the power budget and meeting the performance at 45nm  Future and Ongoing Work  Application-specific language  Larger class of algorithms for AnySP  Better utilization of resources for non-parallel kernels  Speedup sequential parts 25

26 26 The End 26


Download ppt "11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner."

Similar presentations


Ads by Google