Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance Mobile Computing Using Flexible Wide SIMD Processors Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan
2
Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science The Old Mobile Phone The Modern Mobile Phone 2 Video Recording Video Editing Higher Data Rates 3D Rendering Advanced Image Processing Future phones are becoming more complex Richer applications require both more performance and more flexibility Modern phones look like Franken-chips
3
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Power/Performance Requirements for Multiple Systems 3 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) 1 10 100 1000 10000 0.1110100 B e t t e r P o w e r E f f i c i e n c y 1 M o p s / m W 1 0 M o p s / m W 1 0 0 M o p s / m W 1 0 0 0 M o p s / m W SODA (65nm) SODA (90nm) TI C6X Imagine VIRAMPentium M IBM Cell P e r f o r m a n c e ( G o p s ) Power(Watts) 3G Wireless 4 Mobile HD Video
4
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4G Wireless Basics Three kernels make up the majority of the work ► FFT – Extract Data from Signals ► STBC – Combine Data into More Reliable Stream ► LDPC – Error Correction on Data Stream 4 NTT DoCoMo 4G test setup
5
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Definition Video (H.264) Basics 5 4CIF@30fps
6
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Mobile Signal Processing Algorithm Characteristics 6 Algorithms have different SIMD widths ► From very large to very small Though SIMD width varies all algorithms can exploit it ► Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP Algorithm SIMDScalarOverheadSIMD WidthAmount Workload (%) (Elements)of TLP 4G FFT755201024Low STBC815144High LDPC49183396Low H.264 Deblocking Filter7213158Medium Intra-Prediction8551016Medium Inverse Transform805158High Motion Compensation755108High Problems with traditional SIMD High register file power Large data movement/alignment cost Inconsistent lane utilization SIMD implies single thread
7
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science So, What’s the Right Solution? Alternatives ► More processors, less lanes? ► Configurable: Hardware can be SIMD or MIMD? ► Franken chip? SIMD is the answer! It provides high performance and power efficiency ► Low control cost ► More area-efficient scaling ► Single thread context ► Simpler memory system design – no cache coherence
8
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science A Closer Look at SIMD: Power Breakdown Register file power disproportionately high in a traditional SIMD architecture SODA WCDMA MEM 3% REG 37% ALU+MULT 11% CONTROL 38% INTERCONNECT 2% SCALAR PIPE 9% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE
9
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Register File Accesses 9 Many of the register file access do not have to go back to the main register file Lots of power wasted on unneeded register file access!
10
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 LDPC – Scaling Performance with SIMD Width Increasing SIMD width reduces Memory and Shuffle Operations thus reduces power Extra hardware power consumption outweighs reduced operations SIMD loses effectiveness when lanes cannot be put to productive use SIMD on distributed data (SIMdD) Efficient data rearrangement critical to success of SIMD
11
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Data Alignment Issues H.264 Intra-prediction has 9 different prediction modes ► Each prediction mode requires a specific permutation 11 Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel Intra-Prediction
12
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4G/H.264 Summary Lots of different sized parallelism ► From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported TLP (disjoint SIMD) often available Very short-lived values Lots of potential for instruction fusings (beyond pairwise) Limited set of shuffle patterns required for each kernel
13
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP: Push SIMD But, Increase the Inherent Flexibility and Efficiency 13
14
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP Architecture – High Level 14 8 Groups of 8-Wide Flexible Function Units 128x128 16bit Swizzle Network 16 Banked Memory with SRAM-based Crossbar Multiple Output Adder Tree Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline
15
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Multi-Width SIMD Support 15 Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets
16
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Using SIMD Lanes for Deeper Subgraphs 16 Flexible Functional Unit allows us to 1.Exploit Pipeline-parallelism by joining two lanes together 2.Handle register bypass and the temporary buffer 3.Join multiple pipelines to process deeper subgraphs 4.Fuse Instruction Pairs
17
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 SRAM-based Crossbar Multiple SRAM cells replace MUX of traditonal crossbar Each cell stores configuration information The controller selects the specific configuration based on the instruction parameter Each cell can store up to 6 different configurations Power reduced by 50% for 128x128 crossbar
18
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP vs SIMD-based Architecture SIMD width doubled But that only provides half the performance gain, other half due to flexibility features 18
19
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP Energy-Delay vs SIMD-based Architecture Comparison based on 90nm synthesis results Flexibility increases utilization of datapath and hence its efficiency 19
20
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science PE System Total ComponentsUnits SIMD Data Mem(32KB) SIMD Register File(16x1024bit) SIMD ALUs,Multipliers,and SSN SIMD Pipeline+Clock+Routing Intra-processor Interconnect Scalar/AGU Pipeline&Misc. ARM(Cortex-M3) Global Scratchpad Memory(128KB) Inter-processor Bus with DMA 90nm(1V@300MHz) 4 4 4 4 4 4 1 1 1 Area mm 2 Area % 9.7638.78% 3.1712.59% 4.5017.88% 1.184.69% 0.943.73% 1.224.85% 0.62.38% 1.87.15% 1.03.97% 25.17100% Est. 65nm(0.9V@300MHz)13.14 45nm(0.8V@300MHz)6.86 4G+H.264Decoder Power mW Power % 102.887.24% 299.0021.05% 448.5131.58% 233.6016.45% 93.446.58% 134.329.46% 2.5<1% 10<1% 1.5 <1% 1347.03100% 1091.09 862.09 SIMD Buffer(128B) SIMD Adder Tree 4 4 0.823.25% 0.18<1% 84.095.92% 10.43<1% AnySP Power Breakdown We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm 20
21
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Scaling traditional SIMD for mobile applications ► Wide-SIMD hardware under-utilized ► Large fraction of power on non-computation AnySP design ► Can possibly meet the requirements of 100Mbps 4G and HD video on the same platform @45nm Flexibility/Efficiency improvements ► Increase SIMD utilization (FFUs, multiple short vectors) ► Reduce register file power (bypass buffer) ► More efficient data shuffling (SRAM-based crossbar) 21
22
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions For more information ► http://cccp.eecs.umich.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.