University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The Evolution of a Wireless Baseband Processor Mark Woh (University.

11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

A Scalable Low-power Architecture For Software Radio

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Advanced Computer Architectures

11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

11 1 Customizing Wide-SIMD Architectures for H.264 Sangwon Seo 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 Vijay Sundaram 2, Chaitali Chakrabarti 2 1.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Multi-core architectures. Single-core computer Single-core CPU chip.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Lecture 9: Embedded DSP Processor Papers

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.

EKT303/4 Superscalar vs Super-pipelined.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Microarchitecture.

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Architecture & Organization 1

Vector Processing => Multimedia

Architecture & Organization 1

Lecture 24: Memory, VM, Multiproc

Chapter 12 Pipelining and RISC

The University of Adelaide, School of Computer Science

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance Mobile Computing Using Flexible Wide SIMD Processors Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science The Old Mobile Phone The Modern Mobile Phone 2 Video Recording Video Editing Higher Data Rates 3D Rendering Advanced Image Processing Future phones are becoming more complex Richer applications require both more performance and more flexibility Modern phones look like Franken-chips

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Power/Performance Requirements for Multiple Systems 3 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) B e t t e r P o w e r E f f i c i e n c y 1 M o p s / m W 1 0 M o p s / m W M o p s / m W M o p s / m W SODA (65nm) SODA (90nm) TI C6X Imagine VIRAMPentium M IBM Cell P e r f o r m a n c e ( G o p s ) Power(Watts) 3G Wireless 4 Mobile HD Video

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4G Wireless Basics Three kernels make up the majority of the work ► FFT – Extract Data from Signals ► STBC – Combine Data into More Reliable Stream ► LDPC – Error Correction on Data Stream 4 NTT DoCoMo 4G test setup

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Definition Video (H.264) Basics 5

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Mobile Signal Processing Algorithm Characteristics 6 Algorithms have different SIMD widths ► From very large to very small Though SIMD width varies all algorithms can exploit it ► Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP Algorithm SIMDScalarOverheadSIMD WidthAmount Workload (%) (Elements)of TLP 4G FFT Low STBC815144High LDPC Low H.264 Deblocking Filter Medium Intra-Prediction Medium Inverse Transform805158High Motion Compensation755108High Problems with traditional SIMD High register file power Large data movement/alignment cost Inconsistent lane utilization SIMD implies single thread

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science So, What’s the Right Solution? Alternatives ► More processors, less lanes? ► Configurable: Hardware can be SIMD or MIMD? ► Franken chip? SIMD is the answer! It provides high performance and power efficiency ► Low control cost ► More area-efficient scaling ► Single thread context ► Simpler memory system design – no cache coherence

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science A Closer Look at SIMD: Power Breakdown Register file power disproportionately high in a traditional SIMD architecture SODA WCDMA MEM 3% REG 37% ALU+MULT 11% CONTROL 38% INTERCONNECT 2% SCALAR PIPE 9% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Register File Accesses 9 Many of the register file access do not have to go back to the main register file Lots of power wasted on unneeded register file access!

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 LDPC – Scaling Performance with SIMD Width Increasing SIMD width reduces Memory and Shuffle Operations thus reduces power Extra hardware power consumption outweighs reduced operations SIMD loses effectiveness when lanes cannot be put to productive use SIMD on distributed data (SIMdD) Efficient data rearrangement critical to success of SIMD

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Data Alignment Issues H.264 Intra-prediction has 9 different prediction modes ► Each prediction mode requires a specific permutation 11 Traditional SIMD machines take too long or cost too much to do this Good news – small fixed number patterns per kernel Intra-Prediction

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4G/H.264 Summary Lots of different sized parallelism ► From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported TLP (disjoint SIMD) often available Very short-lived values Lots of potential for instruction fusings (beyond pairwise) Limited set of shuffle patterns required for each kernel

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP: Push SIMD But, Increase the Inherent Flexibility and Efficiency 13

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP Architecture – High Level 14 8 Groups of 8-Wide Flexible Function Units 128x128 16bit Swizzle Network 16 Banked Memory with SRAM-based Crossbar Multiple Output Adder Tree Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Multi-Width SIMD Support 15 Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Using SIMD Lanes for Deeper Subgraphs 16 Flexible Functional Unit allows us to 1.Exploit Pipeline-parallelism by joining two lanes together 2.Handle register bypass and the temporary buffer 3.Join multiple pipelines to process deeper subgraphs 4.Fuse Instruction Pairs

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 SRAM-based Crossbar Multiple SRAM cells replace MUX of traditonal crossbar Each cell stores configuration information The controller selects the specific configuration based on the instruction parameter Each cell can store up to 6 different configurations Power reduced by 50% for 128x128 crossbar

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP vs SIMD-based Architecture SIMD width doubled But that only provides half the performance gain, other half due to flexibility features 18

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science AnySP Energy-Delay vs SIMD-based Architecture Comparison based on 90nm synthesis results Flexibility increases utilization of datapath and hence its efficiency 19

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science PE System Total ComponentsUnits SIMD Data Mem(32KB) SIMD Register File(16x1024bit) SIMD ALUs,Multipliers,and SSN SIMD Pipeline+Clock+Routing Intra-processor Interconnect Scalar/AGU Pipeline&Misc. ARM(Cortex-M3) Global Scratchpad Memory(128KB) Inter-processor Bus with DMA Area mm 2 Area % % % % % % % % % % % Est. 4G+H.264Decoder Power mW Power % % % % % % % 2.5<1% 10<1% 1.5 <1% % SIMD Buffer(128B) SIMD Adder Tree % 0.18<1% % 10.43<1% AnySP Power Breakdown We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm 20

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Scaling traditional SIMD for mobile applications ► Wide-SIMD hardware under-utilized ► Large fraction of power on non-computation AnySP design ► Can possibly meet the requirements of 100Mbps 4G and HD video on the same Flexibility/Efficiency improvements ► Increase SIMD utilization (FFUs, multiple short vectors) ► Reduce register file power (bypass buffer) ► More efficient data shuffling (SRAM-based crossbar) 21

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions For more information ►