11 1 Customizing Wide-SIMD Architectures for H.264 Sangwon Seo 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 Vijay Sundaram 2, Chaitali Chakrabarti 2 1.

Slides:



Advertisements
Similar presentations
MPEG-2 to H.264/AVC Transcoding Techniques Jun Xin Xilient Inc. Cupertino, CA.
Advertisements

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
-1/20- MPEG 4, H.264 Compression Standards Presented by Dukhyun Chang
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
In God We Trust Class presentation for the course: “Custom Implementation of DSP systems” Presented by: Mohammad Haji Seyed Javadi May 2013 Instructor:
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT
11 A Memory Interleaving and Interlacing Architecture for Deblocking Filter in H.264/AVC Yeong-Kang Lai, Member, IEEE, Lien-Fei Chen, Student Member, IEEE,
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The Evolution of a Wireless Baseband Processor Mark Woh (University.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.
11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
Real Time Image Feature Vector Generator Employing Functional Cache Memory for Edge Takuki Nakagawa, Department of Electronic Engineering The University.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
An Introduction to H.264/AVC and 3D Video Coding.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
1 Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi- Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney.
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
H.264 Deblocking Filter Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
11 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner.
ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
MOTION ESTIMATION IMPLEMENTATION IN VERILOG
Understanding Sources of Inefficiency in General-Purpose Chips R.Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis,
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Figure 1.a AVS China encoder [3] Video Bit stream.
-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Study and Optimization of the Deblocking Filter in H.265 and its Advantages over H.264 By: Valay Shah Under the guidance of: Dr. K. R. Rao.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Vamsi Krishna Vegunta University of Texas, Arlington
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.
Igor Jánoš. Goal of This Project Decode and process a full-HD video clip using only software resources Dimension – 1920 x 1080 pixels.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Sunpyo Hong, Hyesoon Kim
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
1. 2 Design of a 125  W, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ching-Che Chung 1, Chen-Yi Lee 1,
Implementation and comparison study of H.264 and AVS china EE 5359 Multimedia Processing Spring 2012 Guidance : Prof K R Rao Pavan Kumar Reddy Gajjala.
Computational Controlled Mode Selection for H.264/AVC June Computational Controlled Mode Selection for H.264/AVC Ariel Kit & Amir Nusboim Supervised.
Image Convolution with CUDA
Last update on June 15, 2010 Doug Young Suh
Multi-core SOC for Future Media Processing
Vector Processing => Multimedia
Implementation of DWT using SSE Instruction Set
Linchuan Chen, Peng Jiang and Gagan Agrawal
Study and Optimization of the Deblocking Filter in H
An enhanced estimation: motion and rotation estimation
Research Institute for Future Media Computing
Authors:Tae-Yun Chung; Min-Suk Hong; Young-Nam Oh;
What Choices Make A Killer Video Processor Architecture?
An Efficient Spatial Prediction-Based Image Compression Scheme
Presentation transcript:

11 1 Customizing Wide-SIMD Architectures for H.264 Sangwon Seo 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 Vijay Sundaram 2, Chaitali Chakrabarti 2 1 University of Michigan 2 Arizona State University

22 2 Customizing Wide-SIMD Architectures for H.264 Outline  Motivation  H.264 Analysis  Proposed Architecture  H.264 Kernel Mappings  Results  Conclusion 2

33 3 Customizing Wide-SIMD Architectures for H.264 Motivation – Smart Phone 3 Reference Images :

44 4 Customizing Wide-SIMD Architectures for H.264 Motivation – Inside Smart Phone 4 Reference Images :

55 5 Customizing Wide-SIMD Architectures for H.264 H.264 Design 5 Reference Images : I. Richardson, “H.264 and MPEG-4 video compression,” WILEY, 2003 H.264 encoder/decoder reference design

66 6 Customizing Wide-SIMD Architectures for H.264 H.264 – Analysis  H.264 Kernel Algorithms  Heavy SIMD workload  Different natural SIMD widths  High & Medium Thread Level Parallelism  Need to support multiple SIMD widths to maximize the SIMD utilization 6

77 7 Customizing Wide-SIMD Architectures for H.264 H.264 – Analysis  Example – Deblocking Filter  Two dimensional data are used for multimedia algorithms.  Row or column order memory access works well for one set of edges, but not for the other.  Diagonal memory bank system helps to access blocks along a row or a column. 7 Horizontal Filtering Vertical Filtering

88 8 Customizing Wide-SIMD Architectures for H.264 H.264 – Analysis  Subgraphs for Innerloops of two kernel algorithms  Large amount of data locality  Large RF power consumption (Read/Write)  Bypass and Temporary buffer support 8

99 9 Customizing Wide-SIMD Architectures for H.264 H Analysis  Instruction Pairs  Heavy usage of shuffle and arithmetic operations  Add-Shift : round operation  Sub-Abs : SAD operation  Need to fuse the frequently used instruction pairs 9

10 Customizing Wide-SIMD Architectures for H.264 H Analysis  Permutation Patterns for Intraprediction  Fixed set of shuffle patterns  Need for programmable shuffle network 10

11 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 11

12 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 12 Multiple SIMD widths Thread-Level Parallelism

13 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 13 Diagonal Memory Organization Memory Bank System + Shuffle Network

14 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 14 Short-lived values stored in temporary buffers

15 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 15 Short-lived values Fused Operation

16 Customizing Wide-SIMD Architectures for H.264 Modified SIMD architecture 16 Shuffle Networks are placed here and there to align data

17 Customizing Wide-SIMD Architectures for H.264 Mapping of H.264 Kernels  Intra Prediction 17

18 Customizing Wide-SIMD Architectures for H.264 Results  System Breakdown  H.264 CIF video at 30fps 18

19 Customizing Wide-SIMD Architectures for H.264 Results  Speedup Breakdown  2.13x performance increase on average 19

20 Customizing Wide-SIMD Architectures for H.264 Results  Energy-Delay product comparison  29% energy-delay improvement on average 20

21 Customizing Wide-SIMD Architectures for H.264 Results 21  Comparison with latest H.264 encoders [17] T. C. Chen et.al, “2.8 to 62.7 mW low-power and power-aware H.264 encoder for mobile applications,” 2007 IEEE Symposium on VLSI Circuits, pp. 222–223, June [18] M. Bhatnagar, “TMS320DM6446/3 Power Consumption Summary,” Texas Instruments Application Reports, Feb

22 Customizing Wide-SIMD Architectures for H.264 Conclusion  Key architectural enhancements  SIMD partitioning  Diagonal memory bank system  Bypass and temporary buffer support  Fused operation support  Programmable crossbar  Future work  Image processing algorithms on SIMD architecture 22

23 Customizing Wide-SIMD Architectures for H.264 Backup Slides 23

24 Customizing Wide-SIMD Architectures for H.264 H.264 – Analysis  Diagonal Memory Organization  Two dimensional data are used for multimedia algorithms.  Blocks along a row or a column need to be accessed easily. 24

25 Customizing Wide-SIMD Architectures for H.264 Mapping of H.264 Kernels  Deblocking Filter 25

26 Customizing Wide-SIMD Architectures for H.264 Mapping of H.264 Kernels  Motion Compensation 26

27 Customizing Wide-SIMD Architectures for H.264 Mapping of H.264 Kernels  Motion Estimation 27