Multi-core SOC for Future Media Processing

Slides:



Advertisements
Similar presentations
What Choices Make A Killer Video Processor Architecture? Jonah Probell Ultra Data Corp
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.
High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted.
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Design center Vienna Donau-City-Str. 1 A-1220 Vienna Vers SVEN Scalable Video Engine Gerald Krottendorfer.
Progress on media processor design Xiaolang Yan Xing Qin Jian Yang.
Real-Time Video Analysis on an Embedded Smart Camera for Traffic Surveillance Presenter: Yu-Wei Fan.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Embedded Systems Programming
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
H.264 / MPEG-4 Part 10 Nimrod Peleg March 2003.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
An Introduction to H.264/AVC and 3D Video Coding.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
1 HW-SW Framework for Multimedia Applications on MPSoC: Practice and Experience Adviser : Chun-Tang Chao Adviser : Chun-Tang Chao Student : Yi-Ming Kuo.
1.  Project Goals.  Project System Overview.  System Architecture.  Data Flow.  System Inputs.  System Outputs.  Rates.  Real Time Performance.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
1/23/2005 page1 11/11/2004 MPEG4 Codec for Access Grids National Center for High Performance Computing Speaker: Barz Hsu
Platform-based Design for MPEG-4 Video Encoder Presenter: Yu-Han Chen.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
MPEG Motion Picture Expert Group Moving Picture Encoded Group Prateek raj gautam(725/09)
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.
HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.
By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.
Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.
Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.
Figure 1.a AVS China encoder [3] Video Bit stream.
Aug 25, 2005 page1 Aug 25, 2005 Integration of Advanced Video/Speech Codecs into AccessGrid National Center for High Performance Computing Speaker: Barz.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
V ENUS INTERNATIONAL COLLEGE OF TECHNOLOGY Guided by : Rinkal mam.
Chapter Overview General Concepts IA-32 Processor Architecture
Computer Organization and Assembly Languages Yung-Yu Chuang
Please do not distribute
Andrea Acquaviva, Luca Benini, Bruno Riccò
ECE354 Embedded Systems Introduction C Andras Moritz.
Microarchitecture.
William Stallings Computer Organization and Architecture 8th Edition
System On Chip.
Vector Processing => Multimedia
Digital Signal Processors
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Dynamically Reconfigurable Architectures: An Overview
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Sum of Absolute Differences Hardware Accelerator
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Coe818 Advanced Computer Architecture
A High Performance SoC: PkunityTM
Final Project presentation
Sridhar Rajagopal COMP 625 April 17, 2000
What is Computer Architecture?
DSPs in emerging wireless systems
Computer Organization and Assembly Languages Yung-Yu Chuang 2008/11/17
CPU Structure CPU must:
What Choices Make A Killer Video Processor Architecture?
CSE 502: Computer Architecture
What Are Performance Counters?
Presentation transcript:

Multi-core SOC for Future Media Processing Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University

Outline Opportunities & challenges from media processing Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Opportunities Video conference IP-phone Smart terminal PDA Video camera HDTV Set-top box … The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Challenges—multiple standards 1st MPEG-2 Encoder 6 MPEG-2 2nd Generation Encoder MPEG-4 5 H.26L H.263 H.264 3rd Generation Encoder WMV 4 VP3 AVS 4th Generation Encoder Mbit/s 3 5th Generation Encoder WMV 2 VP3 AVS 1 H.264 / MPEG-4 part 10 The Institute of VLSI Design, Zhejiang Univ. 1994 1995 1996 1997 1998 1999 2000 2018/9/17 2001 2002 2003 2004 2005

Challenges — excellent hardware Very high computation complexity H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS Multiple standards co-exist Demands of flexibility & programmability Low power Low cost Best choice : Application Specific Instruction Processor The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multimedia algorithm characteristics Outer-loop and inner loop Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring Inner loop: Regular algorithms (Prediction, FIR, DCT, motion estimation) The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multimedia algorithm mapping Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer loop Vector processor(VP, VLIW+SIMD) — inner loop The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Multi-core SOC architecture Top level Media processing kernel The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Inside the media processing kernel GAG1 GAG2 GAG3 GAG4 GDM GTM V-DM1 V-DM2 V-DM3 V-DM4 EDSP-control path Vector control path DMA and off chip memories 2D crossbar connection network E-DP V-DP1 V-DP2 V-DP3 V-DP4 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Technologies— specified instruction set __asm{ mov edx, mptr movdqu xmm1, [edx] packssdw xmm1,xmm1// read m50] from memory to xmm1} __asm{ movdqu xmm4, [edx +48] packssdw xmm4,xmm4// read m5[3] from memory} __asm{ movq xmm5,xmm1 psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]); paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]); movq xmm5, xmm2 psraw xmm2,1 psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3] psraw xmm4,1 paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)} for (j=0;j<BLOCK_SIZE;j++){ for (i=0;i<BLOCK_SIZE;i++){ m5[i]=img->cof[i0][j0][i][j]; } m6[0]=(m5[0]+m5[2]); m6[1]=(m5[0]-m5[2]); m6[2]=(m5[1]>>1)-m5[3]; m6[3]=m5[1]+(m5[3]>>1); Our IS 6 cycles adapt programmable processors to specific algorithms by introducing specialized instructions for frequently occurring operations of higher complexity. Integer IDCT in H.264 Intel MMX:13 cycles The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Technologies—instruction mergence Load/Store 30% result = 0; pres_y = dy == 1 ? y_pos : y_pos+1; pres_y = max(0,min(maxold_y,pres_y));//load for(x=-2;x<4;x++) //control { pres_x = max(0,min(maxold_x,x_pos+x));//load result += imY[pres_y][pres_x]*COEF[x+2]; // computation, permutation and load } result1 = max(0, min(255, (result+16)/32));//computation Permutation 25% Computation 35% Control 10% Ld/St and Perm. Merged Computation 6 – tap sub- pixels interpolation Control The Institute of VLSI Design, Zhejiang Univ. 2018/9/17 Reduce a half of time

Benchmarking results for CPU core CK520 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Simulation results for DSP performance Enhanced DSP CAVLC(context adaptive variable length coding) OGG(new audio standard) Sequence (CIF) MIPS/frame Max Average Foreman 0.147,832 0.029,898 Mobile 0.541,943 0.134,240 Function MIPS/frame MDCT 6 De_VQ 2.5 Floor/Coupling 3.5 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Simulation results for DSP performance Vector processor H.264 baseline decoder Sequence (298 frames) MIPS@30 frames Max Average QCIF Foreman 28.1 12.7 Aikyo 19.8 5.3 CIF 116.3 52.3 92.9 22.8 The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Project status Finished 2 versions of CPU Core Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Future work Scheduling for task level parallelism(TLP) between heterogeneous processors Simulation/debugging tools for heterogeneous processors Methodologies for design space exploration The Institute of VLSI Design, Zhejiang Univ. 2018/9/17

Thank you! The Institute of VLSI Design, Zhejiang Univ. 2018/9/17