Progress on media processor design Xiaolang Yan Xing Qin Jian Yang.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

DSPs Vs General Purpose Microprocessors

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Design center Vienna Donau-City-Str. 1 A-1220 Vienna Vers SVEN Scalable Video Engine Gerald Krottendorfer.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Embedded Systems Programming

Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.

CH12 CPU Structure and Function

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Platform-based Design for MPEG-4 Video Encoder Presenter: Yu-Han Chen.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

CASTNESS11, Rome Italy © 2011 Target Compiler Technologies L 1 Ideas for the design of an ASIP for LQCD Target Compiler Technologies CASTNESS’11, Rome,

A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳宸.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

1/29 UTDSP: A VLIW Programmable DSP Processor Sean Hsien-en Peng Department of Electrical and Computer Engineering University of Toronto October 26 th,

IClass – A Many-core processor based on RISC-V

William Stallings Computer Organization and Architecture 8th Edition

Embedded Systems Design

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Multi-core SOC for Future Media Processing

Vector Processing => Multimedia

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

Pipelining and Vector Processing

CS170 Computer Organization and Architecture I

Dynamically Reconfigurable Architectures: An Overview

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

A High Performance SoC: PkunityTM

Computer Architecture

CPU Structure CPU must:

Chapter 11 Processor Structure and function

What Choices Make A Killer Video Processor Architecture?

Presentation transcript:

Progress on media processor design Xiaolang Yan Xing Qin Jian Yang Xiaohua Luo Peiyong Zhang Dake Liu Embedded DSP Research & Develop Group Presented by Chunyue Liu

Outline Overview of media processor Progress on Spock Progress on Schubert - Overview - Key features - Performance Conclusions & Problems

Background and Challenges Media applications have very high computation complexity - H.264 encoding of 720 x frames /s up to 30 GOPS Media processor is on the demand - Some state of art Media Processors (e.g. Nomatic, da Vinci) Multiple standards coexist - Flexible & programmable Our current IC design level constraint ASIP is the best choice Our proposal on IC-DFN’05

Overview of media processor Programmable and heterogeneous processors on a SoC platform - General MCU (CK510, a 32-bit RISC core) Interface (GUI), Os (Linux) - Enhanced DSP (Spock) Audio processing, Bitstream parsing, Data transferring - Vector processor (Schubert) Video processing

Outline Overview of media processor Progress on Spock Progress on Schubert - Overview - Key features - Performance Conclusions & Problems

Progress on Spock Developed tools chain - Assembler, Simulator and Debugger FPGA prototype: real time decoding -128kb/s 40MHz  To test Spock, Dual-core SoC platform is developed - Integrated with CK510 - Inter-processor communication uses mailbox and shared memory -.18um, less than 500mw,166MHz - CK510 core area: 2 x 2 mm 2 - Spock core area: 1.5 x 1.5 mm 2

Overview of Spock Optimization for Control - Branch optimization: conditional execution 2-level hardware loop, repeat Optimization for Signal Processing - Multiple addressing mode: Post address ++/-- Reverse/module addressing - MAC with parallel load - VLX instruction set extension: putbits, showbits, getbits, etc.

Outline Overview of media processor Progress on Spock Progress on Schubert - Overview - Key features - Performance Conclusions & Problems

Progress on Schubert Application coverage to function coverage SW-HW partition: 10%-90% locality Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification Micro-architecture design RTL coding Backend design Design for test RTL code verification Test chip fabrication & test board prototype Good performance? Design Methodology Released 316 novel instructions - SIMD and RISC Developed tools chain - Assembler - Cycle-accurate Simulator Mapped kernels H.264/AVC - IT/IIT, Intra/inter-prediction - de-blocking, Motion estimation MPEG2 - DCT, Motion compensation Micro-Architecture is designed estimated area: 3.5 x 3.5 mm with a 70KB SRAM

Key features of Schubert Dual clusters and dual coupling pipelines - SIMD combined with VLIW architecture Explicit Data Organization SIMD (EDO-SIMD) 2-Dimensional and byte-align addressing storage Cycle accurate instruction set simulator

Dual clusters and dual coupling pipelines Two clusters: - Cluster0: Computation (+/-,*,&,>/<,etc.) - Cluster1: Data conversion & LD/ST - Based on Decoupled Access & Execution (DAE) Two pipelines: - Each cluster holds its own executive-level pipeline - Share the IF & ID level pipeline Advantages - Parallelize computation operations with non-computation operations - Perform well on cycle count

Dual clusters and dual coupling pipelines

Explicit Data Organization SIMD ISA Bottleneck of conventional SIMD ISA - SIMD is inefficient if sub-word data is unaligned each other - SIMD is less flexible than VLIW SIMD classVISMMX/SSEAltiVec Ld/St11.70%21.00%17.90% Organize9.70%12.60%17% Integer ALU13.60%18.80%11.80% Float ALU--9.30%6.90% Cycle percent of conventional SIMD ISA This overhead is reduced by Dual-Cluster How to reduce this overhead? Related works - Complex streamed instruction, Delft TU - Stream buffer, Stream processor, Stanford University - Indirect register addressing, Elite project, IBM

Explicit Data Organization SIMD ISA Proposed EDO-SIMD ISA - Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5) Indicate operand relations (align, merge, extract, broadcast, cross) - Append Permutation network onto the RF pipeline of Cluster0Append Permutation network onto the RF pipeline of Cluster0 - Add Permutation pipeline in the Cluster1 in parallel with AD0Add Permutation pipeline in the Cluster1 in parallel with AD0 Advantages - Merge organization with computation to reduce overhead - As flexible as VLIW - Simplified implementation interpolate DCT Intra predict IIT vOADD vR2, vR1, vR0

2-D stream storage and addressing Multimedia temporal data behavior - 2-D block by block - Row and column access - Byte alignment - Flexible block jumping Conventional 1-D addressing impose burdens on Computation Elements for address generation and address alignment tasks Related works - Linear addressing with circle buffer, Blackfin - Special transpose unit, Trimedia

2-D stream storage and addressing Proposed storage and addressing mode - 2-D stream storage (base, 2-D stride, 2-D offset) - Row and interleave data arrangement (row access & column access ) - Base update for block jump (UPDATE B0, OX0, OY0, B0) - C-like programming model is friendly to programmer asm: vLDOBR B0, 4, 2, vR0; C: for(i=0; i<8; i++) r [i] = b [2][4+i]; Advantages - Reduce addressing and aligning overhead (avoid transpose)

Cycle accurate instruction set simulator Useful for benchmarking and ISA design space exploration during early stage - Input is assemble text program not binary code - Focus on function not micro-architecture Consist of - Resource modeling - ISA function modeling at each pipeline - Behavior and timing modeling - Debug and profiling support 3 men for 2 months work, about 60,000 lines C++ code

Benchmarking and performance Mapped benchmarks: - Full H.264 baseline decoder kernels like integer transform, intra predict, interpolation and de-blocking. - H.264 fast motion estimation - MPEG2 motion compensation and DCT/IDCT The cycle accurate and function correct programs help: - Make assembler, simulator more robust - Demonstrate the performance of ISA - Explore and refine ISA (more than 900 instructions are refined to 316 in the end ) Performance - 4-CIF(704x576) H.264 baseline real-time 200MHz - 16 kB code size for H.264 baseline decoder Cycles for 8x8 IDCT with IEEE compliant precision RISC- Media[10] MMX TMS320C6xNEC V830VIRAM Proposed

Outline Overview of media processor Progress on Spock Progress on Schubert - Overview - Key features - Performance Conclusions & Problems

Conclusions Integration of a general MCU with heterogeneous ASIPs in a SoC platform is a good choice for media processing in China - a good trade-off between performance and flexibility - overcome our IC design level Progress on our Media processor - CK510 and Spock is finished - A dual-core SoC of CK510 and Spock is taped out - Novel features of Schubert are verified and the RTL implement is on-going

Problems Application coverage to function coverage SW-HW partition: 10%-90% locality Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification Micro-architecture design RTL coding Backend design Design for test RTL code verification Test chip fabrication & test board prototype Good performance? Behavior Synthesis tool The Behavior synthesis stage in our ASIP design depends on human experience not tools, which takes too much effort. It is very valuable to research and develop CAD tools for design space exploration of ASIP ISA and ASIP SoC communication during the early stage

Thank you!!!