HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

MPEG-2 to H.264/AVC Transcoding Techniques Jun Xin Xilient Inc. Cupertino, CA.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
PIPELINE AND VECTOR PROCESSING
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.
Standards, process, requirements 4K PLAYBACK EXPLAINED.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
H.264 / MPEG-4 Part 10 Nimrod Peleg March 2003.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Status – Week 272 Victor Moya. Vertex Shader VS 2.0+ (NV30) based Vertex Shader model. VS 2.0+ (NV30) based Vertex Shader model. Multithreaded?? Implemented.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
ECE 424 Embedded Systems Design Lecture 8 & 9 & 10: Embedded Processor Architecture Chapter 5 Ning Weng.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Status – Week 275 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. Parameters: wires in, wires out, child.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
MPEG-2 Digital Video Coding Standard
EE 5359 H.264 to VC 1 Transcoding Vidhya Vijayakumar Multimedia Processing Lab MSEE, University of Arlington Guided.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Assembly & Machine Languages
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Chapter 7 – End-to-End Data Two main topics Presentation formatting Compression We will go over the main issues in presentation formatting, but not much.
Incell Phonium Processor Design Specifications Dale Mansholt Aaron Drake Jonathan Scruggs Travis Svehla Incell Phonium Processor.
MACCE and Real-Time Schedulers Steve Roberts EEL 6897.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
GPU Architecture and Programming
Figure 1.a AVS China encoder [3] Video Bit stream.
Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ULTRASPARC 2005 INTRODUCTION AND ISA BY JAMES MURITHI.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Lecture 04: Instruction Set Principles Kai Bu
The Alpha Thomas Daniels Other Dude Matt Ziegler.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
LonWorks Introduction Hwayoung Chae.
Company Confidential 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Requirement Id : Requirement Title : PP71 Optimized video chain to.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.
Introduction to Operating Systems Concepts
Chapter Six.
15-740/ Computer Architecture Lecture 3: Performance
Sega Dreamcast Visual Memory Unit FPGA Implementation
The University of Adelaide, School of Computer Science
Vector Processing => Multimedia
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Chapter Six.
* From AMD 1996 Publication #18522 Revision E
Midterm 2 review Chapter
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Computer Architecture
CSc 453 Final Code Generation
CSE 502: Computer Architecture
CSc 453 Interpreters & Interpretation
Computer Organization
MPEG-1 MPEG is short for the ‘Moving Picture Experts Group‘.
Presentation transcript:

HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center

2 3D EU Kernel Media Engine URBURB Media (Video Front End) Command Streamer Thread Spawner Thread Dispatcher Indirect data Thread payload Video memory Data portSampler

3 Mode of operation Coded data Output pixel MC IDCTVLDISIQ VFE or host EU Kernels

4 Current XVMC implementation coded data Output pixel MC IDCT VLDIS IQ Host Software per slice data per macroblock data EU Kernels

5 XVMC XVMC lib Media Application DRI interface X Server Graphic Hardware render, sync, resource management mpeg stream decode slice of macro blocks media commands, video memory management

6 Video Memory Layout command stream VFE state Interface descriptors media surface EU kernel Instruction media object command selected interface media pointer command media surface surface state binding tables flush command

7 Execute Unit introduction  SIMD code (variable execute size up to 16) with prediction and control mask.  Float and integer data type  Region based direct and indirect register addressing  Support scalar and immediate source operand

8 EU Registers  GRF (General Register File) –256 bits per register (g0, g1, g2, gxx)  MRF (Message Register File) –256 bits per register (m0, m1, m2, mx), write only, –Used to pass payload from thread to shared function unit.  ARF (Architecture Register File) –e.g null, ip and flag register  Immediate –encoded in instruction

9 Register Region g0 (256 bits) Width=8 VertStride=16 HorzStride=2 Type=w g5.2 w g15.3 UB origin regnum=5, subregnum=2 Regnum.Subregnum Type

10 Data operation WZ YX XX XX register 0 register 1 register 2 register 3 WZ YX WZ YX WZ YX YY YY ZZ ZZ WW WW Array of structure ( vertex shader) Structure of array ( pixel shader and media code) vector

11 Instruction sample (f0) add.sat(16) g28.0 ub g3.0 f g10.0 w {align1} execute size type register number subregister number VertStride HorizStride WidthAccess mode prediction register

12 Instruction set  Normal SIMD instructions –add, mul, avg, mov etc –dp3, dp4 etc  Branch control instructions –If,else, do, while, jmpi etc –branch is needed in media code  Send instructions –communicate with shared function units –media kernel use it to control thread life cycle, read and write into surface

13 Instruction example add.sat(16) g28.0 UB g3.0 f g10.0 W {align1} XXXXXXXXXXXXXXXXYYYYYYYY YYYYYYYY ++++ ZZZZZZZZZZZZZZZZ g28 g3 g4 g10

14 An example Input and output payload register passed from inline data, x, y, mv, field flags etc input Y0-Y3 input U input V reference Y reference U reference V tmp registers Result registers, organized in YUV420 format Indirect data payload media read from reference surface media write to destination surface constant data

15 Planar data vs Packed data  Easy to handle by media kernel  Hard to apply some filters  Can not be directly used as a sampler source in hardware implementation

16 Work flow B DCT Data I kernel PP forward reference frame backward reference frame kernel IP Indirect data inline data Media read message Media write message Destination surface slice of macroblocks

17 About XvMC API  Post processing missing in XvMC API design  Video output mixer.

18 High Level Language  Why a high level language for media kernel is preferred ? –Easy to debug –Easy to reuse code –Hide platform details, easy to understand and maintain  Possible choice –GLSL is not OK – Simple C extension ?

19 H.264  Kernels became much more complex because of difference MC and DCT size combination.  Not suitable on slice level API, because of intra prediction.  Need schedule and dependency control ability for media threads because of intra prediction

20 VAAPI  picture level API  cover mpeg2 h264 vc1 from different entry points  post processing and video output mixer is missing

21 TODO  IDCT code optimize  Mpeg2 XVMC VLD extension  VAAPI for mpeg2  VAAPI for AVC  Video post processing and mixer

22 Q&A Thank You!