MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner.

Slides:



Advertisements
Similar presentations
MPEG-2 to H.264/AVC Transcoding Techniques Jun Xin Xilient Inc. Cupertino, CA.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
CS :: Fall 2003 MPEG-1 Video (Part 1) Ketan Mayer-Patel.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 HW/SW Implementation of JPEG Decoder.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
MPEG2 FGS Implementation ECE 738 Advanced Digital Image Processing Author: Deshan Yang 05/01/2003.
HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Joint Picture Experts Group(JPEG)
JPEG 2000 Image Type Image width and height: 1 to 2 32 – 1 Component depth: 1 to 32 bits Number of components: 1 to 255 Each component can have a different.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Basics and Architectures
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.
Chapter 7 – End-to-End Data Two main topics Presentation formatting Compression We will go over the main issues in presentation formatting, but not much.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
MPEG MPEG : Motion Pictures Experts Group MPEG : ISO Committee Widely Used Video Compression Standard.
Compression video overview 演講者:林崇元. Outline Introduction Fundamentals of video compression Picture type Signal quality measure Video encoder and decoder.
The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Low-Power Wireless Video System Advisor: Professor Alex Doboli Students: Christian Austin Artur Kasperek Edward Safo.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Performed by: Dor Kasif, Or Flisher Instructor: Rolf Hilgendorf Jpeg decompression algorithm implementation using HLS PDR presentation Winter Duration:
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Overview of Fine Granularity Scalability in MPEG-4 Video Standard Weiping Li Presented by : Brian Eriksson.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Low-Power Wireless Video System Advisor: Professor Alex Doboli Students: Christian Austin Artur Kasperek Edward Safo.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Code Optimization.
William Stallings Computer Organization and Architecture 8th Edition
A programmable communications processor for future wireless systems
Optimization Code Optimization ©SoftMoore Consulting.
Vector Processing => Multimedia
Software Equipment Survey
Stream Architecture: Rethinking Media Processor Design
Lecture on High Performance Processor Architecture (CS05162)
Sum of Absolute Differences Hardware Accelerator
Performance Optimization for Embedded Software
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Standards Presentation ECE 8873 – Data Compression and Modeling
What Choices Make A Killer Video Processor Architecture?
Presentation transcript:

MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner

Imagine Architecture2 Programming Imagine  Architecture features –Data bandwidth management –Data-parallel clusters –Parallel-subword operations  Stream programming model –Natural data streams of application –Computation kernels perform “functions”  Challenge is to think in terms of streams instead of traditional C-style sequential code

Scott RixnerImagine Architecture3 Application Development (1)  Compose stream and kernel diagram –Identify natural streams in the application –Understand data-parallelism and how to map it to the clusters –Stream-oriented algorithmic choices  Write kernel code –C-like syntax –idebug enables quick non-performance, functional debugging –iscd/schedviz enables C-level performance tuning

Scott RixnerImagine Architecture4 Application Development (2)  Write stream code –First cut: simple mapping of stream/kernel diagram –idebug enables quick functional testing –Second cut: convert to macrocode (soon to be obsolete) –isim yields cycle-accurate simulation  Performance tuning –schedviz allows quick kernel tuning –appviz shows where application run-time is going

Scott RixnerImagine Architecture5 MPEG2 Encoding  Color Conversion (RGB  YCbCr)  Motion Estimation  Discrete Cosine Transform  Quantization  Run-level Encoding  Variable-length Coding  IDCTQ/Correlation for Reference Frame

Scott RixnerImagine Architecture6 Streams and Kernels

Scott RixnerImagine Architecture7 Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }

Scott RixnerImagine Architecture8 Imagine Programming Tools

Scott RixnerImagine Architecture9 KernelC loop_stream(datain) pipeline(1) { datain >> color1 >> color2 >> color3 >> color4; // c = 0.299R || 0.114B c1 = hi(mulrnd(RB_SCALE, shift(a1, 1))); c2 = hi(mulrnd(RB_SCALE, shift(a2, 1))); c3 = hi(mulrnd(RB_SCALE, shift(a3, 1))); c4 = hi(mulrnd(RB_SCALE, shift(a4, 1))); … Yout << hi(mulrnd(Ymadj, shift(temp0, 1)))+Yaadj; Yout << hi(mulrnd(Ymadj, shift(temp1, 1)))+Yaadj; first = hi(mulrnd((a1a3 - (z1 + z3)), C_SCALE)) + one_two_eight; second = hi(mulrnd((a2a4 - (z2 + z4)), C_SCALE)) + one_two_eight; first = commucperm(perm_a, first); second = commucperm(perm_b, second); CrCbout << select(low, first, second); }

Scott RixnerImagine Architecture10 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2

Scott RixnerImagine Architecture11 StreamC for (row=0; row<NROWS; row++) { // update quantization factor for rate control quantizerScale = newQuantizerScale; // setup streams for this row... // Perform I-Frame encoding convert(InputRow, &YRow, &CbCrRow); dct(YRow, dctIconstants, quantizerScale, &DCTYRow); dct(CbCrRow, dctIconstants, quantizerScale, &DCTCbCrRow); rle(DCTYRow, DCTCbCrRow, rleConstants, &RunLevelsRow); vlc(RunLevelsRow, &bitStream, &newQuantizerScale); // Store generated bit stream... // Generate reference image for subsequent P or B frames idct(DCTYRow, idctIconstants, quantizerScale, &RefYRow); idct(DCTCbCrRow, idctIconstants, quantizerScale, &RefCbCrRow); // Store reference rows... }

Scott RixnerImagine Architecture12 Macrocode for (int row = 0; row < mb_height; row++) { for (int col = 0; col < mb_width; col += iNumBlocks) { rts.write_ucr(1, image_size_param); rts.write_ucr(2, idxparams); rts.vect_op(idxgen, 0, 1, iframe.colorIndices); rts.vect_load(false, iframe.imageBuffer[even], iframe.colorIndices, memInputFrame, msg); rts.vect_op(icolor, 1, 2, "icolor conversion", iframe.imageBuffer[odd], iframe.blkY1dct, iframe.blkCrCb1dct); rts.write_ucr(1, quantizer_scale); rts.vect_op(dct, 2, 1, "Y dct", iframe.blkY1dct, dctIntraConsts, iframe.blkY2rle); rts.write_ucr(1, quantizer_scale); rts.vect_op(dct, 2, 1, "CrCb dct", iframe.blkCrCb1dct, dctIntraConsts, iframe.blkCrCb2rle); rts.write_ucr(1, 0); rts.write_ucr(2, quant_scale); rts.vect_op(rle, 4, 1, "RLE“ iframe.blkY2rle, iframe.blkCrCb2rle, rle_consts, zeroLength, UP(iframe.blkRunLevels[odd])); rts.vect_store(false, iframe.blkRunLevels[odd], memOutputFrame, msg); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "Y idct", iframe.blkY2rle, idctIntraConsts, iframe.blkY3); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "CrCb idct", iframe.blkCrCb2rle, idctIntraConsts, iframe.blkCrCb3); rts.write_ucr(1, 0); rts.vect_op(correlate, 4, 2, "correlate", iframe.blkY3, iframe.blkCrCb3, iframe.dummy_blkYMVref, iframe.dummy_blkCrCbMVref, iframe.blkYref[odd], iframe.blkCrCbref[odd]); rts.vect_store(false, iframe.blkYref[odd], memNewRefY, msg); rts.vect_store(false, iframe.blkCrCbref[odd], memNewRefCrCb, msg); }

Scott RixnerImagine Architecture13 Stereo Depth Extractor Load original packed row Unpack (8bit  16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search

Scott RixnerImagine Architecture14 Tools  idebug (functional simulator) –Built on top of visual studio (any C++ compiler)  iscd (kernel scheduler) –Generates optimized VLIW assembly from C-like code  isim (cycle-accurate simulator) –Simulates current Imagine architecture (configurable)  schedviz (schedule/application visualizer) –Interactive visualization of resource utilization  stream scheduler (run-time stream manager)

Scott RixnerImagine Architecture15 idebug  Macros and libraries  Enable Imagine StreamC/KernelC to be directly compiled by a C++ compiler  Enables the use of any C++ debugger to debug Imagine code  Can add arbitrary C++ code into the StreamC/KernelC for debugging –Function stubs –printf’s, etc.

Scott RixnerImagine Architecture16 Imagine Debugging

Scott RixnerImagine Architecture17 IDebug

Scott RixnerImagine Architecture18 iscd  Optimizing VLIW scheduler  Compiles KernelC  Currently supports –copy propagation & dead code elimination –software pipelining –loop unrolling –schedule randomization –inline functions (no function calls)  Configurable target architecture

Scott RixnerImagine Architecture19 isim  Similar application performance to RTL  ~4M cycles per hour (>1000 cycles per second)  Configurable –Machine description file (same file as for iscd) –# clusters, ALU mix/connection, memory system, etc.  Interactive command prompt –Debugging –Performance monitoring/reporting –Memory/file comparison

Scott RixnerImagine Architecture20 schedviz  Interactive schedule visualizer  Visual Basic  Shows resource utilization –Operation scheduling –Communication scheduling  Enables source-level performance optimization –Never look at assembly code!  Also view application execution –Cluster, memory, network utilization

Scott RixnerImagine Architecture21 Stream Scheduler (1)  Converts StreamC functions into Imagine operations  Allocates: operation issue slots stream-level registers stream register file (SRF) memory  Determines dependencies between operations

Scott RixnerImagine Architecture22 Stream Scheduler (2)  SRF allocation is critical –requires usage information –requires foreknowledge –too costly to perform at run time  Stream scheduler is profile based –run once with simple allocation –collect usage information –perform good allocation –run repeatedly with good allocation

Scott RixnerImagine Architecture23 Handling Large Streams  Strip mining  Double buffering

Scott RixnerImagine Architecture24 Stream Algorithms: Blocksearch Reference Image Row from Current Image Row 0 Row 1 Row 2 blocksearch Motion Vectors Reference row 0 Reference row 1 Reference row 2 Current row search region

Scott RixnerImagine Architecture25 MPEG2 Characteristics  Operations –56% 8-bit ADD/SUB  Little locality –1.47 accesses per word of global data  Computationally intense –155 operations per global data reference

Scott RixnerImagine Architecture26 Performance & Power  Raw Performance –360x288, 24-bit: 350 fps –720x486, 24-bit: 104 fps  Clusters provide high arithmetic bandwidth –27.6 GOPS on blocksearch kernel –17.9 GOPS overall  SRF provides necessary data locality, bandwidth –Only temporary data in off-chip memory are reference frames –2.4 GB/s required, 32 GB/s available  Power Efficiency: 10.7 GOPS/W

Scott RixnerImagine Architecture27 Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

Scott RixnerImagine Architecture28 Stream Recirculation

Scott RixnerImagine Architecture29 MPEG Bandwidth

Scott RixnerImagine Architecture30 MPEG Execution

Scott RixnerImagine Architecture31 Challenges  VLC (Huffman Coding) –Difficult and inefficient to implement on clusters (SIMD on 32-bit data) –Instead, send RLE data over network to FPGA –Could add special-purpose Huffman coding stream unit  Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?) –For smaller image sizes, can simply re-encode a group of macroblocks at a higher quantization level if necessary in real- time

Scott RixnerImagine Architecture32 Imagine Programming  Think in terms of streams  Range of software tools –Compilers –Visualizers –Simulators  Achieve new levels of performance –Less programming effort –Greater power efficiency

Scott RixnerImagine Architecture33 If-Statement Example if (case) { f(x); } else { g(x); } if (case) { strA << x; } else { strB << x; } PE0PE1PE2PE Case values Should PEs execute f( ) or g( )? PE0PE1PE2PE3 SRF0 SRF1 SRF2 SRF3 Shared Control Case values Shared Control

Scott RixnerImagine Architecture34 Conditional Streams –Data streams that are accessed conditionally based on a local case value –Results in an arbitrary expansion or compression of stream in space and time