MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner
Imagine Architecture2 Programming Imagine Architecture features –Data bandwidth management –Data-parallel clusters –Parallel-subword operations Stream programming model –Natural data streams of application –Computation kernels perform “functions” Challenge is to think in terms of streams instead of traditional C-style sequential code
Scott RixnerImagine Architecture3 Application Development (1) Compose stream and kernel diagram –Identify natural streams in the application –Understand data-parallelism and how to map it to the clusters –Stream-oriented algorithmic choices Write kernel code –C-like syntax –idebug enables quick non-performance, functional debugging –iscd/schedviz enables C-level performance tuning
Scott RixnerImagine Architecture4 Application Development (2) Write stream code –First cut: simple mapping of stream/kernel diagram –idebug enables quick functional testing –Second cut: convert to macrocode (soon to be obsolete) –isim yields cycle-accurate simulation Performance tuning –schedviz allows quick kernel tuning –appviz shows where application run-time is going
Scott RixnerImagine Architecture5 MPEG2 Encoding Color Conversion (RGB YCbCr) Motion Estimation Discrete Cosine Transform Quantization Run-level Encoding Variable-length Coding IDCTQ/Correlation for Reference Frame
Scott RixnerImagine Architecture6 Streams and Kernels
Scott RixnerImagine Architecture7 Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }
Scott RixnerImagine Architecture8 Imagine Programming Tools
Scott RixnerImagine Architecture9 KernelC loop_stream(datain) pipeline(1) { datain >> color1 >> color2 >> color3 >> color4; // c = 0.299R || 0.114B c1 = hi(mulrnd(RB_SCALE, shift(a1, 1))); c2 = hi(mulrnd(RB_SCALE, shift(a2, 1))); c3 = hi(mulrnd(RB_SCALE, shift(a3, 1))); c4 = hi(mulrnd(RB_SCALE, shift(a4, 1))); … Yout << hi(mulrnd(Ymadj, shift(temp0, 1)))+Yaadj; Yout << hi(mulrnd(Ymadj, shift(temp1, 1)))+Yaadj; first = hi(mulrnd((a1a3 - (z1 + z3)), C_SCALE)) + one_two_eight; second = hi(mulrnd((a2a4 - (z2 + z4)), C_SCALE)) + one_two_eight; first = commucperm(perm_a, first); second = commucperm(perm_b, second); CrCbout << select(low, first, second); }
Scott RixnerImagine Architecture10 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2
Scott RixnerImagine Architecture11 StreamC for (row=0; row<NROWS; row++) { // update quantization factor for rate control quantizerScale = newQuantizerScale; // setup streams for this row... // Perform I-Frame encoding convert(InputRow, &YRow, &CbCrRow); dct(YRow, dctIconstants, quantizerScale, &DCTYRow); dct(CbCrRow, dctIconstants, quantizerScale, &DCTCbCrRow); rle(DCTYRow, DCTCbCrRow, rleConstants, &RunLevelsRow); vlc(RunLevelsRow, &bitStream, &newQuantizerScale); // Store generated bit stream... // Generate reference image for subsequent P or B frames idct(DCTYRow, idctIconstants, quantizerScale, &RefYRow); idct(DCTCbCrRow, idctIconstants, quantizerScale, &RefCbCrRow); // Store reference rows... }
Scott RixnerImagine Architecture12 Macrocode for (int row = 0; row < mb_height; row++) { for (int col = 0; col < mb_width; col += iNumBlocks) { rts.write_ucr(1, image_size_param); rts.write_ucr(2, idxparams); rts.vect_op(idxgen, 0, 1, iframe.colorIndices); rts.vect_load(false, iframe.imageBuffer[even], iframe.colorIndices, memInputFrame, msg); rts.vect_op(icolor, 1, 2, "icolor conversion", iframe.imageBuffer[odd], iframe.blkY1dct, iframe.blkCrCb1dct); rts.write_ucr(1, quantizer_scale); rts.vect_op(dct, 2, 1, "Y dct", iframe.blkY1dct, dctIntraConsts, iframe.blkY2rle); rts.write_ucr(1, quantizer_scale); rts.vect_op(dct, 2, 1, "CrCb dct", iframe.blkCrCb1dct, dctIntraConsts, iframe.blkCrCb2rle); rts.write_ucr(1, 0); rts.write_ucr(2, quant_scale); rts.vect_op(rle, 4, 1, "RLE“ iframe.blkY2rle, iframe.blkCrCb2rle, rle_consts, zeroLength, UP(iframe.blkRunLevels[odd])); rts.vect_store(false, iframe.blkRunLevels[odd], memOutputFrame, msg); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "Y idct", iframe.blkY2rle, idctIntraConsts, iframe.blkY3); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "CrCb idct", iframe.blkCrCb2rle, idctIntraConsts, iframe.blkCrCb3); rts.write_ucr(1, 0); rts.vect_op(correlate, 4, 2, "correlate", iframe.blkY3, iframe.blkCrCb3, iframe.dummy_blkYMVref, iframe.dummy_blkCrCbMVref, iframe.blkYref[odd], iframe.blkCrCbref[odd]); rts.vect_store(false, iframe.blkYref[odd], memNewRefY, msg); rts.vect_store(false, iframe.blkCrCbref[odd], memNewRefCrCb, msg); }
Scott RixnerImagine Architecture13 Stereo Depth Extractor Load original packed row Unpack (8bit 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search
Scott RixnerImagine Architecture14 Tools idebug (functional simulator) –Built on top of visual studio (any C++ compiler) iscd (kernel scheduler) –Generates optimized VLIW assembly from C-like code isim (cycle-accurate simulator) –Simulates current Imagine architecture (configurable) schedviz (schedule/application visualizer) –Interactive visualization of resource utilization stream scheduler (run-time stream manager)
Scott RixnerImagine Architecture15 idebug Macros and libraries Enable Imagine StreamC/KernelC to be directly compiled by a C++ compiler Enables the use of any C++ debugger to debug Imagine code Can add arbitrary C++ code into the StreamC/KernelC for debugging –Function stubs –printf’s, etc.
Scott RixnerImagine Architecture16 Imagine Debugging
Scott RixnerImagine Architecture17 IDebug
Scott RixnerImagine Architecture18 iscd Optimizing VLIW scheduler Compiles KernelC Currently supports –copy propagation & dead code elimination –software pipelining –loop unrolling –schedule randomization –inline functions (no function calls) Configurable target architecture
Scott RixnerImagine Architecture19 isim Similar application performance to RTL ~4M cycles per hour (>1000 cycles per second) Configurable –Machine description file (same file as for iscd) –# clusters, ALU mix/connection, memory system, etc. Interactive command prompt –Debugging –Performance monitoring/reporting –Memory/file comparison
Scott RixnerImagine Architecture20 schedviz Interactive schedule visualizer Visual Basic Shows resource utilization –Operation scheduling –Communication scheduling Enables source-level performance optimization –Never look at assembly code! Also view application execution –Cluster, memory, network utilization
Scott RixnerImagine Architecture21 Stream Scheduler (1) Converts StreamC functions into Imagine operations Allocates: operation issue slots stream-level registers stream register file (SRF) memory Determines dependencies between operations
Scott RixnerImagine Architecture22 Stream Scheduler (2) SRF allocation is critical –requires usage information –requires foreknowledge –too costly to perform at run time Stream scheduler is profile based –run once with simple allocation –collect usage information –perform good allocation –run repeatedly with good allocation
Scott RixnerImagine Architecture23 Handling Large Streams Strip mining Double buffering
Scott RixnerImagine Architecture24 Stream Algorithms: Blocksearch Reference Image Row from Current Image Row 0 Row 1 Row 2 blocksearch Motion Vectors Reference row 0 Reference row 1 Reference row 2 Current row search region
Scott RixnerImagine Architecture25 MPEG2 Characteristics Operations –56% 8-bit ADD/SUB Little locality –1.47 accesses per word of global data Computationally intense –155 operations per global data reference
Scott RixnerImagine Architecture26 Performance & Power Raw Performance –360x288, 24-bit: 350 fps –720x486, 24-bit: 104 fps Clusters provide high arithmetic bandwidth –27.6 GOPS on blocksearch kernel –17.9 GOPS overall SRF provides necessary data locality, bandwidth –Only temporary data in off-chip memory are reference frames –2.4 GB/s required, 32 GB/s available Power Efficiency: 10.7 GOPS/W
Scott RixnerImagine Architecture27 Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
Scott RixnerImagine Architecture28 Stream Recirculation
Scott RixnerImagine Architecture29 MPEG Bandwidth
Scott RixnerImagine Architecture30 MPEG Execution
Scott RixnerImagine Architecture31 Challenges VLC (Huffman Coding) –Difficult and inefficient to implement on clusters (SIMD on 32-bit data) –Instead, send RLE data over network to FPGA –Could add special-purpose Huffman coding stream unit Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?) –For smaller image sizes, can simply re-encode a group of macroblocks at a higher quantization level if necessary in real- time
Scott RixnerImagine Architecture32 Imagine Programming Think in terms of streams Range of software tools –Compilers –Visualizers –Simulators Achieve new levels of performance –Less programming effort –Greater power efficiency
Scott RixnerImagine Architecture33 If-Statement Example if (case) { f(x); } else { g(x); } if (case) { strA << x; } else { strB << x; } PE0PE1PE2PE Case values Should PEs execute f( ) or g( )? PE0PE1PE2PE3 SRF0 SRF1 SRF2 SRF3 Shared Control Case values Shared Control
Scott RixnerImagine Architecture34 Conditional Streams –Data streams that are accessed conditionally based on a local case value –Results in an arbitrary expansion or compression of stream in space and time