2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

2009/04/07 Yun-Yang Ma

 Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental Results ◦ Conclusion 2

 In the past few years, Graphic Processing Unit (GPU) processing capability grows rapidly 3

 General-purpose Computation on GPUs (GPGPU) ◦ Not only for accelerating the graphics display but also for speeding up non-graphics applications  Linear algebra computation  Scientific simulation 4

 Compute Unified Device Architecture ◦ http://www.nvidia.com.tw/object/cuda_home_tw.ht ml# (nVidia CUDA Zone) http://www.nvidia.com.tw/object/cuda_home_tw.ht ml# ◦ Single program multiple data (SPMD) computing device Fast Object DetectionLeukocyte Tracking Real-time 3D modeling 5

 Architecture 6

 Programming Model ◦ Two parts of program executing  Host : CPU  Device : GPU Main program Host End of Main ．．．．．．． Device ．．．．．．． Kernel End of Kernel do parallelism 7

 Thread Batching ◦ CUDA creates a lot of threads on the device then each thread will execute kernel program with different data ◦ The threads in the same thread block can co-work with each other through the shared memory ◦ Number of threads in a thread block is limited  Thread blocks with same dimension can be organized as a grid and do thread batching 8

 Thread Batching Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0,0)Block (1,0)Block (2,0) Block (0,1) Block (1,1) Block (2,1) Block (0,0)Block (1,0)Block (2,0) Block (0,1) Block (1,1) Block (2,1) Block (0,2)Block (1,2)Block (2,2) Block (0,3) Block (1,3) Block (2,3) Block (1,0) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) 9

 Memory Model ◦ DRAM ◦ Chip memory Grid Block(0, 0) Block(1, 0) Shared Memory Registers Thread (0, 0)Thread (1, 0) Local Memory Global Memory Shared Memory Registers Thread (0, 0)Thread (1, 0) Local Memory 10

 Example : Vector addition 11 Kernel program

 In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC  MB mode [1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008) P_16x16 P_16x8 P_8x16 P_8x8 8x88x44x84x4 12

 Two steps for deciding the final coding mode ◦ Step 1 : Find best motion vectors of each MB mode ◦ Step 2 : Evaluate the R-D performance and choose the best mode  H.264 ME algorithm is extremely complex and time consuming ◦ Fast motion estimation method (TSS, DS, etc.) ◦ In [1], they focus on Full Search ME Too much branch instruction 13

 First stage : Calculate integer pixel MVs 16 4 4 4 8 8 4 8 8 8 8 Compute all SAD values between each block and all reference candidates in parallel Merge 4x4 SADs to form all block sizes Find the minimal SAD and determine the integer pixel MV 16 14

 Second stage : Calculate fraction-pixel MVs ◦ Reference frame is interpolated using a six-tap filter and a bilinear filter defined in H.264/AVC ◦ Calculate the SADs at 24 fractional pixel positions that are adjacent to the best integer MV Half pixel Quarter pixel Integer pixel 24 2322 21 20 19 1817 16 15 14 13 12 11 10 98 7 6 5 43 2 1 15

 4x4 Block-Size SAD Calculation ◦ Sequence resolution : 4CIF (704x576) ◦ Search range : 32 x 32 ( leads to 1024 candidates ) ◦ Each candidate SAD is computed by a thread ◦ 256 threads executed in a thread block Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block 4x4 blocks number in a frameNumber of ME search candidates 256 threads in a thread block = 706/4 x 576/4 x 32 2 x 1/256 = 101376 16

 Block diagram of 4x4 block SAD calculation ●● ‧ ‧ ‧ ‧ ‧ ‧● 1024 candidates of an 4x4 block B1 T256 T1T2 … B2 T256 T1T2 … B3 T256 T1T2 … B4 T256 T1T2 … Kernel … B101376 … DRAM 256 SADs … 17

 Variable Block-Size SAD Generation ◦ Merge the 4x4 SADs obtained in the previous step ◦ Each thread fetches sixteen 4x4 SADs of one MB at a candidate position and combines them to form other block size 18 = 706/16 x 576/16 x 32 2 x 1/256 = 6336

 Block diagram of variable block size SAD calculation DRAM 16 SADs … Kernel B1 T1 T2 T256 … B2 … B6336 4x8 SAD x8 8x4 SAD x8 8x8 SAD x4 8x16 SAD x2 16x8 SAD x2 16x16 SAD x1 … … DRAM 19

 Integer Pixel SAD Comparison ◦ All 1024 SADs of one block are compared and the least SAD is chosen as the integer-pixel MV ◦ Each block size (16x16 to 4x4) has its own kernels for SAD comparison ◦ Seven kernels are implemented and executed sequentially 20

 Block diagram of integer pixel SAD comparison 1024 SADs DRAM Kernel B1 T1 T2 T256 … 4 SADs shared memory SAD T1 ~ T128/2 n -1 256/2 n-1 SADs n iterations Integer-pel MV 21

 During the thread reduction process, a problem may occur ◦ Shared memory bank conflict  A sequential addressing with non-divergent branching strategy is adopted 22

 SAD comparison using sequential addressing with non-divergent branching 8634737847519436 1234 56 78 Shared memory (SAD value & index) Thread ID (Do comparison) 4631733647519436 1234 … 23

 Fractional Pixel MV Refinement ◦ Find the best fractional-pixel motion vector around the integer motion vector of every block Half pixel Quarter pixel Integer pixel 24 2322 21 20 19 1817 16 15 14 13 12 11 10 98 7 6 5 43 2 1 DRAM Encoding Frame Integer- pel MV Reference Frame Kernel B1 shared memory T1 T2 T24 … shared memory T1 ~ T12/2 n -1 24/2 n-1 SADs n iterations fractionl-pel MV 24

 Environment ◦ AMD Athlon 64 X2 Dual Core 2.1GHz with 2G memory ◦ NVIDIA GeForce 8800GTX with 768MB DRAM ◦ CUDA Toolkit and SDK 1.1  Parameters ◦ ME algorithm : Full Search ◦ Search Range : 32x32 25

 The average execution time in ms for processing one frame using the proposed algorithm StepsmsPercentage (%) Step1. 4x4 Block Size SADs Calculation33.9831.24 Step2. Variable Block Size SADs Generation30.6428.16 Step3. Integer Pixel SAD Comparison9.698.90 Step4. Fractional Pixel Interpolation7.106.52 Step5. Fractional Pixel ME Refinement7.109.99 Others16.4915.16 Total108.77100 26

 The ME performance comparison between CPU only and using GPU Sequence Frame rate (fps) using AMD CPU Frame rate (fps) using GPU Speed-up Stefan (CIF)3.0431.5410.38 City (4CIF)0.789.1911.78 27

 In this paper, they present an efficient block level parallelized algorithm for variable block size motion estimation using CUDA GPU.  GPU acting as a coprocessor can effectively accelerate massive data computation. 28

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Similar presentations

Presentation on theme: "2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Similar presentations

Presentation on theme: "2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental."— Presentation transcript:

Similar presentations

About project

Feedback