Department of Electrical Engineering National Cheng Kung University

Slides:

Advertisements

Similar presentations

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Advertisements

Introduction to Parallel Computing

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Fractal Image Compression

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Parallelizing Compilers Presented by Yiwei Zhang.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU platforms GP - General Purpose computation using GPU

Fractal Image Compression By Cabel Sholdt and Paul Zeman.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

On Virtualization of Reconfigurable Hardware in Distributed Systems 指導教授：周哲民學生：陳佑銓 CAD Group Department of Electrical Engineering National Cheng.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Department of Electrical Engineering, Southern Taiwan University Robotic Interaction Learning Lab 1 The optimization of the application of fuzzy ant colony.

GPU Architecture and Programming

Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

OpenCL Programming James Perry EPCC The University of Edinburgh.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Parallel processing

Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.

GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

GPU Architecture and Its Application

Generalized and Hybrid Fast-ICA Implementation using GPU

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Stencil-based Discrete Gradient Transform Using

A DFA with Extended Character-Set for Fast Deep Packet Inspection

Graphics Processing Unit

Accelerating MapReduce on a Coupled CPU-GPU Architecture

ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT

Sum of Absolute Differences Hardware Accelerator

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Multithreaded Programming

Lecture 2 The Art of Concurrency

Memory System Performance Chapter 3

Design principles for packet parsers

Scalable light field coding using weighted binary images

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Department of Electrical Engineering National Cheng Kung University Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms 2013/8/5 指導教授：周哲民學生：陳佑銓 CAD Group Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Abstract Fractal compression is an efficient technique for image and video encoding that uses the concept of self-referential codes. Although offering compression quality that matches or exceeds traditional techniques with a simpler and faster decoding process, fractal techniques have not gained widespread acceptance due to the computationally intensive nature of its encoding algorithm. In this paper, we present a real-time implementation of a fractal compression algorithm in OpenCL [1]. We show how the algorithm can be efficiently implemented in OpenCL and optimized for multi-CPUs, GPUs, and FPGAs.

Introduction(1/3) Encoding techniques aim to reduce redundancy within image and video data so that the amount of data sent over a channel can be minimized. They are ideal applications for hardware acceleration, since these computationally intensive algorithms typically perform the same instructions on large quantities of data. OpenCL is a platform-independent standard where data parallelism is explicitly specified. This programming model targets a specific mode of applications where a host processor offloads computationally intensive portions of code onto an external accelerator

Introduction(2/3)

Introduction(3/3) The application is composed of two sections : the host program, and the kernel. The host program is the serial portion of the application and is responsible for managing data and control flow of the algorithm. The kernel program is the highly parallel part of the application to be accelerated on a device such as a multi-core CPU, GPU, or FPGA. The advantage of using OpenCL is that the same code can be easily targeted to different platforms for performance comparisons. Although the implementation still need to be tuned and adapted to each platform for optimal performance, the evaluation process is much simpler.

Background(1/4) To facilitate algorithm acceleration, these applications are often partitioned into two hardware components One component performs control flow heavy portions of the design, while the data-parallel portion is performed by a second component. This application model is often referred to as the “Host + Accelerator” programming model. In this model, the host is responsible for the control flow, the accelerator to perform data-parallel tasks.

Background(2/4) To encode image or video data, the algorithm partitions an image into regions, and seeks to find one linear equation that best describe each region. The computationally intensive portion of the application is the encoding; Thus, we focus on accelerating the encoding portion of the application. To encode an image, we first create a codebook that contains reference data generated from different regions of the image. For each region to encode, we look through all codebook entries, and find the entry that, after applying a scale factor and an offset, best represents the intended region. Thus, compression is achieved since each region can be represented as three values; the best matching codebook entry ID, scale factor, and offset.

Background(3/4) The first platform evaluated is the Multi-CPU These cores can be multi-threaded, further increasing possible parallelism. Multi-core CPUs may have sizeable on-chip caches that can be shared by its cores. The second platform under evaluation is the GPU. Fractal encoding seems to be a good fit for the GPU. GPUs naturally work well on image data; its memory bandwidth is unparalleled by other platforms. Because the algorithm works on small independent regions, the GPU can easily launch thousands of threads with very low demand for shared memory or local registers.

Background(4/4) The third platform evaluated is the FPGA. The designer needs to keep track of data at every cycle, and know the device architecture well in order to take full advantage of the available circuitry. To alleviate these issues, Altera has introduced an OpenCL-to-FPGA compilation tool. The most important concept behind the OpenCL-to-FPGA compiler is the notion of pipeline parallelism. Unlike multicore CPUs and GPUs where there are lots of custom hardware units already available, the FPGA must be more efficient in using its available resources.

Fractal Compression(1/8) Fractal compression is based on the theory of iterated function systems (IFS) described in [4, 5]. Consider a 2D-image I which is N-pixels high and N-pixels wide. This can be represented using a single dimensional vector P of N2 elements arranged in the following form: Fractal theory tells us that we can express the single pixel image using the following relationship:

Fractal Compression(2/8)

Fractal Compression(3/8) The process of decoding the image from the fractal code is based on applying the recurrence relationship starting from any random initial value as illustrated in Figure 5. The initial value of x is chosen to be 10. At each step, we apply Equation 2 until the value of x converges at 4. It can be shown that regardless of the initial value, the system will converge to the same value.

Fractal Compression(4/8) Fractal codes for general images can be expressed in a similar manner to the single pixel case: In this equation, P is a vector containing N2 pixel values, A is a transformation matrix of N2 × N2 elements and C is an offset vector containing N2 values. This is a generalization of the scale factor and constant offset discussed in the single pixel case. The decoding process uses an iterative technique starting at a random image P0: For matrices which are contractive [14], it can be shown that this process converges to a value of (I − A)−1C.

Fractal Compression(5/8) Fig.6 Generation of the First codebook entry

Fractal Compression(6/8) For each 8x8 region in the original image we produce a 4x4 codebook image as shown in Figure 6. This effectively compresses a region of 64 pixel values into a smaller image with 16 pixel values. The reduction is accomplished by an averaging of 4 pixel values in 2x2 regions. This process of conversion of 8x8 regions to 4x4 codebook entries repeats for each 8x8 region in the original image and forms a library of codebook images. The number of codebook images in this library is (N/8)2. The codebook library is useful to help express an image as a recurrence relationship using the transformation matrices A and C.

Fractal Compression(7/8)

Fractal Compression(8/8) Consider the original image shown in Figure 7. We can divide this image into (N/4)2 4x4 regions. For a particular region Ri, fractal compression attempts to find the best matching codebook entry Dj which matches the region Ri. Rather than simply matching the codebook entry itself, a scale factor si and an offset oi are introduced to find an even closer matching. More formally, this can be expressed by: This computation is essentially the summation of absolute differences (SAD) between Ri and the scaled and offset version of Dj , for all Dj in the codebook.

Fractal Encoding Implementation Our OpenCL implementation of fractal encoding performs the following steps: Read image or video data from a file and process into frame buffers Process a single image frame Create a codebook SAD-based search algorithm to find the best matching entries and generation of fractal codes

Fractal Image Compression The first implementation of the encoder in C++ showed that 99.8% of the compression time lies in the computation of best matching codebook entry using SAD. Therefore, we offload this computation onto an accelerator. The codebook is still computed on the host x86 processor, but sent to the accelerator with the image data for computation. We create a highly data parallel kernel that returns the best matching entry Dj , scale factor si and an offset oi. To accelerate codebook search, we launch 1 work-item or thread per 4x4 region to first compute the sum of absolute difference of the 16 pixel locations. The advantage of launching one thread per 4x4 region is that each thread is independent; No shared memory is necessary.

Fractal Video Compression Our general strategy for video compression involves encoding new frames using the codebook of the original frame. To illustrate this concept, we perform a cross-coding experiment using standard image processing benchmarks, where we use the codebook of one image to encode a different image. In Table I , the image listed on the top is encoded with the codebook of the image listed in the left column. From this experiment, we see the average loss in PSNR is just 1dB;

Fractal Video Compression

Codebook Optimizations(1/2) To achieve real-time compression speeds, we see that kernel runtime is directly influenced by the codebook size. By inspecting the application, it is immediately clear that the codebook transferred to the kernel can be reduced. One simple optimization is to retain a single codebook entry instead. On the accelerator, the local thread can apply the 7 scale factors to each entry before the SAD computation. By doing this, we can speed up the transfer time of the codebook by 7x. The kernel is correspondingly changed with an additional loop that iterates 7 times, once per scale factor. The overall kernel runtime remains unchanged.

Codebook Optimizations(2/2)

Conclusion In this paper, we demonstrated how fractal encoding can be described using OpenCL and detailed the optimizations needed to map it efficiently to any platform. We showed that the core computation by the kernel is 3x faster than a high-end GPU while consuming 12% of the power. We are also 114x faster than a multi-core CPU while consuming 19% of the power.

Thanks for your attention.