Generalized and Hybrid Fast-ICA Implementation using GPU

Slides:



Advertisements
Similar presentations
Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Advertisements

Independent Component Analysis: The Fast ICA algorithm
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
REAL-TIME INDEPENDENT COMPONENT ANALYSIS IMPLEMENTATION AND APPLICATIONS By MARCOS DE AZAMBUJA TURQUETI FERMILAB May RTC 2010.
Independent Component Analysis (ICA)
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Independent Component Analysis (ICA) and Factor Analysis (FA)
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Implementing a Speech Recognition System on a GPU using CUDA
Independent Component Analysis (ICA) A parallel approach.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
The Effects of Parallel Programming on Gaming Anthony Waterman.
Principal Component Analysis (PCA)
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Concurrency and Performance Based on slides by Henri Casanova.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
Computer Graphics Graphics Hardware
GCSE Computing - The CPU
GPU Architecture and Its Application
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Analysis of Sparse Convolutional Neural Networks
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Compressive Coded Aperture Video Reconstruction
Blind Extraction of Nonstationary Signal with Four Order Correlation Kurtosis Deconvolution Name: Chong Shan Affiliation: School of Electrical and Information.
GPU Computing CIS-543 Lecture 10: CUDA Libraries
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Enabling machine learning in embedded systems
Embedded Systems Design
Introduction to Parallelism.
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Convolution (FFT) Bloom
Multi-Layer Perceptron On A GPU
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Anne Pratoomtong ECE734, Spring2002
Pipeline parallelism and Multi–GPU Programming
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
CS 31006: Computer Networks – The Routers
CS 179 Lecture 14.
GPU Implementations for Finite Element Methods
Akshay Tomar Prateek Singh Lohchubh
Presented by Nagesh Adluru
Computer Graphics Graphics Hardware
A Fast Fixed-Point Algorithm for Independent Component Analysis
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Artificial Neural Networks
GCSE Computing - The CPU
6- General Purpose GPU Programming
Presentation transcript:

Generalized and Hybrid Fast-ICA Implementation using GPU Presenter: [Titus Nanda Kumara]

Blind Source Separation (BSS) To a computer it has no idea about The original signal How they are mixed But we need the original signal separately This is called Blind Source Separation Image source : http://music.cs.northwestern.edu The solution is given by Independent Component Analysis (ICA)

ICA in one picture Assumptions We have two recordings to separate two sources All signal arrives at the same time. (No delay difference between them) Amplitude of the original signal can change, but the mixing factors remain same (Singer or the Saxophone does not move) 0.4x 0.8x 0.9x 0.5x Unknown Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice

Independent Component Analysis (ICA) Problem Solution How to unmix a mixed signal (x) if we do not know both original sources (s) and mixing factors (A) Assume the mixture is a linear mixture & the sources are independent Problem can be written as x=As If we have an estimate of A-1 A-1x = A-1As s = A-1x

ICA is used in Separating EEG signal for Brain Computer Interface and other medical or research purposes Separation of Magnetoencephalography (MEG) data Improving the quality of music or sound signals by eliminating cross-talk or noise Finding hidden or fundamental factors in financial data such as background currency exchanges or stock market data ICA is a highly compute intensity algorithm. When the data size is larger it takes a considerable amount of time to run

Fast - ICA Suggested by Aapo Hyvärinen at Helsinki University of Technology around late 1990s Comparatively fast, accurate and highly parallelizable Matrix operations are used in most of the places. Good starting point to improve performance using GPU

GPUs for General Purpose Applications (GPGPU Computing) Facilitate to program the GPU as the programmers desire What is so important about GPU? CPU – Several cores running around 4GHz GPU – Thousands of cores running around 1GHz If the task is completely parallel, it is hundreds or thousands time faster to do it in GPU !

Improving performance of Fast-ICA Divide the algorithm into five sections Input reading Pre-processing Fast-ICA loop Post-processing Output writing Execution Time for matrix sizes of 6 x 8192 6 x 262144 100 x 8192 100 x 262144 0.5%~1.6% 98%~99% 0.2%~0.3%

Amdahl's law To improve the performance, we focused on Fast-ICA loop W matrices size nxn (n is number of sources) Z matrices size nxp p>>>n (p is number of samples)

Inside the Fast-ICA loop

Improving the contrast function A custom kernel was written to apply a non linear function to each element of the matrix. This is a complete parallelizable task

Only the contrast function is not enough The data should transfer between RAM and GPU memory through PCI Express bus. This introduce a delay. The communication delay hides the speed gain

Only the contrast function is not enough To hide the data transferring delay and gain performance, we need a large number of computations happen in GPU

Inside the Fast-ICA loop

Improve matrix operations using cuBLAS cuBLAS is the CUDA implementation for the BLAS library Highly optimized, most of the cases writing custom kernels for matrix operation give lower performance than cuBLAS routines Dimensions

Acceleration of the complete algorithm Pre processing Centering and Whitening to remove the correlation among each row of input - (culaDeviceDgesvd and custom kernels) Fast-ICA loop Matrix multiplications transformations – (cublasDgemm and cublasDgeam ) Contrast function – (custom kernels) Eigen decomposition – (culaDeviceDgeev) Post processing Matrix multiplication with cublasDgemm

Running full algorithm in GPU Running the full algorithm in GPU is not always a good idea

Switching between GPU and CPU When CPU execution is faster, we can switch to CPU But should be careful about switching points because of memory copy delay This operation heavily depends on the data size of the input

Data size vs performance We tested for number of sources 2 - 128 Number of samples 1024 – 524288 Each section is tested for all the combinations CPU is better GPU is better Pre-Processing

Data size vs performance CPU is better GPU is better ICA-main loop

Data size vs performance CPU is better GPU is better

Switching between GPU and CPU The switching points will be based on Hardware Data size Data transfer delay Option 1 : The program can be profiled for the hardware for all the data sizes and define boundaries Option 2 : The program can decide the places based on previous iterations of the Fast-ICA loop

Conclusions Fast-ICA can be efficiently executed in GPU but not for all the cases We cannot write a static program to handle all the cases because the performance of CPU and GPU is depends on the data size The program should intelligently switch between GPU and CPU in appropriate locations to gain the maximum performance in all the scenarios

Thank you