Generalized and Hybrid Fast-ICA Implementation using GPU Presenter: [Titus Nanda Kumara]
Blind Source Separation (BSS) To a computer it has no idea about The original signal How they are mixed But we need the original signal separately This is called Blind Source Separation Image source : http://music.cs.northwestern.edu The solution is given by Independent Component Analysis (ICA)
ICA in one picture Assumptions We have two recordings to separate two sources All signal arrives at the same time. (No delay difference between them) Amplitude of the original signal can change, but the mixing factors remain same (Singer or the Saxophone does not move) 0.4x 0.8x 0.9x 0.5x Unknown Left ear (X1) = 0.8 times Saxophone music + 0.5 times voice Right ear (X2) = 0.9 times Saxophone music + 0.4 times voice
Independent Component Analysis (ICA) Problem Solution How to unmix a mixed signal (x) if we do not know both original sources (s) and mixing factors (A) Assume the mixture is a linear mixture & the sources are independent Problem can be written as x=As If we have an estimate of A-1 A-1x = A-1As s = A-1x
ICA is used in Separating EEG signal for Brain Computer Interface and other medical or research purposes Separation of Magnetoencephalography (MEG) data Improving the quality of music or sound signals by eliminating cross-talk or noise Finding hidden or fundamental factors in financial data such as background currency exchanges or stock market data ICA is a highly compute intensity algorithm. When the data size is larger it takes a considerable amount of time to run
Fast - ICA Suggested by Aapo Hyvärinen at Helsinki University of Technology around late 1990s Comparatively fast, accurate and highly parallelizable Matrix operations are used in most of the places. Good starting point to improve performance using GPU
GPUs for General Purpose Applications (GPGPU Computing) Facilitate to program the GPU as the programmers desire What is so important about GPU? CPU – Several cores running around 4GHz GPU – Thousands of cores running around 1GHz If the task is completely parallel, it is hundreds or thousands time faster to do it in GPU !
Improving performance of Fast-ICA Divide the algorithm into five sections Input reading Pre-processing Fast-ICA loop Post-processing Output writing Execution Time for matrix sizes of 6 x 8192 6 x 262144 100 x 8192 100 x 262144 0.5%~1.6% 98%~99% 0.2%~0.3%
Amdahl's law To improve the performance, we focused on Fast-ICA loop W matrices size nxn (n is number of sources) Z matrices size nxp p>>>n (p is number of samples)
Inside the Fast-ICA loop
Improving the contrast function A custom kernel was written to apply a non linear function to each element of the matrix. This is a complete parallelizable task
Only the contrast function is not enough The data should transfer between RAM and GPU memory through PCI Express bus. This introduce a delay. The communication delay hides the speed gain
Only the contrast function is not enough To hide the data transferring delay and gain performance, we need a large number of computations happen in GPU
Inside the Fast-ICA loop
Improve matrix operations using cuBLAS cuBLAS is the CUDA implementation for the BLAS library Highly optimized, most of the cases writing custom kernels for matrix operation give lower performance than cuBLAS routines Dimensions
Acceleration of the complete algorithm Pre processing Centering and Whitening to remove the correlation among each row of input - (culaDeviceDgesvd and custom kernels) Fast-ICA loop Matrix multiplications transformations – (cublasDgemm and cublasDgeam ) Contrast function – (custom kernels) Eigen decomposition – (culaDeviceDgeev) Post processing Matrix multiplication with cublasDgemm
Running full algorithm in GPU Running the full algorithm in GPU is not always a good idea
Switching between GPU and CPU When CPU execution is faster, we can switch to CPU But should be careful about switching points because of memory copy delay This operation heavily depends on the data size of the input
Data size vs performance We tested for number of sources 2 - 128 Number of samples 1024 – 524288 Each section is tested for all the combinations CPU is better GPU is better Pre-Processing
Data size vs performance CPU is better GPU is better ICA-main loop
Data size vs performance CPU is better GPU is better
Switching between GPU and CPU The switching points will be based on Hardware Data size Data transfer delay Option 1 : The program can be profiled for the hardware for all the data sizes and define boundaries Option 2 : The program can decide the places based on previous iterations of the Fast-ICA loop
Conclusions Fast-ICA can be efficiently executed in GPU but not for all the cases We cannot write a static program to handle all the cases because the performance of CPU and GPU is depends on the data size The program should intelligently switch between GPU and CPU in appropriate locations to gain the maximum performance in all the scenarios
Thank you