Independent Component Analysis (ICA) A parallel approach.

Slides:

Advertisements

Similar presentations

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.

REAL-TIME INDEPENDENT COMPONENT ANALYSIS IMPLEMENTATION AND APPLICATIONS By MARCOS DE AZAMBUJA TURQUETI FERMILAB May RTC 2010.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

Independent Component Analysis (ICA) and Factor Analysis (FA)

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

1 Exercise 1 Submission Monday 6 Dec, 2010 Delayed Submission: 4 points every week How would you calculate efficiently the PCA of data where the dimensionality.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

Implementing a Speech Recognition System on a GPU using CUDA

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Example: Sorting on Distributed Computing Environment Apr 20,

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

QCAdesigner – CUDA HPPS project

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Principal Component Analysis (PCA)

CS 732: Advance Machine Learning

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

SCHOOL OF ENGINEERING AND ADVANCED TECHNOLOGY Engineering Project Routing in Small-World Networks.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Generalized and Hybrid Fast-ICA Implementation using GPU

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Recognition of biological cells – development

Imageodesy for co-seismic shift study

Parallel ODETLAP for Terrain Compression and Reconstruction

Parallel Density-based Hybrid Clustering

CPU Efficiency Issues.

Genomic Data Clustering on FPGAs for Compression

Multi-Processing in High Performance Computer Architecture:

Methodology & Current Results

Department of Computer Science University of York

CSE8380 Parallel and Distributed Processing Presentation

Hybrid Programming with OpenMP and MPI

Parallelization of Sparse Coding & Dictionary Learning

Multithreading Why & How.

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Parallel ODETLAP for Terrain Compression and Reconstruction

Memory System Performance Chapter 3

Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.

Presentation transcript:

Independent Component Analysis (ICA) A parallel approach

Motivation Suppose a person wants to run Image from new.homeschoolmarketplace.com But he doesn’t have a leg We need a computerized mechanism to control this bionic leg

Right side of brain controls left leg, left side controls right leg A characteristic of the brain

Problem we should solve We need to identify the original source signals We need to identify it fast

Problem one - Identifying signal Independent Component Analysis(ICA) Blind Source Separation x = As A -1 x = A -1 As A -1 x = s s = Wx Where W = A -1 X- what we observe A - Mixing matrix(Depends on the source position) S – Unknown original sources x = As We don’t know both (A) and (s) s=Wxs

FastICA algorithm Fast and accurate Problem two - Identifying it fast

FastICA is still slow Dataset sources and 15,000 samples taken within 15 seconds Result - FastICA took about 4,700 seconds to solve this This is about One and Half hours !

We improved FastICA Used parallelism to improve the performance Implemented in threading, parallel processing and hybrid version of threads and processes

FastICA algorithm Pre processing Main loop Post Calculations Centering data Singular Value Decomposition Initialize W matrix Dot products Symmetric Decorrelation Apply non-linear function(g) to the input matrix (x) ex:- Cube(x), tanh(x), log cosh(x) Wx = s

Amdahl's law If we need to improve an algorithm, we need to improve the most time consuming serial part ICA main loop consumed about 90% of total time for 118 source, sample set

Paralleling the FastICA - Threading 118 sources sample data set 118 * 3000 Main loop 118 * * 118 product g(x) 118 * * 118 product g(x) 118 * * 118 product g(x) 118 * * 118 product g(x) Thread * * 118 product g(x) Thread 2 Thread 3 Thread 4Thread 5 Controller collect the matrixes and calculate W until it converges (Apply non-linear functions to the input matrix (x) parallelly)

Paralleling the FastICA - Processes 118 sources sample data set 118 * 3000 Main loop 118 * * 118 product g(x) Process * * 118 product g(x) 118 * * 118 product g(x) 118 * * 118 product g(x) 118 * * 118 product g(x) Process 2 Process 3 Process 4Process 5 Controller collect the matrixes and calculate W until it converges MPI used to distribute data within the cluster

Paralleling the FastICA - Hybrid 118 sources sample data set Main loop Process Controller collect the matrixes and calculate W until it converges Process * 5000 Process * * 118 product g(x ) Thread 1 g(x) 118 * * 118 product g(x ) Thread 2 Thread controller 118 * 5000 Process * * 118 product g(x ) Thread 1 g(x) 118 * * 118 product g(x ) Thread 2 Thread controller 118 * 5000 Process * * 118 product g(x ) Thread 1 g(x) 118 * * 118 product g(x ) Thread 2 Thread controller

To Achieve High Levels of Parallelism Input data decomposition (columns wise distribution) What kind of granularity is matter ? Critical Path The longest directed path between any pair of start and finish nodes Degree of Concurrency Maximum degree of concurrency Average degree of concurrency avg degree of concurrency = total amount of work critical path length

To Achieve High Levels of Parallelism Maximize data locality Minimize volume of data exchange between threads or processes Management of access of shared data

We used following machines to test our solution. Single Node (S) - 4 cores 8 threads High performance computer (HPC/H) - 16 cores and 32 threads MPI Cluster (M) - four single node machines Single Node Computer Intel(R) Core(TM) i GHz cpu MHz : Cache size : 6MB Memory Total: : 4 GB High Performance Computer Intel(R) Xeon(R) CPU E cpu MHz : Cache size : 20MB Memory Total: : 256GB Test data set 118 sources and samples taken from BCI Competition III ( Experimental Setup

Result ImplementationExecution time(Seconds) Maximum Parallelism Single Threaded (S) threads Eigen Thread Enabled(S) threads MPI(C) processes Multithreaded(H) threads Hybrid(C) processes + 8 threads The graph is in log scale

Conclusion We need to identify the original source signals FastICA gives the correct result We need to identify it fast For 300 seconds sample the calculation only took 81 seconds in high performance computer But this is not a feasible solution We need to find a cheap solution : Nvidia-CUDA Future !

Thank you

Basic FastICA Algorithm Input data Preprocessing X-Whiting matrix W-Sphering matrix Feed data into FastICA main loop FastICA main loop If FastICA is converged then this will returnsource matrix Is FastICA converged True False

Multi-threaded FastICA algorithm

MPI and Multithreaded Algorithm

Results

Basic single threaded C++ solution Improved single threaded C++ solution Eigen thread enabled version Threaded ICA algorithm Parallel processing in ICA algorithm Feed the data set for separate processes ICA loop runs paralleled Feed the data set for separate threads ICA loop runs paralleled Threading in parallel processes Feed the data set for separate processes In each process, feed the data set for separate threads We used Eigen library for matrix operations At compile time, we can enable the threading in Eigen matrix calculations Used the compiler optimizations Reused allocated memory locations Passed matrices by reference Changed I/O functions S C S S H C

Technologies/Tools and Libraries