ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan, Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin,

Slides:

Advertisements

Similar presentations

Genetic Analysis of Genome-wide Variation in Human Gene Expression Morley M. et al. Nature 2004,430: Yen-Yi Ho.

Advertisements

RDB2RDF: Incorporating Domain Semantics in Structured Data Satya S. Sahoo Kno.e.sis CenterKno.e.sis Center, Computer Science and Engineering Department,

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

OpenFOAM on a GPU-based Heterogeneous Cluster

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Teresa Przytycka NIH / NLM / NCBI RECOMB 2010 Bridging the genotype and phenotype.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Performance and Energy Efficiency of GPUs and FPGAs

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

1 Intel® Many Integrated Core (Intel® MIC) Architecture MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on.

Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.

Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,

Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

CS 732: Advance Machine Learning

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Scaling up R computation with high performance computing resources.

An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Parallel Programming Models

NFV Compute Acceleration APIs and Evaluation

Early Results of Deep Learning on the Stampede2 Supercomputer

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

Scott Michael Indiana University July 6, 2017

Map-Scan Node Accelerator for Big-Data

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Improving java performance using Dynamic Method Migration on FPGAs

Linchuan Chen, Peng Jiang and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Early Results of Deep Learning on the Stampede2 Supercomputer

Compiler Back End Panel

Compiler Back End Panel

A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants Andrew.

Presentation transcript:

ACCELERATING SPARSE CANONICAL CORRELATION ANALYSIS FOR LARGE BRAIN IMAGING GENETICS DATA Jingwen Yan, Hui Zhang, Lei Du, Eric Wernert, Andew J. Saykin, Li Shen

OUTLINE Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods Data Simulation Experimental Results

IMAGING GENETICS CellsSystems Behavior: Disorders, Complex interactions, phenomena, diseases. Genes UCI, S. Potkin et al.

Underlying Biological Pathway and Mechanism IMAGING GENETICS

Risacher et al 2010 Sloan et al 2010 Potkin et al 2009; Saykin et al 2010 Risacher et al 2013 AV45 ROIs & APOE Swaminathan et al 2012 PiB ROIs & amyloid pathway Potkin et al 2009 Mol Psych schizophrenia study Ho et al 2010 FTO; Reiman et al PNAS 2009 Chiang et al 2012 SNP/Gene networks & WM integrity Shen et al 2010 ROIs; Stein et al 2010 voxels Single ROI Circuit Whole Brain Candidate Gene/SNP Biological Pathway Genome-wide IMAGING GENETICS

OUTLINE Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods Data Simulation Experimental Results

X1 X2 X3 Xn Y1 Y2 Y3 Yn X1 X2 X3 Xn Y1 Y2 Y3 W’X Yn X1 X2 X3 Xn Y1 Y2 Y3 Xu Yn Yv Massive Univariate Analysis Multivariate Multiple Regression Canonical Correlation Analysis SCCA

OUTLINE Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods Data Simulation Experimental Results

COMPUTATIONAL CHALLENGES Example SCCA run at a small scale Participants: 1000 Genotype: 3,200 SNPs Phenotype: 10,000 voxels Permutation: 10,000 permutation tests Running time: more than 12,000 hours Scale up Genotype (array): 6M SNPs Genotype (NGS): 40M variants Phenotype: 200K voxels, imaging, cognitive and biomarker Permutation: 10M permutation to reach p=10 -7 Parameter tuning via cross-validation 10-fold cross-validation coupled with an 11-by-11 grid search SCCA runs: 10×11×11 = 1,210

ACCELERATION WITH MKL Intel Math Kernel Library (MKL) accelerate application performance and reduce development time highly vectorized and threaded linear algebra, fast fourier transforms (FFT), vector math and statistics functions MKL has been optimized to utilize multiple processing cores wider vector units more varied architectures available in a high end system MKL can provide parallelism transparently and speed up programs with supported math routines without changing code. Compiling R with MKL

ACCELERATION WITH OFFLOAD MODEL Xeon Phi SE10P Coprocessor 60 cores with 8GB GDDR5 Intel x86 instruction set Usage of familiar programming models, software, and tools Pros The host system can offload computing workload partially to the Xeon Phi Independently run a compatible program

Texas Advanced Computing Center Stampede cluster MKL + offload Each computing node Two Intel Xeon E processors each with eight 32GB DDR3 memory The Xeon Phi SE10P Coprocessor has 61 cores with 8GB GDDR5 The NVIDIA K20 GPUs on each node have 5GB of on-board GDDR5 Software CentOS 6.3. Stock R 3.01 package compiled with the Intel compilers (v.13) and built with MKL v.11. COMPUTATIONAL PLATFORM

OUTLINE Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods Data Simulation Experimental Results

FREGENE genome simulator Simulate sequence-like data over large genomic regions in large diploid populations Simulated data N=1,000 diploid individuals over 20,000 generations 1 0 Mb genome with the average mutation rate as 2.5e-8 /site/generation 3,274 SNPs with minor allele frequency (MAF) greater than 0.05 included Four SNP data sets (i.e., g500, g1000, g2000, and g3274) by taking the first 500, 1,000, 2,000, and 3,274 SNPs from the entire data, respectively. SYNTHETIC DATA (GENETICS)

SYNTHETIC DATA (IMAGING)

OUTLINE Imaging Genetics Sparse Canonical Correlation Analysis (SCCA) Computational Challenges and Methods Data Simulation Experimental Results

R snowfall package (sfLapply) with MKL and offload model RESULTS Baseline Parallel (MKL+ offload)

RESULTS Accelerated SCCA implementations yielded the same results These correlation coefficients are close to the ground truth value of 1 Correlation coefficient between the first pair of canonical components

RESULTS

CONCLUSION Initial steps to accelerate the SCCA implementation for brain imaging genetics applications. Parallelism achieved in system implementation level to accelerate linear algebra computation using math kernel library (MKL) and partial offloading computing workload. The 2-fold speedup, although encouraging, is still insufficient to handle extremely large-scale neuroimaging genetics data millions of image voxels and millions of SNPs. Future work Big data analytic strategies at the parallel computing model level Parallelization of multiplicative algorithms using MapReduce and CUDA. Application to accelerate enhanced SCCA models as well as other bi- multivariate statistical models for analyzing brain imaging genetics data.

ACKNOWLEDGEMENT This research was supported by NIH R01 LM NIH U01 AG NIH RC2 AG NIH R01 AG19771 NIH P30 AG10133 NSF IIS

Thank you