Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Slides:



Advertisements
Similar presentations
Scalable High Performance Dimension Reduction
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
Presenter: Yang Ruan Indiana University Bloomington
Data Science at Digital Science October Geoffrey Fox Judy Qiu
FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
Yang Ruan PhD Candidate Computer Science Department Indiana University.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
Optimization Indiana University July Geoffrey Fox
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions  Introduction.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Department of Intelligent Systems Engineering
I590 Data Science Curriculum August
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Biology MDS and Clustering Results
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
Data Science for Life Sciences Research & the Public Good
Adaptive Interpolation of Multidimensional Scaling
Towards High Performance Data Analytics with Java
Indiana University July Geoffrey Fox
PHI Research in Digital Science Center
MDS and Visualization September Geoffrey Fox
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Presentation transcript:

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox School of Informatics and Computing Digital Science Center Indiana University Bloomington

We are sort of working on Use Cases with HPC-ABDS Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over “all” images Use Case 44 Radar Images. Running on Amazon

DACIDR for Gene Analysis (Use Case 19,20) Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for pleasingly parallel applications, and Twister (replacing by Yarn) for iterative MapReduce applications Sequences – Cluster  Centers Add Existing data and find Phylogenetic Tree All-Pair Sequence Alignment Streaming Pairwise Clustering Multidimensional Scaling Visualization Simplified Flow Chart of DACIDR

Started Project ~3.5 years ago Note we found two subtle errors (in retrospect incorrect parameter/heuristic used at stage of analysis) so computationally intense part of project with many steps ran through two or three times. Our biology collaborators had never done this type of analysis and this required many discussions at various times We are about to start a new analysis of new data with a factor of 5 more data. Looking for a large enough computer! Note many steps taken as lacked computer resources

Comments on Method Use a mix of O(N 2 ) and O(N) methods to save initial time – 100K 2 < 446K 2 Divide full sample into distinct regions for second phase – again to reduce computational intensity Always visualize results so a lot of “manual workflow” – Visualization finds errors(!) and improves heuristic steps such as hierarchical clustering and center determination Jobs ran on up to 768 cores with all analytics parallelized Lots of metadata for intermediate results

100K with Megaregions

Full 446K Clustered

Megaregion 0 126K Sequences

Process (on web page) Part 1 1)Pick 100K random sample from unique sequences with lengths greater than allreads_uniques_gt200_440641_random_100k_0.txtunique sequences with lengths greater than 200 allreads_uniques_gt200_440641_random_100k_0.txt 2)The remaining is called out-sample sequences 3)Run pairwise local sequence alignment (Smith-Waterman) on 1. 4)Run DA(Deterministic Annealing)-SMACOF on 3. O(N 2 ) 5)Run Deterministic Annealing pairwise clustering on 3. 6)Produce plot from 4. and 5. 7)Refine 6. for spatially compact Mega-regionsMega-regions 8)Assign out-sample sequences in 2. to Mega-regions of 7. based on a nearest neighbour approach O(N) 9)Extract sequences for each such Mega-region from full unique sequence set

Process (on web page) Part 2 10) For each Mega-region sequence set (mega regions size is from 10K to 125K) 1)Run pairwise local sequence alignment (Smith-Waterman) 2)Run DA(Deterministic Annealing)-SMACOF on )Run MDSasChisq in SMACOF mode on 10.1 with sample points of the region fixed to locations from 4 4)Run Deterministic Annealing pairwise clustering on )Produce plot from 10.2 and )While refinements necessary for )Extract distances for necessary sub clusters 2)Run Deterministic Annealing pairwise clustering on )Merge results of with )Produce plot from 10.2 and )Go to )Produce final region specific plot from 10.2 and )Produce final region specific fixed run plot from 10.3 and )Classify clusters into three groups (see cluster status) for clusters in 10.7cluster status 10)Find cluster centers for each clean cluster in MDSs of 10.7 and 10.8

Last Step Take 126 Centers of Clusters. Combine with existing data and compare using MDS on ~2500 sequences Biologists choose subset of existing data and refine sequence using Multiple Sequence alignment (which is viable on ~1000 sequences) Final result is Phylogenetic Tree comparing different methods

Technologies used in analysis Twister Iterative MapReduce Twister MPI.NET Dimension Reduction with Deterministic Annealing SMACOF Dimension Reduction by  2 (later replaced by a method below) New WDA-SMACOF outperforming previous methods Dimension Reduction by Interpolation Clustering by Deterministic Annealing Pairwise Techniques including center determination Clustering by Deterministic Annealing Pairwise Techniques Smith Waterman Gotoh Distance Computation.NET Bio (formerly Microsoft Biology Foundation).NET Bio Phylogram determination RAxML standard package to find phylogenetic trees

Summarize a million Fungi Sequences Spherical Phylogram Visualization RAxML result visualized in FigTree. Spherical Phylogram from new MDS method visualized in PlotViz

Technology Innovations Better Clustering Better Center Determination New 3D Phylograms Better MDS