Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.

Slides:

Advertisements

Similar presentations

A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan

Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Scalable High Performance Dimension Reduction

SALSA HPC Group School of Informatics and Computing Indiana University.

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

Locating in fingerprint space: wireless indoor localization with little human intervention. Proceedings of the 18th annual international conference on.

A NOVEL LOCAL FEATURE DESCRIPTOR FOR IMAGE MATCHING Heng Yang, Qing Wang ICME 2008.

Proteomics Metagenomics and Deterministic Annealing Indiana University October Geoffrey Fox

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.

Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.

Heuristic alignment algorithms and cost matrices

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.

CIS786, Lecture 4 Usman Roshan.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

FLANN Fast Library for Approximate Nearest Neighbors

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Sensor Positioning in Wireless Ad-hoc Sensor Networks Using Multidimensional Scaling Xiang Ji and Hongyuan Zha Dept. of Computer Science and Engineering,

Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University

A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.

Presenter: Yang Ruan Indiana University Bloomington

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Yang Ruan PhD Candidate Computer Science Department Indiana University.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.

SALSA HPC Group School of Informatics and Computing Indiana University.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Xin Tong, Teng-Yok Lee, Han-Wei Shen The Ohio State University

DNA, RNA and protein are an alien language

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Data Mining Soongsil University

Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions Introduction.

Applying Twister to Scientific Applications

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

DACIDR for Gene Analysis

Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.

Local alignment and BLAST

SPIDAL and Deterministic Annealing

Adaptive Interpolation of Multidimensional Scaling

Multivariate Statistical Methods

Towards High Performance Data Analytics with Java

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University

Overview Solve Taxonomy independent analysis problem for millions of data with a high accuracy solution. Introduction to DACIDR All-pair sequence alignment Pairwise Clustering and Multidimensional Scaling Interpolation Visualization SSP-Tree Experiments

DACIDR Flow Chart 16S rRNA Data All-Pair Sequence Alignment Heuristic Interpolation Pairwise Clustering Multidimensional Scaling Dissimilarity Matrix Sample Clustering Result Target Dimension Result Visualization Out- sample Set Sample Set Further Analysis

All-Pair Sequence Analysis Input: FASTA File Output: Dissimilarity Matrix Use Smith Waterman alignment to perform local sequence alignment to determine similar regions between two nucleotide or protein sequences. Use percentage identity as similarity measurement. ACATCCTTAACAA - - ATTGC-ATC - AGT - CTA ACATCCTTAGC - - GAATT - - TATGAT - CACCA -

DACIDR Flow Chart 16S rRNA Data All-Pair Sequence Alignment Heuristic Interpolation Pairwise Clustering Multidimensional Scaling Dissimilarity Matrix Sample Clustering Result Target Dimension Result Visualization Out- sample Set Sample Set Further Analysis

DA Pairwise Clustering Input: Dissimilarity Matrix Output: Clustering result Deterministic Annealing clustering is a robust clustering method. Temperature corresponds to pairwise distance scale and one starts at high temperature with all sequences in same cluster. As temperature is lowered one looks at finer distance scale and additional clusters are automatically detected. No heuristic input is needed.

Multidimensional Scaling Input: Dissimilarity Matrix Output: Visualization Result (in 3D) MDS is a set of techniques used in dimension reduction. Scaling by Majorizing a Complicated Function (SMACOF) is a fast EM method for distributed computing. DA introduce temperature into SMACOF which can eliminates the local optima problem.

DACIDR Flow Chart 16S rRNA Data All-Pair Sequence Alignment Heuristic Interpolation Pairwise Clustering Multidimensional Scaling Dissimilarity Matrix Sample Clustering Result Target Dimension Result Visualization Out- sample Set Sample Set Further Analysis

Interpolation Input: Sample Set Result/Out-sample Set FASTA File Output: Full Set Result. SMACOF uses O(N 2 ) memory, which could be a limitation for million-scale dataset MI-MDS finds k nearest neighbor (k-NN) points in the sample set for sequences in the out-sample set. Then use this k points to determine the out-sample dimension reduction result. It doesn’t need O(N 2 ) memory and can be pleasingly parallelized.

DACIDR Flow Chart 16S rRNA Data All-Pair Sequence Alignment Heuristic Interpolation Pairwise Clustering Multidimensional Scaling Dissimilarity Matrix Sample Clustering Result Target Dimension Result Visualization Out- sample Set Sample Set Further Analysis

Visualization Used PlotViz3 to visualize the 3D plot generated in previous step. It can show the sequence name, highlight interesting points, even remotely connect to HPC cluster and do dimension reduction and streaming back result. Zoom in Rotate

SSP-Tree Sample Sequence Partition Tree is a method inspired by Barnes-Hut Tree used in astrophysics simulations. SSP-Tree is an octree to partition the sample sequence space in 3D. a E F BC A D GH e0e0 e1e1 e2e2 e4e4 e5e5 e3e3 e6e6 e7e7 i0i0 i1i1 i2i2 An example for SSP-Tree in 2D with 8 points

Heuristic Interpolation MI-MDS has to compare every out-sample point to every sample point to find k-NN points HI-MDS compare with each center point of each tree node, and searches k-NN points from top to bottom HE-MDS directly search nearest terminal node and find k-NN points within that node or its nearest nodes. Computation complexity MI-MDS: O(NM) HI-MDS: O(NlogM) HE-MDS: O(N(N T + M T )) 3D plot after HE-MI

Region Refinement Terminal nodes can be divided into: V: Inter-galactic void U: Undecided node G: Decided node Take H(t) to determine if a terminal node t should be assigned to V Take a fraction function F(t) to determine if a terminal node t should be assigned to G. Update center points of each terminal node t at the end of each iteration. Before After

Recursive Clustering DACIDR create an initial clustering result W = {w 1, w 2, w 3, … w r }. Possible Interesting Structures inside each mega region. w 1 -> W 1 ’ = {w1 1 ’, w1 2 ’, w1 3 ’, …, w1 r 1 ’}; w 2 -> W 2 ’ = {w2 1 ’, w2 2 ’, w2 3 ’, …, w2 r 2 ’}; w 3 -> W 3 ’ = {w3 1 ’, w3 2 ’, w3 3 ’, …, w3 r 3 ’}; … w r -> W r ’ = {wr 1 ’, wr 2 ’, wr 3 ’, …, wr r r ’}; Mega Region 1 Recursive Clustering

Experiments Clustering and Visualization Input Data: 1.1 million 16S rRNA data Environment: 32 nodes (768 cores) of Tempest and 100 nodes (800 cores) of PolarGrid. SW vs NW PWC vs UCLUST/CDHIT HE-MI vs MI-MDS/HI-MI

SW vs NW Smith Waterman performs local alignment while Needleman Wunsch performs global alignment. Global alignment suffers problem from various lengths in this dataset SW NW

PWC vs UCLUST/CDHIT PWC shows much more robust result than UCLUST and CDHIT PWC has a high correlation with visualized clusters while UCLUST/CDHIT shows a worse result PWC UCLUST PWCUCLUSTCDHIT Hard-cutoff Threshold Number of A-clusters (number of clusters contains only one sequence) (10)288(77)618(208)134(16)375(95)619(206) Number of clusters uniquely identified Number of shared A- clusters Number of A-clusters in one V-cluster (10)279(77)614(208)131(16)373(95)618(206)

HE-MI vs MI-MDS/HI-MI Input set: 100k subset from 1.1 million data Environment: 32 nodes (256 cores) from PolarGrid Compared time and normalized stress value:

Conclusion By using SSP-Tree inside DACIDR, we sucessfully clustered and visualized 1.1 million 16S rRNA data Distance measurement is important in clustering and visualization as SW and NW will give quite different answers when sequence lengths varies. DA-PWC performs much better than UCLUST/CDHIT HE-MI has a slight higher stress value than MI-MDS, but is almost 100 times faster than the latter one, which makes it suitable for massive scale dataset.

Futurework We are experimenting using different techniques choosing reference set from the clustering result we have. To do further study about the taxonomy-independent clustering result we have, phylogenetic tree are generated with some sequences selected from known samples. Visualized phylogenetic tree in 3D to help biologist better understand it.

Questions?