SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING

Slides:



Advertisements
Similar presentations
Clustering Overview Algorithm Begin with all sequences in one cluster While splitting some cluster improves the objective function: { Split each cluster.
Advertisements

Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Scalable High Performance Dimension Reduction
SALSA HPC Group School of Informatics and Computing Indiana University.
Analyzing large-scale cheminformatics and chemogenomics datasets through dimension reduction David J. Wild Assistant Professor & Director, Cheminformatics.
Proteomics Metagenomics and Deterministic Annealing Indiana University October Geoffrey Fox
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Collective Collaborative Tagging System Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Protein Sequence Classification Using Neighbor-Joining Method
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Bioinformatics tools for phylogeny and visualization
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Requirements Model Inputs (Test Sequences) Expected outputs Implementation Verdict Author Generate Feedback.
Gantt and PERT charts. Representing and Scheduling Project Plans Gantt Charts Useful for depicting simple projects or parts of large projects Show start.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Study of Biological Sequence Structure: Clustering and Visualization & Survey on High Productivity Computing Systems (HPCS) Languages SALIYA EKANAYAKE.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Metagenomic Analysis Using MEGAN4
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
Presenter: Yang Ruan Indiana University Bloomington
Yang Ruan PhD Candidate Computer Science Department Indiana University.
Microbial diversity and virulence probing of five different body sites Anu Rebbapragada, Pub. Health Ontario Central Lab. Canada Wei-Jen Lin, Cal State.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.
SALSA HPC Group School of Informatics and Computing Indiana University.
Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.
Deterministic Annealing Dimension Reduction and Biology Indiana University Environmental Genomics April Geoffrey.
Phylogeny and visualization: MEGA and iTOL Yanbin Yin Spring
Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.
SALSA HPC Group School of Informatics and Computing Indiana University.
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
Deterministic Annealing and Robust Scalable Data mining for the Data Deluge Petascale Data Analytics: Challenges, and Opportunities (PDAC-11) Workshop.
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
Bioinformatics Overview
Volume 76, Issue 4, Pages (April 2011)
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions  Introduction.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Biology MDS and Clustering Results
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
SPIDAL and Deterministic Annealing
Adaptive Interpolation of Multidimensional Scaling
Effective Use of Visual Aids
Parallel Applications And Tools For Cloud Computing Environments
Towards High Performance Data Analytics with Java
Dimension reduction : PCA and Clustering
(A) Tiled view of an ESOM map constructed using all 51 metagenome bins assembled from the samples collected in this study, with the white square encompassing.
Figure Overview.
Visualization of lineage radius increase.
MDS and Visualization September Geoffrey Fox
Presentation transcript:

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING Yang Ruan Advised by Geoffrey Fox

Motivation DACIDR Bioinformatics Data Deluge Large Scale Data Clustering Large Scale Date Visualization Enable Faster Observation and Verification >SRR042318.5 GAGTTTAGCCTTGCG… >SRR042318.32 … >SRR042318.70 GAGTTTTAGCCTTGCGG… >SRR042318.81 GTTTAGCCTTGC… <- id <- Sequence DACIDR

Overview of DACIDR Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Split input set into in-samples and out-of-samples Apply full pairwise clustering and multidimensional scaling on in-samples Use in-sample result to interpolate out-of-samples. Pairwise Clustering All-Pair Sequence Alignment Interpolation Visualization Multidimensional Scaling Simplified Flow Chart of DACIDR

Clustering Visualization Use PlotViz3 to visualize the result in 3D Different identified cluster on in different color DACIDR is parallelized using Twister and MPI Metagenomics hmp16SrRNA COG Protein

Phylogenetic Tree Visualization Reorder the slides to new sections Spherical Phylogram visualized using the phylogenetic tree generated by RaXml using the representative sequences and reference sequences, the color scheme is same as in left figure. RaXml result visualized as Rectangular Phylogram shown in 2D

Flowchart of the Process to Generate Spherical Phylogram