VISUALIZING THE PROTEIN SEQUENCE UNIVERSE E.KOLKER 1 VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1, R.HIGDON 1, W.HAYNES 1, N.KOLKER 1, W.BROOMALL.

Slides:

Advertisements

Similar presentations

Scalable High Performance Dimension Reduction

Advertisements

Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.

SALSA HPC Group School of Informatics and Computing Indiana University.

Health Informatics April 8, Geoffrey Fox Associate Dean for Research and Graduate Studies, School of Informatics.

Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.

Proteomics Metagenomics and Deterministic Annealing Indiana University October Geoffrey Fox

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a.

SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

Network Aware Resource Allocation in Distributed Clouds.

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES High Performance Computing applications in GEON: From Design to Production Dogan Seber.

Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Presenter: Yang Ruan Indiana University Bloomington

FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.

SALSA HPC Group School of Informatics and Computing Indiana University.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

GEOSCIENCE NEEDS & CHALLENGES Dogan Seber San Diego Supercomputer Center University of California, San Diego, USA.

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE E.KOLKER 1 VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1, R.HIGDON 1, W.HAYNES 1, N.KOLKER 1, W.BROOMALL.

Deterministic Annealing and Robust Scalable Data mining for the Data Deluge Petascale Data Analytics: Challenges, and Opportunities (PDAC-11) Workshop.

High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.

Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.

High throughput biology data management and data intensive computing drivers George Michaels.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.

Federal Land Manager Environmental Database (FED) Overview and Update June 6, 2011 Shawn McClure.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

HPC In The Cloud Case Study: Proteomics Workflow

Joslynn Lee – Data Science Educator

Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions Introduction.

Federal Land Manager Environmental Database (FED)

Applying Twister to Scientific Applications

Biology MDS and Clustering Results

DACIDR for Gene Analysis

Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.

Scalable Parallel Interoperable Data Analytics Library

Adaptive Interpolation of Multidimensional Scaling

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L. STANBERRY1, R. HIGDON1, W

Towards High Performance Data Analytics with Java

Presentation transcript:

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE E.KOLKER 1 VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1, R.HIGDON 1, W.HAYNES 1, N.KOLKER 1, W.BROOMALL 1, S.EKANAYAKE 2, A.HUGHES 2, Y.RUAN 2, J.QIU 2, E.KOLKER 1, G.FOX 2 1 SEATTLE CHILDREN’S, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June 2012

Outline Visualizing PSU 2  A 4th paradigm problem in biology  Assigning functions (annotating) proteins  Challenge  Our goal  PSU: methods & initial results  Conclusions ECMLS 2012

Grand Challenge of Functional Genomics Visualizing PSU 3  New technologies produce peta- and exabytes of data  Protein Sequence Universe (PSU), the protein sequence space, expand exponentially  EMP, i5K, iPlant, NEON  30% of existing sequenced proteins unannotated  Existing resources overwhelmed, many unsupported: COG, Systers, ClusTr, eggNOG. ECMLS 2012

Ultimate Goal: Annotate All Proteins Visualizing PSU 4 Our approach:  Revitalize, expand & enhance protein annotation resources.  Develop sustainable software framework.  Use HPC and most powerful CI – grids & clouds.  Provide rigorous and reliable tools to annotate protein sequences. ECMLS 2012

COG: Clusters of Orthologous Groups Visualizing PSU 5  COG database was developed by NCBI.  Proteins classified into groups with common function encoded in complete genomes.  Prokaryotes (COG): 66 genomes, 200K proteins, 5K clusters.  Eukaryotes (KOG): 7 genomes, 113K proteins, 5K clusters.  Valuable scientific resource: 5K citations.  Last updated: ECMLS 2012

Clustering 10 million UniRef100 Visualizing PSU 6  UniRef100: 10M proteins including 5.3M bacterial & archaeal  BLAST - common sequence alignment approach  All vs. All alignment on Azure  475 eight-core virtual machines produced 3+ billion filtered records in 6 days ECMLS 2012

Clustering 10 million UniRef100 Visualizing PSU 7  Use prokaryotic COG as a starting point.  Expand COGs ~20 fold (3.5 million proteins).  Cluster 2M proteins into 500K functional groups  Single linkage clustering with MapReduce framework on Hadoop ECMLS 2012

Promise and Challenge of Annotation Visualizing PSU 8  Clustering facilitates mass annotation BUT  Takes considerable efforts and expertise  Multiple cloud systems and compute solutions ECMLS 2012

Public Resources ECMLS 2012Visualizing PSU 9  Struggle to cope with the influx of data  Provide limited interactive and analytic capabilities  Many no longer supported (SYSTERS, CluSTr, COG)  Biological community needs scalable, sustainable and efficient approach to visualize, explore and annotate new data.

Protein Sequence Universe Visualizing PSU 10  PSU Goal: Enhance annotation resources with analytic and visualization (browser) tools.  Project sequence data into 3D using multidimensional scaling (MDS).  MDS interpolation allows expanding the universe without time consuming all vs all O(N 2 )  3D map allows much faster interpolation ECMLS 2012

Multi-Dimensional Scaling (MDS) Visualizing PSU 11  Sammon‘s objective function  is dissimilarity measure between sequences i and j  d is Euclidean distance between projections x i and x j  Denominator: larger contribution from smaller dissimilarities  f is monotone transformation of dissimilarity measure chosen “artistically” ECMLS 2012

Typical Metagenomic MDS ECMLS 2012Visualizing PSU 12

MDS Details ECMLS 2012Visualizing PSU 13  f chosen heuristically to increase the ratio of standard deviation to mean for and to increase the range of dissimilarity measures.  O(n 2 ) complexity to map n sequences into 3D.  MDS can be solved using EM (SMACOF – fastest but limited) or directly by Newton's method (it’s just  2 )  Used robust implementation of nonlinear  2 minimization with Levenberg-Marquardt  3D projections visualized in PlotViz

MDS Details ECMLS 2012Visualizing PSU 14  Input Data: 100K sequences from well-characterized prokaryotic COGs.  Proximity measure: sequence alignment scores  Scores calculated using Needleman-Wunsch  Scores “ sqrt 4D” transformed and fed into MDS  Analytic form for transformation to 4D   ij n decreases dimension n > 1; increases n < 1  “sqrt 4D” reduced dimension of distance data from 244 for  ij to14 for f(  ij )  Hence more uniform coverage of Euclidean space

3D View of 100K COG Sequences Visualizing PSU 15 ECMLS 2012

Implementation Visualizing PSU 16  NW computed in parallel on 100 node 8-core system.  Used Twister (IU) in the Reduce phase of MapReduce  MDS Calculations performed on 768 core MS HPC cluster (32 nodes)  Scaling, parallel MPI with threading intranode  Parallel efficiency of the code approximately 70%  Lost efficiency due memory bandwidth saturation  NW required 1 day, MDS job - 3 days. ECMLS 2012

Cluster Annotation Visualizing PSU 17 COGAnnotationUniref100 COG1131ABC-type multidrug transport system, ATPase component14406 COG1136 ABC-type antimicrobial peptide transport system, ATPase component7306 COG1126ABC-type polar amino acid transport system, ATPase component4061 COG3839ABC-type sugar transport systems, ATPase component4121 COG0444 ABC-type dipeptide/oligopeptide/nickel transport system ATPase comp3520 COG4608ABC-type oligopeptide transport system, ATPase component3074 COG3842ABC-type spermidine/putrescine transport systems, ATPase comp3665 COG0333Ribosomal protein L COG0454Histone acetyltransferase HPA2 and Related acetyltransferases14085 COG0477Permeases of the major facilitator superfamily48590 COG1028Dehydrogenases with different specificities37461 ECMLS 2012

Heatmap of NW vs Euclidean Distances Visualizing PSU 18 ECMLS 2012

Dendrogram of Cluster Centroids Visualizing PSU 19 ECMLS 2012

Selected Clusters Visualizing PSU 20 ECMLS 2012

Heatmap for Selected Clusters Visualizing PSU 21 ECMLS 2012

Future Steps  Comparison Needleman-Wunsch v. Blast v. PSIBlast  NW easier as complete; Blast has missing distances  Different Transformations distance  monotonic function(distance) to reduce formal starting dimension (increase sigma/mean)  Automate cluster consensus finding as sequence that minimizes maximum distance to other sequences  Improve O(N 2 ) to O(N) complexity by interpolating new sequences to original set and only doing small regions with O(N 2 )  Successful in metagenomics  Can use Oct-tree from 3D mapping or set of consensus vectors  Some clusters diffuse? ECMLS 2012Visualizing PSU 22

ECMLS 2012Visualizing PSU 23 Blast 6

ECMLS 2012Visualizing PSU 24 Full Data Blast 6 Original run has 0.96 cut

ECMLS 2012Visualizing PSU 25 Cluster Data Blast 6 Original run has 0.96 cut

26 Use Barnes Hut OctTree originally developed to make O(N 2 ) astrophysics O(NlogN)

27 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation

440K Interpolated 28

Conclusions Visualizing PSU 29  Data Knowledge: protein annotation  Overwhelming influx of new sequences  Annotation is an immense challenge.  HPC and advanced analytics needed.  PSU as tool to facilitate annotation:  Interactive visualization and exploration  Integrates info on function, pathways, structure, and environment  MDS preserves grouping structure of protein space  MDS can use different proximities and biological data  Parallel MDS handles large-scale data  MDS interpolation quickly maps new sequences into existing space ECMLS 2012 →

DELSA: Data → Knowledge → Action Visualizing PSU 30 Data-Enabled Life Sciences Alliance International  Collective innovation to tackle modern biological challenges through best computational practices and advanced cyberinfrastructure.  Harness expertise and resources across disciplines  Promote accurate, sustainable, scalable approaches  Facilitate translation of data influx into tangible innovations and groundbreaking discoveries ECMLS 2012

References and Resources Visualizing PSU 31  COG data is available at the NCBI site ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/  MDS results are available at  All software used to analyze and visualize the data is an open source.  DELSA:  Protein Global Atlas and Data Accessibility Projects ECMLS 2012

Acknowledgements ECMLS 2012Visualizing PSU 32 Grant support  NSF: under DBI: (EK) and (GF)  NIH: 5 RC2 HG (GF); NIGMS grant R01 GM (EK); NIDDK grants U01-DK and U01-DK (EK)

Visualizing PSU 25 Thank you for your attention ECMLS 2012