An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Homology Based Analysis of the Human/Mouse lncRNome
Spectrum Based RLA Detection Spectral property : the eigenvector entries for the attacking nodes,, has the normal distribution with mean and variance bounded.
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Structural bioinformatics
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
HCS Clustering Algorithm
Shirokuro : A Backtracking Approach Benjamin Bush Faculty Advisors: Dr. Russ Abbott, Dr. Gary Brookfield Department of Computer Science, Department of.
Structure discovery in PPI networks using pattern-based network decomposition Philip Bachman and Ying Liu BIOINFORMATICS System biology Vol.25 no
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Prof. A. Taleb-Bendiab, Talk: SOAS’07, Contact: Date: 12/09/2015, Slide: 1 Software Engineering Concerns in Observing Autonomic.
Hongyu Gong, Lutian Zhao, Kainan Wang, Weijie Wu, Xinbing Wang
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
MINATO ZDD Project Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Toshiki Saitoh (ERATO) Joint work with Masashi.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
1 Efficient Obstacle-Avoiding Rectilinear Steiner Tree Construction Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, Chia-Lin Yang National Taiwan.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
1 Heat Diffusion Classifier on a Graph Haixuan Yang, Irwin King, Michael R. Lyu The Chinese University of Hong Kong Group Meeting 2006.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
Data Structures and Algorithms in Parallel Computing Lecture 7.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
PINALOG Protein Interaction Network Alignment and its implication in function prediction and complex detection Hang Phan Prof. Michael J.E. Sternberg.
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Cohesive Subgraph Computation over Large Graphs
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
Definition In simple terms, an algorithm is a series of instructions to solve a problem (complete a task) We focus on Deterministic Algorithms Under the.
Jin Zhang, Jiayin Wang and Yufeng Wu
Parallel System for BLAST
Learning a hidden graph with adaptive algorithms
Presentation transcript:

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science Washington State University

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX1

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX2

Metagenomics  Application of genomics techniques to the study of microbial communities in their natural environments.  Without isolation and lab cultivation of individual species. 11/19/2008SC08, Austin, TX3

Protein Family Identification Problem  Motivation  Family identification  Functional annotation  Diversity of protein family universe 11/19/2008SC08, Austin, TX4 ……… family 1 family 2 known proteins new metagenomic proteins family i new protein family functional annotation functional annotation

What is a Protein Family?  A protein family is a group of evolutionarily (thus functionally) related proteins. 11/19/2008SC08, Austin, TX5 sequence similarity domain similarity structure similarity

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX6

Related Work  General approach  Perform all-against-all sequence comparison (BLAST)  Group proteins based on pair-wise similarity  Related work  Kriventseva et al. (2001)  Enright et al. (2002)  Pipenbacher et al. (2002)  Kelil et al. (2007)  Yooseph et al. (2007)  … 11/19/2008SC08, Austin, TX7 sequential approach sequential approach

GOS Approach  Yooseph et al. (2007) 11/19/2008SC08, Austin, TX8 ……… Redundancy removal ……… Graph generation Dense subgraph detection Θ(n 2 ) space Ω(n 2 ) time

Limitations of Current Approaches  Constructing large graphs can be time-consuming  ~10 6 CPU hours for ~28.6 million proteins – GOS approach  Quadratic space requirement  Brute-force parallel approach 11/19/2008SC08, Austin, TX9

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX10

Main Ideas of Our Approach  Idea#1: A dense subgraph cannot span two connected components 11/19/2008SC08, Austin, TX11 DS CC DS CC DS use divide and conquer to drastically reduce problem size! Challenge: find connected components without generating the whole graph

Main Ideas of Our Approach  Idea#2: Exact-match based filtering technique 11/19/2008SC08, Austin, TX bp 98% sequence similarity >= 33 bp eliminate unnecessary all-against-all comparisons!

Main Ideas of Our Approach  Idea#3: High overlap of outlinks  dense subgraph 11/19/2008SC08, Austin, TX13 … … u v v u web community outlinks use outlinks comparison to group vertices into dense subgraph!

Our Parallel Approach for Protein Family Identification 11/19/2008SC08, Austin, TX14 connected component detection redundancy removal redundancy removal … … dense subgraph detection dense subgraph detection input protein sequences … connected components protein sequence pairwise sequence homology … … dense subgraph dense subgraph bipartite graph generation bipartite graph generation

Redundancy Removal  Criteria  similarity of the match is >= 98%  >= 95% of the shorter sequence is covered by the match 11/19/2008SC08, Austin, TX15 |||||| |||||||||||||| >=95% generalized suffix tree (GST) p1p1 p2p2 p3p3 p4p4 p5p5 cut off >=98% idea#2

Connected Component Detection 11/19/2008SC08, Austin, TX16 M GST 1 GST 2 GST p ……… 1)manage CC using union-find data structure 2)distribute work in a load-balancing way 1)generate pairs 2)sequence alignment WW W pairs work M – Master node W – Worker node + alignment results

Bipartite Graph Generation 11/19/2008SC08, Austin, TX17 … connected componentG(V,E) B(V,V,E) …

Dense Subgraph Detection  Shingle algorithm 11/19/2008SC08, Austin, TX18 outlinks(u) s elems shingle … … … … permutation s elems comparison c times outlinks(v) u v s, c: parameters …………

Dense Subgraph Detection 11/19/2008SC08, Austin, TX19 … … … … … … … … … … … … shingle dense subgraph dense subgraph st pass2 nd passA~B B(V, V, E) A B

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX20

Qualitative Validation with GOS Data  160k data set  Our results vs. GOS results 11/19/2008SC08, Austin, TX21 #input seq #NR#CC#DS mean degree mean density size of largest DS 160,000138,6331, %13,263 22,18621, %6,828 Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49%

Drastical Work Reduction  40k input data 11/19/2008SC08, Austin, TX22 ~800 million ~8 million all-against-all BLAST our parallel approach #(sequence alignment work)

Run Time as Function of Input Size 11/19/2008SC08, Austin, TX23

Performance Evaluation 11/19/2008SC08, Austin, TX24

Conclusions & Future Work  Presented a parallel approach for protein family identification  Quality testing – better “benchmark”  Parallelization of Shingle algorithm – potential memory problem  Large-scale application – 28.6 million 11/19/2008SC08, Austin, TX25

Acknowledgments  Prof. Srinivas Aluru at Iowa State University for BlueGene/L access  Anonymous reviewers  Funding: Washington State University Foundation and the Office of Research 11/19/2008SC08, Austin, TX26

Thanks! Questions? 11/19/2008SC08, Austin, TX