An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science Washington State University

Outline  Problem Introduction  Related Work  Our Parallel Approach for Protein Family Identification  Experimental Results  Conclusions & Future Work  Acknowledgments 11/19/2008SC08, Austin, TX1

Metagenomics  Application of genomics techniques to the study of microbial communities in their natural environments.  Without isolation and lab cultivation of individual species. 11/19/2008SC08, Austin, TX3

Protein Family Identification Problem  Motivation  Family identification  Functional annotation  Diversity of protein family universe 11/19/2008SC08, Austin, TX4 ……… family 1 family 2 known proteins new metagenomic proteins family i new protein family functional annotation functional annotation

What is a Protein Family?  A protein family is a group of evolutionarily (thus functionally) related proteins. 11/19/2008SC08, Austin, TX5 sequence similarity domain similarity structure similarity

Related Work  General approach  Perform all-against-all sequence comparison (BLAST)  Group proteins based on pair-wise similarity  Related work  Kriventseva et al. (2001)  Enright et al. (2002)  Pipenbacher et al. (2002)  Kelil et al. (2007)  Yooseph et al. (2007)  … 11/19/2008SC08, Austin, TX7 sequential approach sequential approach

GOS Approach  Yooseph et al. (2007) 11/19/2008SC08, Austin, TX8 ……… Redundancy removal ……… Graph generation Dense subgraph detection 1 1 2 2 3 3 Θ(n 2 ) space Ω(n 2 ) time

Limitations of Current Approaches  Constructing large graphs can be time-consuming  ~10 6 CPU hours for ~28.6 million proteins – GOS approach  Quadratic space requirement  Brute-force parallel approach 11/19/2008SC08, Austin, TX9

Main Ideas of Our Approach  Idea#1: A dense subgraph cannot span two connected components 11/19/2008SC08, Austin, TX11 DS CC DS CC DS use divide and conquer to drastically reduce problem size! Challenge: find connected components without generating the whole graph

Main Ideas of Our Approach  Idea#2: Exact-match based filtering technique 11/19/2008SC08, Austin, TX12 100 bp 98% sequence similarity >= 33 bp eliminate unnecessary all-against-all comparisons!

Main Ideas of Our Approach  Idea#3: High overlap of outlinks  dense subgraph 11/19/2008SC08, Austin, TX13 … … u v v u web community outlinks use outlinks comparison to group vertices into dense subgraph!

Our Parallel Approach for Protein Family Identification 11/19/2008SC08, Austin, TX14 connected component detection redundancy removal redundancy removal … … dense subgraph detection dense subgraph detection input protein sequences … connected components protein sequence pairwise sequence homology … … dense subgraph dense subgraph bipartite graph generation bipartite graph generation 4 4 3 3 2 2 1 1

Redundancy Removal  Criteria  similarity of the match is >= 98%  >= 95% of the shorter sequence is covered by the match 11/19/2008SC08, Austin, TX15 |||||| |||||||||||||| >=95% generalized suffix tree (GST) p1p1 p2p2 p3p3 p4p4 p5p5 cut off >=98% idea#2

Connected Component Detection 11/19/2008SC08, Austin, TX16 M GST 1 GST 2 GST p ……… 1)manage CC using union-find data structure 2)distribute work in a load-balancing way 1)generate pairs 2)sequence alignment WW W pairs work M – Master node W – Worker node + alignment results

Bipartite Graph Generation 11/19/2008SC08, Austin, TX17 … connected componentG(V,E) B(V,V,E) …

Dense Subgraph Detection  Shingle algorithm 11/19/2008SC08, Austin, TX18 outlinks(u) s elems shingle … … … … permutation s elems comparison c times outlinks(v) u v s, c: parameters …………

Dense Subgraph Detection 11/19/2008SC08, Austin, TX19 … … … … … … … … … … … … shingle dense subgraph dense subgraph 1 1 2 2 3 3 1 st pass2 nd passA~B B(V, V, E) A B

Qualitative Validation with GOS Data  160k data set  Our results vs. GOS results 11/19/2008SC08, Austin, TX21 #input seq #NR#CC#DS mean degree mean density size of largest DS 160,000138,6331,8618502676%13,263 22,18621,34811342078%6,828 Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49%

Drastical Work Reduction  40k input data 11/19/2008SC08, Austin, TX22 ~800 million ~8 million all-against-all BLAST our parallel approach #(sequence alignment work)

Run Time as Function of Input Size 11/19/2008SC08, Austin, TX23

Performance Evaluation 11/19/2008SC08, Austin, TX24

Conclusions & Future Work  Presented a parallel approach for protein family identification  Quality testing – better “benchmark”  Parallelization of Shingle algorithm – potential memory problem  Large-scale application – 28.6 million 11/19/2008SC08, Austin, TX25

Acknowledgments  Prof. Srinivas Aluru at Iowa State University for BlueGene/L access  Anonymous reviewers  Funding: Washington State University Foundation and the Office of Research 11/19/2008SC08, Austin, TX26

Thanks! Questions? 11/19/2008SC08, Austin, TX

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.

Similar presentations

Presentation on theme: "An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.

Similar presentations

Presentation on theme: "An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical."— Presentation transcript:

Similar presentations

About project

Feedback