Download presentation
Presentation is loading. Please wait.
1
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science Washington State University
2
Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX1
3
Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX2
4
Metagenomics Application of genomics techniques to the study of microbial communities in their natural environments. Without isolation and lab cultivation of individual species. 11/19/2008SC08, Austin, TX3
5
Protein Family Identification Problem Motivation Family identification Functional annotation Diversity of protein family universe 11/19/2008SC08, Austin, TX4 ……… family 1 family 2 known proteins new metagenomic proteins family i new protein family functional annotation functional annotation
6
What is a Protein Family? A protein family is a group of evolutionarily (thus functionally) related proteins. 11/19/2008SC08, Austin, TX5 sequence similarity domain similarity structure similarity
7
Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX6
8
Related Work General approach Perform all-against-all sequence comparison (BLAST) Group proteins based on pair-wise similarity Related work Kriventseva et al. (2001) Enright et al. (2002) Pipenbacher et al. (2002) Kelil et al. (2007) Yooseph et al. (2007) … 11/19/2008SC08, Austin, TX7 sequential approach sequential approach
9
GOS Approach Yooseph et al. (2007) 11/19/2008SC08, Austin, TX8 ……… Redundancy removal ……… Graph generation Dense subgraph detection 1 1 2 2 3 3 Θ(n 2 ) space Ω(n 2 ) time
10
Limitations of Current Approaches Constructing large graphs can be time-consuming ~10 6 CPU hours for ~28.6 million proteins – GOS approach Quadratic space requirement Brute-force parallel approach 11/19/2008SC08, Austin, TX9
11
Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX10
12
Main Ideas of Our Approach Idea#1: A dense subgraph cannot span two connected components 11/19/2008SC08, Austin, TX11 DS CC DS CC DS use divide and conquer to drastically reduce problem size! Challenge: find connected components without generating the whole graph
13
Main Ideas of Our Approach Idea#2: Exact-match based filtering technique 11/19/2008SC08, Austin, TX12 100 bp 98% sequence similarity >= 33 bp eliminate unnecessary all-against-all comparisons!
14
Main Ideas of Our Approach Idea#3: High overlap of outlinks dense subgraph 11/19/2008SC08, Austin, TX13 … … u v v u web community outlinks use outlinks comparison to group vertices into dense subgraph!
15
Our Parallel Approach for Protein Family Identification 11/19/2008SC08, Austin, TX14 connected component detection redundancy removal redundancy removal … … dense subgraph detection dense subgraph detection input protein sequences … connected components protein sequence pairwise sequence homology … … dense subgraph dense subgraph bipartite graph generation bipartite graph generation 4 4 3 3 2 2 1 1
16
Redundancy Removal Criteria similarity of the match is >= 98% >= 95% of the shorter sequence is covered by the match 11/19/2008SC08, Austin, TX15 |||||| |||||||||||||| >=95% generalized suffix tree (GST) p1p1 p2p2 p3p3 p4p4 p5p5 cut off >=98% idea#2
17
Connected Component Detection 11/19/2008SC08, Austin, TX16 M GST 1 GST 2 GST p ……… 1)manage CC using union-find data structure 2)distribute work in a load-balancing way 1)generate pairs 2)sequence alignment WW W pairs work M – Master node W – Worker node + alignment results
18
Bipartite Graph Generation 11/19/2008SC08, Austin, TX17 … connected componentG(V,E) B(V,V,E) …
19
Dense Subgraph Detection Shingle algorithm 11/19/2008SC08, Austin, TX18 outlinks(u) s elems shingle … … … … permutation s elems comparison c times outlinks(v) u v s, c: parameters …………
20
Dense Subgraph Detection 11/19/2008SC08, Austin, TX19 … … … … … … … … … … … … shingle dense subgraph dense subgraph 1 1 2 2 3 3 1 st pass2 nd passA~B B(V, V, E) A B
21
Outline Problem Introduction Related Work Our Parallel Approach for Protein Family Identification Experimental Results Conclusions & Future Work Acknowledgments 11/19/2008SC08, Austin, TX20
22
Qualitative Validation with GOS Data 160k data set Our results vs. GOS results 11/19/2008SC08, Austin, TX21 #input seq #NR#CC#DS mean degree mean density size of largest DS 160,000138,6331,8618502676%13,263 22,18621,34811342078%6,828 Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49%
23
Drastical Work Reduction 40k input data 11/19/2008SC08, Austin, TX22 ~800 million ~8 million all-against-all BLAST our parallel approach #(sequence alignment work)
24
Run Time as Function of Input Size 11/19/2008SC08, Austin, TX23
25
Performance Evaluation 11/19/2008SC08, Austin, TX24
26
Conclusions & Future Work Presented a parallel approach for protein family identification Quality testing – better “benchmark” Parallelization of Shingle algorithm – potential memory problem Large-scale application – 28.6 million 11/19/2008SC08, Austin, TX25
27
Acknowledgments Prof. Srinivas Aluru at Iowa State University for BlueGene/L access Anonymous reviewers Funding: Washington State University Foundation and the Office of Research 11/19/2008SC08, Austin, TX26
28
Thanks! Questions? 11/19/2008SC08, Austin, TX
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.