Presentation is loading. Please wait.

Presentation is loading. Please wait.

Family of HMMs Nam Nguyen University of Texas at Austin.

Similar presentations


Presentation on theme: "Family of HMMs Nam Nguyen University of Texas at Austin."— Presentation transcript:

1 Family of HMMs Nam Nguyen University of Texas at Austin

2 Outline of Talk Background Family of HMMs Model Alignment algorithm Applications of fHMM SEPP (Mirarab, Nguyen, and Warnow. PSB 2012) TIPP (Nguyen, et al. Under review) Conclusions and future work

3 Phylogenetics Study of evolutionary relationship between different species Applications to many fields such as drug discovery, agriculture, and biotechnology Critical are tools for alignment and phylogeny estimation. Courtesy of Tree of Life Project

4 Gather Sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

5 Align Sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

6 Estimate Tree S1 S4 S2 S3 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

7 Multiple Sequence Alignment Fundamental step in bioinformatics pipelines Used in phylogeny estimation, prediction of 2D/3D protein structure, and detection of conserved regions Can be formulated as an NP-hard optimization problem Popular heuristics include progressive alignment methods and iterative methods Heuristics do not scale linearly with the number of sequences Not as accurate on large datasets or evolutionary divergent datasets

8 Statistical model for representing an MSA Uses include inserting sequences into an alignment taxonomic identification homology detection functional annotation Profile Hidden Markov Model (HMM)

9 Statistical model for representing an MSA Uses include inserting sequences into an alignment taxonomic identification homology detection functional annotation Profile Hidden Markov Model (HMM)

10 Statistical model for representing an MSA Uses include inserting sequences into an alignment taxonomic identification homology detection functional annotation Profile Hidden Markov Model (HMM)

11 Metagenomics Courtesy of Wikipedia Study of sequencing genetic material directly from the environment Applications to biofuel production, agriculture, human health Sequencing technology produces millions of short reads from unknown species Fundamental step in analysis is identifying taxa of read

12 ACT..TAGA..A (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG. ACCT (60-200 bp long) Fragmentary Reads: Known Full length Sequences, and an alignment and a tree (500-10,000 bp long) Phylogenetic Placement

13 Input: (Backbone) Alignment and tree on full-length sequences and a query sequence (short read) Output: Placement of the query sequence on the backbone tree Use placement to infer relationship between query sequence and full-length sequences in backbone tree Applications in metagenomic analysis Millions of reads Reads from different genomes mixed together Use placement to identify read

14 Phylogenetic Placement Align each query sequence to backbone alignment to produce an extended alignment Place each query sequence into the backbone tree using extended alignment

15 Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

16 Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

17 Place Sequence S1 S4 S2 S3 Q1 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

18 Place Sequence S1 S4 S2 S3 Q1 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- Q1 Q2 Q3 Query sequences are aligned and placed independently

19 Phylogenetic Placement Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011)

20 Phylogenetic Placement Align each query sequence to backbone alignment: HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree, using extended alignment: pplacer (Matsen et al., BMC Bioinformatics 2010) EPA (Berger et al., Systematic Biology 2011)

21 HMMER and PaPaRa results Increasing rate evolution 0.0 Backbone size: 500 5000 fragments 20 replicates

22 Reducing Evolutionary Distance

23

24

25

26

27 Family of HMMs (fHMM) Represents the MSA with multiple HMMs Input: backbone alignment and tree on full-length sequences S and max decomposition size N Two steps: Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree Build HMMs on subalignments induced by subtrees

28 Family of HMMs (fHMM) Represents the MSA with multiple HMMs Input: backbone alignment and tree on full-length sequences S and max decomposition size N Two steps: Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree Build HMMs on subalignments induced by subtrees

29 Family of HMMs (fHMM) Represents the MSA with multiple HMMs Input: backbone alignment and tree on full-length sequences S and max decomposition size N Two steps: Decompose tree into subtrees of closely related sequences, with at most N leaves in each subtree Build HMMs on subalignments induced by subtrees

30 Alignment using fHMM Score query sequence against every HMM and select HMM that yields best bit score Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

31 Alignment using fHMM Score query sequence against every HMM and select HMM that yields best bit score Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

32 Alignment using fHMM Score query sequence against every HMM and select HMM that yields best bit score Insert query sequence into subalignment, and by transitivity align query sequence to backbone alignment

33 SEPP SEPP = SATé-Enabled Phylogenetic Placement Developers: Mirarab, Nguyen, and Warnow Two stages of decomposition: Placement decomposition Alignment decomposition Parameterized by N and M N: maximum size of alignment subsets M: maximum size of placement subsets N ≤ M Published at Pacific Symposium on Biocomputing 2012

34 Stage 1: Placement decomposition S1 S2 S3 S4 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S5 N=4, M=8 Decompose tree into placement sets of size ≤ 8 Decompose each placement set into alignment sets of size ≤ 4

35 SEPP 4/8: Decompose Tree N=4, M=8 Decompose tree into placement sets of size ≤ 8 Decompose each placement set into alignment sets of size ≤ 4 S9 S10 S11 S12 S13 S14 S9 S10 S11 S12 S13 S14 S15 S1 S2 S3 S4 S6 S7 S8 S5 S1 S2 S3 S4 S6 S7 S8 S5 S2 S3 S4 S6 S7 S5

36 Stage 2: Alignment decomposition S1 S2 S3 S4 S6 S7 S8 S9 S10 S11 S12 S13 S14 S5 N=4, M=8 Decompose tree into placement sets of size ≤ 8 Decompose each placement set into alignment sets of size ≤ 4 S9 S10 S11 S12 S13 S14 S15 S1 S2 S3 S4 S6 S7 S8 S5 S2 S3 S4 S6 S7 S5

37 Align and Place Fragment Align to best HMM Place within placement subtree containing HMM S9 S11 S12 S13 S14 S9 S11 S12 S13 S14 S15 S1 S2 S3 S4 S6 S7 S8 S5 S1 S2 S3 S4 S6 S7 S8 S5 S2 S3 S4 S6 S7 S5 S10

38 Align and Place Fragment Align to best HMM Place within placement subtree containing HMM S9 S11 S12 S13 S14 S9 S11 S12 S13 S14 S15 S1 S2 S3 S4 S6 S7 S8 S5 S1 S2 S3 S4 S6 S7 S8 S5 S2 S3 S4 S6 S7 S5 Q S10

39 Align and Place Fragment Align to best HMM Place within placement subtree containing HMM S9 S11 S12 S13 S14 S9 S11 S12 S13 S14 S15 S1 S2 S3 S4 S6 S7 S8 S5 S1 S2 S3 S4 S6 S7 S8 S5 S2 S3 S4 S6 S7 S5 Q S10 Q’

40 SEPP Parameters: Simulated M2 model condition, 500 true backbone

41 Increasing Placement Size: 10 M2 model condition, 500 true backbone SEPP Parameters: Simulated

42 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50

43 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50,250

44 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50,250,500

45 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50,250,500 Increases: Accuracy

46 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50,250,500 Increases: Accuracy Memory

47 M2 model condition, 500 true backbone SEPP Parameters: Simulated Increasing Placement Size: 10,50,250,500 Increases: Accuracy Memory Running time

48 Decreasing Alignment Size: M2 model condition, 500 true backbone SEPP Parameters: Simulated

49 Decreasing Alignment Size: 250 M2 model condition, 500 true backbone SEPP Parameters: Simulated

50 Decreasing Alignment Size: 250,50 M2 model condition, 500 true backbone SEPP Parameters: Simulated

51 Decreasing Alignment Size: 250,50,10 M2 model condition, 500 true backbone SEPP Parameters: Simulated

52 Decreasing Alignment Size: 250,50,10 Increases: M2 model condition, 500 true backbone SEPP Parameters: Simulated

53 Decreasing Alignment Size: 250,50,10 Increases: Running time M2 model condition, 500 true backbone SEPP Parameters: Simulated

54 Decreasing Alignment Size: 250,50,10 Increases: Running time Accuracy? M2 model condition, 500 true backbone SEPP Parameters: Simulated

55 Decreasing Alignment Size: 250,50,10 Increases: Running time Accuracy? M2 model condition, 500 true backbone SEPP Parameters: Simulated

56 16S.B.ALL dataset, 13k curated backbone SEPP Parameters: Biological Decreasing Alignment Size:

57 SEPP Parameters: Biological 16S.B.ALL dataset, 13k curated backbone Decreasing Alignment Size: 1000

58 16S.B.ALL dataset, 13k curated backbone SEPP Parameters: Biological Decreasing Alignment Size: 1000,250

59 16S.B.ALL dataset, 13k curated backbone SEPP Parameters: Biological Decreasing Alignment Size: 1000,250, 100

60 16S.B.ALL dataset, 13k curated backbone SEPP Parameters: Biological Decreasing Alignment Size: 1000,250, 100,50

61 16S.B.ALL dataset, 13k curated backbone SEPP Parameters: Biological Decreasing Alignment Size: 1000,250, 100,50 Increases: Running time Accuracy

62 SEPP (10% rule) Simulated Results 0.0 Increasing rate evolution Backbone size: 500 5000 fragments 20 replicates

63 SEPP Biological Results 16S.B.ALL dataset, curated alignment/tree, 13k backbone, 13k total fragments For 1 million fragments: PaPaRa+pplacer: ~133 days HMMER+pplacer: ~30 days SEPP 1000/1000: ~6 days

64 SEPP summary Two stages of decomposition Placement decomposition to form placement sets Alignment decomposition to form fHMM Results in 40% lower placement error than HMMER+pplacer on divergent datasets 1/5 running time of HMMER+pplacer on large backbones Local placement uses less than 2 GB peak memory compared to 60-70 GB peak memory for global placement

65 Outline of Talk Background Family of HMMs Model Alignment algorithm Applications of fHMM SEPP (Mirarab, Nguyen, and Warnow. PSB 2012) TIPP (Nguyen, et al. Under review) Conclusions and future work

66 Taxonomic Identification and Profiling Taxonomic identification Objective: Given a query sequence, identify the taxon (species, genus, family, etc...) of the sequence Classification problem Methods include Megan, PhymmBL, Metaphyler, and MetaPhylAn Taxonomic profiling Objective: Given a set of query sequences collected from a sample, estimate the population profile of the sample Estimation problem Can be solved via taxonomic identification

67 ACT..TAGA..A (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG. ACCT (60-200 bp long) Fragmentary Unknown Reads: Known Full length Sequences, and an alignment and a tree (500-10,000 bp long) Using SEPP ML placement 40%

68 ACT..TAGA..A (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG. ACCT (60-200 bp long) Fragmentary Unknown Reads: Known Full length Sequences, and an alignment and a tree (500-10,000 bp long) Taxonomic Identification using Phylogenetic Placement Adding uncertainty 2 nd highest likelihood placement 38% ML placement 40%

69 Developers: Nguyen, Mirarab, Pop, and Warnow SEPP takes the best extended alignment and finds the ML placement. We modify SEPP to use uncertainty: Find many extended alignments of fragments to each reference alignment to reach support alignment threshold Find many placements of fragments for each extended alignment to reach placement support threshold Takes alignment and placement support values Classify each fragment at the Lowest Common Ancestor of all placements obtained for the fragment Under review TIPP: Taxonomic identification and phylogenetic profiling

70 5/14/14 Experimental Design Taxonomic identification Used leave-one-out experiments to examine classification accuracy on classifying novel taxa Used non-leave-one-out experiments with fragments simulated under different error models to examine robustness Fragments simulated under Illumina-like and 454-like error models Taxonomic profiling Collected simulated datasets from various studies Estimated profiles on simulated samples Computed Root Mean Squared Error for each profile

71 Leave-one-out comparison

72 Robustness to sequencing error 454_3 error model has reads with average length of 285 bps, with 60 indels per read

73 Taxonomic profiling Selected 9 different simulated metagenomic model conditions Divided datasets into two groups: short fragments (<= 100 bps) long fragments (>= 100 bps). Report RMSE relative to TIPP’s RMSE

74 Profiling: Short Fragments

75 Note: PhymmBL does not report species level classification

76 Profiling: Long Fragments Note: PhymmBL does not report species level classification

77 TIPP Summary Combines SEPP with statistical support threshold to increase precision with minor reduction in sensitivity Better sensitivity for classifying novel reads compared to MetaPhyler Very robust to sequencing errors Results in overall more accurate profiles (lowest average error in 10 of 12 conditions) Can be parameterized for precision or sensitivity

78 Summary fHMM as a statistical model for MSA Algorithm for alignment using fHMM Computes HMMs on closely related subsets Aligns query sequence to fHMM fHMM improves sequence alignment to an existing alignment

79 Future work Use fHMM as a replacement for profile HMM in other domains – Homology detection – Functional annotation Use different alignment methods within fHMM framework TIPP – Statistical models for combining profiles on different markers – Expand marker sets to include more genes

80 Acknowledgements Siavash Mirarab Tandy Warnow Mihai Pop Supported by NSF DEB 0733029 University of Alberta Bo Liu

81 1KP P450 transcriptome dataset Full-length P450 gene ~500 AA Total sequences before filtering ~225K

82 Ultra-large sequence alignment Most MSA techniques do not grow linearly with number of sequences Alignments are needed on very large datasets Pfam contains families with more than 100,000 sequences More than 1 million 16S sequences in Green Genes DB Datasets can contain fragmentary and full-length sequences

83 HMMs for MSA Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family: Represent seed alignment using HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Can we do something like this without a seed alignment?

84 UPP: Ultra-large alignment using SEPP Developers: Nguyen, Mirarab, and Warnow Input: set of sequences S, backbone size B, and alignment subset size A Output: MSA on S Algorithm Select B random full-length sequences (backbone set) from S Estimate backbone alignment and backbone tree on backbone set Align remaining sequences to backbone alignment Uses nested hierarchical fHMM In preparation

85 Disjoint HMMs HMM 1 HMM 2 HMM 3 HMM 4

86 HMM 1 Nested HMMs

87 m HMM 2 HMM 3 HMM 1 Nested HMMs

88 m HMM 2 HMM 3 HMM 1 HMM 4 HMM 5 HMM 6 HMM 7 Nested HMMs

89 5/14/14 Experimental Design Examined both simulated and biological DNA, RNA, and AA datasets Generated fragmentary datasets from the full-length datasets Compared Clustal-Omega, Mafft, Muscle, and UPP ML trees estimated on alignments using FastTree Scored alignment and tree error Tree error measured in FN rate or Delta FN rate

90 Tree error on simulated RNA datasets UPP(Fast): Backbone size=100, Alignment size=10 Average full-length sequence size 1500 bps Only UPP completes on all datasets within 24 hours on a 12 core machine with 24 GB

91 Running time on simulated RNA datasets UPP has close to linear scaling

92 Tree Error on fragmentary RNASim 10K dataset UPP(Default): Backbone size=1000, Alignment size=10 Average fragment length of 500 bps Average full-length sequence size 1500 bps Delta FN error: ML(Estimated)-ML(True)

93 One Million Sequences: Tree Error UPP(100,100): 1.6 days using 12 processors UPP(100,10): 7 days using 12 processors Note: UPP Decomposition improves accuracy

94 UPP summary Uses nested hierarchical fHMM for sequence alignment Overall, results in the most accurate alignments (not shown) and trees on full-length simulated datasets Larger differences on highly divergent datasets Results in comparable or more accurate alignments and trees on biological datasets (not shown) Yields most accurate trees on both full-length and mixed datasets Only method that can complete within 24 hours on datasets with up to 200K sequences, 1M in less than 2 days

95 Future work Apply FHMM as a replacement for profile HMM in other domains – Homology detection – Functional annotation Use different alignment methods within FHMM framework TIPP – Statistical models for combining profiles on different markers – Expand marker sets to include more genes UPP – Iteration to improve taxonomic sampling

96 Smarter profile estimation Current profile estimation – Pool all reads binned to any marker gene into one set – Compute abundance profile from pooled set, ignoring gene source Problems – Ignores that each gene provides a profile – Ignores that profiles are not independent

97 Tree Error u v w x y FN True TreeEstimated Tree u v w x y False Negative (FN): an edge in the true tree that is missing from the estimated tree Delta Error: the difference in FN of the backbone tree+placement and the backbone tree

98 Tree Error u v w x y True TreeEstimated Tree u v w x y False Negative (FN): an edge in the true tree that is missing from the estimated tree Delta Error: the difference in FN of the backbone tree+placement and the backbone tree z z

99 Tree Error u v w x y True TreeEstimated Tree u v w x y False Negative (FN): an edge in the true tree that is missing from the estimated tree Delta Error: the difference in FN of the backbone tree+placement and the backbone tree z z

100 Profile HMM Q:

101 Profile HMM Q:

102 Profile HMM Q:A

103 Profile HMM Q:

104 Profile HMM Q:A

105 Profile HMM Q:At

106 Profile HMM Q:Atc

107 Profile HMM Q:

108 Profile HMM Q:A

109 Profile HMM Q:A-

110 Profile HMM Q:A-tcTCA-tATG

111 Alignment error on biological datasets Add text box with explanation of settings/experiment

112 Alignment error on biological and simulated datasets

113 Alignment error on Pfam seed

114 Profile Hidden Markov Model (HMM)

115 Alignment using HMM Each path through the HMM represents an alignment The most probable path yields the alignment with the highest likelihood Each sequence can be aligned independently Scales linearly with the number of sequences

116 Limitations of Current Methods Current methods have excellent accuracy but were only tested on small backbone trees with low rates of evolution Alignment quality and tree placement accuracy degrade as sequences are more divergent Current methods are too computationally intensive for large backbone trees Goal: more efficient methods with improved accuracy

117 Decomposing backbone tree

118 Building HMMs

119 Alignment using fHMM

120 ACT..TAGA..A (species5) AGC...ACA (species4) TAGA...CTT (species3) TAGC...CCA (species2) AGG...GCAT (species1) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG. ACCT (60-200 bp long) Fragmentary Unknown Reads: Known Full length Sequences, and an alignment and a tree (500-10,000 bp long) Applying SEPP

121 Good sensitivity but poor precision

122 Leave-one-out comparison

123 Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome and other metagenomic projects Issues: accuracy and speed

124 Profiling: Short Fragments

125 Profiling: Long Fragments

126 Contributions MRL and SuperFine+MRL: new supertree methods. Nguyen, Mirarab, and Warnow. AMB 2012 SEPP: SATé-Enabled phylogenetic placement. Mirarab, Nguyen, and Warnow. PSB 2012 TIPP: Taxonomic identification and phylogenetic profiling. Nguyen, Mirarab, Pop, and Warnow. Under review. UPP: Ultra-large alignment using SEPP. Nguyen, Mirarab, Kumar, Guo, Wang, and Warnow. In preparation. Comparison of different methods for masking alignments. Nguyen, Linder, and Warnow. In preparation.

127 Contributions MRL and SuperFine+MRL: new supertree methods. Nguyen, Mirarab, and Warnow. AMB 2012 SEPP: SATé-Enabled phylogenetic placement. Mirarab, Nguyen, and Warnow. PSB 2012 TIPP: Taxonomic identification and phylogenetic profiling. Nguyen, Mirarab, Pop, and Warnow. Under review. UPP: Ultra-large alignment using SEPP. Nguyen, Mirarab, Kumar, Guo, Wang, and Warnow. In preparation. Comparison of different methods for masking alignments. Nguyen, Linder, and Warnow. In preparation.

128 Tree Error u v w x y FN True TreeEstimated Tree u v w x y False Negative (FN): an edge in the true tree that is missing from the estimated tree

129 Placement Error u v w x y True TreeEstimated Tree u v w x y Delta Error: the difference in FN of the extended tree and the backbone tree Q1

130 Placement Error u v w x y True TreeEstimated Tree u v w x y Delta Error: the difference in FN of the extended tree and the backbone tree Q1

131 Placement Error u v w x y True TreeEstimated Tree u v w x y Delta Error: the difference in FN of the extended tree and the backbone tree Delta Error = 2 – 1 = 1 Q1

132 Alignment using fHMM ` ` Align query to best scoring HMM Insert query sequence into backbone alignment using transitivity


Download ppt "Family of HMMs Nam Nguyen University of Texas at Austin."

Similar presentations


Ads by Google