Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection.

Similar presentations


Presentation on theme: "Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection."— Presentation transcript:

1 Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction

2 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

3 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

4 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

5 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

6 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

7 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

8 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

9 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

10 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa

11 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa

12 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa Consistency!

13 The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg caaa--GAGTATCAcc----------CCTGaaTTGAATaa

14 First step in sequence comparison: alignment S1S1 S2S2 S3S3

15 For genomic sequences: Neither local nor global methods appropriate S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

16 First step in sequence comparison: alignment Local method finds single best local similarity S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

17 First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

18 First step in sequence comparison: alignment S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3 Multiple application of local methods possible

19 First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

20 First step in sequence comparison: alignment Multiple application of local methods possible S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

21 First step in sequence comparison: alignment Threshold has to be applied to filter alignments: reduced sensitivity! S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

22 First step in sequence comparison: alignment Alternative approach: During evolution few large-scale re- arrangements -> relative order homologies conserved Search for chain of local homologies

23 First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

24 First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

25 First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’ S2’S2’ S3’S3’ S3S3

26 First step in sequence comparison: alignment Genomic alignment: chain of homologies S1S1 S2S2 S1’S1’S2’S2’S3’S3’ S3S3

27 First step in sequence comparison: alignment Novel approaches for genomic alignment: WABA PipMaker MGA TBA Lagan Avid DIALIGN

28 Alignment of large genomic sequences Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)

29 Alignment of large genomic sequences

30 Objective function for DIALIGN: Weight score for every possible fragment f based on P-value: P(f) = probability of finding a fragment “like f” by chance in random sequences with same length as input sequences w(f) = -log P(f) (“weight score” of f) ”like f” means: at least same # matches (DNA, RNA) or sum of similarity values (proteins)

31 Objective function for DIALIGN: Score of alignment: sum of weight scores of fragments – no gap penalty!

32 Optimization problem for DIALIGN: Find consistent collection of fragments with maximum total weight score!

33 Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level.

34 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

35 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

36 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Standard score: Consider length, # matches, compute probability of random occurrence

37 Translation option: catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg

38 Translation option: L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I DNA segments translated to peptide segments; fragment score based on peptide similarity: Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values

39 P-fragment (in both orientations) L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg For each fragment f three probability values calculated; Score of f based on smallest P value.

40 Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level.

41 DIALIGN alignment of human and murine genomic sequences

42 DIALIGN alignment of tomato and Thaliana genomic sequences

43 Alignment of large genomic sequences Evaluation of signal detection methods: Apply method to data with known signals (correct answer is known!). E.g. experimentally verified genes for gene finding TP = true positves = # signals correctly predicted (i.e. signal present) FP = false positives = # signals predicted but wrong (i.e no signal present) TN = true negative = # no signal predicted, no signal present FN = false negative = # no signal predicted, signal present!

44 Alignment of large genomic sequences Sn = Sensitivity = correctly predicted signals / present signals = TP / (TP + FN) Sp = Specificity = correctly predicted signals / predicted signals = TP / (TP + FP)

45 Alignment of large genomic sequences Comprehensive evaluation of signal prediction method: Method assigns score to predictions Apply threshold parameter High threshold -> high specificity (Sp), low sensitivity (Sn) Low threshold -> high sensitivity, low specificity ROC curve („receiver-operator curve“) Vary threshold parameter, plot Sn against Sp

46 Performance of long-range alignment programs for exon discovery (human - mouse comparison)

47 DIALIGN alignment of tomato and Thaliana genomic sequences

48 AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments

49 AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

50 AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons

51 Identification of candidate exons Fragments in DIALIGN alignment

52 Identification of candidate exons Build cluster of fragments

53 Identification of candidate exons Identify conserved splice sites

54 Identification of candidate exons Candidate exons bounded by conserved splice sites

55 Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

56 Find optimal consistent chain of candidate exons

57

58

59

60 atggtaggtagtgaatgtga

61 Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

62 Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

63 Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

64 Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

65 DIALIGN fragments

66 Candidate exons

67 Gene model

68 Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

69

70 AGenDA GenScan 64 % 12 % 17 % Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

71 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

72 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster:

73 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but

74 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned

75 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned Reason for mis-alignment: duplications !

76 Alignment of large genomic sequences The Hox gene cluster: 4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!

77 Alignment of large genomic sequences The Hox gene cluster: Complete mis-alignment of entire genes!

78 Alignment of sequence duplications S1S1 S2S2

79 S1S1 S2S2 Conserved motivs; no similarity outside motifs

80 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

81 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

82 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences

83 Alignment of sequence duplications S1S1 S2S2 Mis-alignment would have lower score!

84 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence

85 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence

86 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence Possible mis-alignment

87 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

88 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

89 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

90 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3

91 Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3

92 Alignment of sequence duplications S1S1 S2S2 More plausible alignment – and higher score : S3S3

93 Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3

94 Alignment of sequence duplications S1S1 S2S2 Alternative alignment; probably biologically wrong; lower numerical score! S3S3

95 Anchored sequence alignment Biologically meaningful alignment often not possible by automated approaches.

96 Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure

97 Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure User defines a set anchor points that are to be „respected“ by the alignment procedure

98 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

99 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

100 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point

101 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point

102 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments)

103 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments) Remainder of sequences aligned automatically

104 Anchored sequence alignment -------NLF VALYDFVASG DNTLSITKGE klrvlgynhn iihredkGVI YALWDYEPQN DDELPMKEGD cmt------- Anchored alignment

105 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment

106 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQND DELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment

107 Anchored sequence alignment -------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT------- -------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS-- Anchored multiple alignment

108 Algorithmic questions Goal: Find optimal alignment (=consistent set of fragments) under costraints given by user- specified anchor points!

109 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Algorithmic questions

110 NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

111 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Algorithmic questions

112 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Sequences Algorithmic questions

113 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Sequences start positions Algorithmic questions

114 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Sequences start positions length Algorithmic questions

115 Additional input file with anchor points: 1 3 215 231 5 4.5 2 3 34 78 23 1.23 1 4 317 402 8 8.5 Sequences start positions length score Algorithmic questions

116 Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

117 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

118 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

119 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!

120 Algorithmic questions atctaat---agttaaactcccccgtgcttag Cagtgcgtgtattac-taacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!

121 Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

122 Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Find alignment under constraints given by anchor points!

123 Algorithmic questions Use data structures from multiple alignment

124 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

125 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment

126 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment

127 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Question: which positions are still alignable ?

128 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x

129 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x

130 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

131 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)

132 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

133 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure

134 Algorithmic questions Anchor points treated like fragments in greedy algorithm:

135 Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores

136 Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors

137 Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure

138 Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure Resulting values of ub(x,i) and lb(x,i) used as initial values for alignment procedure

139 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)

140 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) calculated using anchor points

141 Algorithmic questions Ranking of anchor points to prioritize anchor points, e.g. anchor points from verified homologies -- higher priority automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority

142 Application: Hox gene cluster

143 Use gene boundaries as anchor points

144 Application: Hox gene cluster Use gene boundaries as anchor points + CHAOS / BLAST hits

145 Application: Hox gene cluster no anchoring anchoring Ali. Columns 2 seq 2958 3674 3 seq 668 1091 4 seq 244 195 Score 1166 1007 CPU time 4:22 0:19

146 Application: Hox gene cluster Example: Teleost Hox gene cluster:

147 Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment !

148 Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment ! Conclusion: Greedy optimization algorithm does a bad job!

149 Application: Improvement of Alignment programs Two possible reasons for mis-alignments:

150 Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score

151 Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score Bad optimization algorithms: Biologically correct alignment gets best numerical score, but algorithm fails to find this alignment

152 Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Anchored alignments can help to decide

153 Application: RNA alignment

154 aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac------- - ccg----CCA AaagauGGCG Acuuga---- - non-anchored alignment

155 Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac------- - ccg----CCA AaagauGGCG Acuuga---- - structural motif mis-aligned

156 Application: RNA alignment aaCCCCAGCG UAAGUCGCUA UCca-- --CACUCUCC CAAGCGGAGA AC---- ----CCGCCA AAAGAUGGCG ACuuga 3 conserved nucleotides as anchor points

157 WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)

158


Download ppt "Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection."

Similar presentations


Ads by Google