Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.

Similar presentations


Presentation on theme: "CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics."— Presentation transcript:

1 CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

2 Roadmap Last lecture review –Affine gap penalty (more today) –Local sequence alignment –Statistics of substitution matrices Statistics of alignment scores Sequence alignment and FSA –Affine gap penalty –More complex models

3 Seq Alignment Algorithms Global alignment –Basic: Needleman-Wunsch –Variants (LCS, overlapping, …) –Bounded DP (pruning search space) –Linear space (divide-and-conquer) –Affine gap penalty Local Alignment –Basic: Smith-Waterman –All tricks in global alignment applicable Bounded DP, linear space, affine gap

4 The local alignment problem Given two strings X = x 1 ……x M, Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

5 The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) +  (x i, y j ) Iteration: F(i, j) = max

6 The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

7 Analysis Time: –O(MN) for finding the best alignment –Depending on the number of sub-opt alignments Memory: –O(MN) –O(M+N) possible

8 The statistics of alignment Where does  (x i, y j ) come from? Are two aligned sequences actually related?

9 Protein substitution matrix  : score to align amino acid s against t p s, p t, frequency of s and t in database q st : the frequency that s is aligned to t in real homologous sequences Log odds ratio Scaling factor

10 BLOSUM matrices p s, p t, q st estimated from trusted alignments in the BLOCKS database Eliminate near-identical sequences –BLOSUM-N: constructed from sequences where identity between any pair of sequences is less than N% –BLOSUM-62: good for most purposes 45 62 90 Weak homologyStrong homology

11 DNA substitution matrix Given the percent identity you would like to detect and some assumptions You can get the substitution matrix by some calculation

12 Example Assume p A = p C = p T = p G = 0.25 We want 88% identity q AA = q CC = q TT = q GG = 0.22 The rest = 0.12/12 = 0.01 ACGT A5-7 C 5 G 5 T 5

13 Arbitrary substitution matrix Even arbitrary substitution matrix has meaning Better know what you are doing Solve a polynomial function to obtain the scaling factor Calculate target frequency q st Calculate target percent identity

14 Example ACGT A1-2 C 1 G 1 T 1 ACGT A5-4 C 5 G 5 T 5 = 1.33 q st = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity = 1.21 q st = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

15 Today Significance of alignment score Sequence alignment and FSA

16 Statistics of Alignment Scores Q: How do we assess whether an alignment provides good evidence for homology? –Is a score 82 good? What about 180? A: determine how likely it is that such an alignment score would result from chance

17 Most of the theory applies to local alignment For global alignment, your best bet is to do Monte-Carlo simulation –Randomly shuffle your sequences before alignment –What’s the chance you can get a score as high as the real alignment?

18 Procedure to estimate the significance of a global alignment –Given sequence X, Y –Global alignment score = S –Randomly shuffle sequence X (or Y) N times, obtain X 1, X 2, …, X N –Align each X i with Y, let the score be S i –Plot the distribution of S i, and see where the real S locates

19 …………………………………………………… Mouse HEXA Human HEXA Score = 732

20 732 Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences

21 Human HEXA Fly HEXO1 Score = -74

22 -74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences

23 P-value of alignment p-value –The probability that the alignment score can be obtained from aligning random sequences –Small p-value means the score is unlikely to happen by chance A p-value 0.05 means you are 95% sure that the result is significant.

24 What p-value is significant? The most common thresholds are 0.01 and 0.05. Is 95% enough? It depends on the cost associated with making a mistake. Examples of costs: –Doing expensive wet lab validation. –Making clinical treatment decisions. –Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

25 -74 There are 88 random sequences with alignment score >= -74. Therefore P-value = 88 / 200 = 0.44 => alignment is not significant

26 732 There are no random sequences with alignment score >= 732. Therefore the P-value is less than 1 / 200 = 0.05 => significant Even though the p-value looks much smaller than 0.05, we cannot say anything unless we generate more random sequences

27 Drawbacks Monte-Carlo may take long time Cannot accurately estimate p-value if p is small To get 10 -5 p-value, have to align 10 5 random sequences –Unless we can fit a distribution Such distribution may not be generalizable No theory exists for global alignment score distribution

28 Statistics for local alignment Theory much more elegant Score for ungapped local alignment follows extreme value distribution (Gumbel distribution) This distribution is characterized by a larger tail on the right.

29 Normal distributionExtreme value distribution Intuitive interpretation for extreme value distribution Randomly sample 100 numbers from a normal distribution, and compute max Repeat 100 times. The max values will follow extreme value distribution

30 Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. For score S, this probability is calculated as

31 Computing a p-value

32 Statistics for local alignment How does this apply to sequence alignment? Given two unrelated sequences of lengths M, N Expected number of local alignments with score >= S can be calculated by –E(S) = KMN exp[- S] –Known as E-value – : scaling factor as computed in last lecture –K: empirical parameter ~ 0.1 Depend on sequence composition and substitution matrix

33 P-value for alignment score P-value for a local alignment score S when P is small.

34 Example You are aligning two sequences, each has 1000 bases m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?

35 = ln3 = 1.1 E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * 3 -20 = 3 x 10 -5 P-value = 3 x 10 -5 << 0.05 The alignment is significant

36 Distribution of 1000 random sequence pairs 20

37 Multiple-testing problem What if you are searching a 1000-base sequence against a database of 10 6 sequences (average length 1000 bases)? How significant is a score 20 now? You are essentially comparing 1000 bases with 1000x10 6 = 10 9 bases (ignore edge effect) E(20) = 0.1 * 1000 * 10 9 * 3 -20 = 30 By chance we would expect to see 30 matches P-value = 1 – e -30 = 0.9999999999 Not significant at all

38 Statistics for gapped local alignment Theory not well developed Extreme value distribution works well empirically Need to estimate K and empirically –Given the database and substitution matrix, generate some random sequence pairs –Do local alignment –Fit an extreme value distribution to obtain K and

39 More on sequence alignment and FSA

40 Gap penalty models Linear model –  (n) = n x d –Needleman-Wunsch –O(MN) time –O(M+N) memory General gap penalty function –O(N 2 M) time –O(MN) memory  n  n

41 Affine gap penalty  (n) = d + (n – 1)  e | | gap open extension d e  (n) O(MN) time O(M+N) memory

42 Finite State Automaton x, y Aligned Gap on x Gap on y (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e

43 Finite State Automaton F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e Input Output State Mealy machine: output associated with transitions Moore machine: output associated with states Mealy machine generally uses less states. Mutually convertible.

44 Mealy machine A Mealy machine is a 6-tuple, (S, S 0, Σ, Λ, T, G), consisting of the following: –a finite set of states (S) –a start state (also called initial state) S 0 which is an element of (S) –a finite set called the input alphabet (Σ) –a finite set called the output alphabet (Λ) –a transition function (T : S × Σ → S) –an output function (G : S × Σ → Λ)

45 F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e Input Output Start state Current stateInputOutputNext state F (x i,y j )  F F (-,y j )d Ix F (x i,-)d Iy Ix (-,y j )e Ix … …… …

46 Finite State Automaton F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e Given a pair of sequences, find a path in the state diagram to reproduce the sequences using this machine such that the score is the highest

47 F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e AAC ACT F-F-F-F AAC ||| ACT F-I y -F-F-I x AAC- || -ACT F-F-I y -F-I x AAC- | A-CT start state Symbols are generated during transition.

48 F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j-1) +  (x i, y j ) F(i, j) = max Ix(i-1, j-1) +  (x i, y j ) Iy(i-1, j-1) +  (x i, y j )

49 F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e

50 F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e

51 F(i – 1, j – 1) F(i, j) =  (x i, y j ) + max I x (i – 1, j – 1) I y (i – 1, j – 1) F(i, j – 1) – d I x (i, j) = max I x (i, j – 1) – e F(i – 1, j) – d I y (i, j) = max I y (i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

52 Exercise x = GCAC y = GCC m = 2 s = -2 d = -5 e = -1

53 0 -- -- -- -- -- -- -- -- -- -- -- -5 -6 -7 -8 -- -5-6-7 -- -- -- -- FIy: Insertion on y F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix: Insertion on x

54 0 -- -- -- -- 2 -- -- -- -- -- -- -- -5 -6 -7 -8 -- -5-6-7 -- -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

55 0 -- -- -- -- 2-7 -- -- -- -- -- -- -- -5 -6 -7 -8 -- -5-6-7 -- -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

56 0 -- -- -- -- 2-7-8 -- -- -- -- -- -- -- -5 -6 -7 -8 -- -5-6-7 -- -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

57 0 -- -- -- -- 2-7-8 -- -- -- -- -- -- -5 -6 -7 -8 -5-6-7 -- -- -3 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

58 0 -- -- -- -- 2-7-8 -- -- -- -- -- -- -5 -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

59 0 -- -- -- -- 2-7-8 -- -- -- -- -- -- -5 -- -- -- -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

60 0 -- -- -- -- 2-7-8 -- -7 -- -- -- -- -- -5 -- -- -- -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

61 0 -- -- -- -- 2-7-8 -- -74 -- -- -- -- -- -5 -- -- -- -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

62 0 -- -- -- -- 2-7-8 -- -74-5 -- -- -- -- -- -- -- -- -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

63 0 -- -- -- -- 2-7-8 -- -74-5 -- -- -- -- -- -- -- -- -6 -7 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

64 0 -- -- -- -- 2-7-8 -- -74-5 -- -- -- -- -- -- -- -- -6-3 -7 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

65 0 -- -- -- -- 2-7-8 -- -74-5 -- -- -- -- -- -- -- -- -6-3-12-13 -7 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

66 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -- -- -- -- -- -- -6-3-12-13 -7 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

67 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -- -- -- -- -- -- -6-3-12-13 -7-8 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

68 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -- -- -- -- -- -- -6-3-12-13 -7-8-10 -8 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

69 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -9-61 -- -- -- -5 -- -- -- -6-3-12-13 -7-8-10 -8-13-2-3 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- -- -14-11 FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

70 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -9-61 -- -- -- -5 -- -- -- -6-3-12-13 -7-8-10 -8-13-2-3 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- -- -14-11 FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC

71 0 -- -- -- -- 2-7-8 -- -74-5 -- -8-52 -- -9-61 -- -- -- -5 -- -- -- -6-3-12-13 -7-8-10 -8-13-2-3 -5-6-7 -- -- -3-4 -- -- -12 -- -- -13-10 -- -- -14-11 FIy Ix F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC GCAC || | GC-C

72 Exercising FSA How do you make an FSA for the Needleman-Wunsch algorithm?

73 Exercising FSA How do you make an FSA for the Needleman-Wunsch algorithm? F Ix Iy (x i,y j ) /  (x i,-) / d (-, y j ) / d (x i,-)/d (-, y j ) / d

74 Simplify F I (x i,y j ) /  (x i,-) / d (-, y j ) / d (x i,-) / d

75 Simplify more F (x i,y j ) /  (-, y j ) / d (x i,-) / d F(i-1, j-1) +  (x i, y j ) F(i,j) = max F(i-1, j) + d F(i, j-1) + d

76 A more difficult alignment problem (A gene finder indeed!) X is a genomic sequence (DNA) –X encodes a gene –May contain introns Y is an ORF from another species –Contains only exons We want to compare X against Y –Conservation is on the level of amino acids

77 5’ UTR 3’ UTRexon intron Start codonStop codon Open reading frame (ORF) Pre-mRNA Mature mRNA (mRNA) Splice DNA

78 We have a predicted gene We know the positions of the start codon and stop codon But we don’t know where are the splicing sites –Not even the number of introns exon intron Start codon Stop codon intron

79 Mouse putative gene human ORF 1.Most splicing sites start at GT and end at AG 2.But there are lots of GT and AG in the sequence 3.Aligning to a orthologous gene with known ORF may help us determine the splicing sites Orthologous genes: two genes evolved from the same ancestor Coding region are likely conserved on amino acid level UUA, UUG encode the same amino acid So do UCA, UCU, UCG, UCC GT…………AG

80 The Genetic Code Third letter

81 Easy Remove introns Global alignment Mouse putative gene human ORF Mouse putative ORF translate If know where are the exons

82 Or directly align triplets Remove introns Global alignment Mouse putative gene human ORF Mouse putative ORF

83 Codon substitution scores AAAAAGAAUAAC………UCUUCC AAA43 AAG34 AAU 4311 AAC 3411 … … … UCU 1143 UCC 1134 64 x 64 substitution matrix

84 FSA for aligning genomic DNA to ORF A B (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d Considering only exons

85 Mouse putative gene human ORF 1.We don’t know exactly where are the splicing sites 2.Length of introns may not be a multiple of 3 - If convert the whole seq into triplets, may result in ORF shift 17 bases?

86 Model introns Mouse putative gene human ORF 1.Most splicing sites start at GT and end at AG 2.For simplicity, assume length of exon is a multiple of 3 Not true in reality Only a little more work without this assumption GT…………AG 120 nt = 40 aa 126 nt = 42 aa

87 Aligning genomic DNA to ORF Fixed cost to have an intron Alignment with Affine gap penalty

88 FSA for aligning genomic DNA to ORF A B (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e Considering only exons (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d

89 FSA for aligning genomic DNA to ORF A B C (-, GT) / s Start an intron (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

90 FSA for aligning genomic DNA to ORF A B C (-, GT) / s (-, y i ) / 0 Start an intron Continue in intron (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

91 FSA for aligning genomic DNA to ORF A B C (-, GT) / s (-, y i ) / 0 (-, AG) / s Close an intron Start an intron Continue in intron (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

92 A B C (-, GT) / s (-, y j ) / 0 (-, AG) / s A(i-3,j-3) +  (x i-2 x i-1 x i, y j-2 y j-1 y j ) A(i, j) = max B(i-3,j-3) +  (x i-2 x i-1 x i, y j-2 y j-1 y j ) C(i, j-2) + s, if y j-1 y j == ‘AG’ (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

93 A B C (-, GT) / s (-, AG) / s A(i, j-3) + d A(i-3, j) + d B(i, j) = max B(i, j-3) + e B(i-3, j) + e (-, y j ) / 0 (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

94 A B C (-, GT) / s (-, AG) / s B(i, j-2) + s, if y j-1 y j == ‘GT’ C(i, j) = max C(i, j-1) (-, y j ) / 0 (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / d (x i-2 x i-1 x i, y j-2 y j-1 y j ) /  (x i-2 x i-1 x i, - ) or (-, y j-2 y j-1 y j ) / e

95 ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA

96 There is a close relationship between dynamic programming, FSA, regular expression, and regular grammar Using FSA, you can design more complex alignment algorithms If you can draw the state diagram for a problem, it can be easily formulated into a DP problem –In particular, Hidden Markov Models –Will discuss more in a few weeks


Download ppt "CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics."

Similar presentations


Ads by Google