Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &

Similar presentations


Presentation on theme: "Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &"— Presentation transcript:

1 Welcome to CS262!

2 Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics & applications In-depth coverage of Computational Genomics  Algorithms for sequence analysis  Current applications, trends, and open problems Coverage of useful algorithms  Hidden Markov models  Dynamic Programming  String algorithms  Applications of AI techniques

3 Topics in CS262 Part 1: In-depth coverage of basic computational methods for analysis of biological sequences  Sequence Alignment & Dynamic Programming  Hidden Markov models These methods are used heavily in most genomics applications:  DNA sequencing  Comparison of DNA and proteins across organisms  Discovery of genes, promoters, regulatory sites

4 Topics in CS262 Part 2: Topics in computational genomics, more algorithms, and areas of active research  DNA sequencing & assembly: reading a complete genome such as the human DNA  Gene finding: marking genes on the DNA sequence  Large-scale comparative genomics: comparing whole genomes from multiple organisms  Microarrays & regulation: understanding the regulatory code, and potential disease-causing genes  RNA structure: predicting the folding of RNA  Phylogeny and evolution: quantifying the evolution of biological sequences

5 Course responsibilities Homeworks[80%]  4 challenging problem sets, 4-5 problems/pset  Collaboration allowed – please give credit Final[20%]  Takehome, 1 day  Collaboration not allowed  Basic questions – much easier than homeworks ScribingOptional  Due one week after the lecture, except special permission  Scribing grade replaces 2 lowest problems from homeworks

6 Reading material Books  “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchinson Chapters 1-4, 6, (7-8), (9-10)  “Algorithms on strings, trees, and sequences” by Gusfield Chapters (5-7), 11-12, (13), 14, (17) Papers Lecture notes

7 Topic 1. Sequence Alignment

8 Biology in One Slide: 2 Paradigms Molecular Paradigm Evolution Paradigm

9 Complete DNA Sequences nearly 200 complete genomes have been sequenced

10 Evolution

11 Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

12 Evolutionary Rates OK X X Still OK? next generation

13 Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering the evolutionary forces

14 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

15 Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

16 How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 M+N )

17 Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

18 Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j

19 Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d

20 Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = max F(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not

21 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 F(i,j) i = 0 1 2 3 4 Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A - TA

22 The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

23 The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j-1) + s(x i, y j ) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

24 Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

25 A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don’t want to penalize gaps in the ends

26 Different types of overlaps

27 The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y 1 ……………………………… y N

28 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

29 Why local alignment Genes are shuffled between genomes Portions of proteins (domains) are often conserved

30 Cross-species genome similarity 98% of genes are conserved between any two mammals >70% average similarity in protein sequence hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001 mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001 rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938 fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001 mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001 rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938 fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001 mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001 rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938 fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001 mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001 rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938 fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

31 The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

32 The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t ??For all i, j find F(i, j) > t, and trace back Complicated by overlapping local alignments

33 Scoring the gaps more accurately Current model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1)  (n)

34 Convex gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = maxmax k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O(N 2 M)(assume N>M) Space:O(NM)

35 Compromise: affine gaps  (n) = d + (n – 1)  e | | gap gap open extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i aligns to a gap after y j if H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 …x i to y 1 …y j d e  (n)

36 Needleman-Wunsch with affine gaps Why do we need two matrices? x i aligns to y j x 1 ……x i-1 x i x i+1 y 1 ……y j-1 y j - 2.x i aligns to a gap x 1 ……x i-1 x i x i+1 y 1 ……y j …- - Add -d Add -e

37 Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(x i, y j ) V(i, j – 1) – d G(i, j) = max G(i, j – 1) – e V(i – 1, j) – d H(i, j) = max H(i – 1, j) – e Termination: similar

38 To generalize a little… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

39 Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) M ) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

40 Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same Easy to extend to the affine gap case x 1 ………………………… x M y 1 ………………………… y N k(N)


Download ppt "Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &"

Similar presentations


Ads by Google