Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Motifs in DNA References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning.

Similar presentations


Presentation on theme: "Finding Motifs in DNA References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning."— Presentation transcript:

1 Finding Motifs in DNA References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning Perl for Bioinformatics, Tisdall, Chapter 9. 4. Wikipedia

2 Summary Introduce the Motif Finding Problem Explain its significance in bioinformatics Develop a simple model of the problem Design algorithmic solutions: –Brute Force –Branch and Bound –Greedy Compare results of each method.

3 News: October 6, 2009 IBM Developing Chip to Sequence DNA 3 Scientists Share Nobel Chemistry Prize for DNA Work DNA on bloody clothes matches missing US diplomat Gene Discovery May Advance Head and Neck Cancer Therapy Updated map of human genome to help fight against disease Need a New Heart? Grow Your Own S1P Gene Regulating Lipid May Help Develop New Drugs against Cancer DNA DNA DNA

4

5

6 The Motif Finding Problem motif noun 1. a recurring subject, theme, idea, etc., esp. in a literary, artistic, or musical work. 2. a distinctive and recurring form, shape, figure, etc., in a design, as in a painting or on wallpaper. 3. a dominant idea or feature: the profit motif of free enterprise.

7 Example: Fruit Fly Set of immunity genes. DNA pattern: TCGGGGATTTCC Consistently appears upstream of this set of genes. Regulates timing/magnitude of gene expression. “Regulatory Motif” Finding such patterns can be difficult.

8 Construct an Example: 7 DNA Samples cacgtgaagcgactagctgtactattctgcat cgtccgatctcaggattgtctggggcgacgat gggggcggtgcgggagccagcgctcggcgttt gcaaggcgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggtcacg aggtataatgcgaacagctaaaactccggaaa cccccgcaatttaactagggggcgcttagcgt Pattern acctggcc

9 Insert Pattern at random locations: cacgtgaacctggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctggccggggcgacgat gacctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggcccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggacctggcctcacg aggtataatgcgaaacctggcccagctaaaactccggaaa cccccgcaaacctggcctttaactagggggcgcttagcgt

10 Add Mutations: cacgtgaacGtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgAccggggcgacgat gGcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggTccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaActggcctcacg aggtataatgcgaaacctTgcccagctaaaactccggaaa cccccgcaaacTtggcctttaactagggggcgcttagcgt

11 Finally, find the hidden pattern: cacgtgaacgtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgaccggggcgacgat ggcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggtccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaactggcctcacg aggtataatgcgaaaccttgcccagctaaaactccggaaa cccccgcaaacttggcctttaactagggggcgcttagcgt

12 cacgtgaacgtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgaccggggcgacgat ggcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggtccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaactggcctcacg aggtataatgcgaaaccttgcccagctaaaactccggaaa cccccgcaaacttggcctttaactagggggcgcttagcgt

13 Three Approachs Brute Force: –check every possible pattern. Branch and Bound: –prune away some of the search space. Greedy: –commit to “nearby” options, never look back.

14 Brute Force Given that the pattern is of length = L. Generate all DNA patterns of length L. (Called “L-mers”). Match each one to the DNA samples. Keep the L-mer with the best match. “Best” is Based on a scoring function.

15 Scoring: Hamming Distance accgtaccggtaacaagtaccgtacgggtaacaagtaccgtaggtgtaacaagt gtgtaggt 4 mismatches gtgtaggt 2 mismatches gtgtaggt 8 mismatches dna sequence Try all starting positions Find the position with the fewest mismatches L=8 gtgtaggt an L-mer

16 Scoring t = 8 DNA samples try all possible L-mers Try each possible L-mer Score is equal to the sum of the mismatches at the locations with fewest mismatches on each string. The L-mer with the lowest such score is the optimal answer. 3210320132103201 total distance = 12 12

17 Generating all L-mers Systematic enumeration of all DNA strings of length L. DNA has an “alphabet” of 4 letters: { a, c, g, t } Proteins have an alphabet of 20 letters: –one for each of 20 possible amino acids. –{A,B,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W} Solve problem for any size alphabet (k) and any size L-mer (L).

18 Definitions k = size of alphabet L = length of strings to be generated a = vector containing a partial or complete L-mer. i = number of entries in a already filled in. Example: k = 4, L = 5, i = 2, a = (2, 4, *, *, * )

19 Example Alphabet = {1, 2} k = 2, L=4 i = Depth of the Tree (1111) (2222)

20 NEXT VERTEX i = 3 a = 1 3 2 i = 4 a = 1 3 2 1 1 i = L a = 2 3 2 1 2 2 j = Lj = 1 i = L a = 2 3 2 1 2 3 NEXTVERTEX(a, i, L, k) if i < L a(i+1) = 1 return (a, i+1) else for j = L to j = 1 if a(j) < k then a(j) = a(j) +1 return(a, j) return (a,0)

21 i = L a = 2 3 2 1 2 3 j = Lj = 1 1 2 3 2 1 3 i = L-1 a = 2 3 2 1 3 1 i = L a = 2 3 2 1 3 1..... i = 6 2 3 2 1 2 3 i = 5 2 3 2 1 3 i = 6 2 3 2 1 3 1 i = 6 2 3 2 1 3 2 i = 6 2 3 2 1 3 3 i = 4 2 3 2 2 i = 5 2 3 2 2 1 i = 6 2 3 2 2 1 1 i = 6 2 3 2 2 1 2 i = 6 2 3 2 2 1 3..... Example: L = 6 k = 3 alhpabet = {1, 2, 3} When i = L (leaf node)

22 Brute Force Use NEXTVERTEX to generate nodes in the tree. Translate each numeric value into the corresponding L-mer –(e.g.: 1=a, 2=c, 3=g, 4=t). Score each L-mer (Hamming distance). keep the best L-mer (and where it matched in each dna sample).

23 Branch and Bound Use same structure as the Brute Force method. Looks for ways to reduce the computation. Prune branches of the tree that cannot produce anything better than what we have so far.

24 BYPASS BYPASS (a, i, L, k) for j = i to j = 1 –if a(j) < k a(j) = a(j) + 1 return (a, j) return (a, 0)

25 BRANCHANDBOUND a = (1, 1,..., 1) bestDistance = infinity i = L while (i > 0) –if i < L prefix = translate(a1, a2,..., ai) optimisticDistance = TotalDistance(prefix) if optimisticDistance > bestDistance –(a, i) = BYPASS(a, i) else –(a, i) = NEXTVERTEX( a, i ) –else word = translate (a1, a2,....., aL) if TotalDistance( word, DNA ) < bestDistance –bestDistance = TotalDistance(word, DNA) –bestWord = word (a, i) = NEXTVERTEX( a, i) return bestWord

26 Greedy Method Picks a “good” solution. Avoids backtracking. Can give good results. Generally, not the best possible solution. But: FAST.

27 Greedy Method Given t dna samples (each n-long). Find the optimal motif for the first two samples. Lock that choice in place. For the remainder of the samples: –for each dna sample in turn find the L-mer that best fits with the prior choices. never backtrack.

28 t = 8 DNA samples Step 1: Grab the first two samples and find the optimal alignment (consider all starting points s1 and s2, and keep the largest score). Step 2: Go through each remaining sample, successively finding the starting positions (s3, s4,...., st) that give the best consensus score for all the choices made so far.

29 atccagct gggcaact atggatct aagcaacc ttggaact 31005300 13000104 11420100 00130051 atgcatgc 3 3 4 3 5 3 5 4 Consensus Profile Alignment a g g c a a c t Scoring

30 Motif Finding Example n=32 t=16 L=5 atgtgaaaaggcccaggctttgttgttctgat aatcagtttgtggctctctactatgtgcgctg catggcgtaagagcaggtgtacaccgatgctg taaatacacagattccttccgactttctgcat caagccttagctttagatctttgtctcccttt gagccatggactgtccgccagtatcttcctag cgccaactgcccgtttcgcagtgccatgttga agttcccagtcccgatcataggaatttgagca tagggatcgaatgagttgtcctagtcaatcct gtagctcctcaagggatacccacctatcgacg agccgcagcgacaacttgctcgctatctaact ccactccctaagcgctgaacaccggagttctg gaagtcttcttgctgacacattacttgctcgc gaatcgtcgtatgttttcgaccttggtggcat tctcaacatgccttcccctccccaggctatgc tgtgtctatcatcccgttagctacctaaatcg 16 32 5

31 atgtgaaaaggcccaggctttgttgttctgat ***** aatcagtttgtggctctctactatgtgcgctg ***** catggcgtaagagcaggtgtacaccgatgctg ***** taaatacacagattccttccgactttctgcat ***** caagccttagctttagatctttgtctcccttt ***** gagccatggactgtccgccagtatcttcctag ***** cgccaactgcccgtttcgcagtgccatgttga ***** agttcccagtcccgatcataggaatttgagca ***** tagggatcgaatgagttgtcctagtcaatcct ***** gtagctcctcaagggatacccacctatcgacg ***** agccgcagcgacaacttgctcgctatctaact ***** ccactccctaagcgctgaacaccggagttctg ***** gaagtcttcttgctgacacattacttgctcgc ***** gaatcgtcgtatgttttcgaccttggtggcat ***** tctcaacatgccttcccctccccaggctatgc ***** tgtgtctatcatcccgttagctacctaaatcg ***** atgtgaaaaggcccaggctttgttgttctgat ***** aatcagtttgtggctctctactatgtgcgctg ***** catggcgtaagagcaggtgtacaccgatgctg ***** taaatacacagattccttccgactttctgcat ***** caagccttagctttagatctttgtctcccttt ***** gagccatggactgtccgccagtatcttcctag ***** cgccaactgcccgtttcgcagtgccatgttga ***** agttcccagtcccgatcataggaatttgagca ***** tagggatcgaatgagttgtcctagtcaatcct ***** gtagctcctcaagggatacccacctatcgacg ***** agccgcagcgacaacttgctcgctatctaact ***** ccactccctaagcgctgaacaccggagttctg ***** gaagtcttcttgctgacacattacttgctcgc ***** gaatcgtcgtatgttttcgaccttggtggcat ***** tctcaacatgccttcccctccccaggctatgc ***** tgtgtctatcatcccgttagctacctaaatcg ***** consensus_string = ctccc consensus_count = 12 13 12 13 13 final percent score = 78.75 consensus_string = atgtg consensus_count = 14 10 11 12 10 final percent score = 71.25 Branch and BoundGreedy

32 ggccc ctctc caccg cttcc ctccc cttcc ctgcc ttccc gtcct ctcct ctcgc ctccc ctcgc cgacc ctccc atccc consensus_string = ctccc count = 12 13 12 13 13 final percent score = 78.75 atgtg aggtg ttctg atctt atgga atgtt atttg atgag aaggg acttg aagcg aagtc atgtt acatg gtgtc consensus_string = atgtg count = 14 10 11 12 10 final percent score = 71.25 Branch and BoundGreedy

33 Example 2 n = 64 t = 16 L = 8 gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga

34 gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca ******** tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta ******** tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat ******** aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg ******** gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct ******** aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ******** ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc ******** cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc ******** atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt ******** cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc ******** aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt ******** atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ******** ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg ******** tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ******** ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac ******** cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga ******** gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca ******** tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta ******** tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat ******** aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg ******** gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct ******** aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ******** ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc ******** cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc ******** atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt ******** cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc ******** aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt ******** atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ******** ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg ******** tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ******** ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac ******** cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga ******** consensus_string = ccatattt count = 10 11 11 11 13 10 11 14 final percent score = 71.09375 consensus_string = cgtactcc count = 11 10 13 11 10 12 10 8 final percent score = 66.40625 Branch and BoundGreedy

35 Summary Introduce the Motif Finding Problem Explain its significance in bioinformatics Develop a simple model of the problem Design algorithmic solutions: –Brute Force –Branch and Bound –Greedy Compare results of each method.

36 Teaching and Learning

37 Neural Networks for Optimization Bill Wolfe California State University Channel Islands Reference A Fuzzy Hopfield-Tank TSP Model Wolfe, W. J. INFORMS Journal on Computing, Vol. 11, No. 4, Fall 1999 pp. 329-344

38

39 Neural Models Simple processing units Lots of them Highly interconnected Exchange excitatory and inhibitory signals Variety of connection architectures/strengths “Learning”: changes in connection strengths “Knowledge”: connection architecture No central processor: distributed processing

40 Simple Neural Model a i Activation e i External input w ij Connection Strength Assume: w ij = w ji (“symmetric” network)  W = (w ij ) is a symmetric matrix

41 Net Input Vector Format:

42 Dynamics Basic idea:

43 Energy

44

45 Lower Energy da/dt = net = -grad(E)  seeks lower energy

46 Problem: Divergence

47 A Fix: Saturation

48 Keeps the activation vector inside the hypercube boundaries Encourages convergence to corners

49 A Neural Model a i Activation e i External Input w ij Connection Strength W (w ij = w ji ) Symmetric

50 Example: Inhibitory Networks Completely inhibitory –wij = -1 for all i,j –winner take all Inhibitory Grid –neighborhood inhibition –on-center, off-surround

51 Traveling Salesman Problem Classic combinatorial optimization problem Find the shortest “tour” through n cities n!/2n distinct tours

52 TSP solution for 15,000 cities in Germany Ref: http://www.math.cornell.edu/~durrett/probrep/probrep.html

53 TSP 50 City Example

54 Random Tour

55 Nearest-City Tour

56 2-OPT Tour

57 Centroid Tour

58 Monotonic Tour

59 Neural Network Approach neuron

60 Tours – Permutation Matrices tour: CDBA permutation matrices correspond to the “feasible” states.

61 Not Allowed

62 Only one city per time stop Only one time stop per city  Inhibitory rows and columns inhibitory

63 Distance Connections: Inhibit the neighboring cities in proportion to their distances.

64 putting it all together:

65 E = -1/2 { ∑ i ∑ x ∑ j ∑ y a ix a jy w ixjy } = -1/2 { ∑ i ∑ x ∑ y (- d(x,y)) a ix ( a i+1 y + a i-1 y ) + ∑ i ∑ x ∑ j (-1/n) a ix a jx + ∑ i ∑ x ∑ y (-1/n) a ix a iy + ∑ i ∑ x ∑ j ∑ y (1/n 2 ) a ix a jy }

66 Hopfield JJ, Tank DW. Neural computation of decisions in optimization problems. Biological Cybernetics 1985;52:141-52.

67 typical state of the network before convergence x x x x x x x Fuzzy Tour: GAECBFD

68 “Fuzzy Readout”

69

70

71

72

73 Fuzzy Tour Lengths tour length iteration

74 Average Results for n=10 to n=70 cities (50 random runs per n) # cities

75 Conclusions Neurons stimulate intriguing computational models. The models are complex, nonlinear, and difficult to analyze. The interaction of many simple processing units is difficult to visualize. The Neural Model for the TSP mimics some of the properties of the nearest-city heuristic. Much work to be done to understand these models.

76


Download ppt "Finding Motifs in DNA References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning."

Similar presentations


Ads by Google