COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Copyright warning.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Heuristics CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Types of Algorithms.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Tuesday, May 14 Genetic Algorithms Handouts: Lecture Notes Question: when should there be an additional review session?
21-May-15 Genetic Algorithms. 2 Evolution Here’s a very oversimplified description of how evolution works in biology Organisms (animals or plants) produce.
CPSC 322, Lecture 16Slide 1 Stochastic Local Search Variants Computer Science cpsc322, Lecture 16 (Textbook Chpt 4.8) February, 9, 2009.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Ch. 7 - QuickSort Quick but not Guaranteed. Ch.7 - QuickSort Another Divide-and-Conquer sorting algorithm… As it turns out, MERGESORT and HEAPSORT, although.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Genetic Algorithm.
Genetic Algorithms and Ant Colony Optimisation
Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
by B. Zadrozny and C. Elkan
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Randomized Turing Machines
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Design & Analysis of Algorithms Combinatory optimization SCHOOL OF COMPUTING Pasi Fränti
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
Simulated Annealing.
Introduction to Algorithms Jiafen Liu Sept
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
FORS 8450 Advanced Forest Planning Lecture 11 Tabu Search.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Algorithm Design Methods (II) Fall 2003 CSE, POSTECH.
Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.
Local Search and Optimization Presented by Collin Kanaley.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Optimization Problems
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
Sequence Alignment.
ICS 353: Design and Analysis of Algorithms
GENETIC ALGORITHM Basic Algorithm begin set time t = 0;
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
Quicksort This is probably the most popular sorting algorithm. It was invented by the English Scientist C.A.R. Hoare It is popular because it works well.
Randomized Algorithms for Motif Finding [1] Ch 12.2.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Optimization Problems
Genetic Algorithms.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Optimization Problems
Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?
School of Computer Science & Engineering
Boltzmann Machine (BM) (§6.4)
Presentation transcript:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Copyright warning

– Adapted from textbook slides Randomized Algorithms and Motif Finding

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Outline Randomized QuickSort Randomized Algorithms Greedy Profile Motif Search Gibbs Sampler Random Projections

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Outline - CHANGES Greedy Profile Motif Search – Animate “scoring a string with a profile” – Animate “P- most probable l-mer” Gibbs Sampler – Describe relative entropies

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Randomized Algorithms Randomized algorithms make (pseudo)random rather than deterministic decisions. One big advantage is that no input can systematically produce worst-case results, because the algorithm runs differently each time. These algorithms are commonly used in situations where no exact and fast algorithm is known.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Introduction to QuickSort QuickSort is a simple and efficient approach to sorting: Select an element m from unsorted array c and divide the array into two subarrays: c small - elements smaller than m and c large - elements larger than m. Recursively sort the subarrays and combine them together in sorted array c sorted

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example of QuickSort Given an array: c = { 5, 2, 8, 4, 3, 1, 7, 6, 9 } Step 1: Choose the first element as m c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } Our Selection

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example of QuickSort (cont’d) Step 2: Split the array into c small and c large C small = { 3, 2, 4, 5, 1, 0 } C = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } C large = { 8, 7, 9 }

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example of QuickSort (cont’d) Step 3: Recursively do the same thing to c small and c large until each subarray has only one element or is empty. C small = { 3, 2, 4, 5, 1, 0 }C large = { 8, 7, 9 } m = 3 m = 8 { 2, 1, 0 } < { 4, 5 }{ 7 } < { 9 } m = 2 { 1, 0 } < { empty}{ empty } < { 5 } m = 4 m = 1 { 0 } < { empty }

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example of QuickSort (cont’d) Step 4: Combine the two arrays with m working back out of the recursion and as we build together the sorted array. C small = { 0, 1, 2, 3, 4, 5 }C large = { 7, 8, 9 } { 0, 1, 2 } 3 { 4, 5 } { 7 } 8 { 9 } { 0, 1 } 2 { empty } { empty } 4 { 5 } { 0 } 1 { empty }

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example of QuickSort (cont’d) Finally we can assemble c small and c large with our original choice of m, creating the sorted array. C small = { 0, 1, 2, 3, 4, 5 }C large = { 7, 8, 9 }m = 6 C sorted = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The QuickSort Algorithm 1.QuickSort(c) 2.if c consists of a single element 3.return c 4.m  c 1 5.Determine the set of elements c small smaller than m 6.Determine the set of elements c large larger than m 7.QuickSort(c small ) 8.QuickSort(c large ) 9.Combine c small, m, and c large into a single array, c sorted 10.return c sorted

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The QuickSort Algorithm

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info QuickSort Analysis: Optimistic Outlook Runtime is based on our selection of m: A good selection will split c evenly such that |low | = |high |, then the runtime is O(n log n). For a good selection, the recurrence relation is: T(n) = 2T(n/2) + const ·n Time it takes to split the array into 2 parts where const is a positive constant The time it takes to sort two smaller arrays of size n/2

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info QuickSort Analysis: Pessimistic Outlook However, a poor selection will split c unevenly and in the worst case, all elements will be greater or less than m so that one subarray is full and the other is empty. In this case, the runtime is O(n 2 ). For a poor selection, the recurrence relation is: T(n) = T(n-1) + const · n The time it takes to sort one array containing n-1 elements Time it takes to split the array into 2 parts where const is a positive constant

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info QuickSort Analysis (cont’d) QuickSort seems like an inefficient MergeSort To improve QuickSort, we need to choose m to be a good ‘splitter.’ It can be proved that to achieve O(nlogn) running time, we don’t need a perfect split, just reasonably good one. In fact, if both subarrays of each array of length n' are at least of size n'/4, then the running time will be O(n log n). This implies that half of the choices of m make good splitters.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info A Randomized Approach To improve QuickSort, randomly select m. Since half of the elements will be good splitters, if we choose m at random we will get a 50% chance that m will be a good choice. This approach will make sure that no matter what input is received, the expected running time is small.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The RandomizedQuickSort Algorithm 1.RandomizedQuickSort(c) 2.if c consists of a single element 3.return c 4.Choose element m uniformly at random from c 5.Determine the set of elements c small smaller than m 6.Determine the set of elements c large larger than m 7.RandomizedQuickSort(c small ) 8.RandomizedQuickSort(c large ) 9.Combine c small, m, and c large into a single array, c sorted 10.return c sorted *Lines Highlighted in red indicate the differences between QuickSort and and RandomizedQuickSort

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The RandomizedQuickSort Algorithm

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info RandomizedQuickSort Analysis Worst case runtime: O(n 2 ) Expected runtime: O(n log n). Expected runtime is a good measure of the performance of randomized algorithms, often more informative than worst case runtimes. RandomizedQuickSort will always return the correct answer, which offers a way to classify Randomized Algorithms.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Two Types of Randomized Algorithms Las Vegas Algorithms always produce the correct solution (e.g., RandomizedQuickSort) Monte Carlo Algorithms do not always return the correct solution. Las Vegas Algorithms are always preferred, but they are often hard to come by.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The Motif Finding Problem revisited Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info A New Motif Finding Approach Motif Finding Problem: Given a list of t sequences each of length n, find the “best” pattern of length l that appears in each of the t sequences. Previously: we solved the Motif Finding Problem using a Branch and Bound or a Greedy technique. Now: randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Profiles Revisited Let s=(s 1,...,s t ) be the set of starting positions for l-mers in our t sequences. The substrings corresponding to these starting positions will form: - t x l alignment matrix and - 4 x l profile matrix* P. *We make a special note that the profile matrix will be defined in terms of the relative frequency of letters, and not as the count of letters.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Prob(a|P) is defined as the probability that an l-mer a was created by the Profile P. If a is very similar to the consensus string of P then Prob(a|P) will be high If a is very different, then Prob(a|P) will be low. n Prob(a|P) =Π p a i, i i=1 Scoring Strings with a Profile

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Scoring Strings with a Profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = ??? The probability of the consensus string:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Scoring Strings with a Profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Scoring Strings with a Profile (cont’d) Given a profile: P = A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Prob(atacag|P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = Prob(aaacct|P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = The probability of the consensus string: Probability of a different string:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mer Define the P-most probable l-mer from a sequence as an l-mer in that sequence which has the highest probability of being created from the profile P. A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P = Given a sequence = ctataaaccttacatc, find the P-most probable l-mer

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Third try: c t a t a a a c c t t a c a t c Second try: c t a t a a a c c t t a c a t c First try: c t a t a a a c c t t a c a t c P-Most Probable l-mer (cont’d) A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 Find the Prob(a|P) of every possible 6-mer: -Continue this process to evaluate every possible 6-mer

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mer (cont’d) String, Highlighted in RedCalculationsprob(a|P) ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00 ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00 ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00 ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/ Compute prob(a|P) for every possible 6-mer:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mer (cont’d) String, Highlighted in RedCalculationsprob(a|P) ctataaaccttacat1/8 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 7/8 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/2 x 1/8 x 3/8 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 7/8 x 3/8 x 0 x 3/8 x 00 ctataaaccttacat1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/ ctataaaccttacat1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/ ctataaaccttacat1/2 x 0 x 1/2 x 0 1/4 x 00 ctataaaccttacat1/8 x 0 x 0 x 0 x 0 x 1/8 x 00 ctataaaccttacat1/8 x 1/8 x 0 x 0 x 3/8 x 00 ctataaaccttacat1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/ P-Most Probable 6-mer in the sequence is aaacct:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mer (cont’d) ctataaaccttacatc because Prob(aaacct|P) =.0336 is greater than the Prob(*|P) of any other 6-mer in the sequence. aaacct is the P-most probable 6-mer in:

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Dealing with Zeroes In our toy example prob(a|P)=0 in many cases. In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small. To avoid many entries with prob(a|P)=0, there exist techniques to replace these zeroes with a very small number so that one zero does not make the entire probability of a string zero (we will not address these techniques here).

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mers in Many Sequences Find the most probable l-mer w.r.t. P in each of the sequences. ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtataccttacatc tgcattcaatagctta tatcctttccactcac ctccaaatcctttaca ggtcatcctttatcct A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 P=P=

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info P-Most Probable l-mers in Many Sequences (cont’d) ctataaacgttacatc atagcgattcgactg cagcccagaaccct cggtgaaccttacatc tgcattcaatagctta tgtcctgtccactcac ctccaaatcctttaca ggtctacctttatcct P-Most Probable l-mers form a new profile 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Comparing New and Old Profiles Red – frequency increased, Blue – frequency descreased 1aaacgt 2atagcg 3aaccct 4gaacct 5atagct 6gacctg 7atcctt 8tacctt A5/8 4/8000 C00 6/84/80 T1/83/800 6/8 G2/800 1/82/8 A1/27/83/801/80 C 01/25/83/80 T1/8 001/47/8 G1/401/83/81/41/8 old profile

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Greedy Profile Motif Search Use P-Most probable l-mers to adjust start positions until we reach a “best” profile; this is the motif. 1)Select random starting positions. 2)Create a profile P from the substrings at these starting positions. 3)Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a. 4)Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info GreedyProfileMotifSearch Algorithm 1.GreedyProfileMotifSearch(DNA, t, n, l ) 2.Randomly select starting positions s=(s 1,…,s t ) from DNA 3.bestScore  0 4.while Score(s, DNA) > bestScore 5. Form profile P from s 6. bestScore  Score(s, DNA) 7. for i  1 to t 8. Find a P-most probable l-mer a from the i th sequence 9. s i  starting position of a 10.return bestScore

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info GreedyProfileMotifSearch Algorithm

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info GreedyProfileMotifSearch Analysis Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif. It is unlikely that the random starting positions will lead us to the correct solution at all. In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimum solution simply by chance.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling GreedyProfileMotifSearch is probably not the best way to find motifs. However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one. Gibbs Sampling proceeds more slowly and chooses new l-mers at random increasing the odds that it will converge to the correct solution.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling generates a sequence of samples from the joint distribution of two random variables this can be achieved despite not knowing the actual distribution very handy for estimating a distribution or computing an integral

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info How Gibbs Sampling Works 1) Randomly choose starting positions s = (s 1,...,s t ) and form the set of l-mers associated with these starting positions. 2) Randomly choose one of the t sequences. 3) Create a profile P from the other t -1 sequences. 4) For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P. 5) Choose a new starting position for the removed sequence at random based on the probabilities calculated in step 4. 6) Repeat steps 2-5 until there is no improvement

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example Input: t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 1) Randomly choose starting positions, s=(s 1,s 2,s 3,s 4,s 5 ) in the 5 sequences: s 1 =7GTAAACAATATTTATAGC s 2 =11AAAATTTACCTTAGAAGG s 3 =9CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1TACTTAACACCCTGTCAA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7GTAAACAATATTTATAGC s 2 =11AAAATTTACCTTAGAAGG s 3 =9CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1TACTTAACACCCTGTCAA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 2) Choose one of the sequences at random: Sequence 2: AAAATTTACCTTAGAAGG s 1 =7GTAAACAATATTTATAGC s 3 =9CCGTACTGTCAAGCGTGG s 4 =4 TGAGTAAACGACGTCCCA s 5 =1TACTTAACACCCTGTCAA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 3) Create profile P from l-mers in remaining 4 sequences: 1AATATTTA 3TCAAGCGT 4GTAAACGA 5TACTTAAC A1/42/4 3/41/4 2/4 C01/4 002/401/4 T2/41/4 2/41/4 G /40 Consensus String TAAATCGA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 4) Calculate the prob(a|P) for every possible 8- mer in the removed sequence: Strings Highlighted in Red prob(a|P) AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG AAAATTTACCTTAGAAGG0 0 0

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 5) Create a distribution of probabilities of l- mers prob(a|P), and randomly select a new starting position based on this distribution. Starting Position 1: prob( AAAATTTA | P ) = / = 6 Starting Position 2: prob( AAATTTAC | P ) = / = 1 Starting Position 8: prob( ACCTTAGA | P ) = / = 1.5 a) To create this distribution, divide each probability prob(a|P) by the lowest probability: Ratio = 6 : 1 : 1.5

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Turning Ratios into Probabilities Probability (Selecting Starting Position 1): 6/( )= Probability (Selecting Starting Position 2): 1/( )= Probability (Selecting Starting Position 8): 1.5/( )=0.176 b) Define probabilities of starting positions according to computed ratios

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example c) Select the start position according to computed ratios: P(selecting starting position 1):.706 P(selecting starting position 2):.118 P(selecting starting position 8):.176

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example Assume that, by chance this time, we select the substring with the highest probability – then we are left with the following new substrings and starting positions. s 1 =7 GTAAACAATATTTATAGC s 2 =1 AAAATTTACCTCGCAAGG s 3 =9 CCGTACTGTCAAGCGTGG s 4 =5 TGAGTAATCGACGTCCCA s 5 =1 TACTTCACACCCTGTCAA

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampling: an Example 6) We iterate the procedure again with the above starting positions until we cannot improve the score any more.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Gibbs Sampler in Practice Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach). Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. It therefore needs to be run with many randomly chosen seeds to achieve good results.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Another Randomized Approach Random Projection Algorithm is a different way to solve the Motif Finding Problem. Guiding principle: Some instances of a motif agree on a subset of positions. However, it is unclear how to find these “non- mutated” positions. To bypass the effect of mutations within a motif, we randomly select a subset of positions in the pattern creating a projection of the pattern. Search for that projection in a hope that the selected positions are not affected by mutations in most instances of the motif.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. This can be viewed as a projection of l- dimensional space onto k-dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 Projection Projection = (2, 4, 5, 7, 11, 12, 13)

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Random Projections Algorithm Select k out of l positions uniformly at random. For each l-tuple in input sequences, hash into a bucket based on letters at k selected positions. Recover motif from enriched bucket that contain many l-tuples. Bucket TGCT TGCACCT Input sequence: …TCAATGCACCTAT...

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info hash? bucket? enriched? ¿Qué? We assign to each l-mer a bucket corresponding to its sub-sequence of length k. There can be several such l-mers with the same length k subsequence. A bucket will be called "enriched" if it has more than some threshold number of l-mers.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Random Projections Algorithm (cont’d) Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing. In the example below, the bucket **GC*AC is “bad” while the bucket AT**G*C is “good” ATGCGTC...ccATCCGACca......ttATGAGGCtc......ctATAAGTCgc......tcATGTGACac... (7,2) motif

Random Projections Algorithm pseudocode

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example l = 7 (motif size), k = 4 (projection size) Choose projection (1,2,5,7) GCTC...TAGACATCCGACTTGCCTTACTAC... Buckets ATGC ATCCGAC GCCTTAC

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain more than s l- tuples, for some parameter s. ATTCCATCGCTC ATGC

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. Gibbs sampler. Local refinement algorithm ATGCGAC Candidate motif ATG C ATCCGAC ATGAGGC ATAAGTC ATGCGAC

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Synergy between Random Projection and Gibbs Sampler Random Projection is a procedure for finding good starting points: every enriched bucket is a potential starting point. Feeding these starting points into existing algorithms (like Gibbs sampler) provides good local search in vicinity of every starting point. These algorithms work particularly well for “good” starting points.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Building Profiles from Buckets A C G T Profile P Gibbs sampler Refined profile P* ATCCGA C ATGAGG C ATAAGTC ATGTGA C ATGC

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Motif Refinement For each bucket h containing more than s sequences, form profile P(h) Use Gibbs sampler algorithm with starting point P(h) to obtain refined profile P*

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Random Projection Algorithm: A Single Iteration Choose a random k-projection. Hash each l-mer x in input sequence into bucket labeled by h(x) From each enriched bucket (e.g., a bucket with more than s sequences), form profile P and perform Gibbs sampler motif refinement Candidate motif is best found by selecting the best motif among refinements of all enriched buckets.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Choosing Projection Size Projection size k - choose k small enough so that several motif instances hash to the same bucket. - choose k large enough to avoid contamination by spurious l-mers: 4 k >> t (n - l + 1)

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info How Many Iterations should we have? We define a Planted bucket : bucket with hash value h(M), where M is the motif. We choose m = number of iterations, such that Pr(planted bucket contains at least s sequences in at least one of m iterations) = 0.95 Probability is readily computable since iterations form a sequence of independent Bernoulli trials

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info S = { x(1),…x(t) } : set of input sequences Given: A probabilistic motif model W( Q ) depending on unknown parameters Q, and a background probability distribution P. Find value Q max that maximizes likelihood ratio: EM is local optimization scheme. Requires starting value Q 0. Expectation Maximization (EM) Pr(S | W (  max ), P ) Pr(S | P)

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info EM Motif Refinement (cont’d) For each input sequence, x(i), return l-tuple y(i) which maximizes likelihood ratio: -T = { y(1), y(2),…,y(t) } -C(T) = consensus string Pr( y(i) | W( Q h *)) Pr( y(i) | P)

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Heuristic Searching Local search is another kind of randomized algorithm, useful for solving hard problems where no global solution is known. Essentially all local search methods work in the same way, but with variations that distinguish them. For this part we will discuss combinatorial optimization problems: there are no continuous variables to determine.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Why no guarantees? It is not possible to guarantee that local search strategies won't get stuck in local optima – places where while all the neighbours look worse, there are other more distant solutions that are better. The many different local search methods are designed to get around this problem in different ways.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Given a combinatorial optimisation problem Suppose we are given a solution space S = {s i }, an objective function z, and a way of converting solution s i into some other s j – we'll call these perturbations P These three things define a graph that we must search.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The landscape The graph just mentioned can be thought of as a landscape, whose characteristics directly affect how easy it is to solve the (heuristic search) problem. A poor choice of z or P can seriously mess thigns pu!

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Local Search

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Local Search 1.Begin with a solution s i 2.bestScore  z(s i ) 3.while (!bored) 4. choose a perturbation p from P 5. s j  p(s i ) 6. if (z(s j ) is acceptable) 7. s i  s j 8. if (too long without improvement) 9. bored  true 10.return s i

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Local Search 1.Begin with a solution s i 2.bestScore  z(s i ) 3.while (!bored) 4. choose a perturbation p from P 5. s j  p(s i ) 6. if (z(s j ) is acceptable) 7. s i  s j 8. if (too long without improvement) 9. bored  true 10.return s i

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Example: S = set of phylogenetic trees 1.Begin with a solution s i = 2.bestScore  z(s i ) 3.while (!bored) 4. choose a perturbation p from P 5. s j  p(s i ) 6. if (z(s j ) is more than z(s i )) 7. s i  s j 8. if (no more perturbations available) 9. bored  true 10.return s i

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Hill-climbing Only accept a new solution if its score is better than the score of the current solution Related is steepest [a/de]scent: choose the neighbouring solution that provides the best improvement

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Simulated Annealing Accept a new solution if its score is better than the score of the current solution, or (supposing we're maximising) if z(s j ) < z(s i ), accept it with a probability proportional to where T represents a "temperature" that lowers through the search. S. Kirkpatrick and C. D. Gelatt and M. P. Vecchi, Optimization by Simulated Annealing, Science 220(4598): , 1983 V. Cerny, A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41-51, 1985

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info The Great Deluge Accept a new solution if its score is better than the score of a "water level" w If we accept a solution, adjust w by a small fraction of the difference between the new score z(s j ) and the current w. Typically w is about Gunter Dueck: "New Optimization Heuristics: The Great Deluge Algorithm and the Record-to- Record Travel". Journal of Computational Physics, 104(1):86-92, 1993

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Tabu Search Another method used to escape local optima Keep a list of just-visited solutions which are "tabu" (/ "taboo"), and never go back to them The optimal length of this list is sensitive to the problem.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Parallel searching Another add-on to these "single-line" heuristic search methods is to parallelise them. This is not just running multiple single-line searches on parallel machines Two popular such methods are genetic algorithms (GA) ant colony optimization (ACO)

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Genetic Algorithms These are inspired by the idea that recombination of genetic material leads to evolution toward fitter species A "population" of solutions are perturbed in some way and then the "fittest" ones are selected stochastically These are recombined, with possible modification (more perturbation / "mutation") This is repeated ad nauseam...

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Ant Colony Optimization 1.while (!bored) 2. generate a new set of solutions (ants) stochastically, depending on local pheromone values 3. evaluate the solutions 4. mark a "pheromone trail" for good solutions (this decays exponentially) 5. update the pheromone values M. Dorigo, Optimization, Learning and Natural Algorithms, PhD thesis, Politecnico di Milano, Italy.

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Hitch-hiking A (much!) lesser known parallel algorithm Invented in Thredbo Based on the idea that good solutions have common features, but we may not know what they are Use a population of solutions, some designated drivers and some hitchers, each assigned to one of the drivers Do most of the work on the driver solutions Michael A. Charleston. Hitch-Hiking: A Parallel Heuristic Search Strategy, Applied to the Phylogeny Problem. Journal of Computational Biology 8(1): 79-91, Cited more than once!

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Hitch-hiking pseudocode 1.Begin with a population T of solutions 2.assign drivers D and hitchers H, T = D  H 3.while (!bored) 4. for each d i  D 5. find an acceptable perturbation instance p i 6. for each hitcher h j 7. try this perturbation on h i 8. reassign D, H and assignments of hitchers to drivers 9. if (too long without improvement) 10. bored  true 11.return T

Hitch-hiking pseudocode

COMP3456 – Adapted from textbook slideswww.bioalgorithms.info Summary Randomized algorithms are often the only available approach to give some solution that is "good enough" Some (Las Vegas) vary in their execution time but guarantee correct answers (these are rare) Some (Monte Carlo) vary in their output: they cannot guarantee optimality (these are more common) Randomized algorithms have a simple basic structure that can be extended with different perturbations, acceptance rules, stopping criteria tabu lists parallelization