The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008.

The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech2 Outline Biology –Background –Purpose –What is a motif ? –What is the motif problem ? Mathematics –The two types of algorithms –An example algorithm Problems/Review/Questions

P. Tamashiro, GA Tech3 Biology Section

P. Tamashiro, GA Tech4 Background Hamilton Smith discovered first DNA signal in 1970 Worked with Hind II restriction enzyme of Haemophilus influenzae, a type of bacteria that affects the upper respiratory systems Primary capability of the Hind II enzyme was to separate DNA sequences into specific subsequences Restriction enzymes are the easiest signals to locate (Pevzner 2001)

P. Tamashiro, GA Tech5 Purpose of the Motif Problem Related to the connection between drugs and their specific targets in the human body –Drugs - any chemical substance used to treat or investigate a disease –Target - a molecule within the human body that endures a reaction with a drug Could affect the activity of certain proteins or enzymes found in nature through regulatory sites and could dramatically increase the potential benefits of drug target identification (Peter Imming, Christian Sinning, and Achim Meyer, 2006)

P. Tamashiro, GA Tech6 Purpose of the Motif Problem (Peter Imming, Christian Sinning, and Achim Meyer, 2006)

P. Tamashiro, GA Tech7 What is a Motif ? Definition – section of a DNA sequence (contiguous or sometimes non-contiguous) used for gene sequencing or drug target purposes (Mendes) Can be used in two ways: –The existing substring found in the input sequence –The pattern produced by the algorithm itself (Mendes 2008)

P. Tamashiro, GA Tech8 What is a Motif ? Very elementary definition of motif and algorithm

P. Tamashiro, GA Tech9 What is the Motif Problem ? If given some set of input sequences S = {S 1, S 2, …, S t }, does a common subsequence of length l between l min, …, l max exist among q ≤ t of the sequences with no more than e mismatches? If so, how does one break down sequences in order to easily distinguish the relevant signals from randomly reoccurring patterns? (Mendes 2008)

P. Tamashiro, GA Tech10 Mathematics Section

P. Tamashiro, GA Tech11 The Two Types of Algorithms Combinatorics –Graph theory, Counting –Enumeration Probability/Statistics –Expectation Maximization, Probabilistic Optimization –Maximum Likelihood Estimators (a method that uses statistics to find the best model for a given set of data) (Mendes 2008)

P. Tamashiro, GA Tech12 Notation S = {S 1, S 2, …, S t } - set of input sequences l-mer - a subsequence of length l. m ij - an l-mer of the sequence S j that starts at position i. S j [i] - the i-th symbol in the j-th sequence. n j - the length of the j-th sequence (Mendes 2008)

P. Tamashiro, GA Tech13 The WINNOWER Algorithm Summary: finds all motifs of length l that occur in the input sequences that have no more than e mismatches –1. Constructs a graph with vertices being sequences of DNA and edges between vertices of similar sequences –2. Begins eliminating unwanted edges –3. Remaining graph may contain vertex representing a motif

P. Tamashiro, GA Tech14 The WINNOWER Algorithm Figure 3: the WINNOWER algorithm (Mendes 2008)

P. Tamashiro, GA Tech15 The WINNOWER Algorithm Vertices represent all of the l-mers in the set of sequences S = {S 1, S 2, …, S t } There exists an edge between two vertices if the Hamming distance is less than or equal to 2e for two different l-mers Hamming distance is the number of places where corresponding characters are different for two strings

P. Tamashiro, GA Tech16 The WINNOWER Algorithm Graph G = (V, E) is a t-partite graph where each part is made up of vertices developed by the different input sequences The algorithm systematically reduces the number of edges by finding extendable cliques.

P. Tamashiro, GA Tech17 The WINNOWER Algorithm Clique - a subgraph where every two vertices are connected by an edge A clique is extendable if there exists one or more neighbors in each partition. –Suppose there exists a clique C with vertices {V 1, …, V k }. A neighbor of the clique C is a vertex u such that {V 1, …, V k, u} is also a clique. The algorithm reduces the graph G by deleting spurious edges (edges that do not belong to the extended cliques of size k). (Mendes 2008)

P. Tamashiro, GA Tech18 The WINNOWER Algorithm The value of k is increase after each iterations until there are only extendable cliques remaining and G can not be altered any more –If k = 1, the algorithm would delete all vertices that have less than t – 1 neighbors. –If k = 2, the algorithm would delete all vertices that have less than t – 2 neighbors, et cetera. (Mendes 2008)

P. Tamashiro, GA Tech19 The WINNOWER Algorithm The algorithm does not actually give the motif directly. It produces a graph G where only t-cliques remain, and if a t-clique exists, this does not ensure that a motif exists in the set of sequences One must examine all cliques and figure out which are the appropriate motifs if they exist

P. Tamashiro, GA Tech20 The WINNOWER Algorithm The empty graph implies that there exists no motif. A small graph with large cliques implies that some motifs may exist A very large graph means that there are still many spurious edges left over so the algorithm was not efficient in finding a t-clique (Mendes 2008)

P. Tamashiro, GA Tech21 The WINNOWER Algorithm Drawbacks –It consumes large amounts of time and space –Does not guaranteed to accurately find motifs –Even if a t-clique is found, it does not mean that a motif exists

P. Tamashiro, GA Tech22 Problems/Review/Questions

P. Tamashiro, GA Tech23 General Problems Reliability - proficiency in exposing a motif Complexity - what it costs to find a motif Questions that arise: –How should we formally define the reliability of a motif finder? –Should we be content with worst-case time scenarios? (Mendes 2008)

P. Tamashiro, GA Tech24 General Problems One of the most recognized ways of testing reliability of an algorithm is to experiment using input sequences for which expected motifs are available. In other words, test the new algorithms with input sequences and see if the new algorithms give the same conclusion as pre-existing algorithms. (Mendes 2008)

P. Tamashiro, GA Tech25 General Problems Question: –Once the algorithm finishes, how do scientists distinguish the biologically significant patterns from all of the patterns that the algorithm saw fit to keep after all of the iterations were completed? Answer: –The solution requires a better understanding of how DNA and RNA sequences interact with their targets. This will give a better understanding of the requirements of these algorithms, and this will give biologists better direction on how to interpret the algorithms and data that mathematicians and computer scientists explore

P. Tamashiro, GA Tech26 Review We discussed the problem of finding non- trivial motifs in a given set of input sequences allowing some number of mismatches. We also explored its general applications in the biological world. We saw an example algorithm. We discussed problems with current algorithms and asked questions about them.

P. Tamashiro, GA Tech27 References 1. D'Haeseleer, Patrick. "How does DNA sequence motif discovery work?" Nature Biotechnology 24: 959-961. 2. Eskin, Eleazar, and Pavel A. Pevzner. "Finding Composite Regulatory Patterns in DNA Sequences." Bioinformatics 18 (2002): s354-s363. 3. Imming, Peter, Christian Sinning, and Achim Meyer. "Drugs, Their Targets and the Nature and Number of Drug Targets." Nature Reviews Drug Discovery 5 (2006): 821-834. 4. Mendes, Nuno D. "Finding Common Motifs in DNA Sequences: A Survey." Instituto Superior TéCnico. 1 Apr. 2008. 5. Pevzner, Pavel A., and Sing-Hoi Sze. "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences." International Conference on Intelligent Systems for Molecular Biology 8 (2000): 269-278. 5 Feb. 2008 http://www.ncbi.nlm.nih.gov/pubmed/10977088?ordinalpos=2&itool=Entrez System2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSumSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum. 6. Pevzner, Pavel A. Computational Molecular Biology. Cambridge, Massachusetts: The MIT Press, 2001. 133-151.

The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008.

Similar presentations

Presentation on theme: "The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008.

Similar presentations

Presentation on theme: "The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback