Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info.

Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Genomic Repeats Example of repeats: Example of repeats: A ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Motivation to find them: Genomic rearrangements are often associated with repeats Genomic rearrangements are often associated with repeats Trace evolutionary secrets Trace evolutionary secrets Many tumors are characterized by an explosion of repeats Many tumors are characterized by an explosion of repeats ~50% of human genome composed of repeats ~50% of human genome composed of repeats An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Genomic Repeats The problem is often more difficult: The problem is often more difficult: ATGGTCTAGGACCTAGTGTTC Motivation to find them: Motivation to find them: Genomic rearrangements are often associated with repeats Genomic rearrangements are often associated with repeats Trace evolutionary secrets Trace evolutionary secrets Many tumors are characterized by an explosion of repeats Many tumors are characterized by an explosion of repeats ~50% of human genome composed of repeats ~50% of human genome composed of repeats An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

l -mer Repeats Long repeats are difficult to find Long repeats are difficult to find Short repeats are easy to find (e.g., hashing) Short repeats are easy to find (e.g., hashing) Simple approach to finding long repeats: Simple approach to finding long repeats: Find exact repeats of short l -mers. l is usually 10 to 13; for an n-element sequence, table would need 4 l bins with ~ n/4 l elements/bin (i.e., because there are n- l + 1 l -mers per sequence) Find exact repeats of short l -mers. l is usually 10 to 13; for an n-element sequence, table would need 4 l bins with ~ n/4 l elements/bin (i.e., because there are n- l + 1 l -mers per sequence) Use l -mer repeats to potentially extend into longer, maximal repeats Use l -mer repeats to potentially extend into longer, maximal repeats An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

l -mer Repeats (cont’d) There are typically many locations where an l -mer is repeated: There are typically many locations where an l -mer is repeated: GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC starts at locations 3 and 17 The 4-mer TTAC starts at locations 3 and 17 An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Extending l -mer Repeats GCTTACAGATTCAGTCTTACAGATGGT Extend these 4-mer matches: Extend these 4-mer matches: GCTTACAGATTCAGTCTTACAGATGGT Maximal repeat: TTACAGAT Maximal repeat: TTACAGAT An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Maximal Repeats To find maximal repeats in this way, we need ALL start locations of all l -mers in the genome To find maximal repeats in this way, we need ALL start locations of all l -mers in the genome Hashing lets us find repeats quickly in this manner Hashing lets us find repeats quickly in this manner An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Hashing: What is it? What does hashing do? What does hashing do? For different data, generate a unique integer For different data, generate a unique integer Store data in an array at the unique integer index generated from the data Store data in an array at the unique integer index generated from the data Hashing is a very efficient way to store and retrieve data Hashing is a very efficient way to store and retrieve data An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Hashing: Definitions Hash table: array used in hashing Hash table: array used in hashing Records: data stored in a hash table Records: data stored in a hash table Keys: identifies sets of records Keys: identifies sets of records Hash function: uses a key to generate an index to insert at in hash table Hash function: uses a key to generate an index to insert at in hash table Collision: when more than one record is mapped to the same index in the hash table Collision: when more than one record is mapped to the same index in the hash table An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Hashing DNA sequences Each l -mer can be translated into a binary string (A, C, G, T can be represented as 00, 01, 10, 11) Each l -mer can be translated into a binary string (A, C, G, T can be represented as 00, 01, 10, 11) After assigning a unique integer per l -mer it is easy to get all start locations of each l - mer in a genome After assigning a unique integer per l -mer it is easy to get all start locations of each l - mer in a genome An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Hashing: Maximal Repeats To find repeats in a genome: To find repeats in a genome: For all l -mers in the genome, note the start position and the sequence For all l -mers in the genome, note the start position and the sequence Generate a hash table index for each unique l -mer sequence Generate a hash table index for each unique l -mer sequence In each index of the hash table, store all genome start locations of the l -mer which generated that index In each index of the hash table, store all genome start locations of the l -mer which generated that index Extend l -mer repeats to maximal repeats Extend l -mer repeats to maximal repeats An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Hashing: Collisions Dealing with collisions: Dealing with collisions: “Chain” all start locations of l -mers (linked list) “Chain” all start locations of l -mers (linked list) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Pattern Matching What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern? What if, instead of finding repeats in a genome, we want to find all sequences in a database that contain a given pattern? This leads us to a different problem, the Pattern Matching Problem This leads us to a different problem, the Pattern Matching Problem An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Pattern Matching Problem Goal: Find all occurrences of a pattern in a text Goal: Find all occurrences of a pattern in a text Input: Pattern p = p 1 …p n and text t = t 1 …t m Input: Pattern p = p 1 …p n and text t = t 1 …t m Output: All positions 1< i < (m – n + 1) such that the n-letter substring of t starting at i matches p Output: All positions 1< i < (m – n + 1) such that the n-letter substring of t starting at i matches p Motivation: Searching database for a known pattern Motivation: Searching database for a known pattern An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Exact Pattern Matching: A Brute-Force Algorithm PatternMatching(p,t) 1 n  length of pattern p 2 m  length of text t 3 for i  1 to (m – n + 1) 4 if t i …t i+n-1 = p 5 output i An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Exact Pattern Matching: An Example PatternMatching algorithm for: PatternMatching algorithm for: Pattern GCAT Pattern GCAT Text CGCATC Text CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Exact Pattern Matching: Running Time PatternMatching runtime: O(nm) PatternMatching runtime: O(nm) Probability-wise, it’s more like O(m) Probability-wise, it’s more like O(m) Rarely will there be close to n comparisons in line 4 (e.g., consider looking for AAAAT in a string consisting of 10 6 A’s…). Note: probability for x character types that the first character matches is 1/x; probability that it doesn’t is (x-1)/x. Probability that first matches and second doesn’t is (x-1)/x 2. The probability that the first j characters match is 1/ x j. Rarely will there be close to n comparisons in line 4 (e.g., consider looking for AAAAT in a string consisting of 10 6 A’s…). Note: probability for x character types that the first character matches is 1/x; probability that it doesn’t is (x-1)/x. Probability that first matches and second doesn’t is (x-1)/x 2. The probability that the first j characters match is 1/ x j. Better solution: suffix trees Better solution: suffix trees Can solve problem in O(m) time Can solve problem in O(m) time Conceptually related to keyword trees Conceptually related to keyword trees An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Example Keyword tree: Keyword tree: Apple Apple An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Example (cont’d) Keyword tree: Keyword tree: Apple Apple Apropos Apropos An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Example (cont’d) Keyword tree: Keyword tree: Apple Apple Apropos Apropos Banana Banana An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Example (cont’d) Keyword tree: Keyword tree: Apple Apple Apropos Apropos Banana Banana Bandana Bandana An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Example (cont’d) Keyword tree: Keyword tree: Apple Apple Apropos Apropos Banana Banana Bandana Bandana Orange Orange An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Properties Stores a set of keywords in a rooted labeled tree Stores a set of keywords in a rooted labeled tree Each edge labeled with a letter from an alphabet Each edge labeled with a letter from an alphabet Any two edges coming out of the same vertex have distinct labels Any two edges coming out of the same vertex have distinct labels Every keyword stored can be spelled on a path from root to some leaf Every keyword stored can be spelled on a path from root to some leaf An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Threading (cont’d) Search for “appeal” Search for “appeal” appeal appeal An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Threading (cont’d) Search for “apple” Search for “apple” apple apple An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Multiple Pattern Matching Problem Goal: Given a set of patterns and a text, find all occurrences of any of the patterns in the text Goal: Given a set of patterns and a text, find all occurrences of any of the patterns in the text Input: k patterns p 1,…,p k, and text t = t 1 …t m Input: k patterns p 1,…,p k, and text t = t 1 …t m Output: Positions 1 < i < m where substring of t starting at i matches p j for 1 < j < k Output: Positions 1 < i < m where substring of t starting at i matches p j for 1 < j < k Motivation: Searching database for known multiple patterns Motivation: Searching database for known multiple patterns An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Multiple Pattern Matching: Straightforward Approach Can solve as k “Pattern Matching Problems” Can solve as k “Pattern Matching Problems” Runtime: Runtime: O(kmn) O(kmn) using the PatternMatching algorithm k times using the PatternMatching algorithm k times m - length of the text m - length of the text n - average length of the pattern n - average length of the pattern An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Multiple Pattern Matching: Keyword Tree Approach Or, we could use keyword trees: Or, we could use keyword trees: Build keyword tree in O(N) time; N is total length of all patterns (i.e., k*n) Build keyword tree in O(N) time; N is total length of all patterns (i.e., k*n) With naive threading: O(N + nm) With naive threading: O(N + nm) Aho-Corasick algorithm: O(N + m) Aho-Corasick algorithm: O(N + m) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info Example: Suppose m = 10 6, k=100, and n = 1000 O(kmn)  10 2 *10 6 *10 3 = 10 11 O(N + nm)  10 2 *10 3 + 10 3 *10 6 = 10 5 + 10 9 ≈ 10 9 O(N + m)  10 2 *10 3 + 10 6 = 10 5 + 10 6 ≈ 10 6

Keyword Trees: Threading To match patterns in a text using a keyword tree: To match patterns in a text using a keyword tree: Build keyword tree of patterns Build keyword tree of patterns “Thread” the text through the keyword tree “Thread” the text through the keyword tree An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Keyword Trees: Threading (cont’d) Threading is “complete” when we reach a leaf in the keyword tree Threading is “complete” when we reach a leaf in the keyword tree When threading is “complete,” we’ve found a pattern in the text When threading is “complete,” we’ve found a pattern in the text An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Trees=Collapsed Keyword Trees Similar to keyword trees, except edges that form paths are collapsed Similar to keyword trees, except edges that form paths are collapsed Each edge is labeled with a substring of a text Each edge is labeled with a substring of a text All internal vertices have at least two outgoing edges All internal vertices have at least two outgoing edges Leaves labeled by the index of the pattern. Leaves labeled by the index of the pattern. An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Tree of a Text Suffix trees of a text is constructed for all its suffixes Suffix trees of a text is constructed for all its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree Suffix Tree An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Tree of a Text Suffix trees of a text is constructed for all its suffixes Suffix trees of a text is constructed for all its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree Suffix Tree How much time does it take? An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Tree of a Text Suffix trees of a text is constructed for all its suffixes Suffix trees of a text is constructed for all its suffixes ATCATG TCATG CATG ATG TG G quadratic Keyword Tree Suffix Tree Time is linear in the total size of all suffixes, i.e., it is quadratic in the length of the text An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Trees: Advantages Suffix trees of a text is constructed for all its suffixes Suffix trees of a text is constructed for all its suffixes Suffix trees build faster than keyword trees Suffix trees build faster than keyword trees ATCATG TCATG CATG ATG TG G quadratic Keyword Tree Suffix Tree linear (Weiner suffix tree algorithm) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Use of Suffix Trees Suffix trees hold all suffixes of a text Suffix trees hold all suffixes of a text i.e., ATCGC: ATCGC, TCGC, CGC, GC, C i.e., ATCGC: ATCGC, TCGC, CGC, GC, C Builds in O(m) time for text of length m Builds in O(m) time for text of length m To find any pattern of length n in a text: To find any pattern of length n in a text: Build suffix tree for text Build suffix tree for text Thread the pattern through the suffix tree Thread the pattern through the suffix tree Can find pattern in text in O(n) time! Can find pattern in text in O(n) time! O(n + m) time for “Pattern Matching Problem” O(n + m) time for “Pattern Matching Problem” Build suffix tree and lookup pattern Build suffix tree and lookup pattern An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Pattern Matching with Suffix Trees SuffixTreePatternMatching(p,t) 1 Build suffix tree for text t 2 Thread pattern p through suffix tree 3 if threading is complete 4 output positions of all p-matching leaves in the tree 5 else 6 output “Pattern does not appear in text” An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Suffix Trees: Example An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info G G

Multiple Pattern Matching: Summary Keyword and suffix trees are used to find patterns in a text Keyword and suffix trees are used to find patterns in a text Keyword trees: Keyword trees: Build keyword tree of patterns, and thread text through it Build keyword tree of patterns, and thread text through it Suffix trees: Suffix trees: Build suffix tree of text, and thread patterns through it Build suffix tree of text, and thread patterns through it An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Approximate vs. Exact Pattern Matching So far all we’ve seen exact pattern matching algorithms So far all we’ve seen exact pattern matching algorithms Usually, because of mutations, it makes much more biological sense to find approximate pattern matches Usually, because of mutations, it makes much more biological sense to find approximate pattern matches Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Heuristic Similarity Searches Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow Alignment of two sequences usually has short identical or highly similar fragments Alignment of two sequences usually has short identical or highly similar fragments Many heuristic methods (i.e., FASTA) are based on the same idea of filtration Many heuristic methods (i.e., FASTA) are based on the same idea of filtration Find short exact matches, and use them as seeds for potential match extension Find short exact matches, and use them as seeds for potential match extension “Filter” out positions with no extendable matches “Filter” out positions with no extendable matches An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Dot Matrices Dot matrices show similarities between two sequences Dot matrices show similarities between two sequences FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches) FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Dot Matrices (cont’d) Identify diagonals above a threshold length Identify diagonals above a threshold length Diagonals in the dot matrix indicate exact substring matching Diagonals in the dot matrix indicate exact substring matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Diagonals in Dot Matrices Extend diagonals and try to link them together, allowing for minimal mismatches/indels Extend diagonals and try to link them together, allowing for minimal mismatches/indels Linking diagonals reveals approximate matches over longer substrings Linking diagonals reveals approximate matches over longer substrings An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Approximate Pattern Matching Problem Goal: Find all approximate occurrences of a pattern in a text Goal: Find all approximate occurrences of a pattern in a text Input: A pattern p = p 1 …p n, text t = t 1 …t m, and k, the maximum number of mismatches Input: A pattern p = p 1 …p n, text t = t 1 …t m, and k, the maximum number of mismatches Output: All positions 1 < i < (m – n + 1) such that t i …t i+n-1 and p 1 …p n have at most k mismatches (i.e., Hamming distance between t i …t i+n-1 and p < k) Output: All positions 1 < i < (m – n + 1) such that t i …t i+n-1 and p 1 …p n have at most k mismatches (i.e., Hamming distance between t i …t i+n-1 and p < k) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Approximate Pattern Matching: A Brute-Force Algorithm ApproximatePatternMatching(p, t, k) 1 n  length of pattern p 2 m  length of text t 3 for i  1 to m – n + 1 4 dist  0 5 for j  1 to n 6 if t i+j-1 != p j 7 dist  dist + 1 8 if dist < k 9 output i An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Approximate Pattern Matching: Running Time That algorithm runs in O(nm). That algorithm runs in O(nm). Landau-Vishkin algorithm: O(kn) Landau-Vishkin algorithm: O(kn) We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”: We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”: We want to match substrings in a query to substrings in a text with at most k mismatches We want to match substrings in a query to substrings in a text with at most k mismatches Motivation: we want to see similarities to some gene, but we may not know which parts of the gene to look for Motivation: we want to see similarities to some gene, but we may not know which parts of the gene to look for An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Query Matching Problem Goal: Find all substrings of the query that approximately match the text Goal: Find all substrings of the query that approximately match the text Input: Query q = q 1 …q r, Input: Query q = q 1 …q r, text t = t 1 …t m, text t = t 1 …t m, n (length of matching substrings), n (length of matching substrings), k (maximum number of mismatches) k (maximum number of mismatches) Output: All pairs of positions (i, j) such that the Output: All pairs of positions (i, j) such that the n-letter substring of q starting at i approximately matches the n-letter substring of q starting at i approximately matches the n-letter substring of t starting at j, n-letter substring of t starting at j, with at most k mismatches with at most k mismatches An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Approximate Pattern Matching vs Query Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Query Matching: Main Idea Approximately matching strings share some perfectly matching substrings. Approximately matching strings share some perfectly matching substrings. Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy). Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy). An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Filtration in Query Matching We want all n-matches between a query and a text with up to k mismatches We want all n-matches between a query and a text with up to k mismatches “Filter” out positions we know do not match between text and query “Filter” out positions we know do not match between text and query Potential match detection: find all matches of l -tuples in query and text for some small l Potential match detection: find all matches of l -tuples in query and text for some small l Potential match verification: Verify each potential match by extending it to the left and right, until (k + 1) mismatches are found Potential match verification: Verify each potential match by extending it to the left and right, until (k + 1) mismatches are found An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Filtration: Match Detection Theorem: If x 1 …x n and y 1 …y n match with at most k mismatches, they share an l -mer that is perfectly matched, with l =  n/(k + 1) . That is, x i+1 …x i+ l = y i+1 …y i+ l for some 0  i  n- l. Proof: Partition the set of positions from 1 to n into k+1 groups with  n/(k + 1)  positions in each group. Observe that k mismatches can affect at most k of these k+1 parts so at least one of these k+1 parts is perfectly matched An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Filtration: Match Detection (cont’d) Suppose k = 3. We would then have l =n/(k+1)=n/4: Suppose k = 3. We would then have l =n/(k+1)=n/4: There are at most k mismatches in n, so at the very least there must be one out of the k+1 l –tuples without a mismatch There are at most k mismatches in n, so at the very least there must be one out of the k+1 l –tuples without a mismatch Note: this gives us a way to select a value for l… Note: this gives us a way to select a value for l… (or at least an upper bound) 1…ll +1…2l2l +1…3l3l +1…n 12kk + 1 An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

l -mer Filtration Algorithm Potential match detection: Find all matches of l -mers in both query and text for l =  n/(k+1)  (e.g., use either hashing or a suffix tree). Potential match detection: Find all matches of l -mers in both query and text for l =  n/(k+1)  (e.g., use either hashing or a suffix tree). Potential match verification: Verify each potential match by extending it to the left and to the right until the first k + 1 mismatches are found (or the beginning or end of the query or the text is found). Potential match verification: Verify each potential match by extending it to the left and to the right until the first k + 1 mismatches are found (or the beginning or end of the query or the text is found). An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Filtration: Match Verification For each l -match we find, try to extend the match further to see if it is substantial For each l -match we find, try to extend the match further to see if it is substantial query Extend perfect match of length l until we find an approximate match of length n with k mismatches text An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Filtration: Example k = 0k = 1k = 2k = 3k = 4k = 5 l -tuple length nn/2n/3n/4n/5n/6 Shorter perfect matches required Performance decreases An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Local alignment is too slow… Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Local alignment is too slow… Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) But, guaranteed to find the optimal local alignment But, guaranteed to find the optimal local alignment Sets the standard for sensitivity Sets the standard for sensitivity An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Local alignment is too slow… Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) Quadratic local alignment is too slow while looking for similarities between long strings (e.g. the entire GenBank database) Basic Local Alignment Search Tool Basic Local Alignment Search Tool Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J. Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J. Journal of Mol. Biol., 1990 Search sequence databases for local alignments to a query Search sequence databases for local alignments to a query An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAST Great improvement in speed, with a modest decrease in sensitivity Great improvement in speed, with a modest decrease in sensitivity Minimizes search space instead of exploring entire search space between two sequences Minimizes search space instead of exploring entire search space between two sequences Finds short exact matches (“seeds”), only explores locally around these “hits” Finds short exact matches (“seeds”), only explores locally around these “hits” An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

What Similarity Reveals BLASTing a new gene BLASTing a new gene Evolutionary relationship Evolutionary relationship Similarity between protein function Similarity between protein function BLASTing a genome BLASTing a genome Potential genes Potential genes An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAST algorithm Keyword search of all words of length w from the query of length n in database of length m with score above threshold Keyword search of all words of length w from the query of length n in database of length m with score above threshold w = 11 for DNA queries, w =3 for proteins w = 11 for DNA queries, w =3 for proteins Local alignment extension for each found keyword Local alignment extension for each found keyword Extend result until longest match above threshold is achieved Extend result until longest match above threshold is achieved Running time O(nm) Running time O(nm) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAST algorithm (cont’d) Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 264 Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD keyword GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 neighborhood score threshold (T = 13) Neighborhood words High-scoring Pair (HSP) extension An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Original BLAST Dictionary Dictionary All words of length w All words of length w Alignment Alignment Ungapped extensions until score falls below some statistical threshold Ungapped extensions until score falls below some statistical threshold Output Output All local alignments with score > threshold All local alignments with score > threshold An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Original BLAST: Example ACGAAGTAAGGTCCAGT A G C G T T A G G T C C T A G T C w = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score is under 50% Output result GTAAGGTCC GTTAGGTCC cf. Serafim Batzoglou lectures (Stanford) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Gapped BLAST: Example ACGAAGTAAGGTCCAGT A G C G T T A G G T C C T A G T C cf. Serafim Batzoglou lectures (Stanford) An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info Original BLAST exact keyword search, THEN: Original BLAST exact keyword search, THEN: Extend with gaps around ends of exact match until score < threshold Extend with gaps around ends of exact match until score < threshold Output result Output result GTAAGGTCC-AGT GTTAGGTCCTAGT

Incarnations of BLAST blastn: Nucleotide-nucleotide blastn: Nucleotide-nucleotide blastp: Protein-protein blastp: Protein-protein blastx: Translated query vs. protein database blastx: Translated query vs. protein database tblastn: Protein query vs. translated database tblastn: Protein query vs. translated database tblastx: Translated query vs. translated tblastx: Translated query vs. translated database database An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Incarnations of BLAST (cont’d) PSI-BLAST PSI-BLAST Find members of a protein family or build a custom position-specific score matrix Find members of a protein family or build a custom position-specific score matrix Megablast: Megablast: Search longer sequences with fewer differences Search longer sequences with fewer differences WU-BLAST: (Wash U BLAST) WU-BLAST: (Wash U BLAST) Optimized, added features Optimized, added features An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Assessing sequence similarity Need to know how strong an alignment can be expected from chance alone Need to know how strong an alignment can be expected from chance alone “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model “Chance” relates to comparison of sequences that are generated randomly based upon a certain sequence model Sequence models may take into account: Sequence models may take into account: G+C content G+C content Poly-A tails Poly-A tails “Junk” DNA “Junk” DNA Codon bias Codon bias Etc. Etc. An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Sample BLAST output Score E Score E Sequences producing significant alignments: (bits) Value gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44 gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44 gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44 gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43 ALIGNMENTS >gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] Length = 148 Score = 171 bits (434), Expect = 3e-44 Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%) Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%) Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK Sbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60 Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120 Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YH + F VQ A+QK +A V +AL +YH Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148 Blast of human beta globin protein against zebra fish An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Sample BLAST output (cont’d) Score E Score E Sequences producing significant alignments: (bits) Value gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75 gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75 gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72 gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66 gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34 gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33 ALIGNMENTS >gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Identities = 183/219 (83%) Strand = Plus / Plus Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468 Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| |||||||||||| ||||||||| |||||||||| ||||| |||||||||||| Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507 Blast of human beta globin DNA against human DNA An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Expect (E) value The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect

E-values in blast results represent the probability of the alignment occurring by chance. It is a statistical calculation based on the quality of alignment (the score) and the size of the database. For example if an alignment obtained from one database has an E-value of x, the exact same alignment obtained from a database of different size will have an E-value of y. An E-value of 1e-3 is saying that there is a 0.001 chance that that alignment would exist in the database by chance, that is, if the database contains 10000 sequences, then you might expect that alignment to occur maybe 10 times. An E-value of 0 is actually a rounded down probability (maybe 1e-250 or something), and is simply saying that there is (almost) no chance that alignment can occur by chance. The score is the measure of similarity between two sequences, and is calculated from the alignment matrix. http://www.protocol-online.org/biology-forums/posts/5426.html

Hits and Misses Sensitivity: how good is the search at finding relationships (even distant ones)? Selectivity: are the relationships reported actually true relationships? Consider the tradeoffs…

Errors The error rate of a classification system is one measure of how well the system solves the problem for which it was designed. Other measures are possible, including speed and expense. Definitions The classifier makes a classification error whenever it classifies the input object as class C i when the true class is class C j ; i  j and C i  C r, the reject class. The empirical error rate of a classification system is the number of errors made on independent test data divided by the number of classifications attempted. The empirical reject rate of a classification system is the number of rejects made on independent test data divided by the number of classifications attempted. Independent test data are sample objects with true class known, including objects from the reject class, that were not used in designing the feature extraction and classification algorithms. Computer Vision (Shapiro and Stockman)

False Alarms and False Dismissals in Two-Class Problems Assume that a system is attempting to identify which items belong to class A: False alarm (false positive): placing an object in class A which does not belong to the class False dismissal (false negative): failing to place an object in class A which does, in fact, belong to the class Note that the cost of errors depends upon the problem. It may be necessary to bias the decision process so as to minimize false dismissals at the cost of increasing the number of false alarms. Computer Vision (Shapiro and Stockman)

Precision Versus Recall The precision of a retrieval system is the number of relevant items (true C 1 ) retrieved divided by the total number of items retrieved (true C 1 plus false alarms actually from C 2 ). The recall of a retrieval system is the number of relevant items retrieved by the system divided by the total number of relevant items in the database. Equivalently, this is the number of true C 1 items retrieved divided by the total of the true C 1 items retrieved and the false dismissals. Example: Consider a database with 10,000 items, 400 of which are of item A. A specific user query is intended to retrieve the item A entries. If the system retrieves 300 instances of item A and 200 instances of other unrelated items the precision of the system for this query is 300/(300+200) =.6 (i.e., 60%) and the recall is 300/400 =.75 (i.e., 75%) Note that 100% recall could be obtained by returning all items in the database, but the precision would be abysmal (300/10000) =.03 (i.e., 3%). Similarly, higher precision could be achieved by tweaking the system for a low false alarm rate, but the recall would likely suffer. Computer Vision (Shapiro and Stockman)

Timeline 1970: Needleman-Wunsch global alignment algorithm 1970: Needleman-Wunsch global alignment algorithm 1981: Smith-Waterman local alignment algorithm 1981: Smith-Waterman local alignment algorithm 1985: FASTA 1985: FASTA 1990: BLAST (basic local alignment search tool) 1990: BLAST (basic local alignment search tool) 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve! 2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve! Pattern Hunter Pattern Hunter BLAT BLAT An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

PatternHunter: faster and even more sensitive BLAST: matches short consecutive sequences (consecutive seed) BLAST: matches short consecutive sequences (consecutive seed) Length = k Length = k Example (k = 11): Example (k = 11):11111111111 Each 1 represents a “match” PatternHunter: matches short non-consecutive sequences (spaced seed) PatternHunter: matches short non-consecutive sequences (spaced seed) Increases sensitivity by locating homologies that would otherwise be missed Increases sensitivity by locating homologies that would otherwise be missed Example (a spaced seed of length 18 w/ 11 “matches”): Example (a spaced seed of length 18 w/ 11 “matches”):111010010100110111 Each 0 represents a “don’t care”, so there can be a match or a mismatch An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Spaced seeds Example of a hit using a spaced seed: How does this result in better sensitivity? An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Why is PH better? BLAST: redundant hits BLAST: redundant hits PatternHunter This results in > 1 hit and creates clusters of redundant hits This results in very few redundant hits An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Why is PH better? BLAST may also miss a hit GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT In this example, despite a clear homology, there is no sequence of continuous matches longer than length 9. BLAST uses a length 11 and because of this, BLAST does not recognize this as a hit! Resolving this would require reducing the seed length to 9, which would have a damaging effect on speed 9 matches An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Advantage of Gapped Seeds 11 positions 10 positions An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Why is PH better? Higher hit probability Higher hit probability Lower expected number of random hits Lower expected number of random hits An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Use of Multiple Seeds Basic Searching Algorithm 1. Select a group of spaced seed models 2. For each hit of each model, conduct extension to find a homology. An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Another method: BLAT BLAT (BLAST-Like Alignment Tool) BLAT (BLAST-Like Alignment Tool) Same idea as BLAST - locate short sequence hits and extend Same idea as BLAST - locate short sequence hits and extend An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

BLAT vs. BLAST: Differences BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database Index is stored in RAM which is memory intensive, but results in faster searches Index is stored in RAM which is memory intensive, but results in faster searches An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

However… BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments BLAT was designed to find sequences of 95% and greater similarity of length >40; may miss more divergent or shorter sequence alignments An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

PatternHunter and BLAT vs. BLAST PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity PatternHunter is 5-100 times faster than Blastn, depending on data size, at the same sensitivity BLAT is several times faster than BLAST, but best results are limited to closely related sequences BLAT is several times faster than BLAST, but best results are limited to closely related sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Resources tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf http://www.genomeblat.com/genomeblat/blatRapShow.pps http://www.genomeblat.com/genomeblat/blatRapShow.pps http://www.genomeblat.com/genomeblat/blatRapShow.pps An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info References Simons, Robert W. Advanced Molecular Genetics Course, UCLA (2002). http://www.mimg.ucla.edu/bobs/C159/Presentations/Benzer.pdf Simons, Robert W. Advanced Molecular Genetics Course, UCLA (2002). http://www.mimg.ucla.edu/bobs/C159/Presentations/Benzer.pdf http://www.mimg.ucla.edu/bobs/C159/Presentations/Benzer.pdf Batzoglou, S. Computational Genomics Course, Stanford University (2004). http://www.stanford.edu/class/cs262/handouts.html Batzoglou, S. Computational Genomics Course, Stanford University (2004). http://www.stanford.edu/class/cs262/handouts.html http://www.stanford.edu/class/cs262/handouts.html

Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info.

Similar presentations

Presentation on theme: "Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info.

Similar presentations

Presentation on theme: "Combinatorial Pattern Matching An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info."— Presentation transcript:

Similar presentations

About project

Feedback