Download presentation
Presentation is loading. Please wait.
Published byHeather Richard Modified over 9 years ago
1
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel Pevzner (prepared by Iman Famili)
2
Outline New computational ideas for sequence comparison: Divide-and-conquerDivide-and-conquer technique Recursive programsRecursive programs HashHash tables
3
Edit Graphs Finds similarities between two sequences. Every alignment in this method corresponds to the longest path problem from a source to a sink. The alignment is done by constructing an “edit graph”. There are 3 types of edges in the edit graph horizontal (H), diagonal (D), and vertical (V) corresponding to insertion (I), match/mismatch (M), and deletion (D), respectively. Every edge of the edit graph (i.e. every movement) has a weight corresponding to the penalty or premium for that action. The best path is the path with the maximum length. Edit Graph TGCATA A T C T G A T deletions: mismatches: insertions: matches: source sink
4
Computational Complexity of Dynamic Programming Sequence alignment is limited by: Time:Time: –Four operations are needed at each vertex. –The required time is proportional to the number of edges in the edit graph (i.e. O(nm), where n and m are sequence lengths). Space:Space: –The required memory is proportional to the number of vertices in the edit graph, O(nm).
5
Computational Complexity of Dynamic Programming –To compute the score of alignment, we can reduce the calculations to 2 columns at every computing instance. This can be done since scoring for each box in dynamic programming (DP) matrix is done based only on the three previously calculated boxes. Therefore only a linear memory is required for construction of the DP matrix. –To calculate the alignment (backtracking through the matrix), however, a quadratic memory is needed (n 2 ) since all the scores are needed to find the best alignment. only 2 columns are needed to determine the score of each box (forward calculation) all columns are needed for calculating the best alignment (backtracking)
6
Space-Efficient Sequence Alignment To solve the space complexity of sequence alignment: Find the middle vertex between a source and a sink by computing the score of the path s *,m/2 from (0,0) to (i,m/2) and s reverse *,m/2 from (i,m/2) to (n,m) (i.e. find the longest path between the source and the middle vertex and middle vertex and the sink). Repeat this process iteratively middle m/2m (0,0) (n,m) n i m/2m (0,0) (n,m) n middle m/2m (0,0) n middle (n,m) m (0,0) (n,m) n m (0,0) n(n,m) m (0,0) n(n,m) Source Sink
7
Space-Efficient Sequence Alignment The computing time is equal to the area of the rectangles. The total time to find the middle vertices is therefore: area+area/2+area/4+… 2*area The space complexity is of order n, O(n). Pseudocode for this algorithm is: Path (source, sink) If source and sink are in consecutive columns output the longest path from the source to the sink Else middle middle vertex between source and sink Path (source, middle) Path (middle, sink)
8
String Matching: naïve approach Let’s say we want to compare a sequence of length l =10 against a database of length, for example, n =10 9 and we want to find the exact sequence l =10 in n. We can: 1.Move l along n one base at a time and find similar sequences (this takes a long time): l =10 n =10 9 So, essentially moving diagonally along the database alignments:
9
Sting Matching: hashing 2.Create a hash table of all possible combinations of l - length strings that exist in n Hash Table and search your l -length string against the hash table.
10
Approximate String Matching Now if instead of l =10 we have l =1000, we can apply the same method by dividing l into overlapping strings of 10 base-long and cross the resultant alignments, as shown below: String matching in this fashion may be done using filtration/verification algorithms that will be described next.
11
Filtration/Verification Method Let’s say we want to find a string in a database with up to 2 mismatches, or in general, find a string t 1 … t n (text) in a database q 1 … q p (query) with up to k mismatches. The query matching problem is to find all m -substrings of the query and the text that match with at most k mismatches. Filtration/verification algorithms are used to perform this task. Filtration/verification algorithms involve a two-stage process. walk in both directions while mismatches are < k First, a set of positions are reselected in the text that are potentially similar to the query. Second, each potential position is verified if mismatches are less than k and rejected if more than k mismatches are found.
12
Filtration/Verification Method Filtration algorithm is done in 2-steps: 1.Potential match detection: Find all matches of t -tuples in both query and the text for l = m / k +1 (it’s sparse alignment happens rarely) 2.Potential match verification: Verify each potential match by extending it to the left and to the right until either (i) the first k +1 mismathces are found or (ii) the beginning or end of the query or the text is found This is the idea behind BLAST and FASTA.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.