Download presentation
Presentation is loading. Please wait.
Published byProsper Stone Modified over 9 years ago
1
Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang
2
SEQUENCE ALIGNMENT
3
Sequence Similarity Alignment – Arrange DNA/Protein sequences to show the similarity “ ” denotes the insertion/deletion event
4
Other variations Edit distance Longest common substring Affine gap scoring Using scoring matrix (BLOSUM, PAM)
5
Alignment score computation Needleman–Wunsch – Dynamic programming
6
Other variations NameProblemWorst timeAverage timeMemory Four Russian Edit distance 1,0 M*N/log(N) MN Ukkonen Global edit (linear cost) NDN+D 2 D2D2 WatermanLocal alignmentMN Tree treeLocal alignmentM2N2M2N2 BWTSWMeaningful local alignment MN2MN2 MN 0.68
7
Local alignment – Find the best alignments of two substring from the sequences
8
BWTSW
9
– Motivation Scoring 75% similarity Local alignment table most are zero Meaningful alignment – Suffix tree – Meaningful alignment – Meaningful alignment with gap – How good is it?
10
Meaningful alignment (1) Sequences similarity sometimes implies functional similarity. Biologists is NOT usually interested in sequences with less than 70% similarity. BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending gap = -2
11
Meaningful alignment (2) BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending Gap = -2 – At least 70% match to have none zero score
12
Meaningful alignment (3) BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending Gap = -2 How many none zero entries in the local alignment DP table?
13
How to improve? Idea: – Not storing zero score entries – Using suffix tree to prune off early
14
BWTSW details FM index for suffix tree representation Prune zero entries Store DP vector using linked list
15
Analysis Text length = N Pattern length = M Alphabet size =
16
Average running time (1) Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0 – Sizeof{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0} – F(L) counts the number of pairs of 75% identity. F(L) = sum(i=0..L/4, Binomial(L,i) * ( -1) i ) F(L) k 1 k 2 L F(log(N)) k 3 * N 0.68
17
Average running time (2) Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/ L For M < log(N) – The number of entries are – O(M * F(M)) < O(log(N)*F(log(N)) For M > log (N) – O(M * N * F(M) / L ) On average – Time = O(M*F(log(N))) = M * N 0.68
18
DAWG
19
Possible improvement of BWTSW Worst case running time O(N 2 M) – When M=N – O(M N 0.68 +M 3 ) When M is substring of N What about ST vs. ST?
20
What we used in BWTSW is Suffix Trie (not suffix tree). – #Prove it# Suffix trie has O(N 2 )nodes DAWG is a similar structure with O(N) nodes
21
DAWG (1)
22
DAWG (2) DAWG: Directed Acyclic Word Graph DAWG is a cyclic automata that recognizes all the sub-strings of the given string.
23
DAWG (3) Example: – DAWG of “abcbc”
24
DAWG (4) End-set view
25
Trivial DAWG construction Using End-set class
26
DAWG properties For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges
27
D(w) and ST(w R ) There is a map between nodes in DAWG and implicit ST(w R ) – Example: w=abcbc, w R =cbcba Store DAWG using ST, which uses only o(N) bits a a b cb cbaa
28
D(w) and ST(w R ) (2) list all incoming edges of node q in Dw using ST(w^R)
29
Local Alignment using DAWG Basis Induction
30
Extensions Meaningful alignment using DAWG – Prune the nodes whose Score is less than zero Shortest path pruning style Cache log(N) nodes the worst case running time is M*N*log(N), average case is the same for M << N.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.