Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang.

Similar presentations


Presentation on theme: "Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang."— Presentation transcript:

1 Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang

2 SEQUENCE ALIGNMENT

3 Sequence Similarity Alignment – Arrange DNA/Protein sequences to show the similarity “  ” denotes the insertion/deletion event

4 Other variations Edit distance Longest common substring Affine gap scoring Using scoring matrix (BLOSUM, PAM)

5 Alignment score computation Needleman–Wunsch – Dynamic programming

6 Other variations NameProblemWorst timeAverage timeMemory Four Russian Edit distance  1,0 M*N/log(N) MN Ukkonen Global edit (linear cost) NDN+D 2 D2D2 WatermanLocal alignmentMN Tree treeLocal alignmentM2N2M2N2 BWTSWMeaningful local alignment MN2MN2 MN 0.68

7 Local alignment – Find the best alignments of two substring from the sequences

8 BWTSW

9 – Motivation Scoring 75% similarity Local alignment table most are zero Meaningful alignment – Suffix tree – Meaningful alignment – Meaningful alignment with gap – How good is it?

10 Meaningful alignment (1) Sequences similarity sometimes implies functional similarity. Biologists is NOT usually interested in sequences with less than 70% similarity. BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending gap = -2

11 Meaningful alignment (2) BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending Gap = -2 –  At least 70% match to have none zero score

12 Meaningful alignment (3) BLAST score – Match = 1 – Mismatch = -3 – Open Gap = -5 – Extending Gap = -2 How many none zero entries in the local alignment DP table?

13 How to improve? Idea: – Not storing zero score entries – Using suffix tree to prune off early

14 BWTSW details FM index for suffix tree representation Prune zero entries Store DP vector using linked list

15 Analysis Text length = N Pattern length = M Alphabet size = 

16 Average running time (1) Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0 – Sizeof{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0} – F(L) counts the number of pairs of 75% identity. F(L) = sum(i=0..L/4, Binomial(L,i) * (  -1) i ) F(L)  k 1 k 2 L F(log(N))  k 3 * N 0.68

17 Average running time (2) Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/  L For M < log(N) – The number of entries are – O(M * F(M)) < O(log(N)*F(log(N)) For M > log (N) – O(M * N * F(M) /  L ) On average – Time = O(M*F(log(N))) = M * N 0.68

18 DAWG

19 Possible improvement of BWTSW Worst case running time O(N 2 M) – When M=N – O(M N 0.68 +M 3 ) When M is substring of N What about ST vs. ST?

20 What we used in BWTSW is Suffix Trie (not suffix tree). – #Prove it# Suffix trie has O(N 2 )nodes DAWG is a similar structure with O(N) nodes

21 DAWG (1)

22 DAWG (2) DAWG: Directed Acyclic Word Graph DAWG is a cyclic automata that recognizes all the sub-strings of the given string.

23 DAWG (3) Example: – DAWG of “abcbc”

24 DAWG (4) End-set view

25 Trivial DAWG construction Using End-set class

26 DAWG properties For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges

27 D(w) and ST(w R ) There is a map between nodes in DAWG and implicit ST(w R ) – Example: w=abcbc, w R =cbcba Store DAWG using ST, which uses only o(N) bits a a b cb cbaa

28 D(w) and ST(w R ) (2) list all incoming edges of node q in Dw using ST(w^R)

29 Local Alignment using DAWG Basis Induction

30 Extensions Meaningful alignment using DAWG – Prune the nodes whose Score is less than zero Shortest path pruning style Cache log(N) nodes  the worst case running time is M*N*log(N), average case is the same for M << N.


Download ppt "Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang."

Similar presentations


Ads by Google