Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.strandls.com Read Alignment Algorithms. www.strandls.com The Problem 2 Given a very long reference sequence of length n and given several short strings.

Similar presentations


Presentation on theme: "Www.strandls.com Read Alignment Algorithms. www.strandls.com The Problem 2 Given a very long reference sequence of length n and given several short strings."— Presentation transcript:

1 www.strandls.com Read Alignment Algorithms

2 www.strandls.com The Problem 2 Given a very long reference sequence of length n and given several short strings (reads) of length m each, m << n Find the best matching location for each read in the reference Where the best location is that which minimizes the number of mismatches We ignore insertions and deletions for the moment; those will come later Provided the number of mismatches is at most, say 5% of m

3 www.strandls.com Indexing the Reference 3 What if we do not allow any mismatches at all? Pre-process the reference sequence so… Each query – find the best matching location of a read – can be identified in time proportional to m and independent of n The resulting data structure is called an index Suffix trees are one possible index A trie of all suffixes of the reference sequence, with a $ marker at the end

4 www.strandls.com Suffix Trees 4 CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T CG C Query

5 www.strandls.com Space Required by Suffix Trees 5 n-1 internal nodes plus n leaves, so 2n-1 nodes 2n-2 tree pointers + n pointers into the reference So ~3n pointers 36GB! Can we make this smaller?

6 www.strandls.com Indexing the Reference with Mismatches 6 What if we allow mismatches? So we put the query through the suffix tree but get struck – can’t proceed further Next, resume by dropping the first character, but without redoing the work already done How?

7 www.strandls.com Suffix Links in Suffix Trees 7 CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T GC G Query

8 www.strandls.com Indexing with Mismatches (Contd) 8 For an internal node A with string x leading down from the root to that node and branching into xa and xb Let x=cy Then there exists a node B with string y leading down from the root to that node The suffix link from A leads to this node B Such a node exists So if you get stuck, you follow the suffix link in constant time and continue from where you left off, to find the longest perfect-match substring starting at each position in the read Or alternatively, find all substrings of a certain minimum length that match Check explicitly for the number of mismatches at each of these locations

9 www.strandls.com Space Required by Suffix Trees & Links 9 n-1 internal nodes plus n leaves plus n-1 suffix links, so 3n-1 nodes 3n-3 tree pointers + n pointers into the reference So ~4n pointers 48GB! Can we make this smaller? Can we fit this tree into an array?

10 www.strandls.com A Succinct Data Structure 10 C G AC$ A C $CG C G AC$ C $ CGA G A C$C $ C GAC The Reference All circular shifts, sorted lexicographically Burrows- Wheeler Transform Store only the first and last columns and the links back to the reference Used in bzip

11 www.strandls.com A Succinct Data Structure 11 CGAC$ AC$CG CGAC$ C$CGA GAC$C $CGAC 2 2 0 0 3 3 1 1 4 4 $ A G C $ G The Reference The reference can be reconstructed from the first and last columns Claim: The ith G in the first column corresponds to the ith G in the last column! Likewise for A,C,G,T.

12 www.strandls.com Proof of Claim 12  yG<xG if and only if Gy<Gx; That’s it!  So given a G in the first column, say corresponding to the string Gx – It’s rank r is trivial to find because the first column is sorted, just store counts for all 4 characters – We need to locate the corresponding G in the last column – In other words, the index of the string xG in the table – Which is the rth G in the last column [The Select Query]  So given a G in the last column, say corresponding to the string xG – Find it’s rank r among G’s in the last column [The Rank Query] – We need to locate the corresponding G in the first column – In other words, the index of the string Gx in the table – Which is the rth G in the first column, trivial to find

13 www.strandls.com Select and Rank Queries 13  Given a binary array – SELECT: Given index i, find the ith 1 – RANK: Given index i, find how many 1s precede this location  Use a separate array for each of the 4 characters  RANK is easy, just keeps counts at Δ milestones and answer queries by traversing to the nearest milestone in time Δ – 4n/Δ bytes of storage, O(Δ) time  SELECT needs a bit more, keep counts for Δ-rank milestones – Go to the nearest rank milestone and traverse from there – May need to traverse quite a bit though – So need an extra data structure to get to the next 1, which you store at Δ milestones – So 8n/Δ bits storage, O(Δ) time  Of course we need the 4 n-bit binary arrays as well  So 4n bits + 48n/Δ bytes and O(Δ) time

14 www.strandls.com String Matching using Rank-Selects 14  Given a string Gx  Assume inductively we have the band B of indices in the table corresponding to suffixes that begin with x  We want the band B’ that begins with Gx  Take the band B, take the last column, identify the rank of the first and last G in the last column, find their corresponding first column indices; that’s the band – All doable using RANK alone  At the end you have the band containing all suffixes which begin with Gx  Unless of course, there are none, in which case the band will vanish at some point  We can use this to find matches for say all length 16 substrings of a read  So 4n+48n/Δ bytes and O(mΔ) time per read

15 www.strandls.com Indentifying Indices in the Reference 15  We still have to go from a band in the table to indices in the reference  4n bits if we store explicitly  We can use the same trick, store explicitly at Δ milestones  Then, if we have index i with string Gx, then we can go to index i+1 with string xG and so on till we get to a milestone  4n/Δ bytes storage  Time per index is O(Δ)

16 www.strandls.com Sorting Circular Shifts 16  It remains to describe the construction of the table in the first place  Given a string S=x 0 x 1 x 2 ….$ – Consider string S’=(x 0 x 1 x 2 ) (x 1 x 2 x 3 ) (x 3 x 4 x 5 ) (x 4 x 5 x 6 )…. – Note (x 2 x 3 x 4 ) and other triplets starting at 2 mod 3 are missing – Rename S’ so identical tuples get the same number and distinct tuples get different numbers – Recursively sort S’ How does x 0 x 1 x 2 … compare to x 1 x 2 x 3 … ? – Already available from recursion How does x 0 x 1 x 2 … compare to x 2 x 3 x 4 … ? – Compare x 0, x 2 and then x 1 x 2 …, x 3 x 4 … – We have info for comparing all pairs of suffixes! – Sort the 2 mod 3 suffixes and then merge them in – Time T(n)= 2T(n/3)+O(n)

17 www.strandls.com A Generalization: Difference Covers 17 v2v3v This string has size |D|n/v Set D of indices mod v Time taken to create this string is O(n |D|) Sorting suffixes of this string gives the sorted order of all suffixes which begin at indices j such that j mod v is in D

18 www.strandls.com A Generalization: Difference Covers For any 2 indices i and j i-j mod v is the distance between some two beads in D x<v D is a Difference Cover if distances between beads in D generate 0,1…,v-1 x<v

19 www.strandls.com A Generalization: Difference Covers There exists a Difference Cover of size 1.5* sqrt( v)! sqrt(v)

20 www.strandls.com Thank you 20


Download ppt "Www.strandls.com Read Alignment Algorithms. www.strandls.com The Problem 2 Given a very long reference sequence of length n and given several short strings."

Similar presentations


Ads by Google