Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.

Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads

Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks! 2

Transform What is the definition for “transform”? –To change the nature, function, or condition of; convert. –To change markedly the appearance or form of Lossless and reversible –By the way, to transform is simple, a kid can do it. –To put them back is a problem. –Think of a 3 years old baby, he pretty much can transform anything, disassemble anything, but … There exist efficient reverse algorithms that can retrieve the original text from the transformed text. 3

4 What is BWT? The Burrows and Wheeler transform (BWT) is a block sorting lossless and reversible data transform. The BWT can permute a text into a new sequence which is usually more “compressible”. Surfaced not long ago, 1994, by Michael Burrows and David Wheeler. The transformed text can be better compressed with fast locally-adaptive algorithms, such as run-length-encoding (or move-to-front coding) in combination with Huffman coding (or arithmetic coding).

5 Why BWT? Run length encoding –Replacing a long series of a repeated character with a count of the repetition. Squeezing to a number and a character. AAAAAAA *A7, * flag –Ideally, the longer of the sequence of the same character is, the better. –In reality, the input data, however, does not necessarily favor the expectation of the RLE method.

6 Bridge reality and ideality BWT can transform a text into a sequence that is easier to compress. –Closer to ideality (what is expected by RLE). Compression on the transformed text improves the compression performance

7 Preliminaries Alphabet Σ –{a,b,c,$} We assume –an order on the alphabet a<b<c<$ –A character is available to be used as the sentinel, denoted as $. A text T is a sequence of characters drawn from the alphabet. Without loss of generality, a text T of length $N$ is denoted as x_1x_2x_3...x_{N-1}$, where every character x_i is in the alphabet, Σ, for i in [1, N-1]. The last character of the text is a sentinel, which is the lexicographically greatest character in the alphabet and occurs exactly once in the text. Appending a sentinel to the original text is not a must but helps simplifying the understanding and make any text nonrepeating. –abcababac$

8 How to transform? Three steps –Form a N*N matrix by cyclically rotating (left) the given text to form the rows of the matrix. –Sort the matrix according to the alphabetic order. –Extract the last column of the matrix.

9 One example how the BWT transforms mississippi. T=mississippi$

10 Step 1: form the matrix The N * N symmetric matrix, MO, originally constructed from the texts obtained by rotating the text $T$. –The matrix OM has S as its first row, i.e. OM[1, 1:N]=T. –The rest rows of OM are constructed by applying successive cyclic left-shifts to T, i.e. each of the remaining rows, a new text T_i is obtained by cyclically shifting the previous text T_{i-1} one column to the left. The matrix OM obtained is shown in the next slide.

11 m i s s i s s i p p i $ i s s i s s i p p i $ m s s i s s i p p i $ m i s i s s i p p i $ m i s i s s i p p i $ m i s s s s i p p i $ m i s s i s i p p i $ m i s s i s i p p i $ m i s s i s s p p i $ m i s s i s s i p i $ m i s s i s s i p i $ m i s s i s s i p p $ m i s s i s s i p p i Step 1: form the matrix First treat the input string as a cyclic string and construct N* N matrix from it.

12 Step 2: transform the matrix Now, we sort all the rows of the matrix OM in ascending order with the leftmost element of each row being the most significant position. Consequently, we obtain the transformed matrix M as given below. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i Completely sorted from the leftmost column to the rightmost column.

Step 2: transform the matrix Properties of first column –Lexicographical order –Maximally ‘clumped’ why? –From it, can we create last? Properties of last column –Some clumps (real files) –Can we create first? Why? See row 4: –Last char precedes first in original! True for all rows! Can recreate everything: –Simple (code) but hard (idea) i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i

14 Step 3: get the transformed text The Burrows Wheeler transform is the last column in the sorted list, together with the row number where the original string ends up.

15 Step 3: get the transformed text From the above transform, L is easily obtained by taking the transpose of the last column of M together with the primary index. –4 –L= s s m p $ p i s s i i i i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 Notice how there are 3 i's in a row and 2 consecutive s's and another 2 consecutive s’s - this makes the text easier to compress, than the original string “mississippi$”.

What is the benefit? The transformed text is more amenable to subsequent compression algorithms. 16

17 Any problem? It sounds cool, but … Is the transformation reversible?

18 BWT is reversible and lossless The remarkable thing about the BWT is not only that it generates a more easily compressible output, but also that it is reversible, i.e. it allows the original text to be re-generated from the last column data and the primary index.

19 BWT is reversible and lossless mississippi$ Index 4 and ssmp$pissiii BWT Inverse BWT mississippi$ ??? How to achieve the goal?

20 The intuition Assuming you are in a 1000 people line. –For some reason, people are dispersed –Now, we need to restore the line. –What should you (the people in line) do? –What is your strategy? Centralized? –A bookkeeper or ticket numbers, that requires centralized extra bookkeeping space Distributed? –If every person can point out who stood immediately in front of him. Bookkeeping space is distributed.

For Inverse BWT The order is distributed and hidden in the output themselves!!! 21

22 The trick is Where to start? Who is the first one to ask? –The last one. Finding immediate preceding character –By finding immediate preceding row of the current row. A loop is needed to recover all. Each iteration involves two matters Recover the current people (by index) In addition to that, to point out the next people (by index) to keep the loop running.

23 Two matters –Recover the current people (by index) L[currentindex], so what is the currentindex? –In addition to that, to point out the next people (by index) –currentindex = new index; –// how to update currentindex, we need a updating method.

24 We want to know where is the preceding character of a given character. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 Based on the already known primary index, 4, we know, L[4], i.e. $ is the first character to retrieve, backwardly, but our question is which character is the next character to retrieve?

25 We want to know where is the preceding character of a given character. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 We know that the next character is going to be ‘i’? But L[6]=L[9]= L[10] = L[11] =‘i’. Which index should be chosen? Any of 6, 9, 10, and 11 can give us the right character ‘i’, but the correct strategy also has to determine which index is the next index continue the restoration.

26 The solution The solution turns out to be very simple: –Using LF mapping! –Continue to see what LF mapping is?

27 Inverse BW-Transform Assume we know the complete ordered matrix Using L and F, construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F. Using LF-mapping and L, then reconstruct T backwards by threading through the LF- mapping and reading the characters off of L.

28 L and F i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4

29 LF mapping i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 7 8 4 5 11 6 0 9 10 1 2 3

30 Inverse BW-Transform: Reconstruction of T Start with T[] blank. Let u = N Initialize Index = the primary index (4 in our case) T[u] = L[index]. We know that L[index] is the last character of T because M[the primary index] ends with $. For each i = u-1, …, 1 do: s = LF[s] (threading backwards) T[i] = L[s] (read off the next letter back)

31 Inverse BW-Transform: Reconstruction of T First step: s = 4 T = [.._ _ _ _ _ $] Second step: s = LF[4] = 11 T = [.._ _ _ _ i $] Third step: s = LF[11] = 3 T = [.._ _ _ p i $] Fourth step: s = LF[3] = 5 T = [.._ _ p p i $] And so on…

32 Who can retrieve the data? Please complete it!

33 Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 7 8 4 5 11 6 0 9 10 1 2 3 ? Which one

34 Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 7 8 4 5 11 6 0 9 10 1 2 3 ? Why not this?

35 Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 7 8 4 5 11 6 0 9 10 1 2 3 ? Why this?

38 The mathematic explanation T1=S1+P T2=S2+P If T1<T2, S1<S2 Now, let us reverse S and P P+S1= T1’ P+S2=T2’ Since S1<S2, we know T1’<T2’

39 The secret is hidden in the sorting strategy the forward component. –Sorting strategy preserves the relative order in both last column and first column.

40 We had assumed we have the matrix. But actually we don’t. Observation, we only need two columns. Amazingly, the information contained in the Burrows-Wheeler transform (L) is enough to reconstruct F, and hence the mapping, hence the original message!

41 First, we know all of the characters in the original message, even if they're permuted in the wrong order. This enables us to reconstruct the first column.

42 Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column.

43 Inverse BW-Transform: Construction of C Store in C[c] the number of occurrences in T of the characters {1, …, c-1}. In our example: T = mississippi$  i 4, m 1, p 2, s 4, $ 1 C = [0 4 5 7 11] Notice that C[c] + m is the position of the mth occurrence of c in F (if any).

44 Inverse BW-Transform: Constructing the LF-mapping Why and how the LF-mapping? –Notice that for every row of M, L[i] directly precedes F[i] in the text (thanks to the cyclic shifts). Let L[i] = c, let r i be the number of occurrences of c in the prefix L[1,i], and let M[j] be the r i -th row of M that starts with c. Then the character in the first column F corresponding to L[i] is located at F[j]. How to use this fact in the LF-mapping?

45 Inverse BW-Transform: Constructing the LF-mapping So, define LF[1…N] as LF[i] = C[L[i]] + r i. C[L[i]] gets us the proper offset to the zeroth occurrence of L[i], and the addition of r i gets us the r i -th row of M that starts with c.

46 Inverse BW-Transform Construct C[1…|Σ|], which stores in C[i] the cumulative number of occurrences in T of character i. Construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F using only L and C. Reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

def rankBwt(bw): ''' Given BWT string bw, return parallel list of B-‐ranks.Also returns tots: map from character to # times it appears. ''' tots = dict() ranks = [] for c in bw: if c not in tots: tots[c] = 0 ranks.append(tots[c]) tots[c] += 1 return ranks, tots def firstCol(tots): ''' Return map from character to the range of rows prefixed by the character. ''' first = {} totc = 0 for c, count in sorted(tots.iteritems()): first[c] = (totc, totc + count) totc += count return first rowi = first[c][0] + ranks[rowi] return t Calculate B-ranks and count occurrences of each char Make concise representation of first BWM column Do reversal Python example: http://nbviewer.ipython.org/686049 1 http://nbviewer.ipython.org/686049 1 Burrows-Wheeler Transform: reversing defreverseBwt(bw): ''' Make T from BWT(T) ''' ranks, tots = rankBwt(bw) first = firstCol(tots) rowi = 0 # start in first row t = '$' # start with rightmost character while bw[rowi] != '$': c = bw[rowi] t = c + t # prepend to answer # jump to row that starts with c of same rank

48 Pros and cons of BWT Pros: –The transformed text does enjoy a compression-favorable property which tends to group identical characters together so that the probability of finding a character close to another instance of the same character is increased substantially. –More importantly, there exist efficient and smart algorithms to restore the original string from the transformed result. Cons: –the need of sorting all the contexts up to their full lengths of $N$ is the main cause for the super-linear time complexity of BWT. –Super-linear time algorithms are not hardware friendly.

49 Conclusions The BW transform makes the text (string) more amenable to compression. –BWT in itself does not modify the data stream. It just reorders the symbols inside the data blocks. –Evaluation of the performance actually is subject to information model assumed. Another topic. The transform is lossless and reversible

50 BW Transform Summary Any naïve implementation of the transform has an O(n^3) time complexity. –The best solution has O(n), which is tricky to implement. We can reverse it to reconstruct the original text in O(n) time, using O(n) space. Once we obtain L, we can compress L in a provably efficient manner

FM Index FM Index: an index combining the BWT with a few small auxilliary data structures “FM” supposedly stands for “Full-text Minute-space.” (But inventors are named Ferragina and Manzini) Core of index consists of F and L from BWM: a b a a b $ a b a a a b a $ a b a $ a b b a a b a a $ a b a a a b a $ Paolo Ferragina, and Giovanni Manzini. "Opportunistic data structures with applications." Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE, 2000. F can be represented very simply (1 integer per alphabet character) And L is compressible Potentially very space- economical!

FM Index: querying Though BWM is related to su ﬃ x array, we can’t query it the same way 6$6$ 5a$5a$ 2 a a b a $ 3 a b a $ 0 a b a a b a $ 4 b a $ 1 b a a b a $ We don’t have these columns; binary search isn’t possible

FM Index: querying Easy to find all the rows beginning with a, thanks to F’s simple structure Look for range of rows of BWM(T) with P as prefix Do this for P’s shortest su ﬃ x, then extend to successively longer su ﬃ xes until range becomes empty or we’ve exhausted P

FM Index: querying aba We have rows beginning with a, now we seek rows beginning with ba Look at those rows in L. b 0, b 1 are b s occuring just to left. Use LF Mapping. Let new range delimit those b s Now we have the rows with prefix ba

FM Index: querying We have rows beginning with ba, now we seek rows beginning with aba

FM Index: querying P = aba Now we have the same range, [3, 5), we would have got from querying su ﬃ x array

FM Index: querying When P does not occur in T, we will eventually fail to find the next character in L:

FM Index: querying aba If we scan characters in the last column, that can be very slow, O(m)

FM Index: fast rank calculations

F L $ a a b a a.$ b.a b a Another idea: pre-calculate # a s, b s in L up to some rows, e.g. every 5 th row. Call pre-calculated rows checkpoints. Tall y a b Lookup here succeeds as usual Oops: not a checkpoint But there’s one nearby To resolve a lookup for character c in non-checkpoint row, scan along L until we get to nearest checkpoint. Use tally at the checkpoint, adjusted for # of cs we saw along the way. 10 32

FM Index: fast rank calculations Assuming checkpoints are spaced O(1) distance apart, lookups are O(1) abbaaaabbbaabbababbaaaabbbaabbab Tally La b... What’s my rank? 482 + 2 - 1 = 483 What’s my rank? 439 - 2 - 1 = 436 Checkpoin t tally -> rank a s along the way 482432 488439

FM Index: a few problems a b a a b $ a b a a a b a $ a b a $ a b b a a b a a $ a b a a a b a $ a0a1a2a3a0a1a2a3 b0b0 b1b1 F$F$ La0b0b1a1$a2a3La0b0b1a1$a2a3 (1) This scan is O(m) work Solved!At the expense of adding checkpoints (O(m) integers) to index. With checkpoints it’s O(1) (2) def reverseBwt(bw): """ Make T from BWT(T) """ ranks, tots = rankBwt(bw) first = firstCol(tots) rowi = 0 t = "$" while bw[rowi] != '$': c = bw[rowi] t = c + t rowi = first[c][0] + ranks[rowi] return t m integers Ranking takes too much space With checkpoints, we greatly reduce # integers needed for ranks But it’s still O(m) space - there’s literature on how to improve this space bound

FM Index: a few problems

FM Index: resolving o ﬀ sets Idea: store some, but not all, entries of the su ﬃ x array Lookup for row 4 succeeds - we kept that entry of SA Lookup for row 3 fails - we discarded that entry of SA

FM Index: resolving o ff sets But LF Mapping tells us that the a at the end of row 3 corresponds to......the a at the begining of row 2 ???? And row 2 has a su ffi x array value = 2 So row 3 has su ffi x array value = 3 = 2 (row 2’s SA val) + 1 (# steps to row 2) If saved SA values are O(1) positions apart in T, resolving o ff set is O(1) time

FM Index: problems solved (3) At the expense of adding some SA values (O(m) integers) to index Call this the “SA sample” Need a way to find where these occurrences are in T: Solve d!

FM Index: small memory footprint

Short read mapping Input: –A reference genome –A collection of many 25-100bp tags (reads) –User-specified parameters Output: –One or more genomic coordinates for each tag In practice, only 70-75% of tags successfully map to the reference genome. Why?

Multiple mapping A single tag may occur more than once in the reference genome. The user may choose to ignore tags that appear more than n times. As n gets large, you get more data, but also more noise in the data.

Inexact matching An observed tag may not exactly match any position in the reference genome. Sometimes, the tag almost matches one or more positions. Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out.wikipedia The user can specify the maximum number of mismatches, or a phred-style quality score threshold. As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags. ?

Read Length is Not As Important For Resequencing Jay Shendure

Mapping Reads Back Hash Table (Lookup table) –FAST, but requires perfect matches. [O(m n + N)] Array Scanning –Can handle mismatches, but not gaps. [O(m N)] Dynamic Programming (Smith Waterman) –Indels –Mathematically optimal solution –Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] Burrows-Wheeler Transform (BW Transform) –FAST. [O(m + N)] (without mismatch/gap) –Memory efficient. –But for gaps/mismatches, it lacks sensitivity

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution.

Why Burrows-Wheeler? BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory Linear-time search algorithm – proportional to length of query for exact matches

Main advantage of BWT against suffix array BWT needs less memory than suffix array For human genome m = 3 * 10 9 : –Suffix array: mlog 2 (m) bits = 4m bytes = 12GB –BWT: m/4 bytes plus extras = 1 - 2 GB m/4 bytes to store BWT (2 bits per char) Suffix array and occurrence counts array take 5 m log 2 m bits = 20 n bytes In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) Tradeoff between time and space

Comparison Burrows-Wheeler Requires <2Gb of memory. Runs 30-fold faster. Is much more complicated to program. Bowtie Spaced seeds Requires ~50Gb of memory. Runs 30-fold slower. Is much simpler to program. MAQ

Short-read mapping software SoftwareTechniqueDeveloperLicense ElandHashing readsIllumnia? SOAPHashing refsBGIAcademic MaqHashing readsSanger (Li, Heng)GNUPL BowtieBWTSalzberg/UMDGNUPL BWABWTSanger (Li, Heng)GNUPL SOAP2BWT & hashingBGIAcademic http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

References (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713-4 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760 SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966–1967 (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8. Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) http://www.allisons.org/ll/AlgDS/Strings/BWT/

Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.

Similar presentations

Presentation on theme: "Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.

Similar presentations

Presentation on theme: "Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads."— Presentation transcript:

Similar presentations

About project

Feedback