The Burrows-Wheeler Transform

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Chapter 4 Systems of Linear Equations; Matrices
Chapter 4 Systems of Linear Equations; Matrices
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
The Assembly Language Level
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Data Parallel Algorithms Presented By: M.Mohsin Butt
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
A Data Compression Algorithm: Huffman Compression
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Data Compression Basics & Huffman Coding
Data dan Teknologi Multimedia Sesi 08 Nofriyadi Nurdam.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Elementary Sorting Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Introduction to Modern Symmetric-key Ciphers
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Learning Objectives for Section 4.5 Inverse of a Square Matrix
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
Multi-media Data compression
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
CPS 100e, Fall Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used.
Information and Network Security Lecture 2 Dr. Hadi AL Saadi.
Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Chapter 4 Systems of Linear Equations; Matrices
Burrows-Wheeler Transformation Review
3.3 Fundamentals of data representation
NUMBER SYSTEMS.
HUFFMAN CODES.
BWT-Transformation What is BWT-transformation? BWT string compression
Information and Coding Theory
13 Text Processing Hongfei Yan June 1, 2016.
Huffman Coding CSE 373 Data Structures.
Learning Objectives for Section 4.5 Inverse of a Square Matrix
Image Coding and Compression
Advanced Seminar in Data Structures
Chapter 4 Systems of Linear Equations; Matrices
Data Structures – Week #7
Huffman Coding Greedy Algorithm
CPS 296.3:Algorithms in the Real World
Chapter 4 Systems of Linear Equations; Matrices
Presentation transcript:

The Burrows-Wheeler Transform Sen Zhang

Transform What is the definition for “transform”? To change the nature, function, or condition of; convert. To change markedly the appearance or form of Lossless and reversible By the way, to transform is simple, a kid can do it. To put them back is a problem. Think of a 3 years old baby, he pretty much can transform anything, disassemble anything, but … There exist efficient reverse algorithms that can retrieve the original text from the transformed text.

What is BWT? The Burrows and Wheeler transform (BWT) is a block sorting lossless and reversible data transform. The BWT can permute a text into a new sequence which is usually more “compressible”. Surfaced not long ago, 1994, by Michael Burrows and David Wheeler. The transformed text can be better compressed with fast locally-adaptive algorithms, such as run-length-encoding (or move-to-front coding) in combination with Huffman coding (or arithmetic coding).

Outline What does BWT stand for? Why BWT? Steps of BWT Data Compression algorithms REL Huffman coding Combine them What is left out? Bring the reality closer to ideality Steps of BWT BWT is reversible and lossless Steps to inverse Variants of BWT ST When was BWT initially proposed? Where are the inventors of the algorithms? Your homework!

Why BWT? Run length encoding Replacing a long series of a repeated character with a count of the repetition. Squeezing to a number and a character. AAAAAAA *A7 , * flag Ideally, the longer of the sequence of the same character is, the better. In reality, the input data, however, does not necessarily favor the expectation of the RLE method.

Bridge reality and ideality BWT can transform a text into a sequence that is easier to compress. Closer to ideality (what is expected by RLE). Compression on the transformed text improves the compression performance

Preliminaries Alphabet Σ We assume {a,b,c,$} an order on the alphabet A character is available to be used as the sentinel, denoted as $.

How to transform? Three steps Form a N*N matrix by cyclically rotating (left) the given text to form the rows of the matrix. Sort the matrix according to the alphabetic order. Extract the last column of the matrix.

One example how the BWT transforms mississippi. T=mississippi$

Step 1: form the matrix The N * N symmetric matrix, MO, originally constructed from the texts obtained by rotating the text $T$. The matrix OM has S as its first row, i.e. OM[1, 1:N]=T. The rest rows of OM are constructed by applying successive cyclic left-shifts to T, i.e. each of the remaining rows, a new text T_i is obtained by cyclically shifting the previous text T_{i-1} one column to the left. The matrix OM obtained is shown in the next slide.

A text T is a sequence of characters drawn from the alphabet. Without loss of generality, a text T of length $N$ is denoted as x_1x_2x_3...x_{N-1}$, where every character x_i is in the alphabet, Σ, for i in [1, N-1]. The last character of the text is a sentinel, which is the lexicographically greatest character in the alphabet and occurs exactly once in the text. Appending a sentinel to the original text is not a must but helps simplifying the understanding and make any text nonrepeating. abcababac$

Step 1 form the matrix First treat the input string as a cyclic string and construct N* N matrix from it.

Step 1: form the matrix m i s s i s s i p p i $ i s s i s s i p p i $ m s s i s s i p p i $ m i s i s s i p p i $ m i s i s s i p p i $ m i s s s s i p p i $ m i s s i s i p p i $ m i s s i s i p p i $ m i s s i s s p p i $ m i s s i s s i p i $ m i s s i s s i p i $ m i s s i s s i p p $ m i s s i s s i p p i

Step 2: transform the matrix Now, we sort all the rows of the matrix OM in ascending order with the leftmost element of each row being the most significant position. Consequently, we obtain the transformed matrix M as given below. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i Completely sorted from the leftmost column to the rightmost column.

Step 3: get the transformed text The Burrows Wheeler transform is the last column in the sorted list, together with the row number where the original string ends up.

Step 3: get the transformed text From the above transform, L is easily obtained by taking the transpose of the last column of M together with the primary index. 4 L= s s m p $ p i s s i i i i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 Notice how there are 3 i's in a row and 2 consecutive s's and another 2 consecutive s’s - this makes the text easier to compress, than the original string “mississippi$”.

What is the benefit? The transformed text is more amenable to subsequent compression algorithms.

Any problem? It sounds cool, but … Is the transformation reversible?

BWT is reversible and lossless The remarkable thing about the BWT is not only that it generates a more easily compressible output, but also that it is reversible, i.e. it allows the original text to be re-generated from the last column data and the primary index.

BWT is reversible and lossless mississippi$ BWT Index 4 and ssmp$pissiii ??? How to achieve the goal? Inverse BWT mississippi$

The intuition Assuming you are in a 1000 people line. For some reason, people are dispersed Now, we need to restore the line. What should you (the people in line) do? What is your strategy? Centralized? A bookkeeper or ticket numbers, that requires centralized extra bookkeeping space Distributed? If every person can point out who stood immediately in front of him. Bookkeeping space is distributed.

For IBWT The order is distributed and hidden in the output themselves!!!

The trick is Where to start? Who is the first one to ask? The last one. Finding immediate preceding character By finding immediate preceding row of the current row. A loop is needed to recover all. Each iteration involves two matters Recover the current people (by index) In addition to that, to point out the next people (by index) to keep the loop running.

Two matters Recover the current people (by index) L[currentindex], so what is the currentindex? In addition to that, to point out the next people (by index) currentindex = new index; // how to update currentindex, we need a updating method.

We want to know where is the preceding character of a given character. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 Based on the already known primary index, 4, we know, L[4], i.e. $ is the first character to retrieve, backwardly, but our question is which character is the next character to retrieve?

We want to know where is the preceding character of a given character. i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4 We know that the next character is going to be ‘i’? But L[6]=L[9]= L[10] = L[11] =‘i’. Which index should be chosen? Any of 6, 9, 10, and 11 can give us the right character ‘i’, but the correct strategy also has to determine which index is the next index continue the restoration.

We know that the next character is going to be ‘i’? But L[6]=L[9]= L[10] = L[11] =‘i’. Which index should be chosen? Any of 6, 9, 10, and 11 can give us the right character ‘i’, but the correct strategy also has to determine which index is the next index continue the restoration.

The solution The solution turns out to be very simple: Using LF mapping! Continue to see what LF mapping is?

Inverse BW-Transform Assume we know the complete ordered matrix Using L and F, construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F. Using LF-mapping and L, then reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

L and F i p p i $ m i s s i s s i s s i p p i $ m i s s i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 4

LF mapping i p p i $ m i s s i s s i s s i p p i $ m i s s 7 i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4

Inverse BW-Transform: Reconstruction of T Start with T[] blank. Let u = N Initialize Index = the primary index (4 in our case) T[u] = L[index]. We know that L[index] is the last character of T because M[the primary index] ends with $. For each i = u-1, …, 1 do: s = LF[s] (threading backwards) T[i] = L[s] (read off the next letter back)

Inverse BW-Transform: Reconstruction of T First step: s = 4 T = [.._ _ _ _ _ $] Second step: s = LF[4] = 11 T = [.._ _ _ _ i $] Third step: s = LF[11] = 3 T = [.._ _ _ p i $] Fourth step: s = LF[3] = 5 T = [.._ _ p p i $] And so on…

Who can retrieve the data? Please complete it!

Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4 ? Which one

Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4 ? Why not this?

Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4 ? Why this?

Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4 ? Why this?

Why does LF mapping work? i p p i $ m i s s i s s i s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i p p i $ m i s s i s s i s s i p p i $ m i s s s i p p i $ m i s s i s s i s s i p p i $ m i $ m i s s i s s i p p i 7 8 4 5 11 6 9 10 1 2 3 4 ? Why this?

The mathematic explanation T1=S1+P T2=S2+P If T1<T2, S1<S2 Now, let us reverse S and P P+S1= T1’ P+S2=T2’ Since S1<S2, we know T1’<T2’

The secret is hidden in the sorting strategy the forward component. Sorting strategy preserves the relative order in both last column and first column.

We had assumed we have the matrix. But actually we don’t. Observation, we only need two columns. Amazingly, the information contained in the Burrows-Wheeler transform (L) is enough to reconstruct F, and hence the mapping, hence the original message!

First, we know all of the characters in the original message, even if they're permuted in the wrong order. This enables us to reconstruct the first column.

Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column.

Inverse BW-Transform: Construction of C Store in C[c] the number of occurrences in T of the characters {1, …, c-1}. In our example: T = mississippi$ i 4, m 1, p 2, s 4, $ 1 C = [0 4 5 7 11] Notice that C[c] + m is the position of the mth occurrence of c in F (if any).

Inverse BW-Transform: Constructing the LF-mapping Why and how the LF-mapping? Notice that for every row of M, L[i] directly precedes F[i] in the text (thanks to the cyclic shifts). Let L[i] = c, let ri be the number of occurrences of c in the prefix L[1,i], and let M[j] be the ri-th row of M that starts with c. Then the character in the first column F corresponding to L[i] is located at F[j]. How to use this fact in the LF-mapping?

Inverse BW-Transform: Constructing the LF-mapping So, define LF[1…N] as LF[i] = C[L[i]] + ri. C[L[i]] gets us the proper offset to the zeroth occurrence of L[i], and the addition of ri gets us the ri-th row of M that starts with c.

Inverse BW-Transform Construct C[1…|Σ|], which stores in C[i] the cumulative number of occurrences in T of character i. Construct an LF-mapping LF[1…N] which maps each character in L to the character occurring in F using only L and C. Reconstruct T backwards by threading through the LF-mapping and reading the characters off of L.

Another example You are given and input string “ababc” (a) Using Burrows-Wheeler, create all cyclic shifts of the string (b) sorted order (b) Output L and the primary index. (g) Given L, determine F and LF (and show how you do it). (h) Decode the original string using indexX, L, and LF (and show how you do it).

Pros and cons of BWT Pros: Cons: The transformed text does enjoy a compression-favorable property which tends to group identical characters together so that the probability of finding a character close to another instance of the same character is increased substantially. More importantly, there exist efficient and smart algorithms to restore the original string from the transformed result. Cons: the need of sorting all the contexts up to their full lengths of $N$ is the main cause for the super-linear time complexity of BWT. Super-linear time algorithms are not hardware friendly.

Block wise It works on blocks of certain typical size.

An improved algorithm -Schindler Transforms To address the above drawbacks, a slightly different transform, called ST, was proposed. which can sort the texts by using only their first $k$ characters (where $k$ can be a value far less than $N$), but still render itself reversible. The key idea of ST is a two-hierarchy priority sorting scheme, which can be easily achieved using the radix sort. the lexicographical sorting criterion. the positional sorting criterion.

ST transform Let OM be the same matrix as defined for the BWT. Under k-order ST, OM is transformed to M_k by sorting all its rows according to their first k leftmost characters, i.e. k-order contexts, only. In case that any two k-order contexts are equal, the tie is resolved by their relative positions in the original OM. i p p i $ m i s s i s s i s s i s s i p p i $ m i s s i p p i $ m i s s i $ m i s s i s s i p p m i s s i s s i p p i $ p i $ m i s s i s s i p p p i $ m i s s i s s i s i s s i p p i $ m i s s i p p i $ m i s s i s s s i s s i p p i $ m i s s i p p i $ m i s s i $ m i s s i s s i p p i Only partially sorted on the leftmost two columns

Pros and Cons of ST Pros: Cons: Faster than BWT Hardware implementation friendly Cons: The currently known approach to inverse ST is based on a hashing function. The relationship between inverse ST and inverse BWT is not well studied.

An application scheme in data communication system

Conclusions The BW transform makes the text (string) more amenable to compression. BWT in itself does not modify the data stream. It just reorders the symbols inside the data blocks. Evaluation of the performance actually is subject to information model assumed. Another topic. The transform is lossless and reversible

BW Transform Summary Any naïve implementation of the transform has an O(n^3) time complexity. The best solution has O(n), which is tricky to implement. We can reverse it to reconstruct the original text in O(n) time, using O(n) space. Once we obtain L, we can compress L in a provably efficient manner

Issues left out How about if all characters in the alphabet set appear in the text, i.e. no sentinel can be used? Do you need to compare N positions? How about the input data is not ascii encoded, but an image, or a biological sequence (DNA, RNA or protein)? Why not the first column, but the last column? In BWT, the last column, L, of the sorted matrix contains concentrations of identical characters, which is why L is easy to compress. However, the first column, F, of the same matrix is even easier to compress since it . Why select column L and not column F?

homework The BWT algorithms Forward Transform Backward Transform Either in the Windows environment or the Linux environment

Examples of running your program in the command line bwt –f text1 text2 Transfer text1 to text2 bwt –i text2 text3 Inverse text2 to text3

How to verify the correctness of your algorithms. Because the bwt is reversible and lossless, if your implementation is correct, text3 should be the same as text1. Your can manually verify text1 and text3 Alternatively, you can run “diff” command in Linux to report any differences between any two files.

Requirements Stage 1: use a fixed string or accept a string from keyboard to test the correctness of your algorithms. (80 points) Stage 2: then expand your solution to read the string from a given file. (20 points) Notice that text2 should be a binary file, for the first data is index, then followed by ascii code.

How to sort the matrix 1. the simplest way 2. radix sort Whatever sorting algorithm you feel comfortable Make each row a string, then do string comparison C string, need to know how functions for string comparison Cpp string, need to know how to how to use string class. You use whichever way you feel the most comfortable. 2. radix sort 3. suffix array

Knowledge to be practiced for the homework Array Dynamic memory allocation String manipulation Sorting File operation Data compression algorithms