CPS 100e, Fall 2008 1 Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used.

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Chapter 4: The Building Blocks: Binary Numbers, Boolean Logic, and Gates Invitation to Computer Science, Java Version, Third Edition.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.
1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
CS503: First Lecture, Fall 2008 Michael Barnathan.
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
How our brains work And how it ties in to programming and arrays Storage Sorting Algorithms Etc…
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Lists in Python.
MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Week 5 - Monday.  What did we talk about last time?  Linked list implementations  Stacks  Queues.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Chapter 7: Sorting Algorithms Insertion Sort. Sorting Algorithms  Insertion Sort  Shell Sort  Heap Sort  Merge Sort  Quick Sort 2.
CS50 Week 2. RESOURCES Office hours ( Lecture videos, slides, source code, and notes (
Computer Organization and Assembly Language Bitwise Operators.
UTILITIES Group 3 Xin Li Soma Reddy. Data Compression To reduce the size of files stored on disk and to increase the effective rate of transmission by.
Dynamic Programming. Algorithm Design Techniques We will cover in this class: ◦ Greedy Algorithms ◦ Divide and Conquer Algorithms ◦ Dynamic Programming.
Compsci 100, Fall What is a transform? l Multiply two near-zero numbers, what happens?  Add their logarithms: log(a)+log(b) = log(ab), invertible.
Compsci 100, Spring What’s left to talk about? l Transforms  Making Huffman compress more  Understanding what transforms do Conceptual understanding,
Fall 2002CS 150: Intro. to Computing1 Streams and File I/O (That is, Input/Output) OR How you read data from files and write data to files.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Matlab tutorial course Lesson 4: Writing your own functions: programming constructs
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
Searching and Sorting Searching: Sequential, Binary Sorting: Selection, Insertion, Shell.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
CPS 100, Spring Search, Backtracking,Heuristics l How do you find a needle in a haystack?  How does a computer play chess?  Why would you write.
CompSci From bits to bytes to ints  At some level everything is stored as either a zero or a one  A bit is a binary digit a byte is a binary.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Sorting and Runtime Complexity CS255. Sorting Different ways to sort: –Bubble –Exchange –Insertion –Merge –Quick –more…
COMP9319 Web Data Compression and Search
Invitation to Computer Science, C++ Version, Fourth Edition
HUFFMAN CODES.
Information and Coding Theory
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
CSE 311 Foundations of Computing I
CompSci 201 Data Representation & Huffman Coding
13 Text Processing Hongfei Yan June 1, 2016.
Invitation to Computer Science, Java Version, Third Edition
Image Coding and Compression
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Introduction to Computer Science
CENG 351 Data Management and File Structures
Insertion Sort and Shell Sort
Linear Time Sorting.
CPS 296.3:Algorithms in the Real World
Presentation transcript:

CPS 100e, Fall Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used to preprocess data, or transform data, to make it more amenable to compression like Huffman Coding  Huff depends on redundancy/repetition, as do many compression schemes l l l Main idea in BWT: transform the data into something more compressible and make the transform fast, though it will be slower than no transform  TANSTAAFL (what does this mean?)

CPS 100e, Fall David Wheeler ( ) l Invented subroutine l “Wheeler was an inspiring teacher who helped to develop computer science teaching at Cambridge from its inception in 1953, when the Diploma in Computer Science was launched as the world's first taught course in computing. ”

CPS 100e, Fall Mike Burrows l He's one of the pioneers of the information age. His invention of Alta Vista helped open up an entire new route for the information highway that is still far from fully explored. His work history, intertwined with the development of the high-tech industry over the past two decades, is distinctly a tale of scientific genius.

CPS 100e, Fall BWT efficiency BWT is a block transform – requires storing n copies of the file and O(n log n) to make the copies (file of length n)  We can’t really do this in practice in terms of storage  Instead of storing n copies of the file, store one copy and an integer index (break file into blocks of size n) But sorting is still O(n log n) and it’s actually worse  Each comparison in the sort looks at the entire file  In normal sort analysis the comparison is O(1), strings are small  Here we have comparison is O(n), so sort is actually…  O(n 2 log n), why?

CPS 100e, Fall BWT at 10,000 ft: big picture l Remember, goal is to exploit/create repetition (redundancy)  Create repetition as follows  Consider original text: duke blue devils.  Create n copies by shifting/rotating by one character 0: duke blue devils. 1: uke blue devils.d 2: ke blue devils.du 3: e blue devils.duk 4: blue devils.duke 5: blue devils.duke 6: lue devils.duke b 7: ue devils.duke bl 8: e devils.duke blu 9: devils.duke blue 10: devils.duke blue 11: evils.duke blue d 12: vils.duke blue de 13: ils.duke blue dev 14: ls.duke blue devi 15: s.duke blue devil 16:.duke blue devils

CPS 100e, Fall BWT at 10,000 ft: big picture l Once we have n copies (but not really n copies!)  Sort the copies  Remember the comparison will be O(n)  We’ll look at the last column, see next slide What’s true about first column? 4: blue devils.duke 9: devils.duke blue 16:.duke blue devils 5: blue devils.duke 10: devils.duke blue 0: duke blue devils. 3: e blue devils.duk 8: e devils.duke blu 11: evils.duke blue d 13: ils.duke blue dev 2: ke blue devils.du 14: ls.duke blue devi 6: lue devils.duke b 15: s.duke blue devil 7: ue devils.duke bl 1: uke blue devils.d 12: vils.duke blue de

CPS 100e, Fall |ees.kudvuibllde| |.bddeeeikllsuuv| 4: blue devils.duke 9: devils.duke blue 16:.duke blue devils 5: blue devils.duke 10: devils.duke blue 0: duke blue devils. 3: e blue devils.duk 8: e devils.duke blu 11: evils.duke blue d 13: ils.duke blue dev 2: ke blue devils.du 14: ls.duke blue devi 6: lue devils.duke b 15: s.duke blue devil 7: ue devils.duke bl 1: uke blue devils.d 12: vils.duke blue de l Properties of first column  Lexicographical order  Maximally ‘clumped’ why?  From it, can we create last? l Properties of last column  Some clumps (real files)  Can we create first? Why? l See row labeled 8:  Last char precedes first in original! True for all rows! l Can recreate everything:  Simple (code) but hard (idea)

CPS 100e, Fall What do we know about last column? l Contains every character of original file  Why is there repetition in the last column?  Is there repetition in the first column? l We keep the last column because we can recreate the first  What’s in every column of the sorted list?  If we have the last column we can create the first Sorting the last column yields first  We can create every column which means if we know what row the original text is in we’re done! Look back at sorted rows, what row has index 0?

CPS 100e, Fall BWT from a 5,000 ft view l How do we avoid storing n copies of the input file?  Store one copy and an index of what the first character is  0 and “duke blue devils.” is the original string  3 and “duke blue devils.” is “e blue devils. du”  What is 7 and “duke blue devils.” l You’ll be given a class Rotatable that can be sorted  Construct object from original text and index  When compared, use the index as a place to start  Rotatable can report the last char of any “row”  Rotatable can report it’s index (stored on construction)

CPS 100e, Fall BWT 2,000 feet l To transform all we need is the last column and the row at which the original string is in the list of sorted strings  We take these two pieces of information and either compress them or transform them further  After the transform we run Huff on the result l We can’t store/sort a huge file, what do we do?  Process big files in chunks/blocks Read block, transform block, Huff block Read block, transform block, Huff block… Block size may impact performance

CPS 100e, Fall Toward BWT from zero feet l First look at code for HuffProcessor.compress  Tree already made, preprocessCompress  How writeHeader works? writeCompressedData ? public int compress(InputStream in, OutputStream out) { BitOutputStream bos = new BitOutputStream(out); BitInputStream bis = new BitInputStream(in); int bitCount = 0; myRoot = makeTree(); makeMapEncodings(myRoot,””); bitCount += writeHeader(bout); bitCount += writeCompressedData(bis,bos); bout.flush(); return bitCount; }

CPS 100e, Fall BWT from zero feet, part I l Read a block of data, transform it, then huff it  To huff we write a magic number, write header/tree, and write compressed bits based on Huffman encodings  We already have huff code, we just need to use it on a transformed bunch of characters rather than on the input file So process input stream by passing it to BW transform which reads a chunk and returns char[], the last column  A char is a 16-bit, unsigned value, we only need 8-bit value, but use char because we can’t use byte In Java byte is signed, -256, What does all that mean?

CPS 100e, Fall Use what we have, need new stream l We want to use existing compression code we wrote before  Read a block of 8-bit/chunks, store in char[] array  Repeat until no more blocks, most blocks full, last not full ?  For each block as char[], treat as stream and feed it to Huff Count characters, make tree, compress l We need an Adapter, something that takes char[] array and turns it into an InputStream which we feed to Huff compressor  Java provides ByteArrayInputStream, turns byte[] to stream  We can store 8-bit chunks as bytes for stream purposes

CPS 100e, Fall ByteArrayInputStream and blocks public int compress(InputStream in, OutputStream out) { BitOutputStream bos = new BitOutputStream(out); BitInputStream bis = new BitInputStream(in); int bitCount = 0; BurrowsWheeler bwt = new BurrowsWheeler(); while (true){ char[] chunk = bw.transform(bis); if (chunk.length < 1) break; chunk = btw.mtf(chunk); byte[] array = new byte[chunk.length]; for(int k=0; k < array.length; k++){ array[k] = (byte) chunk[k]; } ByteArrayInputStream bas = new ByteArrayInputStream(array); preprocessInitialize(bas); myRoot = makeTree(); makeMapEncodings(myRoot,””); BitInputStream blockBis = new BitInputStream(new ByteArrayInputStream(array)); bitCount += writeHeader(bout); bitCount += writeCompressedData(Blockbis,bos); } bos.flush(); return bitCount; }

CPS 100e, Fall How do we untransform? l Untransforming is very slick  Basically sort the last column in O(n) time  Then run an O(n) algorithm to get back original block l We sort the last column in O(n) time using a counting sort, which is sometimes one phase of radix sort  We could just sort, that’s easier to code and a good first step  The counting sort leverages that we’re sorting “characters” --- whatever we read when doing compression which is an 8-bit chunk  How many different 8-bit chunks are there?

CPS 100e, Fall Counting sort l If we have an array of integers all of whose values are between 0 and 255, how can we sort by counting number of occurrences of each integer?  Suppose we have 4 occurrences of one, 1 occurrence of two, 3 occurrences of five and 2 occurrences of seven, what’s the sorted array? (we don’t know the original, just the counts)  What’s the answer? How do we write code to do this? l More than one way, as long as O(n) doesn’t matter really

CPS 100e, Fall Another transform: Move To Front l In practice we can introduce more repetition and redundancy using a Move-to-front transform (MTF)  We’re going to compress a sequence of numbers (the 8- bit chunks we read, might be the last column from BWT)  Instead of just writing the numbers, use MTF to write l Introduce more redundancy/repetition if there are runs of characters. For example: consider “AAADDDFFFF”  As numbers this is  Using MTF, start with index[k] = k 0,1,2,3,4,…,96,97,98,99,…,255  Search for 97, initially it’s at index[97], then MTF 97,0,1,2,3,4,5,…, 96,98,99,…,255

CPS 100e, Fall More on why MTF works l As numbers this is  Using MTF, start with index[k] = k  Search for 97, initially it’s at index[97], then MTF 97,0,1,2,3,4,5,…,96,98,99,100,101,…  Next time we search for 97 where is it? At 0! l So, to write out we actually write , then we write out 100, where is it? Still at 100, why? Then MTF:  100,97,0,1,2,3,…96,98,99,101,102,… l So, to write out we write:  97, 0, 0, 100, 0, 0, 102, 0, 0  Lots of zeros, ones, etc. Thus more Huffable, why?

CPS 100e, Fall Complexity of MTF and UMTF l Given n characters, we have to look through 256 indexes (worst case)  So, 256*n, this is …. O(n)  Average case is much better, the whole point of MTF is to find repeats near the beginning (what about MTF complexity?) l How to untransform, undo MTF, e.g., given  97, 0, 0, 100, 0, 0, 102, 0, 0 l How do we recover AAADDDFFF (97,97,97,100,100,…102)  Initially index[k] = k, so where is 97? O(1) look up, then MTF

CPS 100e, Fall Burrows Wheeler Summary l Transform data: make it more “compressable”  Introduce redundancy  First do BWT, then do MTF (latter provided)  Do this in chunks  For each chunk array (after BWT and MTF) huff it l To uncompress data  Read block of huffed data, uncompress it, untransform  Undo MTF, undo BWT: this code is given to you  Don’t forget magic numbers