James A. Edwards, Uzi Vishkin University of Maryland.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

CS252: Systems Programming Ninghui Li Program Interview Questions.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.

Greedy Algorithms (Huffman Coding)

Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.

James Edwards and Uzi Vishkin University of Maryland 1.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Lecture 6: Huffman Code Thinh Nguyen Oregon State University.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.

A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.

CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.

Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.

Lossless Data Compression Using run-length and Huffman Compression pages

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

Data Compression Basics & Huffman Coding

Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein.

Huffman Codes Message consisting of five characters: a, b, c, d,e

CSE Lectures 22 – Huffman codes

Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.

Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.

1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.

MES Genome Informatics I - Lecture V. Short Read Alignment

Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.

Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.

Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.

1/20 A Novel Technique for Input Vector Compression in System-on-Chip Testing Student: Chien Nan Lin Satyendra Biswas, Sunil Das, and Altaf Hossain,” Information.

Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.

ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

Introduction to Algorithms Chapter 16: Greedy Algorithms.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Bahareh Sarrafzadeh 6111 Fall 2009

Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.

Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.

Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.

Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.

Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.

Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,

CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.

Information and Coding Theory

Objective of This Course

Huffman Coding CSE 373 Data Structures.

CENG 351 Data Management and File Structures

CPS 296.3:Algorithms in the Real World

Presentation transcript:

James A. Edwards, Uzi Vishkin University of Maryland

Introduction Lossless data compression Common tool  better use of memory (e.g., disk space) and network bandwidth. Burrows-Wheeler (BW) compression e.g., bzip2 Relatively higher compression ratio (pro) but slower (con) Snappy (Google) lower compression ratios but fast. Example For MPI on large machines speed is critical. Our motivation fast and high compression ratio Unexpected Prior work unknown to us made empirical follow-up … stronger Assumption throughout: fixed constant-size alphabet

State of the field Irregular algorithms: prevalent in CS curriculum and daily work (open-ended problems/programs). Yet, very limited support on today’s parallel hardware. Even more limited with strong scaling Low support for irregular parallel code in HW  SW developers limit themselves to regular algorithms  HW benchmarks  optimize HW for regular code … Namely, parallel data compression is of general interest as an undeniable application representing a big underrepresented “application crowd”

“Truly Parallel” BW compression Existing parallel approach: break input into blocks, compress blocks in parallel Practical drawback: good compression & speed only with large input Theory drawback: not really parallel Truly parallel: compress entire input using a parallel algorithm Works for both large and small inputs Can be combined with block-based approach Applications of small inputs: Faster (decompression) & greater compression  better use of main memory [ISCA05] & cache [ISCA12] Warehouse-scale computers. Bandwidth between various pairs of nodes can be extremely different; for MPI, MapReduce low bandwidth between pairs debilitating [HP 5th ed.] (i.e., Snappy was a solution)

Attempts at truly parallel BW compression A 2011 survey paper [Eirola] stipulates that parallelizing BW could hardly work on GPGPU, and decompression would fall behind further. Portions require “very random memory accessing” “…it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.” The best GPGPU result: even more painful In 2012, Patel et al. concurrently attempted to develop parallel code for BW compression on GPUs but their best result was 2.8X slowdown. Patel reported separately 1.2X speedup for decompression (hence, not referenced in SPAA13 version.)

Stages of BW compression & decompression Block-Sorting Transform (BST) Move-to- Front (MTF) encoding Huffman encoding Inverse Block-Sorting Transform (IBST) Move-to- Front (MTF) decoding Huffman decoding Compression Decompression S S S BST S MTF S BW S BST S MTF

Inverse Block-Sorting Transform Serial algorithm: 1. Sort characters of S BST ; the sorted order T[i] forms a ring i → T[i] 2. Starting with $, traverse the ring to recover S Parallel algorithm: 1. Use parallel integer sorting to find T[i] 2. Use parallel list ranking to traverse the ring Both steps require O(log n) time and O(n) work On current parallel HW list ranking gets you – why we chose this step (END) banana$ S (read right to left) i S BST [i]annb$aa T[i]T[i] i rank[i] S BST [i] Linked ring i → T[i]

Conclusion and where to go from here? Despite being originally described as a serial algorithm, BW compression can be accomplished by a parallel algorithm. Material for a few good exercises on prefix sum & list ranking? For a more detailed description of our algorithm, see reference [4] in our brief announcement. This algorithm demonstrates the importance of parallel primitives such as prefix sums and list ranking. Requires support of fine-grained, irregular parallelism and sometimes also strong scaling  Issues on all current parallel hardware. Indeed: While recent work from UC Davis (2012) on parallel BW compression on GPUs that we missed taxed ~20% of our originality (same Step 2), It failed to achieve any speedup on compression. Instead a slowdown of 2.8x. For decompression: 1.2X speedup. On the UMD experimental Explicit Multi-Threading (XMT) architecture, we achieved speedups of 25x for compression and 13x for decompression [5]. On balance UC Davis paper huge gift: 70x vs. GPU for compression and 11X for decompression.

Where to go from here? Remaining options for the community Figure out how to do it on current HW Or, bash PRAM Or, the alternative we pursued Develop a parallel algorithm that will work well on buildable HW designed to support the best- established parallel algorithmic theory Final thought connecting to several other SPAA presentations This is an example where MPI on large systems works in tandem with PRAM-like support on small systems. Intra-node (of a large system) use PRAM compression & decompression algorithms for inter-node MPI messages Counter-argument to an often unstated position. That we need the same parallel programming model at very large and small scales

References [4] J. A. Edwards and U. Vishkin. Parallel algorithms for Burrows-Wheeler compression and decompression. TR, UMD, [5] J. A. Edwards and U. Vishkin. Empirical speedup study of truly parallel data compression. TR, UMD,

Block-Sorting Transform (BST) Goal: bring occurrences of characters together Serial algorithm: 1. Form a list of all rotations of the input string 2. Sort the list lexicographically 3. Take the last column of the list as output Equivalent to sorting the suffixes of the input string banana$ anana$b nana$ba ana$ban na$bana a$banan $banana a$banan ana$ban anana$b banana$ na$bana nana$ba banana$ Input to BST List of rotations annb$aa Output of BST Sort

Block-Sorting Transform (BST) Parallel algorithm: 1. Find the suffix tree of S (O(log 2 n) time, O(n) work)) 2. Find the suffix array SA of S by traversing the suffix tree (Euler tour technique: O(log n) time, O(n) work) 3. Permute characters according to SA (O(1) time, O(n) work) $ a $ na banana$ $ na$ na na$ $ i S[i]S[i]banana$ SA[i] S[SA[i]-1]annb$aa

Move-to-Front (MTF) encoding Goal: Assign low codes to repeated characters Serial algorithm: Maintain list of characters in order last seen Parallel algorithm: use prefix sums to compute the MTF list for each character (O(log n) time, O(n) work) Associative binary operator: X + Y = Y concat (X – Y) LiLi 130 j L0[j]L0[j] 0$ 1a 2b 3n i2103 S BST [i]annb 3 S MTF [i] j 0a 1$ 2b 3n L1[j]L1[j] j 0n 1a 2$ 3b L2[j]L2[j] j 0n 1a 2$ 3b L3[j]L3[j] a,$,b,n nba$ b,n$,an,ab,na,$a $,a,b,nb,n,aa,$ b,n,a,$a,$ assumed prefixS BST annb$aa

Move-to-Front (MTF) decoding Same algorithm as encoding, with the following changes Serial: The MTF lists are used in reverse Parallel: Instead of combining MTF lists, combine permutation functions S MTF Permutation function =

Huffman Encoding Goal: Assign shorter bit strings to more-frequent MTF codes The parallelization of this step is already well known Serial algorithm: 1. Count frequencies of characters 2. Build Huffman table based on frequencies 3. Encode characters using the table Parallel algorithm: 1. Use integer sorting to count frequencies (O(log n) time, O(n) work) 2. Build Huffman table using the (standard, heap-based) serial algorithm (O(1) time and work) 3. (a) Compute the prefix sums of the code lengths to determine where in the output to write the code for each character (O(log n) time, O(n) work) (b) Actually write the output (O(1) time, O(n) work)

Huffman Decoding Serial algorithm: Read through compressed data, decoding one character at a time Parallel algorithm: partition input and apply serial algorithm to each partition Problem: Decoding cannot start in the middle of the codeword for a character Solution: Identify a set of valid starting bits using prefix sums (O(log n) time, O(n) work)

Huffman Decoding How to identify valid starting positions: Divide the input string into partitions of length l (the length of the longest Huffman codeword) 1. Assign a processor to each bit in the input. Processor i decodes the compressed input starting at index i and stops when it crosses a partition boundary, recording the index where it stopped. (O(1) time, O(n) work) Now each partition has l pointers entering it, all of which originate from the immediately preceding partition. 2. Use prefix sums to merge consecutive pointers. (O(log n) time, O(n) work) Now each partition still has l pointers entering it, but they all originate from the first partition. 3. For each bit in the input, mark it as a valid starting position if and only if the pointer that points to that bit originates from the first bit (index 0) of the first partition (O(1) time, O(n) work)

Lossless data compression on GPGPU architectures (2011) Inverse BST: “Problems would possibly arise from poor GPU performance of the very random memory accessing caused by the scattering of characters throughout the string.” MTF decoding: “Speeding up decoding on GPGPU platforms might be more challenging since the character lookup is already constant time on serial implementations, and starting decoding from multiple places is difficult since the state of the stack is not known at the other places.” Huffman decoding: “Here again, decompression is harder. This is due to the fact that the decoder doesn’t know where one codeword ends and another begins before it has decoded the whole prior input.” “As for the codeword tables for the VLE, it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.”