XML Compression Aslam Tajwala Kalyan Chakravorty.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Character and String definitions, algorithms, library functions Characters and Strings.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.
Greedy Algorithms (Huffman Coding)
Data Compression Michael J. Watts
Compression & Huffman Codes
Information Retrieval in Practice
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Variable Length Data and Records Eswara Satya Pavan Rajesh Pinapala CS 257 ID: 221.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Compression JPG compression, Source: Original 10:1 Compression 45:1 Compression.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
Mark Graves Leveraging Existing DBMS Storage for XML DBMS.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Python and Web Programming
Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001.
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
Overview of Search Engines
CSE Lectures 22 – Huffman codes
System Calls 1.
ProvideX Data Dictionary & Views System Presented by: Patrizio Lucci.
WORKING WITH XSLT AND XPATH
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
CMPE 421 Parallel Computer Architecture
Huffman Encoding Veronica Morales.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
CITA 330 Section 6 XSLT. Transforming XML Documents to XHTML Documents XSLT is an XML dialect which is declared under namespace "
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
1 Introduction  Extensible Markup Language (XML) –Uses tags to describe the structure of a document –Simplifies the process of sharing information –Extensible.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
XML eXtensible Markup Language. XML A method of defining a format for exchanging documents and data. –Allows one to define a dialect of XML –A library.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
XP New Perspectives on XML, 2 nd Edition Tutorial 7 1 TUTORIAL 7 CREATING A COMPUTATIONAL STYLESHEET.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Compression techniques Adaptive and non-adaptive.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Madhuri Gollu Id: 207. Agenda Agenda  Records with Variable Length Fields  Records with Repeating Fields  Variable Format Records  Records that do.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Information Retrieval in Practice
Module 11: File Structure
Compression & Huffman Codes
Data Compression.
CPSC 231 Organizing Files for Performance (D.H.)
Compressing XML Documents with Finite State Automata
Applied Algorithmics - week7
Appendix D: Network Model
Operation System Program 4
Variable Length Data and Records
Appendix D: Network Model
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Computer Organization & Architecture 3416
VIJAYA PAMIDI CS 257- Sec 01 ID:102
Presentation transcript:

XML Compression Aslam Tajwala Kalyan Chakravorty

Overview Motivation for XML Compression Techniques for achieving XML compression XMill XMill Architecture

Why Compress XML? Structured nature of XML makes it understandable to humans, Downside: XML is verbose –Each non-empty element tag must end with a matching closing tag -- data –Ordering of tags is often repeated in a document (e.g. multiple records)

Why Compress XML?: 2 XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied Can gain a significant savings from compression, due to highly structured nature of XML XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents

Using Huffman/LZ Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values) Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression

Using Huffman/LZ: 2 Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip) Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression

Huffman Encoding Example ACDABA Since these are 6 characters, this text is 6 bytes or 48 bits long tree is build that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D= (ACDABA = 11 bits)

LZ77 Example( Dictionary Based Compressors) Lempel-Ziv 77 algorithm Dictionary is a portion of encoded sequence The encoder examines the input sequence through a sliding window The window consists of two parts: –a search buffer that contains a portion of the recently encoded sequence, and –a look-ahead buffer that contains the next portion of the sequence to be encoded.

XMill (Liefke and Suciu, 2000) Relies heavily on zlib, the compression library used in gzip Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API) During compression, each XML tag is examined to see which compression technique(s) should be applied

XML Compression View XML as a tree Separate the tree structure and what is stored in leaves Save the tree structure so that it can be restored The compressed file may or may not remember the tree structure breadfruit tree

XMill: Compression Strategy XMill applies 3 principles during compression: –Separate structure (element tags and attribute names) from data –Group related data items in a single container; compress each container separately –Apply appropriate semantic compressors to each container

XMill – Separating Structure From Content Start tags and attribute names are dictionary-encoded (as T1, T2, etc.) End tags replaced with ‘/’ token Data values replaced with their container number

XMill – Separating Structure From Content 2 Homer Simpson Frank Grimes Dictionary T1 =>Employees T2 => Employee T3 Structure Container T1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / / C3 1 2 C4 Homer Simpson Frank Grimes

XMill: Container Expressions Users can override default settings using the container expression language –Specify container membership –Specify which semantic compressor(s) are applied for each container E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container: xmill –p //(Name | Location) employees.xml

XMill: Semantic Compressors CompressorDescription tDefault Text Compressor (gzipped only) uCompressor for positive integers (binary encoded using 1 – 4 bytes) iCompressor for integers u8Compressor for positive integers < 256 diDifferential compressor for integers

XMill: Semantic Compressors 2 CompressorDescription rlRun-length encoder (store single copy of a sequence, its length, and repetition count) eEnumeration encoder (dictionary) “…”Constant compressor – outputs nothing: used to check that current token is a specified constant value

XMill: Semantic Compressors 3 Text compressor is applied to each element by default User can add other instructions via command line: xmill –p //price=>i file.xml Applies integer compressor to each occurrence of ‘price’ element in file.xml

XMill Architecture (1/3)

XMill Architecture (2/3) SAX Parser –sends tokens to the path processor. Path Processor –determines how to map data values to containers. Semantic Compressors –compresses the input and copies it to the container – in the memory window. –E.x. binary encoding of integers, differential compressors. When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.

Performance Evaluation (1/2)

Performance Evaluation (2/2)

References XMill:An efficent Compressor for XML Data XGrind:A query friendly compressor suciu/COURSES/590DS/19compression.ppt

Questions ?