Lecture 9: XML Compression

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

Xyleme A Dynamic Warehouse for XML Data of the Web.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.

XML Compression Aslam Tajwala Kalyan Chakravorty.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.

Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.

TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.

Lecture 8: XML Compression COMP Semistructured Data / XML zSemistructured => yloosely structured (no restrictions on tags & nesting relationships)

Succinct Representations of Trees

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

Database Management 9. course. Execution of queries.

Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Querying Structured Text in an XML Database By Xuemei Luo.

Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

ReiserFS Hans Reiser

XML Access Control Koukis Dimitris Padeleris Pashalis.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.

Martin Kruliš by Martin Kruliš (v1.1)1.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,

Lecture on Data Structures(Trees). Prepared by, Jesmin Akhter, Lecturer, IIT,JU 2 Properties of Heaps ◈ Heaps are binary trees that are ordered.

Information Retrieval in Practice

COMP9319 Web Data Compression and Search

Unit 4 Representing Web Data: XML

TCSS 342, Winter 2006 Lecture Notes

Data Coding Run Length Coding

Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees

CPS216: Data-intensive Computing Systems

Compressing XML Documents with Finite State Automata

Querying and Transforming XML Data

Chapter 5 : Trees.

Database System Implementation CSE 507

Efficient Filtering of XML Documents with XPath Expressions

XML in Web Technologies

13 Text Processing Hongfei Yan June 1, 2016.

External Methods Chapter 15 (continued)

Evaluation of Relational Operations

Chapter 8 – Binary Search Tree

(b) Tree representation

Chapter 11 Data Compression

Chapter 6: Transform and Conquer

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Greedy: Huffman Codes Yin Tat Lee

Database Design and Programming

XML indexing – A(k) indices

Implementation of Relational Operations

CS 261 – Data Structures Trees.

CSE 373 Data Structures and Algorithms

Lecture-Hashing.

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

Lecture 9: XML Compression

Semistructured Data / XML loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the “semistructured” umbrella self-describing the standard for information representation & exchange

XML data file can be modeled in a tree form <Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName

XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7

Query evaluation Top-down Bottom-up Hybrid

XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7

XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] <b><c>12</c><d>7</d></b> c d c 12 7 7

Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree

Major Criteria for indexing Speed up the search (by cutting the search space down) Relatively smaller size than the original data graph/tree Easy to maintain (during data loading during updates)

An Example of DAG Data root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone

Index graph based on language-equivalence a reduced graph that summarizes all paths from the root in the data graph The paths from root to o12 staff dept/member support/member

Language-equivalent nodes Let L(x) := {w |  a path from the root to x labeled w} The set L(x) may be infinite when there are cycles Nodes x, y are language-equivalent (x  y) if L(x) = L(y) We construct index I by taking the nodes to be the equivalent classes for 

Language-equivalent The paths from root to o3 staff dept/member Paths to o4 happen to be exactly the same 2 sequences Same for o8 and o12 o3  o4  o8  o12

Equivalence classes o3  o4  o8  o12 o1  o2  o7 o12  o13 root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone o3  o4  o8  o12 o1  o2  o7 o12  o13 o5  o6  o9 o10 o11

The index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 member support staff dept name phone

Query processing based on the index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 o10 o11 member support staff dept name phone dept/member/(name | phone) -> dept/member/name UNION dept/member/phone -> {o5, o6, o9} UNION {o10} -> {o5, o6, o9, o10}

About this indexing scheme The index graph is never > the data In practice, the index graph is small enough to fit in memory Construct the index is however a problem check two nodes are language-equivalent is very expensive (are PSPACE) approximation based on bisimulation exists

A Data Guide root dept support staff o11 o1, o2, o7 o3, o4, o8, o12 member phone member name o12, o13 o3, o4, o8, o12 o5, o6, o9 o10 phone name o5, o6, o9 o10

About Data Guide unique labels at each node (hence) extents are no longer disjoint query processing proceeds as before size of the index may >= data size good for data that is regular & has no cycles

XML-Specific Compressors Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities eliminated Very good compression ratio Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate

XMill First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors: Downloadable: www.cs.washington.edu/homes/suciu/XMILL

XMill Architecture:

An Example:Web Server Logs ASCII File 15.9 MB (gzipped 1.6MB): 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB): <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry>

How Xmill Works: Three Ideas Compress the structure separately from the data: gzip Structure gzip Data <apache:entry> <apache:host> </apache:host> . . . </apache:entry> 202.239.238.16 GET / HTTP/1.0 text/html 200 … + =1.75MB

How Xmill Works: Three Ideas Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … GET / HTTP/1.0 GET / HTTP/1.1 … + + =1.33MB

How Xmill Works: Three Ideas Apply semantic (specialized) compressors: gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g. 104.32.23.1  4 bytes) Need user input to select the semantic compressor

Experiments

XML Compression

Compression Time

Transfer Time (& Decode)

XGRIND (Tolani & Haritsa, 2002) Encodes elements and attributes using XMill’s approach DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each element/attribute Homomorphic compression: compressed document retains original structure June 24, 2008 XML Compression Techniques 31

XML Compression Techniques XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / June 24, 2008 XML Compression Techniques 32

XML Compression Techniques XGRIND Many queries can be carried out entirely in compressed domain Exact-match, prefix-match Some others require only decompression of relevant values Range, substring Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill June 24, 2008 XML Compression Techniques 33

ISX Requirements Space does matter for many applications Generally reducing space improves cache locality Indirection is expensive Support fast navigations Support fast insertion and deletion Support efficient joins Separate topology, text and schema

ISX Goal To find a space-efficient storage scheme for XML data without compromising both query and update performances

Proposed Storage Structure The ISX Structure

Sample DBLP XML Fragment

Balanced Parenthesis Encoding 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1

Node Navigations

Topology Tiers No. of ) No. of ( No. of text nodes Min, max of forward excess Min, max of backward excess

Primitive operators

Topology Tiers No. of ) No. of ( No. of text nodes Min, max of forward excess Min, max of backward excess Excess 2 Where is the close tag?

Tier 2 excess

Efficient Updates

Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology

Another example 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Loading time 17.8s 0.67s Runtime (//www) 333MB //www 1.814s 0.143s 5M DBLP MSXML ISX Runtime (loading) 15MB 4MB Loading time 0.54s 0.035s Runtime (//www) 21MB //www 0.096s 0.004s

ISX Features

Experiments Setup Fixed at 64MB memory buffer Up to 16 GB XML document E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)

Storage Size (ISX vs NoK)

Storage Size (ISX, XMill, XGrind): DBLP

Storage Size (ISX, XMill): TreeBank

Bulk Loading Performance

Queries

Q1: //inproceedings

Q5: //article[.//month/text() = “July”]//title

Other queries

XPath 13 axes We can navigate along 13 axes: ancestor ancestor-or-self attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self

Node Navigation

Full document traversal

Update (Insertion) Performance

ISX Summary Small storage footprint Small runtime footprint Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance

Compressing and Searching XML Data Via Two Zips Paolo Ferragina et al. Slides modified from P. Ferragina’s

An XML excerpt It is verbose ! ... <dblp> <book> <author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose !

A tree interpretation... XML document exploration  Tree navigation XML document search  Labeled subpath searches Subset of XPath [W3C]

The Problem XML-native search engines We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence P of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] no subpath/content searches, and poor performance on labeled trees

A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05] We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays the indexing of T reduces to implement simple rank/select query operations over these two arrays

The XBW-Transform Sa Sp Step 1. C B A D c a b C B D c a A b e C B C D B C A C D A C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Permutation of tree nodes upward labeled paths

The XBW-Transform Sa Sp Step 2. C B A D c a b C b a D c B A e A C B C D A C D B C Step 2. Stably sort according to Sp upward labeled paths

The XBW-Transform Sp Slast Sa Key fact Step 3. C B A D c a b XBW 1 C b C b a D c B A e A C B C C D A C D B C Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children

XBzip – a simple XML compressor Tags, Attributes and symbol = XBW is compressible: Sa and Spcdata are locally homogeneous Slast has some structure Pcdata

Some structural properties B A D c a b C XBW B Slast Sa Sp 1 C b a D c B A e A C B C C D A C D B C B A B D c b a D D a c a c b Two useful properties: Children are contiguous and delimited by 1s Children reflect the order of their parents

XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers

XBW is searchable (count subpaths) D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s