Lecture 9: XML Compression
Semistructured Data / XML loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the “semistructured” umbrella self-describing the standard for information representation & exchange
XML data file can be modeled in a tree form <Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7
Query evaluation Top-down Bottom-up Hybrid
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] <b><c>12</c><d>7</d></b> c d c 12 7 7
Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree
Major Criteria for indexing Speed up the search (by cutting the search space down) Relatively smaller size than the original data graph/tree Easy to maintain (during data loading during updates)
An Example of DAG Data root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone
Index graph based on language-equivalence a reduced graph that summarizes all paths from the root in the data graph The paths from root to o12 staff dept/member support/member
Language-equivalent nodes Let L(x) := {w | a path from the root to x labeled w} The set L(x) may be infinite when there are cycles Nodes x, y are language-equivalent (x y) if L(x) = L(y) We construct index I by taking the nodes to be the equivalent classes for
Language-equivalent The paths from root to o3 staff dept/member Paths to o4 happen to be exactly the same 2 sequences Same for o8 and o12 o3 o4 o8 o12
Equivalence classes o3 o4 o8 o12 o1 o2 o7 o12 o13 root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone o3 o4 o8 o12 o1 o2 o7 o12 o13 o5 o6 o9 o10 o11
The index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 member support staff dept name phone
Query processing based on the index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 o10 o11 member support staff dept name phone dept/member/(name | phone) -> dept/member/name UNION dept/member/phone -> {o5, o6, o9} UNION {o10} -> {o5, o6, o9, o10}
About this indexing scheme The index graph is never > the data In practice, the index graph is small enough to fit in memory Construct the index is however a problem check two nodes are language-equivalent is very expensive (are PSPACE) approximation based on bisimulation exists
A Data Guide root dept support staff o11 o1, o2, o7 o3, o4, o8, o12 member phone member name o12, o13 o3, o4, o8, o12 o5, o6, o9 o10 phone name o5, o6, o9 o10
About Data Guide unique labels at each node (hence) extents are no longer disjoint query processing proceeds as before size of the index may >= data size good for data that is regular & has no cycles
XML-Specific Compressors Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities eliminated Very good compression ratio Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate
XMill First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors: Downloadable: www.cs.washington.edu/homes/suciu/XMILL
XMill Architecture:
An Example:Web Server Logs ASCII File 15.9 MB (gzipped 1.6MB): 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB): <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry>
How Xmill Works: Three Ideas Compress the structure separately from the data: gzip Structure gzip Data <apache:entry> <apache:host> </apache:host> . . . </apache:entry> 202.239.238.16 GET / HTTP/1.0 text/html 200 … + =1.75MB
How Xmill Works: Three Ideas Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … GET / HTTP/1.0 GET / HTTP/1.1 … + + =1.33MB
How Xmill Works: Three Ideas Apply semantic (specialized) compressors: gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g. 104.32.23.1 4 bytes) Need user input to select the semantic compressor
Experiments
XML Compression
Compression Time
Transfer Time (& Decode)
XGRIND (Tolani & Haritsa, 2002) Encodes elements and attributes using XMill’s approach DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each element/attribute Homomorphic compression: compressed document retains original structure June 24, 2008 XML Compression Techniques 31
XML Compression Techniques XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / June 24, 2008 XML Compression Techniques 32
XML Compression Techniques XGRIND Many queries can be carried out entirely in compressed domain Exact-match, prefix-match Some others require only decompression of relevant values Range, substring Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill June 24, 2008 XML Compression Techniques 33
ISX Requirements Space does matter for many applications Generally reducing space improves cache locality Indirection is expensive Support fast navigations Support fast insertion and deletion Support efficient joins Separate topology, text and schema
ISX Goal To find a space-efficient storage scheme for XML data without compromising both query and update performances
Proposed Storage Structure The ISX Structure
Sample DBLP XML Fragment
Balanced Parenthesis Encoding 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1
Node Navigations
Topology Tiers No. of ) No. of ( No. of text nodes Min, max of forward excess Min, max of backward excess
Primitive operators
Topology Tiers No. of ) No. of ( No. of text nodes Min, max of forward excess Min, max of backward excess Excess 2 Where is the close tag?
Tier 2 excess
Efficient Updates
Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology
Another example 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Loading time 17.8s 0.67s Runtime (//www) 333MB //www 1.814s 0.143s 5M DBLP MSXML ISX Runtime (loading) 15MB 4MB Loading time 0.54s 0.035s Runtime (//www) 21MB //www 0.096s 0.004s
ISX Features
Experiments Setup Fixed at 64MB memory buffer Up to 16 GB XML document E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)
Storage Size (ISX vs NoK)
Storage Size (ISX, XMill, XGrind): DBLP
Storage Size (ISX, XMill): TreeBank
Bulk Loading Performance
Queries
Q1: //inproceedings
Q5: //article[.//month/text() = “July”]//title
Other queries
XPath 13 axes We can navigate along 13 axes: ancestor ancestor-or-self attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self
Node Navigation
Full document traversal
Update (Insertion) Performance
ISX Summary Small storage footprint Small runtime footprint Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance
Compressing and Searching XML Data Via Two Zips Paolo Ferragina et al. Slides modified from P. Ferragina’s
An XML excerpt It is verbose ! ... <dblp> <book> <author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose !
A tree interpretation... XML document exploration Tree navigation XML document search Labeled subpath searches Subset of XPath [W3C]
The Problem XML-native search engines We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence P of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] no subpath/content searches, and poor performance on labeled trees
A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05] We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays the indexing of T reduces to implement simple rank/select query operations over these two arrays
The XBW-Transform Sa Sp Step 1. C B A D c a b C B D c a A b e C B C D B C A C D A C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Permutation of tree nodes upward labeled paths
The XBW-Transform Sa Sp Step 2. C B A D c a b C b a D c B A e A C B C D A C D B C Step 2. Stably sort according to Sp upward labeled paths
The XBW-Transform Sp Slast Sa Key fact Step 3. C B A D c a b XBW 1 C b C b a D c B A e A C B C C D A C D B C Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children
XBzip – a simple XML compressor Tags, Attributes and symbol = XBW is compressible: Sa and Spcdata are locally homogeneous Slast has some structure Pcdata
Some structural properties B A D c a b C XBW B Slast Sa Sp 1 C b a D c B A e A C B C C D A C D B C B A B D c b a D D a c a c b Two useful properties: Children are contiguous and delimited by 1s Children reflect the order of their parents
XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers
XBW is searchable (count subpaths) D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr] Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s