COMP9319 Web Data Compression and Search

COMP9319 Web Data Compression and Search
Revision

Course Aims As the amount of Web data increases, it is becoming vital to not only be able to search and retrieve this information quickly, but also to store it in a compact manner. This is especially important for mobile devices which are becoming increasingly popular. Without loss of generality, within this course, we assume Web data (excluding media content) will be in XML and its like (e.g., XHTML). This course aims to introduce the concepts, theories, and algorithmic issues important to Web data compression and search. The course will also introduce the most recent development in various areas of Web data optimization topics, common practice, and its applications.

Assumed knowledge Official prerequisite of this course is COMP2911/9024. At the start of this course students should be able to: understand fundamental data structures. produce correct programs in C/C++, i.e., compilation, running, testing, debugging, etc. produce readable code with clear documentation. appreciate use of abstraction in computing.

Learning outcomes have a good understanding of the fundamentals of text compression be introduced to advanced data compression techniques such as those based on Burrows Wheeler Transform have programming experience in Web data compression and optimization have a deep understanding of XML and selected XML processing and optimization techniques understand the advantages and disadvantages of data compression for Web search have a basic understanding of XML distributed query processing appreciate the past, present and future of data compression and Web data optimization

Assessment 3 prog assignments (20, 30, 50 pts)
Penalty for late programming assigt submission: 10% reduction for the first day and 30% reduction for each subsequent day. Advanced project in lieu of the assignments is possible. Pre-arrangement with the lecturer before week3 is needed.

Assessment Final written exam (100 pts)
If you are ill on the day of the exam, do not attend the exam – I will not accept any medical special consideration claims from people who already attempted the exam. Final mark (Harmonic Mean): 2 * (Assgt Total) * FinalExam / (Assgt Total + FinalExam)

Assignments Programming assignments are relatively challenging
In addition to correctness, specified performance parameters are required 7

Final exam 3 hours written exam UNSW approved calculator
Remember to get the sticker before the exam A single-sided A4 sheet(typed or handwritten)

Poor assignment performance
Should you take the exam if you’ve failed the assignmets? Yes (for your WAM and for a chance to pass the subject). Read next slide for detail…

Special consideration
Special consideration (scale up) may be granted if: You have achieved >= 75% in the final exam. You got poor assignment scores but your codes show that you have good understanding of the corresponding methodologies and algorithms.

Exam preparation Make sure you understand the slides (if needed, refer to the original papers) Work on the exercises (before you check the answers) Check and study the answers Assignment related knowledge may be re-tested in the exam

Exam consultation Special consultation before the exam:
In case you have question before the exam Or you have question with the exercises Or you have question with your assigt3 score 11:00-13:00 Wed 15th June, Rm213, K17

Yet to be done Follow up cases of potential plagiarism
Some resolved but more to test

Assignment 3 results There are a few extension cases with medical reasons. Test cases are extensive (38 tests). Hence marking will not be finalized until June 8. I’ll try best to announce it by end of that week. Do not share/disclose your solution until you receive your marks.

Revision

Topics Week Lecture 1 Course overview; basic information theory
2 Arithmetic coding, adaptive coding, dictionary coding 3 Adaptive Huffman, LZW; Overview of BWT 4 Pattern matching and regular expression 5 FM index, backward search, suffix array 6 Suffix array O(n), compressed BWT 7 XML overview & indexing; XPath queries & processing 8 XML compression: XMill, XGRIND, ISX, XBW 9 XPath containment, Distributed Web query processing Cloud data optimization, Web graph Latent semantic indexing

Entropy What is the minimum number of bits per symbol?
Answer: Shannon’s result – theoretical minimum average number of bits per code work is known as Entropy (H) ORCHID Research Group Department of Computer Science, University of Illinois at Urbana-Champaign 17

Huffman coding S Freq Huffman a 30 00 b 01 c 20 10 d 110 e 111 100 1
1 60 40 1 1 20 1 30 30 20 10 10 a b c d e

Run-length coding Run-length coding (encoding) is a very widely used and simple compression technique does not assume a memoryless source replace runs of symbols (possibly of length one) with pairs of (symbol, run-length)

Adaptive Huffman abbbbba: abbbbba: a: b: Modified from Wikipedia

Arithmetic coding (encode)

Arithmetic coding (decode)

LZW Compression w = NIL; while ( read a character k ) {
if wk exists in the dictionary w = wk; else add wk to the dictionary; output the code for w; w = k; }

LZW Decompression read a character k; output k; w = k;
while ( read a character/code k ) { entry = dictionary entry for k; output entry; add w + entry[0] to dictionary; w = entry; }

BWT(S) function BWT (string s)
create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table)

InverseBWT(S) function inverseBWT (string s) create empty table
repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character)

Other ways to reverse BWT
Consider L=BWT(S) is composed of the symbols V0 … VN-1, the transformed string may be parsed to obtain: The number of symbols in the substring V0 … Vi-1 that are identical to Vi. (i.e., Occ[ ]) For each unique symbol, Vi, in L, the number of symbols that are lexicographically less than that symbol. (i.e., C[ ])

Move to Front (MTF) Reduce entropy based on local frequency correlation Usually used for BWT before an entropy-encoding step Author and detail: Original paper at cs9319/Papers

BWT compressor vs ZIP ZIP (i.e., LZW based) BWT+RLE+MTF+AC
From

Pattern Matching Brute Force Boyer Moore KMP

Regular expressions L(001) = {001}
L(0*10*) = {1, 01, 10, 010, 0010, …} i.e. {w | w has exactly a single 1} L()* = {w | w is a string of even length} L((0(0+1))*) = { ε, 00, 01, 0000, 0001, 0100, 0101, …} L((0+ε)(1+ ε)) = {ε, 0, 1, 01} L(1Ø) = Ø ; concatenating the empty set to any set yields the empty set. Rε = R R+Ø = R Note that R+ε may or may not equal R (we are adding ε to the language) Note that RØ will only equal R if R itself is the empty set.

Theory of DFAs and REs RE. Concise way to describe a set of strings.
DFA. Machine to recognize whether a given string is in a given set. Duality: for any DFA, there exists a regular expression to describe the same set of strings; for any regular expression, there exists a DFA that recognizes the same set.

DFA to RE: State Elimination
Eliminates states of the automaton and replaces the edges with regular expressions that includes the behavior of the eliminated states. Eventually we get down to the situation with just a start and final node, and this is easy to express as a RE

Burrows-Wheeler Transform
Reminder: Recovering T from L F L Find F by sorting L First char of T? m Find m in L L[i] precedes F[i] in T. Therefore we get mi How do we choose the correct i in L? The i’s are in the same order in L and F As are the rest of the char’s i is followed by s: mis And so on…. #iiiimppssss ipssm#pissii 34

Backward-search example
1 2 3 4 5 6 7 8 9 10 11 12 P = pssi i = c = First = Last = (Last – First + 1) = First 3 4 Last ‘s’ ‘i’ C[‘i’] + 1 = 2 C[‘s’] + Occ(‘s’,1) +1 = = 9 C[‘s’] + Occ(‘s’,5) = 8+2 = 10 C[‘i’ + 1] = C[‘m’] = 5 2 4 C[ ] = 8 6 5 1 i m p s 35

Compression for inverted index
First, we will consider space for dictionary Main motivation for dictionary compression: make it small enough to keep in main memory Then for the postings file Motivation: reduce disk space needed, decrease time needed to read from disk Note: Large search engines keep significant part of postings in memory We will devise various compression schemes for dictionary and postings. VB code, Gamma code 36 36 36

Signature files Definition Structure
Word-oriented index structure based on hashing. Use liner search. Suitable for not very large texts. Structure Based on a Hash function that maps words to bit masks. The text is divided in blocks. Bit mask of block is obtained by bitwise ORing the signatures of all the words in the text block. Word not found, if no match between all 1 bits in the query mask and the block mask.

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

Longest common substring (of two strings)
Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b Find such node with largest “string depth” # $ 4 a # $ 2 b 1 $ 3 2 1

Suffix array We loose some of the functionality but we save space.
Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3 1 4 2

Linear time suffix array construction
Consider the popular example string S: bananainpajamas$ Construct the suffix array of S using the linear time algorithm Then compute the BWT(S) What’s the relationship between the suffix array and BWT ? (e.g., SA -> BWT vs BWT -> SA )

Compressed SA: Run-Length FM-index...
1 B c a g t S a c g t F L 1 B’

Changes to formulas Recall that we need to compute CT[c]+rankc(L,i) in the backward search. Theorem: C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1, when L[i]¹ c, and otherwise to select1(B’,CS[c]+rankc(S,rank1(B,i)))+ i-select1(B,rank1(B,i)).

XML Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName
<FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName

XML Parsers There are several different ways to categorise parsers:
Validating versus non-validating parsers Parsers that support the Document Object Model (DOM) Parsers that support the Simple API for XML (SAX) Parsers written in a particular language (Java, C, C++, Perl, etc.)

XPath: More Qualifiers
< “60”] < “25”] /bib/book[author/text()]

Query evaluation Top-down Bottom-up Hybrid

XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7

Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree

Two techniques Based on the idea of language-equivalence Data Guide

XPath containment ?  person person person * name name name
0 * 0 1 name name name ?

XML Compression Techniques
XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / XML Compression Techniques 54

ISX: Balanced Parenthesis Encoding

XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW
C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers

XBW is searchable (count subpaths)
D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s

Example query Assume data are distributed in 3 sites
Assume the RPE: a.b*.c The query starts from site 1 a c s1 s2 s3 b

The database Site 2 y1 x1 a b y2 x2 x3 y3 d c x4 Site 1 z1 z2 z3 z4

Query Processing Given a query, we compute its automaton
Send it to each site Start an identical process at each site Compute two sets Stop(n, s) and Result(n, s) Transmits the relations to a central location and get their union

Stop and Result at site 2 Start Stop (y1, s2) (z2, s2)
Note: here s2 is State 2, not Site 2 Start Result (y1, s2) y3 (y1, s3) y1 (y3, s3) y3

Content optimization

Web Graph Compression The compression techniques are specialized for Web Graphs. The average link size decreases with the increase of the graph. The average link access time increases with the increase of the graph. The -codes seems to have the best trade-off between avg. bit size and access time.

Finally Course Evaluation:
Much appreciated if you can provide some feedbacks to the course !!

The End kuo$cdogl

COMP9319 Web Data Compression and Search

Similar presentations

Presentation on theme: "COMP9319 Web Data Compression and Search"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP9319 Web Data Compression and Search

Similar presentations

Presentation on theme: "COMP9319 Web Data Compression and Search"— Presentation transcript:

Similar presentations

About project

Feedback