Download presentation
Presentation is loading. Please wait.
1
Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany
2
EIIR 2008, Glasgow 2/19 Introduction Most document collections are not static –Intranet documents, Mail folders, Blogs, Source- code, and contents of the World Wide Web –Contents are being archived – possibly time- stamped and/or versioned Wikis Document repositories (SVN, CVS, …) Desktop Web Archives! Search over evolving collections –Ability to query the collection “as of” given time Time-travel Search [BBNW’07]
3
EIIR 2008, Glasgow 3/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and Controlled FUSION Experimental Evaluation
4
EIIR 2008, Glasgow 4/19 Historical Information Needs 1.News articles discussing Cola-drinks Cancer controversy during 2005-2006 2.Contemporary articles about “Harry Potter and the Philosopher’s Stone” 3.Angela Merkel’s interview during 2002
5
EIIR 2008, Glasgow 5/19 Time-Travel Search Angela Merkel Interview @ 2002 Keyword Query Time-context for Evaluation & Ranking Keyword search extended with a time-context for evaluation Q = q @ t s Evaluate q using the collection that existed at time t s Key Challenges Dealing with the MASSIVE size Adapting the scoring models (typically defined for static collections) Efficient query processing Opportunities Redundancy in content Sufficiency of good approximations Append-only data growth
6
EIIR 2008, Glasgow 6/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation
7
EIIR 2008, Glasgow 7/19 FluxCapacitor/TTIX Adapt Inverted Index structure to include validity time-interval of each document-version Documents D1, D2, D3 are observed to have changed at different times Time now Version-history of Documents t1t1 t2t2 t3t3 t5t5 t4t4 t6t6 t7t7 t8t8 t9t9 t 11 D3 2.2 [t 0,t 3 ) D1 2.0 [t 0,t 2 ) D3 1.9 [t 3,t 7 ) D2 1.87 [t 0,t 1 ) D1 1.6 [t 2,t 4 ) … Time-stamped Inverted Index t 12 t 10 t 13 Vocabulary t0t0 D1 D2 D3 D3 “deletion” D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) … … … Doc. Ids [Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007] Index Compaction via Approximate Temporal Coalescing A sublist materialization framework for trading off space- performance
8
EIIR 2008, Glasgow 8/19 Phrase Queries Significantly improve effectiveness Essential for quickly locating –entities – e.g., “Coca Cola”, “Where Eagles Dare”,… –concepts – e.g., “Water filtering” –… Indexing for Phrase queries –For each word, need to store positional information for every occurrence –Index-size blowup –Size reduction via gap encoding + space-efficient coding on positions [Scholer et al. 2002]
9
EIIR 2008, Glasgow 9/19 Phrase Queries in FluxCapacitor Baseline: For each document version d t b, posting of the following structure Word-positions compressed using standard techniques –(Gap + Elias-/Golomb-)encodings Validity Time-interval (=64 bits) Document Identifier (=64 bits) List of Word-Positions Can this be Improved?
10
EIIR 2008, Glasgow 10/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation
11
EIIR 2008, Glasgow 11/19 Word-Positions across Versions High Level of Redundancy between versions –Append-only changes leave most parts unchanged –word b between d t1 and d t2 Numerical closeness of positions –Small shifts in positions –word c between d t2 and d t3 b: c:
12
EIIR 2008, Glasgow 12/19 FUSION Idea: –Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility Positions: all word-positions in any of the versions Timestamps: all intermediate version timestamps Signatures: for each version, a bit-signature of positions b: c:
13
EIIR 2008, Glasgow 13/19 Query Processing – win some, lose some Save on overall space –Naïve organization + processing => reads the whole list, computes ranking –FUSION maintains smaller list, so faster (naïve) query processing Who is Naïve !? –Skip pointers to jump ahead during query proc. –In the worst case, FUSION ends up reading and processing all the versions, instead of just one version! Baseline - Good performance, Bad storage FUSION - Bad (worst-case) performance, Good storage
14
EIIR 2008, Glasgow 14/19 Controlled FUSION Compute a set of fusions over contiguous versions s.t. –It takes minimal storage for word positions –For any version, the maximum worst case query processing overhead is within η Can be set up as an optimization problem Optimal solution computable in O(n 3 ) time and O(n) space –Assumption: storage cost is monotonous –In practice, we found it close to O(n 2 )
15
EIIR 2008, Glasgow 15/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation
16
EIIR 2008, Glasgow 16/19 Experimental Evaluation English Wikipedia –Revision history (2004 – 2005) –10% sample ( ~35,000 docs, ~900,000 ver.) Baseline: –Elias- code: 97.51 GBytes –Elias- code: 97.77 GBytes FUSION: –η between 1.1 – 10 –Elias- & Elias- for compressing word-positions in each fused posting
17
EIIR 2008, Glasgow 17/19 Experimental Results = 1.5 35% of the baseline = 1.5 44% of the baseline
18
EIIR 2008, Glasgow 18/19 Conclusions Time-travel Search –Key to archive search & analysis –An interesting and important problem! Our Time-machine: FluxCapacitor/TTIX –Builds on inverted index framework –Tunable index-size reduction FUSION –Adds phrase-querying to FluxCapacitor/TTIX –More than 50% space reduction over baseline With 50% worst-case overhead in query proc.
19
Thank You! Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.