Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Partitioned Elias-Fano Indexes
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Modern Information Retrieval Chapter 8 Indexing and Searching.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Online Piece-wise Linear Approximation of Numerical Streams with Precision Guarantees Hazem Elmeleegy Purdue University Ahmed Elmagarmid (Purdue) Emmanuel.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Random-Accessible Compressed Triangle Meshes Sung-eui Yoon Korea Advanced Institute of Sci. and Tech. (KAIST) Peter Lindstrom Lawrence Livermore National.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
03/19/02Scalab Seminar Series1 Routing in Peer-to-Peer Systems Ramaswamy N.Vadivelu Scalab, ASU.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Space-Efficient Support for Temporal Text Indexing in a Document Archive Context Kjetil Nørvåg Department of Computer and Information Science Norwegian.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Evidence from Content INST 734 Module 2 Doug Oard.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Spatio-Temporal Databases
Efficient Multi-User Indexing for Secure Keyword Search
Text Indexing and Search
Wikitology Wikipedia as an Ontology
Spatio-Temporal Databases
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Inverted Indexing for Text Retrieval
Query processing: phrase queries and positional indexes
Fabio Grandi DEIS - Univ. of Bologna, Italy
Presentation transcript:

Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany

EIIR 2008, Glasgow 2/19 Introduction Most document collections are not static –Intranet documents, Mail folders, Blogs, Source- code, and contents of the World Wide Web –Contents are being archived – possibly time- stamped and/or versioned Wikis Document repositories (SVN, CVS, …) Desktop Web Archives! Search over evolving collections –Ability to query the collection “as of” given time Time-travel Search [BBNW’07]

EIIR 2008, Glasgow 3/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and Controlled FUSION Experimental Evaluation

EIIR 2008, Glasgow 4/19 Historical Information Needs 1.News articles discussing Cola-drinks Cancer controversy during Contemporary articles about “Harry Potter and the Philosopher’s Stone” 3.Angela Merkel’s interview during 2002

EIIR 2008, Glasgow 5/19 Time-Travel Search Angela Merkel 2002 Keyword Query Time-context for Evaluation & Ranking Keyword search extended with a time-context for evaluation Q = t s Evaluate q using the collection that existed at time t s Key Challenges Dealing with the MASSIVE size Adapting the scoring models (typically defined for static collections) Efficient query processing Opportunities Redundancy in content Sufficiency of good approximations Append-only data growth

EIIR 2008, Glasgow 6/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation

EIIR 2008, Glasgow 7/19 FluxCapacitor/TTIX Adapt Inverted Index structure to include validity time-interval of each document-version Documents D1, D2, D3 are observed to have changed at different times Time now Version-history of Documents t1t1 t2t2 t3t3 t5t5 t4t4 t6t6 t7t7 t8t8 t9t9 t 11 D3 2.2 [t 0,t 3 ) D1 2.0 [t 0,t 2 ) D3 1.9 [t 3,t 7 ) D [t 0,t 1 ) D1 1.6 [t 2,t 4 ) … Time-stamped Inverted Index t 12 t 10 t 13 Vocabulary t0t0 D1 D2 D3 D3 “deletion” D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) D3 xx [t 0,t 3 ) D1 xx [t 0,t 2 ) D3 xx [t 3,t 7 ) D2 xx [t 0,t 1 ) … … … Doc. Ids [Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007] Index Compaction via Approximate Temporal Coalescing A sublist materialization framework for trading off space- performance

EIIR 2008, Glasgow 8/19 Phrase Queries Significantly improve effectiveness Essential for quickly locating –entities – e.g., “Coca Cola”, “Where Eagles Dare”,… –concepts – e.g., “Water filtering” –… Indexing for Phrase queries –For each word, need to store positional information for every occurrence –Index-size blowup –Size reduction via gap encoding + space-efficient coding on positions [Scholer et al. 2002]

EIIR 2008, Glasgow 9/19 Phrase Queries in FluxCapacitor Baseline: For each document version d t b, posting of the following structure Word-positions compressed using standard techniques –(Gap + Elias-/Golomb-)encodings Validity Time-interval (=64 bits) Document Identifier (=64 bits) List of Word-Positions Can this be Improved?

EIIR 2008, Glasgow 10/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation

EIIR 2008, Glasgow 11/19 Word-Positions across Versions High Level of Redundancy between versions –Append-only changes leave most parts unchanged –word b between d t1 and d t2 Numerical closeness of positions –Small shifts in positions –word c between d t2 and d t3 b: c:

EIIR 2008, Glasgow 12/19 FUSION Idea: –Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility Positions: all word-positions in any of the versions Timestamps: all intermediate version timestamps Signatures: for each version, a bit-signature of positions b: c:

EIIR 2008, Glasgow 13/19 Query Processing – win some, lose some Save on overall space –Naïve organization + processing => reads the whole list, computes ranking –FUSION maintains smaller list, so faster (naïve) query processing Who is Naïve !? –Skip pointers to jump ahead during query proc. –In the worst case, FUSION ends up reading and processing all the versions, instead of just one version!  Baseline - Good performance, Bad storage FUSION - Bad (worst-case) performance, Good storage

EIIR 2008, Glasgow 14/19 Controlled FUSION Compute a set of fusions over contiguous versions s.t. –It takes minimal storage for word positions –For any version, the maximum worst case query processing overhead is within η Can be set up as an optimization problem Optimal solution computable in O(n 3 ) time and O(n) space –Assumption: storage cost is monotonous –In practice, we found it close to O(n 2 )

EIIR 2008, Glasgow 15/19 Outline Time-travel Search Our Time-machine: FluxCapacitor/TTIX –Phrase Queries in TTIX FUSION and controlled FUSION Experimental Evaluation

EIIR 2008, Glasgow 16/19 Experimental Evaluation English Wikipedia –Revision history (2004 – 2005) –10% sample ( ~35,000 docs, ~900,000 ver.) Baseline: –Elias-  code: GBytes –Elias-  code: GBytes FUSION: –η between 1.1 – 10 –Elias-  & Elias-  for compressing word-positions in each fused posting

EIIR 2008, Glasgow 17/19 Experimental Results  = 1.5  35% of the baseline  = 1.5  44% of the baseline

EIIR 2008, Glasgow 18/19 Conclusions Time-travel Search –Key to archive search & analysis –An interesting and important problem! Our Time-machine: FluxCapacitor/TTIX –Builds on inverted index framework –Tunable index-size reduction FUSION –Adds phrase-querying to FluxCapacitor/TTIX –More than 50% space reduction over baseline With 50% worst-case overhead in query proc.

Thank You! Questions ?