Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014.
XML DOCUMENTS AND DATABASES
How did we get here? (CMIS v0.5) F2F, January 2009.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Chapter 5: Introduction to Information Retrieval
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Yahoo! Research Overview Marcus Fontoura Prabhakar Raghavan, Head.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Xyleme A Dynamic Warehouse for XML Data of the Web.
IR Models: Structural Models
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Parallel and Distributed IR
1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
Search Engines and Information Retrieval Chapter 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Querying Structured Text in an XML Database By Xuemei Luo.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web- and Multimedia-based Information Systems Lecture 2.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Large Scale Search: Inverted Index, etc.
Efficient Multi-User Indexing for Secure Keyword Search
Search Engines and Search techniques
Information Retrieval in Practice
Map Reduce.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
MR Application with optimizations for performance and scalability
Multimedia Information Retrieval
Implementation Based on Inverted Files
MR Application with optimizations for performance and scalability
6. Implementation of Vector-Space Retrieval
Inverted Indexing for Text Retrieval
Mining Anchor Text for Query Refinement
Efficient Retrieval Document-term matrix t1 t tj tm nf
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene Shekita, Runping Qi

Motivation IR systems typically use inverted indices to facilitate efficient retrieval Web, , news, and other data contains significant amount of duplicated or shared content Indexing duplicate content is expensive

Scope of Work We assume duplicate or common content is already identified in the corpus We concern ourselves only with the efficient indexing of such content

Types of Shared Content Web duplicates: Very common – on the order of 40% of all pages /news threads: Whole messages are often quoted Attachments are duplicated Identical messages in multiple mailboxes

Some Statistics IBM Intranet has about 40% duplicate content. Internet crawls reveal similar statistics In the Enron dataset, 61% of messages are in threads. 31% quote other messages verbatim

Na ï ve Solution 1 : Index Everything Pros: Simple to implement Semantics are preserved Cons: Index size blows up Performance penalty (big index + post filtering)

Na ï ve Solution 2: Index Just One Copy Pros: Best performance Not too difficult to implement Cons: Only applies to the duplicates scenario Semantics are changed, and relevant results may not be returned for a query

The Web Duplicate Case: Meta Data Vs. Content Removal of web duplicates changes the semantics of the query text almaden.ibm.com /... text watson.ibm.com /... Query: text url:watson

Our Solution Content is split to shared and private parts Shared content is indexed only once Private content (such as metadata in the Web duplicates case) is indexed for each document Index provides virtual cursors that simulate having all content indexed

Advantages Index size, build time, and query efficiency Precise semantics No need for post-filtering

Inverted Indices Index is sorted by term For each term, a sorted list of documents in which it appears is maintained (postings list) Each occurrence (posting) contains additional payload T 1 :, … T 2 :, …

Document Sharing Model Each document is partitioned into private and shared content. The two types are differentiated by posting payload Documents exist in a tree – shared content is shared with all descendents Document IDs (and hence index order) are dictated by a DFS traversal of document trees

The Document Tree Content is shared from ancestor to descendants:

Example: docid = 1: From: andrei To: ronny, marcus did you read it? docid = 2: From: ronny To: marcus did you, marcus? docid = 3: From: marcus To: ronny not yet! andrei: did:, it: marcus:,,, not: read: ronny:,, yet: you:, DocumentsInverted index posting lists

Querying Inverted Indexes Queries contain mandatory terms, forbidden terms, and optional terms (such as +term1 – term2) Typically a zigzag algorithm is used Uses cursors on postings list. Cursors support two operations: next() – Moves to the next posting fwdBeyond(d) – Moves to the first posting for a document with id >= d

Top Level Query Algorithm 1. while (more results required) { 2. Invoke zigzag algorithm 3. Forward optional term cursors 4. Score document 5. Advance required/forbidden cursors 6. } In our solution, this algorithm, uses virtual cursors

Additional Information In The Index Tree information is encoded by two attributes for each document: root(d) – The docid for the document at the root of the tree containing d lastDescendent(d) – The highest-numbered document that is a descendent of d

Physical Cursor Addition physicalCursor::fwdShare(d) 1. while (this.docid<=d and this.docid does not share content with d) { 2. r=root(d); 3. l=lastDescendant(this.docid); 4. if (this.docid<r) { 5. this.fwdBeyond(r); 6. } else if (l<d) { 7. this.fwdBeyond(l+1); 8. } else this.next(); 9. }

fwdShared(d) example: p p p s s fwdShared(10)fwdBeyond(root(10))Next()fwdBeyond(lastDescendent(6)+1) T:,,,,

Virtual Cursors Two types of cursors: Regular (positive) virtual cursors. These behave as if all shared content was indexed for all documents that contain it Negated virtual cursors, represent the complement of the postings list (used for forbidden terms) Implemented on top of a physical cursor

Virtual Cursor Methods VirtualCursor::next() 1. l=lastDescendant(C p.docid) 2. if (C p.payload == shared and this.docid<l) 3. this.docid++; 4. else { 5. C p.next(); 6. this.docid=C p.docid; 7. } VirtualCursor::fwdBeyond(d) 1. if (this.docid>=d) 2. return; 3. C p.fwdShare(d); 4. this.docid = max(C p.docid,d);

Virtual Positive Cursors Maintain a physical and logical positions. Support next() and fwdBeyond(d) p p p s s next()fwdBeyond(10)

Virtual Negative Cursors Support next() and fwdBeyond(d). Physical cursor ahead of logical cursor p p p s next()fwdBeyond(7) p

Web Duplicates Application Trees are flat, with the masters at the root. Leaves only have private content: docid = 1 root = 1 lastDescendant = 4 docid = 2 root = 1 lastDescendant = 2 docid = 3 root = 1 lastDescendant = 3 docid = 4 root = 1 lastDescendant = 4 S1S1 P1P1 P2P2 P3P3 P4P4 docid = 6 root = 5 lastDescendant = 6 S5S5 P5P5 P6P6

Build Performance Evaluation Subsets of IBM Intranet (36-44% dups): # docsIS1 (GB)IS2 (GB)Space saved IT1 (s)IT2 (s)Speedup 500K % % 1000K % % 1500K % % 2000K % % 2500K % %

Runtime Performance: Single Terms Queries

Runtime Performance: Two Term Queries