Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
MapReduce.
Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Ch. 3 Lin and Dyer’s text Pages (39-69)
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval IR 4. Plan This time: Index construction.
Near Duplicate Detection
Chapter 19: Information Retrieval
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Databases & Data Warehouses Chapter 3 Database Processing.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
IRTools Software Overview Gregory B. Newby UNC Chapel Hill
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Google, Bing, MSN, Yahoo! and many more!. How useful are search Engines? We discussed some of the techniques involved in the previous lesson. Search Engines.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Document Indexing Document indexing is the process of associating or tagging documents with different “search” terms Content: 1.Index construction 2.Scaling.
Search Engines Session 5 INST 301 Introduction to Information Science.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Architecting Search in 2013/2016 On-Prem Ajay Iyer.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Dr. Frank McCown Comp 250 – Web Development Harding University
Indexing & querying text
Search Engine Architecture
Implementation Issues & IR Systems
MR Application with optimizations for performance and scalability
The Anatomy of a Large-Scale Hypertextual Web Search Engine
MapReduce Algorithm Design
Basic Information Retrieval
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
Data Mining Chapter 6 Search Engines
MR Application with optimizations for performance and scalability
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engine Architecture
Inverted Indexing for Text Retrieval
Midterm Review CSE4/587 B.Ramamurthy 4/4/2019 4/4/2019 B.Ramamurthy
MapReduce Algorithm Design
Web Search Engines.
Midterm Review CSE4/587 B.Ramamurthy 4/8/2019 4/8/2019 B.Ramamurthy
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dryer.
Presentation transcript:

Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer

Introduction Web search is a quintessential large-data problem. So are any number of problems in genomics. – Google, amazon (aws) all are involved in research and discovery in this area Web search or full text search depends on a data structure called inverted index. Web search problem breaks down into three major components: – Gathering the web content (crawling) (project 1) – Construction of inverted index (indexing) (project 2) – Ranking the documents given a query (retrieval) (exam 2)

Issues with these components Crawling and indexing have similar characteristics: resource consumption is high Retrieval is very different from these: spikey, variability is high, quick response is a requirement, many concurrent users; There are many requirements for a web crawler or in general a data aggregator.. – Etiquette, bandwidth resources, multilingual, duplicate contents, frequency of changes…

Inverted Indexes Regular index: Document  terms Inverted index term  documents Example: term1  {d1,p}, {d2, p}, {d23, p} term2  {d2, p}. {d34, p} term3  {d6, p}, {d56, p}, {d345, p} Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)

Retrieval Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs. The docs are ranked and top k docs are listed. It is good to have the inverted index in memory. If not, some queries may involve random disk access for decoding of postings. Solution: organize the disk accesses so that random seeks are minimized.

Pseudo Code Pseudo code  Baseline implementation  value-key conversion pattern implementation  …

Baseline implementation procedure map (docid n, doc d) H  new Associative array for all terms in doc d H{t}  H{t} + 1 for all term in H emit(term t, posting )

Reducer for baseline implmentation procedure reducer( term t, postings[, …]) P  new List for all posting in postings Append (P, ) Sort (P) // sorted by docid Emit (term t, postings P)

Shuffle an sort phase Is a very large group by term of the postings Lets look at a toy example Fig. 4.3 some items are incorrect in the figure

Revised Implementation Issue: MR does not guarantee sorting order of the values.. Only by keys So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory. Lets check a revised solution (term t, posting ) to (term, tf f)

Modified mapper Map (docid n, doc d) H  new AssociativeArray For all terms t in doc H{t}  H{t} + 1 For all terms in H emit (tuple, H{t})

Modified Reducer Initialize tprev  0 P  new PostingList method reduce (tuple, tf [f1,..]) if t # tprev ^ tprev # 0 { emit (term t, posting P); reset P; } P.add( ) tprev  t Close emit(term t, postings P)

Other modifications Partitioner and shuffle have to deliver all related to same reducer Custom partitioner so that all terms t go to the same reducer. Lets go through a numerical example

From Yahoo siteYahoo site