Large Scale Search: Inverted Index, etc.

Slides:



Advertisements
Similar presentations
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Information Retrieval
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introducing Information Retrieval and Web Search
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
LIS618 lecture 2 the Boolean model Thomas Krichel
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Information Retrieval Introduction to Information Retrieval COMP4201 Information Retrieval and Search Engines Lecture 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 1: Boolean retrieval.
Introduction to Information Retrieval Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
Web-based Information Architecture 01: Boolean Retrieval Hongfei Yan School of EECS, Peking University 2/27/2013.
Why indexing? For efficient searching of a document
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
COMP9319: Web Data Compression and Search
Take-away Administrativa
Information Retrieval : Intro
CS122B: Projects in Databases and Web Applications Winter 2017
Lecture 1: Introduction and the Boolean Model Information Retrieval
COIS 442 Foundations on IR Information Retrieval and Web Search
Slides from Book: Christopher D
Modified by Dongwon Lee from slides by
Modified from Stanford CS276 slides Lecture 4: Index Construction
정보 검색 특론 Information Retrieval and Web Search
Lecture 7: Index Construction
Boolean Retrieval.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
CSCE 561 Information Retrieval System Models
Index Construction: sorting
Lecture 7: Index Construction
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Information Retrieval
Discussion 5 Sara Javanmardi.
Boolean Retrieval.
Introducing Information Retrieval and Web Search
CS276 Information Retrieval and Web Search
Introducing Information Retrieval and Web Search
Inverted Index Construction
Introducing Information Retrieval and Web Search
Presentation transcript:

Large Scale Search: Inverted Index, etc. Comp: 441/552 Large Scale Machine Learning

Classical Search Problem Given: A (giant) set of text documents Assume it is a static collection for the moment Task: Retrieve documents that is relevant to the user’s query

The Measures: Precision and Recall Precision : Fraction of retrieved docs that are relevant to the user’s information need 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐷𝑜𝑐𝑠 ∩ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑠 Recall : Fraction of relevant docs in collection that are retrieved 𝑅𝑒𝑐𝑎𝑙𝑙= 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐷𝑜𝑐𝑠 ∩ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑑𝑜𝑐𝑠 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑠 There is usually a tradeoff between these two. For example if we report everything, recall is 1. 𝐹1𝑆𝑐𝑜𝑟𝑒=2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Example Search for Documents containing the words Brutus AND Caesar but NOT Calpurnia Naïve Solution: grep Brutus and Caesar, then strip out lines containing Calpurnia Problem: Won’t work for the web. Just too slow!

Picture Taken from Internet Sec. 1.1 Term-document Matrix 1 if Document contains word, 0 otherwise Picture Taken from Internet

Picture Taken from Internet An Efficient Solution 0/1 can be treated as bit-vector. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) BITWISE AND 110111 AND 110100 AND 101111 = 100100 Picture Taken from Internet

Issue: Cant build the Matrix Even for 1 million documents, with say 500K distinct terms. We get around half a trillion 0s and 1s. Storage will blow up. Every document is sparse and usually contains around 1000 terms on average, so the number of 1s is around a billion. How to exploit sparsity ?

Picture Taken from Internet Sec. 1.2 Solutions: Inverted index or Postings We need variable-size postings lists On disk, a continuous run of postings is normal and best Generally a linked list (good for insertion). Sorted (Why?) Brutus 1 2 4 11 31 45 173 174 Dictionary Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54 101 Picture Taken from Internet 8

Manipulating Sorted list is linear time. Picture Taken from Internet

Some Benefits All Boolean Operation is O(x+y). BITWISE AND is now Intersection of Postings. OR is MERGE Set DIFF can also be done similarly. Primary commercial retrieval tool for 3 decades. Order of operation is important. X AND Y AND Z. X AND Y first or Y AND Z first ? A whole Topic in itself: Query Optimization

Indexer steps: Token sequence Sec. 1.2 Indexer steps: Token sequence Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Picture Taken from Internet

Picture Taken from Internet Sec. 1.2 Indexer steps: Sort Sort by terms And then docID Picture Taken from Internet

Indexer steps: Dictionary & Postings Sec. 1.2 Indexer steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added for query optimization. Storage ? Picture Taken from Internet

Picture Taken from Internet Phrase Queries ? How about queries like “Rice University” Solution 1: Use Bi-grams. 2-contiguous words as tokens. “This is Rice University” gets 3 tokens “This is”, “is Rice”, and “Rice University” Problems: Longer phrases will blow up the dimentionality. If the vocabulary is 500k, then bi-grams are (500k)*(500k). Trigrams etc. ? Picture Taken from Internet

Another Popular Solution: Positional Index Sec. 2.4.2 Another Popular Solution: Positional Index <Rice: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “Rice University”? For phrase queries, we use a merge algorithm recursively Can now do all sorts of phrase queries, slower than bi-grams but more expressive. (An ideal solution is usually a combination) Still needs significant memory. (2-4x more for positions) Picture Taken from Internet

Processing a phrase query Sec. 2.4.2 Processing a phrase query Extract inverted index entries for each distinct term: Rice, University. Merge their doc:position lists to enumerate all positions with “Rice University”. Rice: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... University: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches

Picture from https://moz.com/blog/illustrating-the-long-tail Power Law in Real World Catching results is a very powerful idea. Identifying Heavy-Hitters on data streams is a topic in itself. Picture from https://moz.com/blog/illustrating-the-long-tail