Inverted Index Hongning Wang CS@UVa.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Inverted Index Hongning Wang
Information Retrieval in Practice
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Evidence from Content INST 734 Module 2 Doug Oard.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Information Retrieval in Practice
University of Maryland Baltimore County
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval in Practice
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Indexing & querying text
Database Management System
Information Retrieval in Practice
Modified from Stanford CS276 slides Lecture 4: Index Construction
Search Engine Architecture
Implementation Issues & IR Systems
CS 430: Information Discovery
Database Performance Tuning and Query Optimization
Vector Space Model Hongning Wang
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Query Languages.
Basic Information Retrieval
Chapter 5: Information Retrieval and Web Search
Chapter 11 Database Performance Tuning and Query Optimization
Search Engine Architecture
Minwise Hashing and Efficient Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Inverted Index Hongning Wang CS@UVa

Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Evaluation Feedback Doc Representation Doc Analyzer Query Rep (Query) User Ranker Index results Indexer CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval What we have now Documents have been Crawled from Web Tokenized/normalized Represented as Bag-of-Words Let’s do search! Query: “information retrieval” information retrieval retrieved is helpful for you everyone Doc1 1 Doc2 CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Complexity analysis Space complexity analysis 𝑂(𝐷∗𝑉) D is total number of documents and V is vocabulary size Zipf’s law: each document only has about 10% of vocabulary observed in it 90% of space is wasted! Space efficiency can be greatly improved by only storing the occurred words Solution: linked list for each document CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Complexity analysis Time complexity analysis 𝑂( 𝑞 ∗𝐷∗|𝐷|) 𝑞 is the length of query, |𝐷| is the length of a document doclist = [] for (wi in q) { for (d in D) { for (wj in d) { if (wi == wj) { doclist += [d]; break; } return doclist; Bottleneck, since most of them won’t match! CS@UVa CS6501: Information Retrieval

Solution: inverted index Build a look-up table for each word in vocabulary From word to documents! Dictionary Postings Time complexity: 𝑂( 𝑞 ∗|𝐿|), |𝐿| is the average length of posting list By Zipf’s law, 𝐿 ≪𝐷 information retrieval retrieved is helpful for you everyone Doc1 Doc2 Doc1 Query: information retrieval Doc2 Doc1 Doc2 Doc1 Doc2 Doc1 Doc2 Doc2 Doc1 CS@UVa CS6501: Information Retrieval

Structures for inverted index Dictionary: modest size Needs fast random access Stay in memory Hash table, B-tree, trie, … Postings: huge Sequential access is expected Stay on disk Contain docID, term freq, term position, … Compression is needed “Key data structure underlying modern IR” - Christopher D. Manning CS@UVa CS6501: Information Retrieval

Sorting-based inverted index construction <Tuple>: <termID, docID, count> <1,1,3> <2,1,2> <3,1,1> ... <1,2,2> <3,2,3> <4,2,5> … <1,300,3> <3,300,1> Sort by docId Parse & Count <1,1,3> <1,2,2> <2,1,2> <2,4,3> ... <1,5,3> <1,6,2> … <1,299,3> <1,300,1> Sort by termId “Local” sort <1,1,3> <1,2,2> <1,5,2> <1,6,3> ... <1,300,3> <2,1,2> … <5000,299,1> <5000,300,1> Merge sort All info about term 1 ... doc1 doc2 doc300 Term Lexicon: 1 the 2 cold 3 days 4 a ... DocID Lexicon: 1 doc1 2 doc2 3 doc3 ... CS@UVa CS6501: Information Retrieval

Sorting-based inverted index Challenges Document size exceeds memory limit Key steps Local sort: sort by termID For later global merge sort Global merge sort Preserve docID order: for later posting list join Can index large corpus with a single machine! Also suitable for MapReduce! CS@UVa CS6501: Information Retrieval

A close look at inverted index Approximate search: e.g., misspelled queries, wildcard queries Proximity search: e.g., phrase queries Dictionary Postings information retrieval retrieved is helpful for you everyone Doc1 Doc2 Doc1 Dynamic index update Doc2 Doc1 Doc2 Index compression Doc1 Doc2 Doc1 Doc2 Doc2 Doc1 CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Dynamic index update Periodically rebuild the index Acceptable if change is small over time and penalty of missing new documents is negligible Auxiliary index Keep index for new documents in memory Merge to index when size exceeds threshold Increase I/O operation Solution: multiple auxiliary indices on disk, logarithmic merging CS@UVa CS6501: Information Retrieval

Recap: full text indexing Bag-of-Words representation Doc1: Information retrieval is helpful for everyone. Doc2: Helpful information is retrieved for you. information retrieval retrieved is helpful for you everyone Doc1 1 Doc2 Word-document adjacency matrix CS@UVa CS6501: Information Retrieval

Recap: automatic text indexing Remove non-informative words Remove 1s Remove 0s Remove rare words CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Recap: normalization Convert different forms of a word to normalized form in the vocabulary U.S.A -> USA, St. Louis -> Saint Louis Solution Rule-based Delete periods and hyphens All in lower case Dictionary-based Construct equivalent class Car -> “automobile, vehicle” Mobile phone -> “cellphone” CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Recap: stemming Reduce inflected or derived words to their root form Plurals, adverbs, inflected word forms E.g., ladies -> lady, referring -> refer, forgotten -> forget Bridge the vocabulary gap Risk: lose precise meaning of the word E.g., lay -> lie (a false statement? or be in a horizontal position?) Solutions (for English) Porter stemmer: pattern of vowel-consonant sequence Krovetz Stemmer: morphological rules CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Recap: inverted index Build a look-up table for each word in vocabulary From word to documents! Dictionary Postings Time complexity: 𝑂( 𝑞 ∗|𝐿|), |𝐿| is the average length of posting list By Zipf’s law, 𝐿 ≪𝐷 information retrieval retrieved is helpful for you everyone Doc1 Doc2 Doc1 Query: information retrieval Doc2 Doc1 Doc2 Doc1 Doc2 Doc1 Doc2 Doc2 Doc1 CS@UVa CS6501: Information Retrieval

Recap: sorting-based inverted index construction <Tuple>: <termID, docID, count> <1,1,3> <2,1,2> <3,1,1> ... <1,2,2> <3,2,3> <4,2,5> … <1,300,3> <3,300,1> Sort by docId Parse & Count <1,1,3> <1,2,2> <2,1,2> <2,4,3> ... <1,5,3> <1,6,2> … <1,299,3> <1,300,1> Sort by termId “Local” sort <1,1,3> <1,2,2> <1,5,2> <1,6,3> ... <1,300,3> <2,1,2> … <5000,299,1> <5000,300,1> Merge sort All info about term 1 ... doc1 doc2 doc300 Term Lexicon: 1 the 2 cold 3 days 4 a ... DocID Lexicon: 1 doc1 2 doc2 3 doc3 ... CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Index compression Benefits Save storage space Increase cache efficiency Improve disk-memory transfer rate Target Postings file CS@UVa CS6501: Information Retrieval

Basics in coding theory Expected code length 𝑬 𝑳 = 𝒊 𝒑 𝒙 𝒊 × 𝒍 𝒊 Event P(X) Code a 0.75 00 b 0.10 01 c 10 d 0.05 11 Event P(X) Code a 0.25 00 b 01 c 10 d 11 Event P(X) Code a 0.75 b 0.10 10 c 111 d 0.05 110 𝑬 𝑳 =𝟏.𝟒 𝑬 𝑳 =𝟐 𝑬 𝑳 =𝟐 CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Index compression Observation of posting files Instead of storing docID in posting, we store gap between docIDs, since they are ordered Zipf’s law again: The more frequent a word is, the smaller the gaps are The less frequent a word is, the shorter the posting list is Heavily biased distribution gives us great opportunity of compression! Information theory: entropy measures compression difficulty. CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Index compression Solution Fewer bits to encode small (high frequency) integers Variable-length coding Unary: x1 is coded as x-1 bits of 1 followed by 0, e.g., 3=> 110; 5=>11110 -code: x=> unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001 -code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101 CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Index compression Example Table 1: Index and dictionary compression for Reuters-RCV1. (Manning et al. Introduction to Information Retrieval) Data structure Size (MB) Text collection 960.0 dictionary 11.2 Postings, uncompressed 400.0 Postings -coded 101.0 Compression rate: (101+11.2)/960 = 11.7% CS@UVa CS6501: Information Retrieval

Search within in inverted index Query processing Parse query syntax E.g., Barack AND Obama, orange OR apple Perform the same processing procedures as on documents to the input query Tokenization->normalization->stemming->stopwords removal CS@UVa CS6501: Information Retrieval

Search within inverted index Procedures Lookup query term in the dictionary Retrieve the posting lists Operation AND: intersect the posting lists OR: union the posting list NOT: diff the posting list CS@UVa CS6501: Information Retrieval

Search within inverted index Example: AND operation scan the postings Term1 128 2 4 8 16 32 64 Term2 34 1 2 3 5 8 13 21 Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones Time complexity: 𝑂( 𝐿 1 +| 𝐿 2 |) CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Phrase query “computer science” “He uses his computer to study science problems” is not a match! We need the phase to be exactly matched in documents N-grams generally does not work for this Large dictionary size, how to break long phrase into N-grams? We need term positions in documents We can store them in the inverted index CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Phrase query Generalized postings matching Equality condition check with requirement of position pattern between two query terms e.g., T2.pos-T1.pos = 1 (T1 must be immediately before T2 in any matched document) Proximity query: |T2.pos-T1.pos| ≤ k 128 34 2 4 8 16 32 64 1 3 5 13 21 Term1 Term2 scan the postings CS@UVa CS6501: Information Retrieval

More and more things are put into index Document structure Title, abstract, body, bullets, anchor Entity annotation Being part of a person’s name, location’s name CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Spelling correction Tolerate the misspelled queries “barck obama” -> “barack obama” Principles Of various alternative correct spellings of a misspelled query, choose the nearest one Of various alternative correct spellings of a misspelled query, choose the most common one CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Spelling correction Proximity between query terms Edit distance Minimum number of edit operations required to transform one string to another Insert, delete, replace Tricks for speed-up Fix prefix length (error does not happen on the first letter) Build character-level inverted index, e.g., for length 3 characters Consider the layout of a keyboard E.g., ‘u’ is more likely to be typed as ‘y’ instead of ‘z’ CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Spelling correction Proximity between query terms Query context “flew form IAD” -> “flew from IAD” Solution Enumerate alternatives for all the query terms Heuristics must be applied to reduce the search space CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Spelling correction Proximity between query terms Phonetic similarity “herman” -> “Hermann” Solution Phonetic hashing – similar-sounding terms hash to the same value CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval What you should know Inverted index for modern information retrieval Sorting-based index construction Index compression Search in inverted index Phrase query Query spelling correction CS@UVa CS6501: Information Retrieval

CS6501: Information Retrieval Today’s reading Introduction to Information Retrieval Chapter 2: The term vocabulary and postings lists Section 2.3, Faster postings list intersection via skip pointers Section 2.4, Positional postings and phrase queries Chapter 4: Index construction Chapter 5: Index compression Section 5.2, Dictionary compression Section 5.3, Postings file compression CS@UVa CS6501: Information Retrieval