Download presentation
Presentation is loading. Please wait.
Published byRandolph Nelson Modified over 9 years ago
1
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
2
An index is a data structure that is designed to make search (or finding things) fast and efficient Text search often requires an inverted index Represents a class of similar data structures Inverted because we associate documents with words (rather than identifying words within or as part of documents)
3
Each index term is associated with an inverted list that may contain: A list of documents A list of word occurrences in documents Word counts Positional information regarding each word Metadata identifying fields (title, author, etc.) etc.
4
Each entry in an inverted index is called a posting The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered ▪ Sorted by document number
5
Inverted index with counts for documents S 1, S 2, S 3, and S 4 What does this data structure tell us?
6
how? Limitations of scale? How can we parallelize this?
7
To handle larger indexes: Build the inverted list structure until we run out of memory Write the partial index to disk; repeat At the end of this process, we have many partial indexes, which must be merged
8
Partial indexes must be designed so they can be merged in small pieces Store tokens/words in alphabetical order
9
Use the merging strategy to parallelize: Multiple machines build partial indexes A single machine collects and merges all partial indexes to produce a final index Parallelization and distributed computing is required due to the scale of information Not just for search Also for analytics and data mining
10
First normalize the user query using the same normalization rules applied during text transformation Convert to lowercase (downcase) Remove extraneous characters Perform stemming etc.
11
Document-at-a-time query processing: Calculate complete scores for documents by processing all relevant term lists, one document at a time Term-at-a-time query processing: Accumulate scores for documents by processing term lists in their entirety, one term list at a time
14
Read less data from the inverted lists A multi-keyword search requires that all query terms appear in the results Use skipping and skip pointers to speed up multi-keyword searches term: skip pointers GOAL: skip those documents that do not contain the other query term(s)
15
Calculate scores for fewer documents Apply conjunctive processing in which every document must contain all query terms Works best when one query term occurs much less frequently than the others Modify document-at-a-time and term-at-a-time algorithms to remove documents that do not contain all query terms
18
Read and study Chapter 5 (skim §5.4) Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.