Presentation is loading. Please wait.

Presentation is loading. Please wait.

Why indexing? For efficient searching of a document

Similar presentations


Presentation on theme: "Why indexing? For efficient searching of a document"— Presentation transcript:

1 Why indexing? For efficient searching of a document
Sequential text search Small documents Text volatile Data structures Large, semi-stable document collection Efficient search

2 Representation of Inverted Files
Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially. Document file: Stores the documents. Important for user interface design.

3 Organization of Inverted Files
Index file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

4 Decisions in Building Inverted Files: What is a Term?
• Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

5 Efficiency Criteria Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

6 Data Structure Indexing Methods
Inverted index Suffix trees and arrays Signature files Word oriented index structures based on hashing (usually not used for large texts)

7 Inverted Index This is the primary data structure for text indexes
Basically two elements: (Vocabulary, Occurrences) Main Idea: Invert documents into a big index Basic steps: Make a “dictionary” of all the tokens in the collection For each token, list all the docs it occurs in. Possibly location in document Compress to reduce redundancy in the data structure Also reduces I/O and storage required

8 Inverted File Types Index file Document file Weight file
Actual posting list for each distinct term in the collection Document file Information about each document; ID, name, when published, etc. Weight file Similarity between document and query

9 Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

10 How Are Inverted Files Created
Documents are parsed one document at a time to extract tokens. These are saved with the Document ID. <token, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

11 How Inverted Files are Created
After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

12 How Inverted Files are Created
Multiple term entries for a single document are merged. Within-document term frequency information is compiled. Result <token,DID,tf> <the,1,2>

13 How Inverted Files are Created
Then the file can be split into A Dictionary file File of unique tokens and A Postings file File of what document the token is in and how often. Sometimes where the token is in the document. Worst case O(n); n size of database.

14 Dictionary and Posting Files
Dictionary Postings

15 Inverted indexes Permit fast search for individual terms
For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) <token,DID,tf,position> <token,(DIDi,tf,positionij),…> These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2

16 How Inverted Files are Used
Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

17 Inverted index Associates a posting list with each term
POSTING LIST example a (d1, 1) the (d1,2) (d2,2) Replace frequency with tfidf Compress index and put hash links Match query to index and rank

18 Position in inverted file posting
POSTING LIST example now (d1;1,1) time (d1;1,10) (d2;1,126) Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country 69 It was a dark and stormy night in the country manor. The time was past midnight

19 Change weight Multiple term entries for a single document are merged.
Within-document term frequency information is compiled. Replace term freq by tfidf.

20 Documents File for Web Search System
For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information)

21 Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

22 Index File Structures: Linear Index
Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added


Download ppt "Why indexing? For efficient searching of a document"

Similar presentations


Ads by Google