Searching and Indexing

Searching and Indexing
Thanks to Bill Arms, Marti Hearst

Motivation Effective search Why indexing Automated indexing crucial
Search results Similarity Ranking Importance

Searching and Browsing: The Human in the Loop
Return objects Return hits Browse repository Search index

Definitions Information retrieval: Subfield of computer and information science that deals with automated retrieval of documents (especially text) based on their content and context. Searching: Seeking for specific information within a body of information. The result of a search is a set of hits. Browsing: Unstructured exploration of a body of information. Linking: Moving from one item to another following links, such as citations, references, etc.

Definitions (continued)
Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols. Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words. Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.

Web Search: Browsing Users give queries of 2 to 4 words
Most users click only on the first few results; few go beyond the fold on the first page 80% of users, use search engine to find sites search to find site browse to find information Amil Singhal, Google, 2004

Browsing in Information Space
Starting point x x x x x x x x x x x x x x Effectiveness depends on (a) Starting point (b) Effective feedback (c) Convenience

Designing the Search Page Making Decisions
Overall organization: Spacious or cramped Division of functionality to different pages Positioning components in the interface Emphasizing parts of the interface Query insertion: insert text string or fill in text boxes Interactivity of search results Performance requirements

Sorting and Ranking Hits
When a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large. The value to the use depends on the order in which the hits are presented. Three main methods: • Sorting the hits, e.g., by date • Ranking the hits by similarity between query and document • Ranking the hits by the importance of the documents

Indexes Search systems rarely search document collections directly. Instead an index is built of the documents in the collection and the user searches the index. Document collection User Create index Search index Documents can be digital (e.g., web pages) or physical (e.g., books) Index

Automatic indexing The aim of automatic indexing is to build indexes and retrieve information without human intervention. When the information that is being searched is text, methods of automatic indexing can be very effective. History Much of the fundamental research in automatic indexing was carried out by Gerald Salton, Professor at Cornell. Introduced the vector model.

Information Retrieval from Collections of Textual Documents
Major Categories of Methods Ranking by similarity to query (vector space model) Exact matching (Boolean) Ranking of matches by importance of documents (PageRank) Combination methods

Text Based Information Retrieval
Most ranking methods are based on the vector space model. Most matching methods are based on Boolean operators. Web search methods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. Take out the sequence and one has a “bag of words” The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.

Word Frequency Observation: Some words are more common than others.
Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them Treat documents as a bag of words

Word Frequency Example The following example is taken from:
Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f).

f f f the 1,130,021 from 96,900 or 54,958 of 547, he 94, about 53,713 to 516, million 93, market 52,110 a 464, year 90, they 51,359 in 390, its 86, this 50,933 and 387, be 85, would 50,828 that 204, was 83, you 49,281 for 199, company83,070 which 48,273 is 152, an 76, bank 47,940 said 148, has 74, stock 47,401 it 134, are 74, trade 47,310 on 121, have 73, his 47,116 by 118, but 71, more 46,244 as 109, will 71, who 42,142 at 101, say 66, one 41,635 mr 101, new 64, their 40,910 with 101, share 63,925

Rank Frequency Distribution
For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f w has rank r and frequency f r

Rank Frequency Example
Normalize the words in Callan's data by total number of word occurrences in the corpus. Define: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of word occurrences in the sample.

rf*1000/n rf*1000/n rf*1000/n the 59 from 92 or 101 of he about 102 to million market 101 a year they 103 in its this 105 and be would 107 that was you 106 for company which 107 is 72 an bank 109 said has stock 110 it are trade 112 on have his 114 by but more 114 as will who 106 at say one 107 mr new their 108 with share 114

Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

Methods that Build on Zipf's Law
Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.

Definitions Corpus: A collection of documents that are indexed and searched together. Word list: The set of all terms that are used in the index for a given corpus (also known as a vocabulary file). With full text searching, the word list is all the terms in the corpus, with stop words removed. Related terms may be combined by stemming. Controlled vocabulary: A method of indexing where the word list is fixed. Terms from it are selected to describe each document. Keywords: A name for the terms in the word list, particularly with controlled vocabulary.

Why the interest in Queries?
Queries are ways we interact with IR systems Nonquery methods? Types of queries?

Types of Query Structures
Query Models (languages) – most common Boolean Queries Extended-Boolean Queries Natural Language Queries Vector queries Others?

Simple query language: Boolean
Earliest query model Terms + Connectors (or operators) terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT

Similarity Ranking Methods
Methods that look for matches (e.g., Boolean) assume that a document is either relevant to a query or not relevant. Similarity ranking methods: measure the degree of similarity between a query and a document. Similar Query Documents Similar: How similar is document to a request?

Similarity Ranking Methods
Index database Documents Query Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query

Term Similarity: Example
Problem: Given two text documents, how similar are they? [Methods that measure similarity do not assume exact matches.] A documents can be any length from one word to thousands. A query is a special type of document. Example Here are three documents. How similar are they? d1 ant ant bee d2 dog bee dog hog dog ant dog d3 cat gnu dog eel fox

Term Similarity: Basic Concept
Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The number of terms that are shared (b) Whether the terms are common or unusual (c) How many times each term appears (d) The lengths of the documents

TERM VECTOR SPACE Term vector space
n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list). Vector Document i is represented by a vector. Its magnitude in dimension j is tij, where: tij > if term j occurs in document i tij = otherwise tij is the weight of term j in document i.

A Document Represented in a 3-Dimensional Term Vector Space

Basic Method: Incidence Matrix (Binary Weighting)
document text terms d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog d d d 3 vectors in 8-dimensional term vector space Weights: tij = 1 if document i contains term j and zero otherwise

Basic Vector Space Methods: Similarity
The similarity between two documents is a function of the angle between their vectors in the term vector space.

Two Documents Represented in 3-Dimensional Term Vector Space
 t1

Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x x1nx2n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

Example: Comparing Documents (Binary Weighting)
ant bee cat dog eel fox gnu hog length d 2 d 4 d 5

Example: Comparing Documents
Similarity of documents in example: d1 d2 d3 d d d

Simple Uses of Vector Similarity in Information Retrieval
Threshold limit on retrieval For query q, retrieve all documents with similarity above a threshold, e.g., similarity > 0.50. Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

Similarity of Query to Documents (No Weighting)
q ant dog document text terms d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog q d d d

Calculate Ranking Similarity of query to documents in example:
d1 d2 d3 q / /√2 1/√10 If the query q is searched against this document set, the ranked results are: d2, d1, d3

Weighting: Unnormalized Form of Term Frequency (tf)
document text terms d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog length d 5 d 19 d 5 Weights: tij = frequency that term j occurs in document i

Example (continued) Similarity of documents in example: d1 d2 d3
Similarity depends upon the weights given to the terms. [Note differences in previous results for binary weighting.]

Similarity: Weighting by Unnormalized Form of Term Frequency
query q ant dog document text terms d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog length q √2 d 5 d 19 d 5

Calculate Ranking Similarity of query to documents in example:
d1 d2 d3 q /√ /√38 1/√10 If the query q is searched against this document set, the ranked results are: d2, d1, d3

Choice of Weights query q ant dog document text terms
d1 ant ant bee ant bee d2 dog bee dog hog dog ant dog ant bee dog hog d3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog q ? ? d ? ? d ? ? ? ? d ? ? ? ? ? What weights lead to the best information retrieval?

Evaluation Before we can decide whether one system of weights is better than another, we need systematic and repeatable methods to evaluate methods of information retrieval.

Methods for Selecting Weights
Empirical Test a large number of possible weighting schemes with actual data. (Based on work of Salton, et al.) Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

Weighting Term Frequency (tf)
Suppose term j appears fij times in document i. What weighting should be given to a term j? Term Frequency: Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

Normalized Form of Term Frequency: Free-text Document
Length of document Unnormalized method is to use fij as the term frequency. ...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length.

Term Frequency: Free-text Document
A standard method for free-text documents Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents. Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i. Term frequency (tf): tfij = fij / mi Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate. i

Weighting Inverse Document Frequency (idf)
Suppose term j appears fij times in document i. What weighting should be given to a term j? Inverse Document Frequency: Concept By Zipf's Law we know that some terms appear much more often than others across the documents of a corpus. A term that occurs in only a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency
A standard method The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf): idfj = log2 (n/nj) nj > 0 Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

Example of Inverse Document Frequency
n = 1,000 documents term j nj n/nj idfj A B C D 1, From: Salton and McGill

Full Weighting: A Standard Form of tf.idf
Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (weight of term j in document i) = (term frequency) * (inverse document frequency) A standard tf.idf weighting scheme, for free text documents, is: tij = tfij * idfj = (fij / mi) * (log2 (n/nj) + 1) when nj > 0

Vector Similarity Computation with Weights
Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document di, wik = 0 if term k occurs in document di, wik is greater than zero (wik is called the weight of term k in document di) Similarity between di and dj is defined as:  wikwjk |di| |dj| Where di and dj are the corresponding weighted term vectors n cos(di, dj) = k=1

Discussion of Similarity
The choice of similarity measure is widely used and works well on a wide range of documents, but has no theoretical basis. There are several possible measures other that angle between vectors There is a choice of possible definitions of tf and idf With fielded searching, there are various ways to adjust the weight given to each field.

Organization of Files for Full Text Searching
Word list Postings Documents store Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

Representation of Inverted Files
Document store: Stores the documents. Important for user interface design. Word list (vocabulary file): Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially.

Document Store The Documents Store holds the corpus that is being indexed. The corpus may be: • primary documents, e.g., electronic journal articles or Web pages • surrogates, e.g., catalog records or abstracts, which refer to the primary documents

Document Store The storage of the document store may be:
Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents stored on independently managed servers (e.g., the Web) Each requires: a document ID, which is a unique identifier that can be used by the search system to refer to the document, and a location counter, which can be used to specify location of words or characters within a document.

Documents Store for Web Search Systems
• A document is a Web page. • The documents store is the Web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the Web for indexing. After indexing, the local copy of each page is discarded, unless stored in a cache. (In addition to the usual word list and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Inverted File Inverted file:
An inverted file is list of search terms that are organized for associative look-up, i.e., to answer the questions: • In which documents does a specified search term appear? • Where within each document does each term appear? (There may be several occurrences.) The word list and the postings file together provide an inverted file system for free text searching. In addition, they contain the data needed to calculate weights and information that is used to display results.

Inverted File -- Basic Concept
Word Document abacus 19 22 actor 2 29 aspen 5 atoll 11 34 Stop words are removed before building the index.

Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created
Documents are parsed one document at a time to extract tokens/terms. These are saved with the Document ID (DID). <term, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created
After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

Multiple term/token entries for a single document are merged. Within-document term frequency information is compiled. Result <term,DID,tf> <the,1,2>

Then the inverted file can be split into A Dictionary file File of unique terms and A Postings file File of what document the term/token is in and how often. Sometimes where the term is in the document. Worst case O(n); n size of tokens.

Dictionary and Posting Files
Dictionary Posting

Inverted indexes document ID (DID) frequency of term in doc (optional)
Permit fast search for individual terms For each term/token, you get a list consisting of: document ID (DID) frequency of term in doc (optional) position of term in doc (optional) <term,DID,tf,position> <term,(DIDi,tf,positionij),…> These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2

How Inverted Files are Used
Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Inverted index Associates a posting list with each term
POSTING LIST example a (d1, 1) … the (d1,2) (d2,2) Replace term frequency(tf) with tfidf Compress index and put hash links Match query to index and rank

Position in inverted file posting
POSTING LIST example now (d1;1,1) … time (d1;1,12) (d2;1,126) Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country 69 It was a dark and stormy night in the country manor. The time was past midnight

Change weight Multiple term entries for a single document are merged.
Within-document term frequency information is compiled. Replace term freq by tfidf.

Inverted List -- Definitions
Inverted List: A list of all the entries in an inverted file that apply to a specific word, e.g. abacus 19 22 Posting: Entry in an inverted list that applies to a single instance of a term within a document, e.g., there are three postings for "abacus": abacus abacus abacus

Use of Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a document, then q and dj have no terms in common iff q.dj = 0. 1. To calculate all the non-zero similarities find R. the set of all the documents, dj, that contain at least one term in the query: 2. Merge the inverted lists for each term ti in the query, with a logical or, to establish the set, R. 3. For each dj  R, calculate Similarity(q, dj), using appropriate weights. 4. Return the elements of R in ranked order.

Enhancements to Inverted Files -- Concept
Location: Each posting holds information about the location of each term within the document. Uses user interface design -- highlight location of search term adjacency and near operators (in Boolean searching) Frequency: Each inverted list includes the number of postings for each term. term weighting query processing optimization

Inverted File -- Concept (Enhanced)
Word Postings Document Location abacus 22 56 actor aspen atoll 11 70 Inverted list for term actor

Postings File: A Linked List for Each Term
1 abacus 2 actor 66 3 aspen 4 atoll 3 70 A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Length of Postings File
For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times

Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

Data for Calculating Weights
The calculation of weights requires extra data to be held in the inverted file system. For each term, tj and document, di fij number of occurrences of tj in di For each term, tj nj number of documents containing tj For each document, di mi maximum frequency of any term in di For the entire document file n total number of documents

Word List: Individual Records for Each Term
The record for term j in the word list contains: term j pointer to inverted (postings) list for term j number of documents in which term j occurs (nj)

Efficiency Criteria Storage
Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Word List On disk If a word list is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that a word list has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

File Structures for Inverted Files: Linear Index
Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Indexing Subsystem documents Documents assign document IDs text
document numbers and *field numbers break into tokens tokens stop list* non-stoplist tokens stemming* *Indicates optional operation. stemmed terms term weighting* terms with weights Index database

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

User interface service
Notation Documents File of catalog Searchable file records index Docs Catalog Index User Human Automatic interface action process Physical objects User interface service UI

Single Homogeneous Collection: Full Text Indexing
• Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Docs Index Examples: SMART, Lucene

Single Homogeneous Collection: Use of Catalog Records
• Documents may be digital or physical objects, e.g., books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Create catalog Docs Search Index Catalog Example: Library catalog

Several Similar Collections: One Computer System
• Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e.g., different stoplists). Build indexes Search Index Index Docs Index Docs Docs Example: PubMed

Distributed Architecture: Meta-search (Broadcast Search)
• A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Index 1 Search UI Index 2 User interface service Index n Example: dogpile, clusty

Web Searching: Architecture
• Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Docs on Web server Build index Search Index to all Web pages Docs on Web server Examples: Google, Yahoo!

Use of Web Search Service
Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location. Index to all Web pages Retrieve Docs on Web server Docs on Web server

Web Searching: Building the Index
Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents are usually discarded, but a cached copy may be retained.

Web Crawling Advantages of Web crawling
• Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but ... • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections.

Standardization: Function Versus Cost of Acceptance
Few adopters Many adopters Function

Example: Textual Mark-up
Cost of acceptance SGML XML HTML Function ASCII

Importance of automated indexing
Crucial to modern information retrieval and web search Vector model of documents and queries Inverted index Scalability issues crucial for large systems

Searching and Indexing

Similar presentations

Presentation on theme: "Searching and Indexing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Searching and Indexing

Similar presentations

Presentation on theme: "Searching and Indexing"— Presentation transcript:

Similar presentations

About project

Feedback