Information Retrieval

Information Retrieval
Chapter 21 Information Retrieval

What is Information Retrieval (IR)?
Introduction What is Information Retrieval (IR)? “Information retrieval deals with the representation, storage, and access to (unstructured) documents or representatives of documents (document surrogates)” (Gerard Salton) “The location and presentation to a user of information relevant to an information need expressed as a query” (Robert R. Korfhage)

Primary goal of an IR system:
Introduction Primary goal of an IR system: “Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”

Introduction Goals of information retrieval (IR):
Retrieve information which are relevant to the user’s need Represent and organize information for easy access Translate user information need into queries that can be processed by IR systems Information Retrieval vs. Data Retrieval: Information Retrieval: deal with natural language text which are unstructured and semantically ambiguous Data Retrieval: deals with well-defined and structured data IR systems rank the content of each document in a collection according to a degree of relevance to the user’s query (need)

Information Retrieval vs. Data Retrieval
Design Issues Data Retrieval Information Retrieval Matching Exact Match Partial (Best) Match Model Deterministic Probabilistic Classification Approach Monotonic Polytechnic Query language Artificial Natural Query specification Complete Incomplete Items wanted Matching Relevant Error response Sensitive Insensitive Data representation Schema (Mostly) Index terms

Information Retrieval
Topics to be addressed (in this chapter) Relevance Ranking Using Terms/Keywords Relevance Ranking Using Hyperlinks Synonyms, Homonyms, and Ontologies Indexing of Documents Measuring Retrieval Effectiveness Web Search Engines Information Retrieval and Structured Data Directories

Information Retrieval Systems
Information retrieval (IR) systems use a simpler data model than database systems Information organized as a collection of documents Documents are unstructured, no schema Information retrieval locates relevant documents, on the basis of user input such as keywords/example documents e.g., find docs containing the words “database systems” Can be used even on textual descriptions provided with non- textual data, such as images Web search engines are the most familiar example of IR systems

Information Retrieval Systems (Cont.)
Differences from database systems IR systems don’t deal with transactional updates, concurrency control, data recovery, etc. IR systems deal with some querying issues not generally addressed by database systems Approximate searching by keywords Ranking of retrieved answers by estimated degree of relevance Database systems deal with structured data, with schemas that define the data organization, which does not apply to IR

Keyword Search In full text retrieval, all the words (i.e., terms) in each doc are considered to be keywords. IR systems typically allow query expressions formed using keywords & the logical connectives and (implicit), or, and not Ranking of documents on the basis of estimated relevance to a query is critical Relevance ranking is based on factors such as Term frequency (TF) of occurrence of query keyword in docs & Inverse document frequency (IDF) How many documents the query keyword occurs in Fewer documents  give more importance to the keyword Hyperlinks to documents More links to a document  the document is more important

Relevance Ranking Using Terms
TF-IDF (Term Frequency-Inverse Document Frequency) ranking: Let n(d) = number of terms in document d n(d, t) = number of occurrences of term t in d Relevance of a document d to a term t is determined by The log factor is to avoid excessive weight to frequent terms log2 (X) > 0, if X > 1 Relevance of document d to query Q n(d) n(d, t) 1 + TF(d, t) = log2 TF(d, t) = n(d, t) / max(n(d, l)), or where l is a term in document d r(d, Q) =  TF(d, t) n(t) t  Q , where n(t) = number of documents with term t in a collection of documents or r(d, Q) = TF(d, t)  IDF(t), where IDF(t) = log2(N/n(t)), and N is the total number of documents in a collection  t Q

Relevance Ranking Using Terms (Cont.)
Most systems add to the above model Words that occur in title, author list, section headings, etc., are given greater importance Words whose first occurrence is late in the document are given lower importance Very common words, called stop words, such as “a”, “an”, “the”, “it”, etc., are eliminated Proximity: if keywords in a query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score Usually only top few documents are returned, not all

Similarity Based Retrieval
Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words, e.g., find k terms in document D with highest TF(d, t)/n(t) and use these terms to find relevance of other documents. Relevance feedback: Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by keyword query and system finds other similar documents Vector space model: define an n-dimensional space, where n is the number of words in the document set Vector for document d goes from origin to a point whose ith coordinate is TF(d, ti)/n(ti) The cosine of the angle between the vectors of two documents (or a query and a doc) is used as a measure of their similarity

The Vector (Space) Model
dj q  r(dj, Q) = Sim(q, dj) = cos() = [dj  q] / |dj |  |q| = [ti=1 wij  wiq ] / ( ti=1 wij2   ti=1 wiq2 ) where  is the inner product operator, |q| (|dj |, respectively) is the length of q (dj, respectively), wij = TF(dj, i)  IDF(i), and wiq = [0.5  (TF(q, i) / max(TF(q, l)))]  log2(N / n(i)), where N is the total number of documents and n(i) is the number of documents with keyword i Since wij  0 and wiq  0, 1  sim(q, dj )  0    

The Vector (Space) Model
Example. d1 d2 d3 d4 d5 d6 d7 k1 k2 k3 Sim(q, dj) = (dj  q) / (|dj |  |q|) = (ti=1 wij  wiq) / ( ti=1 wij2  ti=1 wiq2 ) Doc k k k q ● dj |q| |dj| |q||dj| Sim(q, dj) 5 14 5 14 5 0.6 d 1 d 14 1 14  1 0.27 11 d 14 10 14 10 0.93 d 2 14 4 14  2 0.27 17 d 14 21 14 21 0.99 5 14 5 d 14 5 0.6 d 10 14 25 14  5 0.54 q

Measuring Retrieval Effectiveness
IR systems save space by using index structures that support only approximate retrieval. May result in: False negatives (false drops) - relevant documents that are not retrieved False positives - non-relevant documents were retrieved For many applications a good index should not permit any false drops, but may permit a few false positives; neither one is desirable in an IR system Relevant performance metrics: Precision - what percentage of the retrieved documents are relevant to the query, i.e., #Relevant / #Retrieved Recall - what percentage of the documents relevant to the query were retrieved, i.e., #Retrieved Relevant / #Relevant

Precision versus Recall
Let R be the set of relevant documents, and let |R| be the number of documents in R for a given query. Let A be the document answer set, and let |A| be the number of documents in A. Let Ra = R  A, and |Ra| be the number of documents in Ra. Recall = , fraction of relevant documents retreived Precisioin = , fraction of retrieved docs that are relevant False Positives = False Negatives = |Ra| |R| |Ra| |A| Collection Relevant Docs |R| Answer Set |A| In Answer Set |Ra| |A| - |Ra| |R| - |Ra|

Measuring Retrieval Effectiveness (Cont.)
Recall vs. precision tradeoff: Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many non-relevant documents would be fetched, reducing precision Measures of retrieval effectiveness, i.e., precision-vs-recall Recall as a function of number of documents fetched, or Precision as a function of recall, i.e., number of docs fetched Problem: which documents are actually relevant, and which are not; researchers have created collections of (manually tagged) documents and queries for measures 20 40 60 80 100 Precision Recall • 30 50 10 70 90 1. d d9 11. d38 2. d84 7. d d48 3. d56 8. d d250 4. d6 9. d d113 5. d8 10. d d3 100% Let |R| = 10 66% 50% 40% 33%

Relevance Using Hyperlinks
Number of documents relevant to a query can be enormous if only term frequencies are taken into account (in a large collection of documents, e.g., the Web) Using term frequencies makes “spamming” easy e.g., a travel agency can add many occurrences of the words “travel” to its page to make its rank very high Most of the time people are looking for Web pages from popular sites (using hyperlinks) Idea: use popularity of Web site (e.g., how many people visit it, or Web pages point to it) to rank site pages that match given keywords Problem: hard to find actual popularity of site

Relevance Using Hyperlinks (Cont.)
Solution: use the number of hyperlinks to a site/page as a measure of the popularity or prestige of the site/page Refinements Combine with TF-IDF based measures with the popularity of a page P to compile the overall degree relevance of P In computing prestige based on links to a site/page, give more weight to links from sites/pages that have higher prestige A transfer-of-prestige approach Definition is circular, i.e., popularity of a page is defined by the popularity of other pages and there may be cycles of links The circular problem can be solved by a system of simultaneous linear equations Basis of the Google PageRank ranking mechanism the popularity of page P is based on the popularity of links to P

Google PageRank ranking mechanism: Using a random walk (traversal) model Step 1. Starts at a random Web page Step 2 and onward. Jumps to a randomly chosen Web page with probably , or an outgoing link from the current Web page with probably (1 - ) Pages pointed to from many Web pages (or from a high Page Rank) are more likely to be visited and have higher PageRank PageRank, P[j], for each Web page j is defined as P[j] = ( / N) + (1 - )   (T[i, j]  P[i]) where N i=1 Step 2(i) Step 2(ii) T[i, j] = 1/Ni (Ni = number of links out of page i), 0    1, N = number of pages, P[i] = 1/N (to begin with)

Connections to social networking theories that ranked prestige of people e.g., the president of the U.S.A has a high prestige, since many people know him Someone known by many prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many related pages (on a topic) An authority is a page that contains actual information on a topic A page gets a high hub prestige if it points to many pages with high authority-prestige A page gets a high authority prestige if it is pointed to by many pages with high hub-prestige Again, prestige definitions are cyclic, and can be got by solving linear equations Use authority prestige when ranking answers to a query

Synonyms and Homonyms Synonyms Homonyms
e.g., document: “motorcycle repair”; query: “motorcycle maintenance” need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle  (repair  maintenance)” Homonyms e.g., “object” has different meanings as noun/verb disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic Synonyms may have other meanings as well Must understand intended meaning in order to infer synonyms Or verify synonyms, e.g., program, with the user

Indexing of Documents An inverted index maps each keyword Ki to a set of documents Si that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations in document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF AND operation: Finds documents that contain all of K1, K2, ..., Kn using intersection, i.e., S1  S2  ...  Sn OR operation: documents that contain at least one of K1, K2, …, Kn using union, i.e., S1  S2  ...  Sn Each Si is kept sorted to allow efficient intersection/union by merging not can be implemented by merging of sorted lists

Concept-Based Querying
Approach For each word, determine the concept it represents from context (e.g., disambiguate each word sometimes by looking at other surrounding words in the same page) Use one or more ontologies: Hierarchical structure showing relationship between concepts e.g., the ISA relationship that we saw in the E-R model This approach can be used to standardize terminology in a specific field, but comes with very high overhead in handing billions of documents; not good for search engines Ontologies can link multiple languages Foundation of the Semantic Web (not covered here)

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback