Modern Information Retrieval Lecture 2: Key concepts in IR.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Probabilistic Latent Semantic Analysis
Modern Information Retrieval Chapter 4 Query Languages.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Information Extraction and Automatic Summarisation *
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval
Knowledge based Question Answering System Anurag Gautam Harshit Maheshwari.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Multimedia Information Retrieval
Basic Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS4485: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Modern Information Retrieval Lecture 2: Key concepts in IR

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 2Sharif University Spring 2012

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 3Sharif University Spring 2012

Definitions A database is a collection of documents. A document is a sequence of terms, expressing ideas about some topic in a natural language. A term is a semantic unit, a word, phrase, or potentially root of a word. A query is a request for documents pertaining to some topic. Marjan Ghazvininejad 4Sharif University Spring 2012

Definitions … An Information Retrieval (IR) System attempts to find relevant documents to respond to a user’s request. The real problem boils down to matching the language of the query to the language of the document. Marjan Ghazvininejad 5Sharif University Spring 2012

Hard Parts of IR Simply matching on words is a very brittle approach. One word can have a zillion different semantic meanings Consider: Take “take a place at the table” “take money to the bank” “take a picture” “take a lot of time” “take drugs” Marjan Ghazvininejad 6Sharif University Spring 2012

More Problems with IR You can’t even tell what part of speech a word has: “I saw her duck” A query that searches for “pictures of a duck” will find documents that contain “I saw her duck away from the ball falling from the sky” Marjan Ghazvininejad 7Sharif University Spring 2012

More Problems with IR Proper Nouns often use regular old nouns Consider a document with “a man named Abraham owned a Lincoln” A word matching query for “Abraham Lincoln” may well find the above document. Marjan Ghazvininejad 8Sharif University Spring 2012

What is Different about IR from the rest of Computer Science Most algorithms in computer science have a “right” answer:  Consider the two problems: – Sort the following ten integers – Find the highest integer  Now consider: – Find the document most relevant to “hippos in the zoo” Marjan Ghazvininejad 9Sharif University Spring 2012

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 10Sharif University Spring 2012

Measuring Effectiveness An algorithm is deemed incorrect if it does not have a “right” answer. A heuristic tries to guess something close to the right answer. Heuristics are measured on “how close” they come to a right answer. IR techniques are essentially heuristics because we do not know the right answer. So we have to measure how close to the right answer we can come. Marjan Ghazvininejad 11Sharif University Spring 2012

Precision  x/y Recall  x/z Retrieved Documents (y) Entire Document Collection Relevant Retrieved (X) Relevant Documents (Z) Marjan Ghazvininejad 12Sharif University Spring 2012

Precision / Recall Example Consider a query that retrieves 10 documents. Lets say the result set is: D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 If all ten were relevant, we would have 100 percent precision. If there were only ten relevants in the whole collection, we would have 100 percent recall Marjan Ghazvininejad 13Sharif University Spring 2012

Example Now lets say that only documents two and five are relevant. Consider these results: D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 Since we have retrieved ten documents and gotten two of them right, precision is 20 percent. Recall is 2 / total relevant in entire collection. Marjan Ghazvininejad 14Sharif University Spring 2012

Levels of Recall If we keep retrieving documents, we will ultimately retrieve all documents and achieve 100 percent recall. That means that we can keep retrieving documents until we reach x% of recall. Marjan Ghazvininejad 15Sharif University Spring 2012

Levels of Recall … Retrieve top 2000 documents. Lets say there are five total documents relevant. DocumentDocIDRecallPrecision 100A B C D E Marjan Ghazvininejad 16Sharif University Spring 2012

Recall / Precision Graph Compute precision at.1,.2,.3, …, 1.0 levels of recall. Optimal graph would have straight line precision always at 1, recall always at 1. Typically, as recall increases, precision drops. Marjan Ghazvininejad 17Sharif University Spring 2012

Evaluating IR Recall is the fraction of relevant documents retrieved from the set of total relevant documents collection-wide. Precision is the fraction of relevant documents retrieved from the total number retrieved. An IR system ranks documents by SC, allowing the user to trade off between precision and recall. Marjan Ghazvininejad 18Sharif University Spring 2012

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 19Sharif University Spring 2012

Strategy vs. Utility An IR strategy is a technique by which a relevance assessment is obtained between a query and a document. An IR utility is a technique that may be used to improve the assessment given by a strategy. A utility may plug into any strategy. Marjan Ghazvininejad 20Sharif University Spring 2012

Strategies Manual – Boolean Automatic – Probabilistic –OKAPI, Robertson/Spack-Jones – Kwok –Inference Networks – Vector Space Model – Latent Semantic Indexing (LSI) Adaptive Models – Genetic Algorithms – Neural Networks Marjan Ghazvininejad 21Sharif University Spring 2012

Boolean Queries Query: (cost OR price) AND paper D1: Paper cost increase of 5%. (relevant) D2: Price of jellybeans up 7%. (not relevant) Marjan Ghazvininejad 22Sharif University Spring 2012

Automatic Strategy Query: cost of paper D1: Paper cost increase of 5%. D2: Cost of copper up 8%. Cost of aluminum down 2%. D3: Miracles of modern medicine. Marjan Ghazvininejad 23Sharif University Spring 2012

Vector Space Model Marjan Ghazvininejad 24Sharif University Spring 2012

Vector Space Model … D ij, Q j equals tf ij idf j  tf ij = term frequency of term j in document i  idf j = inverse database frequency of term j Usually scaled logarithmically  D ij = log(tf ij + 1) log(d/(df ij + 1)) Rank by cosine of angle between D i and Q  SC = D i Q /(|D| |Q|) Marjan Ghazvininejad 25Sharif University Spring 2012

Latent Semantic Indexing Marjan Ghazvininejad 26Sharif University Spring 2012

Adaptive Strategy Probabilistic  Learning based on straight probability estimates. Neural Networks  Learning based on a model of the brain. Genetic Algorithms  Learning based on a model of evolution. Marjan Ghazvininejad 27Sharif University Spring 2012

Utilities Variant forms of terms  Stemming, N-grams Synonyms  Thesauri, semantic nets, relevance feedback, clustering, latent semantic indexing Term proximity  Passage-based retrieval, parsing Marjan Ghazvininejad 28Sharif University Spring 2012

Utilities … Query: biological weapons D1: Iraqi biologists in weapon program. D2: Iraq implicated in germ warfare probe. D3: Scientists use biological techniques as latest weapons against cancer. Marjan Ghazvininejad 29Sharif University Spring 2012

Stemming Stemming: common prefixes and suffixes are removed – Biology, biologist, biologists – Uses language-dependent rules Marjan Ghazvininejad 30Sharif University Spring 2012

N-grams: matching fixed-length strings of N characters Language independent Tolerates misspellings, errors Accuracy not as good as using words Typically, a two pass matching algorithm is used Marjan Ghazvininejad 31Sharif University Spring 2012

Synonyms (Manual) A thesaurus lists related terms – weapon = arms, gun, warfare A semantic net describes relationships between terms – Biologist IS-A scientist – Weapon USED-IN war Marjan Ghazvininejad 32Sharif University Spring 2012

Synonyms (Automatic) Premise: related words are often found in the same document. – Relevance feedback: terms from the top documents are used to construct a new query. – Clustering: documents with common terms are grouped. – Latent semantic indexing: uses a term document matrix. Marjan Ghazvininejad 33Sharif University Spring 2012

Term Proximity Premise: Document are not just bags of words. Query terms are more significant if they occur close together – Passage-based retrieval: the document is divided into sections (paragraphs, overlapping fixed-length), ranked individually. – Phrases: pairs of words (or longer sequences) are treated as single terms. – Parsing: parts of speech (noun phrases, etc.) are identified and treated as terms. Marjan Ghazvininejad 34Sharif University Spring 2012

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 35Sharif University Spring 2012

Next Time Boolean Retrieval Readings  Chapter ?????? in IR text (?????????????)  Joyce & Needham “The Thesaurus Approach to Information Retrieval” (in Readings book)  Luhn “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Texts” (in Readings)  Doyle “Indexing and Abstracting by Association, Pt I” (in Readings) Marjan Ghazvininejad 36Sharif University Spring 2012

Lecture Overview Why is IR so hard? How do we evaluate an IR system? High-level introduction to IR Techniques: – Overview of Retrieval Strategies – Overview of Utilities Discussion References Marjan Ghazvininejad 37Sharif University Spring 2012

References Marjan Ghazvininejad Sharif University Spring