CS4485: Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 1 Introduction.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
Modern Information Retrieval Lecture 2: Key concepts in IR.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Automated Information Retrieval
Text Based Information Retrieval
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Recuperação de Informação B
Retrieval Performance Evaluation - Measures
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

CS4485: Information Retrieval Who I am: Dr. Lusheng WANG Dept. of Computer Science office: Y6429 phone: 2788 9820 e-mail: lwang@cs.cityu.edu.hk web site: http://www.cs.cityu.edu.hk/~lwang/ 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Text Book: B-Y Ricardo and R-N Berthier, Modern Information Retrieval, Addison Wesley, 1999. We will add more material in the handout. References: W.B. Frakes and R. Baeza-Yates. Information Retrieval:Data Structures & Algorithms. Prentice Hall,Englewood Cliffs,NJ,USA,1992 I.H. Witten, A. Moffat, and T.C.Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, NewYork, 1994. Michael Lesk. Practical Digital Libraries; Books,Bytes, and Bucks. Morgan Kaufmann, 1997. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Information Retrieval User task: Translate the information needed into query in some language Provide some words Information Retrieval v.s. Browsing Information retrieval: finding useful information. Browsing: The objectives are not clearly defined and may change during the browsing process. Most system combines the two types. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Logic View of the documents Classic view– a set of index terms or keywords Full text logic view: keep the full text (with computers) Still need some special treatment (chapter 7) Elimination of stopwords (useless words appear in all documents) Use of stemming (reduces distinct words to their common grammatical root) Identification of noun groups (eliminates adjectives, adverbs, and verbs) Compression techniques Structures are used—structured text retrieval models (chapters, section, subsections) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng What we will cover (Syllabus:http://www.cs.cityu.edu.hk//content/courses/index.html) Retrieval models for text (documents) Retrieval models for hypertext (searching the web) Retrieval Evaluation Query Languages Query operations Text operations Chinese language text operations Indexing and searching (algorithmic issues) Brief introduction to multimedia IR. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Evaluation 50% coursework 50% examination Coursework: 1 assignment 20% A midterm examination 20% A project (do it in pairs) 60% 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Definitions A database is a collection of documents. A document is a sequence of terms, expressing ideas about some topic in a natural language. A term is a semantic unit, a word, phrase, or potentially root of a word. A query is a request for documents pertaining to some topic. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Definitions (Cont.) An Information Retrieval (IR) System attempts to find relevant documents to respond to a user’s request. The real problem boils down to matching the language of the query to the language of the document. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Hard Parts of IR Simply matching on words is a very brittle approach. One word can have a zillion different semantic meanings Consider: Take “take a place at the table” “take money to the bank” “take a picture” “take a lot of time” “take drugs” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng More problems with IR You can’t even tell what part of speech a word has: “I saw her duck.” A query that searches for “pictures of a duck” will find documents that contain “I saw her duck away from the ball galling from the sky” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng More Problems with IR Proper Nouns often use regular old nouns Consider a document with “a man named Abraham owned a Lincoln” A word matching query for “Abraham Lincoln” may well find the above document. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

What is Different about IR from the rest of Computer Science Most algorithms in computer science have a “right” answer: Consider the two problems: Sort the following ten integers Find the highest integer Now consider: Find the document most relevant to “hippos in the zoo” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Measuring Effectiveness An algorithm is deemed incorrect if it does not have a “right” answer. A heuristic tries to guess something close to the right answer. Heuristics are measured on “how close” they come to a right answer. IR techniques are essentially heuristics because we do not know the right answer. So we have to measure how close to the right answer we can come. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Precision / Recall Example Consider a query that retrieves 10 documents. Lets say the result set is. D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 If all ten were relevant, we would have 100 percent precision. If there were only ten relevant in the whole collection, we would have 100 percent recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Example (continued) Now lets say that only documents two and five are relevant. Consider these results: D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 Since we have retrieved ten documents and gotten two of them right, precision is 20 percent. Recall is 2/totall relevant in entire collection. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Levels of Recall If we keep retrieving documents, we will ultimately retrieve all documents and achieve 100 percent recall. That means that we can keep retrieving documents until we reach x% of recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Levels of Recall (example) Retrieve top 2000 documents. Lets say there are five total documents relevant. Document DocID Recall Precision -100 A .20 .01 -200 B .40 .01 -500 C .60 .006 -1000 D .80 .004 -1500 E 1.0 .003 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

How to evaluation the quality of the retrieval system Let R be the set of all relevant documents A: set of all documents reported as relevant by the system Ra: AR, the set of relevant documents reported. Recall = |Ra|/|R|. Recall = 10%: 10% of the relevant documents in R are found. Precision = |Ra|/|A|. Precision = 90%: 90% of the reported documents are relevant (suppose 100% are relevant). Recall=100% does not mean the system finds ALL relevant documents Precision=100% does not mean all reported documents are relevant. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Evaluating IR Recall is the fraction of relevant documents retrieved from the set of total relevant documents collection-wide. Precision is the fraction of relevant documents retrieved from the total number retrieved. An IR system ranks documents by SC, allowing the user to trade off between precision and recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Precision/Recall Tradeoff 100% Top 10 Top 100 Top 1000 Precision Recall 100% 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Strategy vs Utility An IR strategy is a technique by which a relevance assessment is obtained between a query and a document. An IR utility is a technique that may be used to improve the assessment given by a strategy. A utility may plug into any strategy. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Strategies Manual Boolean Automatic Probabilistic Inference Networks Vector Space Model Latent Semantic Indexing (LSI) Adaptive Models Genetic Algorithms Neural Networks 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Retrieval: Ad hoc and Filtering Ad hoc retrieval: the documents in the collection remain relatively static while new queries are submitted to the system. (library) Filtering: queries remain relatively the same while new documents come and leave the system. (stock market) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

A formal Characterization of IR models Definition An information retrieval model is a quadruple [D,Q,F,R(qi,dj)] where 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Continue: (1) D is a set composed of logical views (or representations) for the documents in the collection. (2) Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. (3) F is a framework for modeling document representations, queries, and their relationships. (4) R(qi,dj) is a ranking function which associates a real number with a query qiQ and a document representation djD. Such ranking defines an ordering among the documents with regard to the query qi. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Index terms A document is represented by a set of keywords, called index terms. How to select keywords is an important issue and will be discussed in Chapter 7. Some terms are more important than other terms, e.g., a terms appears in five documents is more important than a term appears in most of the document. The word “The” is not useful while the word “cityU” is important for retrieval information related to our university. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Boolean Model: Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j =0 if term ki does not appear in dj and wi,j=1 if term ki is in dj. A query is a Boolean function that is represented as a disjunctive normal form. (1,1,1)(1,1,0)(1,0,0) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

An example of Boolean retrieval model Documents: d1=(1, 0, 1, 1, 1, 1, 1, 1), d2=(0, 1, 0, 0, 1, 1, 1, 1) d3=(0, 0, 0, 1, 1, 1, 1,1), d4=(1, 1, 0, 0, 1, 1, 0, 0 ) Query: (1, 1, 1, 1, 1, 1,1, 1) (1,1, 0,0,1,1,0,0) Result: Only d4 is selected.. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Representation of documents: Boolean model: d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(1,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Advantages simple, easy to understand by users precise semantics Neat formulation Get great attention in the past Disadvantages Binary decision criterion (relevant or non-relevant) Hard to get the Boolean formula for required information. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Vector Space Model Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j ≥0 Each query q is also represented by a vector q=(w1,q, w2,q, …, wn,q). The similarity between the document and the query is defined as sim(dj, q) = i=1, , …, n (wi,j wi,q )/ (i=1…n wi,j 2)0.5 (i=1… n wi,q 2)0.5 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Example 1: dj=(2, 3, 1, 0) and q=(2, 3, 1, 0). sim(dj,q)=(4+9+1+0)/(4+9+1+0)0.5(4+9+1+0)0.5 =1. Example 2: dj=(0, 0,0,5) and q=(2, 3, 1, 0). sim(dj,q)=0/(25)0.5(4+9+1+0)0.5=0. Example 3: dj=(1, 3, 1,1) and q=(2, 3, 1, 0). sim(dj,q)=(2+9+1+0)/(12)0.5(14)0.5=0.8570.5 =0.925. Example 3: dj=(1, 3, 1,0) and q=(2, 3, 1, 0). sim(dj,q)=(2+9+1+0)/(11)0.5(14)0.5>0.925. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Note that wi,j≥0 and wi,q ≥0, since sim(q, dj) is in [0,1]. The documents are ranked according to the similarity. Even if the match is only partial, the document might be retrieved 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng How to determine the weights wi,j on terms? Definition : Let N be the total number of documents in the system and ni be the number of documents in which the index ki appears.Let freqi,j be the raw frequency of term ki in the document dj (i.e. the number of times the term ki is mentioned in the text of the document dj). Then, the normalized frequency fi,j of term ki in the document dj is given by where the maximum is computed over all terms which are mentioned in the text of the document dj. If the term ki does not appear in the document dj then fi,j=0. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Continue: Further, let idfi, inverse document frequency for ki, be given by The best known term-weighting schemes use weights which are given by Such term-weighting strategies are called tf-idf schemes. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng idfi=ln(1000/ni) ni 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng idfi=log(1000/ni) ni 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Continue: Several variations of the above expression for the weight wi,j are described in an interesting paper by Salton and Buckley which appeared in 1988. For the query term weights, Salton and Buchley suggest 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Example 1: d1:Its term-weighting scheme improves retrieval performance; d2:Its partial matching strategy allows retrieval of documents that approximate the query conditions; d3:Its cosine ranking formula sorts the documents according to their degree of similarity to the query. In this example, N=3, for the term ki=“retrieval” , ni=2, idfi=log(3/2)=0.176, freqi,1=1,fi,1=1,wi,1=0.176. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Example 2: d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(2,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng freq k i i,j dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 3 1 d2 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng fi,j k i dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 1 0.33 d2 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng 1 2 3 4 5 6 7 8 9 ni idfi 0.18 0.48 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng wi,j k i dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 0.18 0.16 0.06 d2 0.48 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Example 2: d1: Computer science dept. Algorithms improve retrieval performance; d2:computer study, algorithm, programming skills, query conditions; d3: computer stores, notebook, printers d4: computer store sales CD’s and software Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. improve, 8. retrieval, 9. performance, 10. programming, 11. skills, 12. query, 13. conditions, 14. notebook, 15. printers, 16. sales, 17. CD’s, 18. software, 19. algorithm Question: 19 keywords or 18 keywords? – language process Every document contains may “the” do we need it? Table, desk, are they the same? Related? 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Summary Information Retrieval models Boolean model Vector space model 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng Course Arrangement: No lecture and tutorial in week 2. I make up class will be scheduled in week 3. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng