CS4485: Information Retrieval

CS4485: Information Retrieval
Who I am: Dr. Lusheng WANG Dept. of Computer Science office: Y6429 phone: web site: 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485 Information Retrieval /WANG Lusheng
Text Book: B-Y Ricardo and R-N Berthier, Modern Information Retrieval, Addison Wesley, 1999. We will add more material in the handout. References: W.B. Frakes and R. Baeza-Yates. Information Retrieval:Data Structures & Algorithms. Prentice Hall,Englewood Cliffs,NJ,USA,1992 I.H. Witten, A. Moffat, and T.C.Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, NewYork, 1994. Michael Lesk. Practical Digital Libraries; Books,Bytes, and Bucks. Morgan Kaufmann, 1997. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Information Retrieval
User task: Translate the information needed into query in some language Provide some words Information Retrieval v.s. Browsing Information retrieval: finding useful information. Browsing: The objectives are not clearly defined and may change during the browsing process. Most system combines the two types. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Logic View of the documents
Classic view– a set of index terms or keywords Full text logic view: keep the full text (with computers) Still need some special treatment (chapter 7) Elimination of stopwords (useless words appear in all documents) Use of stemming (reduces distinct words to their common grammatical root) Identification of noun groups (eliminates adjectives, adverbs, and verbs) Compression techniques Structures are used—structured text retrieval models (chapters, section, subsections) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

What we will cover (Syllabus: Retrieval models for text (documents) Retrieval models for hypertext (searching the web) Retrieval Evaluation Query Languages Query operations Text operations Chinese language text operations Indexing and searching (algorithmic issues) Brief introduction to multimedia IR. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Evaluation 50% coursework 50% examination Coursework: 1 assignment 20% A midterm examination 20% A project (do it in pairs) 60% 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Definitions A database is a collection of documents. A document is a sequence of terms, expressing ideas about some topic in a natural language. A term is a semantic unit, a word, phrase, or potentially root of a word. A query is a request for documents pertaining to some topic. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Definitions (Cont.) An Information Retrieval (IR) System attempts to find relevant documents to respond to a user’s request. The real problem boils down to matching the language of the query to the language of the document. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Hard Parts of IR Simply matching on words is a very brittle approach. One word can have a zillion different semantic meanings Consider: Take “take a place at the table” “take money to the bank” “take a picture” “take a lot of time” “take drugs” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

More problems with IR You can’t even tell what part of speech a word has: “I saw her duck.” A query that searches for “pictures of a duck” will find documents that contain “I saw her duck away from the ball galling from the sky” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

More Problems with IR Proper Nouns often use regular old nouns Consider a document with “a man named Abraham owned a Lincoln” A word matching query for “Abraham Lincoln” may well find the above document. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

What is Different about IR from the rest of Computer Science
Most algorithms in computer science have a “right” answer: Consider the two problems: Sort the following ten integers Find the highest integer Now consider: Find the document most relevant to “hippos in the zoo” 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Measuring Effectiveness
An algorithm is deemed incorrect if it does not have a “right” answer. A heuristic tries to guess something close to the right answer. Heuristics are measured on “how close” they come to a right answer. IR techniques are essentially heuristics because we do not know the right answer. So we have to measure how close to the right answer we can come. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Precision / Recall Example
Consider a query that retrieves 10 documents. Lets say the result set is. D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 If all ten were relevant, we would have 100 percent precision. If there were only ten relevant in the whole collection, we would have 100 percent recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Example (continued) Now lets say that only documents two and five are relevant. Consider these results: D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 Since we have retrieved ten documents and gotten two of them right, precision is 20 percent. Recall is 2/totall relevant in entire collection. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Levels of Recall If we keep retrieving documents, we will ultimately retrieve all documents and achieve 100 percent recall. That means that we can keep retrieving documents until we reach x% of recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Levels of Recall (example)
Retrieve top 2000 documents. Lets say there are five total documents relevant. Document DocID Recall Precision A B C D E 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

How to evaluation the quality of the retrieval system
Let R be the set of all relevant documents A: set of all documents reported as relevant by the system Ra: AR, the set of relevant documents reported. Recall = |Ra|/|R|. Recall = 10%: 10% of the relevant documents in R are found. Precision = |Ra|/|A|. Precision = 90%: 90% of the reported documents are relevant (suppose 100% are relevant). Recall=100% does not mean the system finds ALL relevant documents Precision=100% does not mean all reported documents are relevant. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Evaluating IR Recall is the fraction of relevant documents retrieved from the set of total relevant documents collection-wide. Precision is the fraction of relevant documents retrieved from the total number retrieved. An IR system ranks documents by SC, allowing the user to trade off between precision and recall. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Precision/Recall Tradeoff 100% Top 10 Top 100 Top 1000 Precision Recall 100% 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Strategy vs Utility An IR strategy is a technique by which a relevance assessment is obtained between a query and a document. An IR utility is a technique that may be used to improve the assessment given by a strategy. A utility may plug into any strategy. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Strategies Manual Boolean Automatic Probabilistic Inference Networks Vector Space Model Latent Semantic Indexing (LSI) Adaptive Models Genetic Algorithms Neural Networks 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Retrieval: Ad hoc and Filtering
Ad hoc retrieval: the documents in the collection remain relatively static while new queries are submitted to the system. (library) Filtering: queries remain relatively the same while new documents come and leave the system. (stock market) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

A formal Characterization of IR models
Definition An information retrieval model is a quadruple [D,Q,F,R(qi,dj)] where 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Continue: (1) D is a set composed of logical views (or representations) for the documents in the collection. (2) Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. (3) F is a framework for modeling document representations, queries, and their relationships. (4) R(qi,dj) is a ranking function which associates a real number with a query qiQ and a document representation djD. Such ranking defines an ordering among the documents with regard to the query qi. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Index terms A document is represented by a set of keywords, called index terms. How to select keywords is an important issue and will be discussed in Chapter 7. Some terms are more important than other terms, e.g., a terms appears in five documents is more important than a term appears in most of the document. The word “The” is not useful while the word “cityU” is important for retrieval information related to our university. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Boolean Model: Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j =0 if term ki does not appear in dj and wi,j=1 if term ki is in dj. A query is a Boolean function that is represented as a disjunctive normal form. (1,1,1)(1,1,0)(1,0,0) 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

An example of Boolean retrieval model
Documents: d1=(1, 0, 1, 1, 1, 1, 1, 1), d2=(0, 1, 0, 0, 1, 1, 1, 1) d3=(0, 0, 0, 1, 1, 1, 1,1), d4=(1, 1, 0, 0, 1, 1, 0, 0 ) Query: (1, 1, 1, 1, 1, 1,1, 1) (1,1, 0,0,1,1,0,0) Result: Only d4 is selected.. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Representation of documents: Boolean model:
d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(1,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Advantages simple, easy to understand by users precise semantics Neat formulation Get great attention in the past Disadvantages Binary decision criterion (relevant or non-relevant) Hard to get the Boolean formula for required information. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Vector Space Model Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j ≥0 Each query q is also represented by a vector q=(w1,q, w2,q, …, wn,q). The similarity between the document and the query is defined as sim(dj, q) = i=1, , …, n (wi,j wi,q )/ (i=1…n wi,j 2)0.5 (i=1… n wi,q 2)0.5 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Example 1: dj=(2, 3, 1, 0) and q=(2, 3, 1, 0). sim(dj,q)=( )/( )0.5( )0.5 =1. Example 2: dj=(0, 0,0,5) and q=(2, 3, 1, 0). sim(dj,q)=0/(25)0.5( )0.5=0. Example 3: dj=(1, 3, 1,1) and q=(2, 3, 1, 0). sim(dj,q)=( )/(12)0.5(14)0.5= =0.925. Example 3: dj=(1, 3, 1,0) and q=(2, 3, 1, 0). sim(dj,q)=( )/(11)0.5(14)0.5>0.925. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Note that wi,j≥0 and wi,q ≥0, since sim(q, dj) is in [0,1]. The documents are ranked according to the similarity. Even if the match is only partial, the document might be retrieved 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

How to determine the weights wi,j on terms? Definition : Let N be the total number of documents in the system and ni be the number of documents in which the index ki appears.Let freqi,j be the raw frequency of term ki in the document dj (i.e. the number of times the term ki is mentioned in the text of the document dj). Then, the normalized frequency fi,j of term ki in the document dj is given by where the maximum is computed over all terms which are mentioned in the text of the document dj. If the term ki does not appear in the document dj then fi,j=0. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Continue: Further, let idfi, inverse document frequency for ki, be given by The best known term-weighting schemes use weights which are given by Such term-weighting strategies are called tf-idf schemes. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

idfi=ln(1000/ni) ni 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

idfi=log(1000/ni) ni 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Continue: Several variations of the above expression for the weight wi,j are described in an interesting paper by Salton and Buckley which appeared in 1988. For the query term weights, Salton and Buchley suggest 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Example 1: d1:Its term-weighting scheme improves retrieval performance; d2:Its partial matching strategy allows retrieval of documents that approximate the query conditions; d3:Its cosine ranking formula sorts the documents according to their degree of similarity to the query. In this example, N=3, for the term ki=“retrieval” , ni=2, idfi=log(3/2)=0.176, freqi,1=1,fi,1=1,wi,1=0.176. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Example 2: d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(2,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

freq k i i,j dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 3 1 d2 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

fi,j k i dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 1 0.33 d2 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

1 2 3 4 5 6 7 8 9 ni idfi 0.18 0.48 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

wi,j k i dj k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 0.18 0.16 0.06 d2 0.48 d3 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Example 2: d1: Computer science dept. Algorithms improve retrieval performance; d2:computer study, algorithm, programming skills, query conditions; d3: computer stores, notebook, printers d4: computer store sales CD’s and software Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. improve, 8. retrieval, 9. performance, 10. programming, 11. skills, 12. query, 13. conditions, 14. notebook, 15. printers, 16. sales, 17. CD’s, 18. software, 19. algorithm Question: 19 keywords or 18 keywords? – language process Every document contains may “the” do we need it? Table, desk, are they the same? Related? 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Summary Information Retrieval models Boolean model Vector space model 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

Course Arrangement: No lecture and tutorial in week 2. I make up class will be scheduled in week 3. 2019/4/22 CS4485 Information Retrieval /WANG Lusheng

CS4485: Information Retrieval

Similar presentations

Presentation on theme: "CS4485: Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS4485: Information Retrieval

Similar presentations

Presentation on theme: "CS4485: Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback