Introduction to Digital Libraries Searching

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Boolean and Vector Space Retrieval Models
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Extended Boolean Model n Boolean model is simple and elegant. n But, no provision for a ranking n As with the fuzzy model, a ranking can be obtained by.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
CS/Info 430: Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
Najah Alshanableh. Fuzzy Set Model n Queries and docs represented by sets of index terms: matching is approximate from the start n This vagueness can.
Vector Space Models.
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Search
Representation of documents and queries
Advanced Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Advanced information retrieval
Presentation transcript:

Introduction to Digital Libraries Searching

Technical View: Retrieval as Matching Documents to Queries Algorithm Surrogates Surrogates Document Space Terms Query Form A Query Space Sample Sample Vectors Query Form B NO User! No Feedback! Etc.. Etc.. Retrieval is algorithmic. Evaluation is typically a binary decision for each pairwise match and one or more aggregate values for a set of matches (e.g., recall and precision). 5

Human View: Information-Seeking Process Results Perceived Needs Queries Indexes Problem Actions Data Physical Interface Information seeking is an active, iterative process controlled by a human who Changes throughout the process. Evaluation is relative to human needs. 6

IR Models s e Adhoc r Filtering T a k Browsing Extended Boolean vector Set Theoretic Fuzzy Extended Boolean Classic Models boolean vector probabilistic Algebraic Generalized Vector Lat. Semantic Index Neural Networks U s e r T a k Retrieval: Adhoc Filtering Non-Overlapping Lists Proximal Nodes Structured Models Probabilistic Inference Network Belief Network Browsing Browsing Flat Structure Guided Hypertext

“Classic” Retrieval Models Boolean Documents and queries are sets of index terms Vector Documents and queries are documents in N-dimensional space Probabilistic Based on probability theory

Boolean Searching Exactly what you would expect and, or, not operations defined requires an exact match based on inverted file (computer and science) and (not(animals)) would prevent a document with “use of computers in animal science research” from being retrieved

Boolean ‘AND’ Information AND Retrieval Information Retrieval

Example Draw a Venn diagram for: What is the meaning of: Care and feeding and (cats or dogs) What is the meaning of: Information and retrieval and performance or evaluation

Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information  ¬computer”

Boolean-based Matching Exact match systems; separate the documents containing a given term from those that do not. Terms Queries Mediterranean adventure agriculture cathedrals horticulture scholarships disasters bridge leprosy recipes flags tennis Venus flags AND tennis 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 leprosy AND tennis Venus OR (tennis AND flags) Documents (bridge OR flags) AND tennis

Exercise 1 Swift 2 Shakespeare 3 4 Milton 5 6 7 8 Chaucer 9 10 11 12 13 14 15 ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Boolean features Order dependency of operators Nesting of search terms ( ), NOT, AND, OR (DIALOG) May differ on different systems Nesting of search terms Nutrition and (fast or junk) and food

Boolean Limitations Searches can become complex for the average user too much ANDing can clobber recall tricky syntax: “research AND NOT computer science” “research AND NOT (computer science)” (implicit OR) “research AND NOT (computer AND science)” all different -- (frequently seen in NTRS logs)

Vector Model Calculate degree of similarity between document and query Ranked output by sorting similarity values Also called ‘vector space model’ Imagine your documents as N-dimensional vectors (where N=number of words) The “closeness” of 2 documents can be expressed as the cosine of the angle between the two vectors

Vector Space Model Documents and queries are points in N-dimensional space (where N is number of unique index terms in the data collection) Q D

Vector Space Model with Term Weights assume document terms have different values for retrieval therefore assign weights to each term in each document example: proportional to frequency of term in document inversely proportional to frequency of term in collection

Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 3 2 5 Is D1 or D2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?

Document and Query Vectors Documents and Queries are vectors of terms Vectors can use binary keyword weights or assume 0-1 weights (term frequencies) Example terms: “dog”,”cat”,”house”, “sink”, “road”, “car” Binary: (1,1,0,0,0,0), (0,0,1,1,0,0) Weighted: (0.01,0.01, 0.002, 0.0,0.0,0.0)

Document Collection Representation A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : Dn w1n w2n … wtn

Inner Product: Example 1 k1 k2 k3

factors information help human operation retrieval systems Vector Space Example Simple Match Query (1 1 0 1 0 1 1) Rec1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) =4 Rec2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) =3 Rec3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) =2 indexed words: factors information help human operation retrieval systems Query: human factors in information retrieval systems Vector: (1 1 0 1 0 1 1) Record 1 contains: human, factors, information, retrieval Vector: (1 1 0 1 0 1 0) Record 2 contains: human, factors, help, systems Vector: (1 0 1 1 0 0 1) Record 3 contains: factors, operation, systems Vector: (1 0 0 0 1 0 1) Weighted Match Query (1 1 0 1 0 1 1) Rec1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) =13 Rec2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) =8 Rec3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) =3

Term Weights: Term Frequency More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}

Some formulas for Sim Dot product Cosine Dice Jaccard t1 D Q t2

Example Documents: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra

The Idea qor = kx  ky; wxj = x and wyj = y ky (1,1) dj+1 OR dj y = wyj (0,0) x = wxj kx We want a document to be as far as possible from (0,0) sim(qor,dj) = sqrt( x + y ) 2

with each term is associated a fuzzy set Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: with each term is associated a fuzzy set each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory

Probabilistic Model P(REL|D) Views retrieval as an attempt to answer a basic question: “What is the probability that this document is relevant to this query?” expressed as: P(REL|D) ie. Probability of x given y (Probability that of relevance given a particular document D)

Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) The system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms

Recombination after dimensionality reduction

Classic IR Models Vector vs. probabilistic “Numerous experiments demonstrate that probabilistic retrieval procedures yield good results. However, the results have not been sufficiently better than those obtained using Boolean or vector techniques to convince system developers to move heavily in this direction

Example Build the inverted file for the following document F1={Written Quiz for Algorithms and Techniques of Information Retrieval} F2={Program Quiz for Algorithms and Techniques of Web Search} F3={Search on the Web for Information on Algorithms}

Example You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1   Give the scores of the 9 documents for the query trees, minors using Boolean search Give the scores of the 9 documents for the query trees, minors using the vector model.