Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Chapter 5: Query Operations Hassan Bashiri April
Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
CSE3201/4500 Information Retrieval Systems
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Digital Libraries Searching
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Plan for Today’s Lecture(s)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh

2 Previous chapter User Information Need oVague oSemantic, not formal Document Relevance oOrder, not retrieve Huge amount of information oEfficiency concerns oTradeoffs Art more than science

3 Modeling Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon model = step 1 = step 2 =... = result math phenomenon model step 1 step 2... ?!

4 Modeling Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally Keep only important properties (for this application) Do this with text:

5 Modeling in IR: idea Tag documents with fields oAs in a (relational) DB: customer = {name, age, address} oUnlike DB, very many fields: individual words! oE.g., bag of words: {word 1, word 2,...}: {3, 5, 0, 0, 2,...} Define a similarity measure between query and such a record o(Unlike DB) Rank (order), not retrieve (yes/no) oJustify your model (optional, but nice) Develop math and algorithms for fast access oas relational algebra in DB

Taxonomy of IR systems

7 Aspects of an IR system IR model oBoolean, Vector, Probabilistic Logical view of documents oFull text, bag of words,... User task oretrieval, browsing Independent, though some are more compatible

Appropriate models

9 Characterization of an IR model D = {d j }, collection of formal representations of docs oe.g., keyword vectors Q = {q i }, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(q i,d j ): Q D R, ranking function odefines ordering

Specific IR models

11 IR models Classical oBoolean oVector oProbabilistic (clear ideas, but some disadvantages) Refined oEach one with refinements oSolve many of the problems of the basic models oGive good examples of possible developments in the area oNot investigated well We can work on this

12 Basic notions Document: Set of index term oMainly nouns oMaybe all, then full text logical view Term weights osome terms are better than others oterms less frequent in this doc and more frequent in other docs are less useful Documents index term vector {w 1j, w 2j,..., w tj } oweights of terms in the doc ot is the number of terms in all docs oweights of different terms are independent (simplification)

13 Boolean model Weights {0, 1} oDoc: set of words Query: Boolean expression oR(q i,d j ) {0, 1} Good: oclear semantics, neat formalism, simple Bad: ono ranking ( data retrieval), retrieves too many or too few odifficult to translate User Information Need into query No term weighting

14 Vector model Weights (non-binary) Ranking, much better results (for User Info Need) R(q i,d j ) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)

Projection

16 Weights How are the weights w ij obtained? Many variants. One way: TF-IDF balance TF: Term frequency oHow well the term is related to the doc? oIf appears many times, is important oProportional to the number of times that appears IDF: Inverse document frequency oHow important is the term to distinguish documents? oIf appears in many docs, is not important oInversely proportional to number of docs where appears Contradictory. How to balance?

17 TF-IDF ranking TF: Term frequency IDF: Inverse document frequency Balance: TF IDF oOther formulas exist. Art.

18 Advantages of vector model One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast But: Does not consider term dependencies oconsidering them in a bad way hurts quality ono known good way No logical expressions (e.g., negation: mouse & NOT cat)

19 Probabilistic model Assumptions: oset of relevant docs, oprobabilities of docs to be relevant oAfter Bayes calculation: probabilities of terms to be important for defining relevant docs Initial idea: interact with the user. oGenerate an initial set oAsk the user to mark some of them as relevant or not oEstimate the probabilities of keywords. Repeat Can be done without user oJust re-calculate the probabilities assuming the users acceptance is the same as predicted ranking

20 (Dis) advantages of Probabilistic model Advantage: Theoretical adequacy: ranks by probabilities Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad) Does not perform well (?)

21 Alternative Set Theoretic models Fuzzy set model Takes into account term relationships (thesaurus) oBible is related to Church Fuzzy belonging of a term to a document oDocument containing Bible also contains a little bit of Church, but not entirely Fuzzy set logic applied to such fuzzy belonging ological expressions with AND, OR, and NOT Provides ranking, not just yes/no Not investigated well. oWhy not investigate it?

22 Extended Boolean model Alternative Set Theoretic models Extended Boolean model Combination of Boolean and Vector In comparison with Boolean model, adds distance from query osome documents satisfy the query better than others In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?

23 Alternative Algebraic models Generalized Vector Space model Classical independence assumptions: oAll combinations of terms are possible, none are equivalent (= basis in the vector space) oPair-wise orthogonal: cos ({k i }, {k j }) = 0 This model relaxes the pair-wise orthogonality: cos ({k i }, {k j }) 0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?

24 Latent Semantic Indexing model Alternative Algebraic models Latent Semantic Indexing model Index by larger units, concepts sets of terms used together Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) Group index terms together (map into lower dimensional space). So some terms are equivalent. oNot exactly, but this is the idea oEliminates unimportant details oDepends on a parameter (what details are unimportant?) Not investigated well. Why not investigate it?

25 Neural Network model Alternative Algebraic models Neural Network model NNs are good at matching Iteratively uses the found documents as auxiliary queries oSpreading activation. oTerms docs terms docs terms docs... Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?

26 Models for browsing Flat browsing: String oJust as a list of paper oNo context cues provided Structure guided: Tree oHierarchy oLike directory tree in the computer Hypertext (Internet!): Directed graph oNo limitations of sequential writing oModeled by a directed graph: links from unit A to unit B units: docs, chapters, etc. oA map (with traversed path) can be helpful

27 Research issues How people judge relevance? oranking strategies How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need? ouser interfaces: an open issue Meta-search engines: combine results from different Web search engines oThey almost do not intersect oHow to combine ranking?

28 Conclusions Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity oTF-IDF term weighting oThis (or similar) weighting is used in all further models Many interesting and not well-investigated variations opossible future work

29 Thank you! Till March 22, 6 pm