Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Advertisements

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Text Categorization.
Chapter 5: Query Operations Hassan Bashiri April
Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
CSE3201/4500 Information Retrieval Systems
PSSA Preparation.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Digital Libraries Searching
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Vector Space Models.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Plan for Today’s Lecture(s)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling Alexander Gelbukh www.Gelbukh.com

Previous chapter User Information Need Document Relevance Vague Semantic, not formal Document Relevance Order, not retrieve Huge amount of information Efficiency concerns Tradeoffs Art more than science

Modeling Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon  model = step 1 = step 2 = ... = result math phenomenon  model  step 1  step 2  ...  ?!

Modeling in IR: idea Tag documents with fields As in a (relational) DB: customer = {name, age, address} Unlike DB, very many fields: individual words! E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...} Define a similarity measure between query and such a record Unlike DB, order, not retrieve (yes/no) Justify your model (optional, but nice) Develop math and algorithms for fast access as relational algebra in DB

Taxonomy of IR systems

Aspects of an IR system IR model Logical view of documents User task Boolean, Vector, Probabilistic Logical view of documents Full text, bag of words, ... User task retrieval, browsing Independent, though some are more compatible

Taxonomy of IR models Boolean (set theoretic) Vector (algebraic) fuzzy extended Vector (algebraic) generalized vector latent semantic indexing neural network Probabilistic inference network belief network

Taxonomy of other aspects Text structure Non-overlapping lists Proximal nodes model Browsing Flat Structure guided hypertext

Appropriate models

Retrieval operation mode Ad-hoc static documents interactive ordered Filtering ( ad-hoc on new docs) changing document collection notification not interactive machine learning techniques can be used yes/no

Characterization of an IR model D = {dj}, collection of formal representations of docs e.g., keyword vectors Q = {qi}, possible formal representations of user information need (queries) F, framework for modeling these two: reason for the next R(qi,dj): Q  D  R, ranking function defines ordering

Specific IR models

IR models Classical Refined Boolean Vector Probabilistic (clear ideas, but some disadvantages) Refined Each one with refinements Solve many of the problems of the “basic” models Give good examples of possible developments in the area Not investigated well We can work on this

Basic notions Document: Set of index term Term weights Mainly nouns Maybe all, then full text logical view Term weights some terms are better than others terms less frequent in this doc and more frequent in other docs are less useful Documents  index term vector {w1j, w2j, ..., wtj} weights of terms in the doc t is the number of terms in all docs weights of different terms are independent (simplification)

Boolean model Weights  {0, 1} Query: Boolean expression Good: Bad: Doc: set of words Query: Boolean expression R(qi,dj)  {0, 1} Good: clear semantics, neat formalism, simple Bad: no ranking ( data retrieval), retrieves too many or too few difficult to translate User Information Need into query No term weighting

Vector model Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc vector E.g., cosine measure: (there is a typo in the book)

Projection

Weights How are the weights wij obtained? Many variants. One way: TF-IDF balance TF: Term frequency How well the term is related to the doc? If appears many times, is important Proportional to the number of times that appears IDF: Inverse document frequency How important is the term to distinguish documents? If appears in many docs, is not important Inversely proportional to number of docs where appears Contradictory. How to balance?

TF-IDF ranking TF: Term frequency IDF: Inverse document frequency Balance: TF  IDF Other formulas exist. Art.

Advantages of vector model One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast But: Does not consider term dependencies considering them in a bad way hurts quality no known good way No logical expressions (e.g., negation: “mouse & NOT cat”)

Probabilistic model Assumptions: Initial idea: interact with the user. set of “relevant” docs, probabilities of docs to be relevant After Bayes calculation: probabilities of terms to be important for defining relevant docs Initial idea: interact with the user. Generate an initial set Ask the user to mark some of them as relevant or not Estimate the probabilities of keywords. Repeat Can be done without user Just re-calculate the probabilities assuming the user’s acceptance is the same as predicted ranking

(Dis) advantages of Probabilistic model Theoretical adequacy: ranks by probabilities Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad) Does not perform well (?)

Alternative Set Theoretic models Fuzzy set model Takes into account term relationships (thesaurus) Bible is related to Church Fuzzy belonging of a term to a document Document containing Bible also contains “a little bit of” Church, but not entirely Fuzzy set logic applied to such fuzzy belonging logical expressions with AND, OR, and NOT Provides ranking, not just yes/no Not investigated well. Why not investigate it?

Alternative Set Theoretic models Extended Boolean model Combination of Boolean and Vector In comparison with Boolean model, adds “distance from query” some documents satisfy the query better than others In comparison with Vector model, adds the distinction between AND and OR combinations There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like This can be even different within one query Not investigated well. Why not investigate it?

Alternative Algebraic models Generalized Vector Space model Classical independence assumptions: All combinations of terms are possible, none are equivalent (= basis in the vector space) Pair-wise orthogonal: cos ({ki}, {kj}) = 0 This model relaxes the pair-wise orthogonality: cos ({ki}, {kj})  0 Operates by combinations (co-occurrences) of index terms, not individual terms More complex, more expensive, not clear if better Not investigated well. Why not investigate it?

Alternative Algebraic models Latent Semantic Indexing model Index by larger units, “concepts”  sets of terms used together Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) Group index terms together (map into lower dimensional space). So some terms are equivalent. Not exactly, but this is the idea Eliminates unimportant details Depends on a parameter (what details are unimportant?) Not investigated well. Why not investigate it?

Alternative Algebraic models Neural Network model NNs are good at matching Iteratively uses the found documents as auxiliary queries Spreading activation. Terms  docs  terms  docs  terms  docs  ... Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?

Alternative Probabilistic models Bayesian Inference Network model (One of the authors of the book worked in this. In fact not so important) Probability as belief (not as frequency) Belief in importance of terms. Query terms have 1.0 Similar to Neural Net Documents found increase the importance of their terms Thus act as new queries But different propagation formulas Flexible in combining sources of evidence Can be applied to different ranking strategies (Boolean or TF-IDF) Good quality of results (Warning! Authors work in this)

Alternative Probabilistic models Belief Network model (Introduced by one of the authors of the book.) Better network topology Separation of document and term space More general than Inference model -------------------------------------------------------------------- Bayesian network models: do not include cycles and this have linear complexity unlike Neural Nets Combine distinct evidence sources (also user feedback) Are a neat formalism. Better alternative to combinations of Boolean and Vector

Models for structured text Cat in the 3rd chapter. Cat in same paragraph as Dog Non-overlapping lists Chapters, sections, paragraphs – as regions Technically treated much like terms (ranges of positions) Sections containing Cat Proximal nodes model (suggested by the authors) Chapters, sections, paragraphs – as objects (nodes)

Models for browsing Flat browsing Structure guided Just as a list of paper No context cues provided Structure guided Hierarchy Like directory tree in the computer Hypertext (Internet!) No limitations of sequential writing Modeled by a directed graph: links from unit A to unit B units: docs, chapters, etc. A map (with traversed path) can be helpful

The Web Internet Not hypertext Authors call “hypertext” a well-organized hypertext Internet: not depository but heap of information

Research issues How people judge relevance? ranking strategies How to combine different sources of evidence? What interfaces can help users to understand and formulate their Information Need? user interfaces: an open issue Meta-search engines: combine results from different Web search engines They almost do not intersect How to combine ranking?

Conclusions Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and simplicity TF-IDF term weighting This (or similar) weighting is used in all further models Many interesting and not well-investigated variations possible future work

Thank you! Till October 2