Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
1 Automatic Indexing Automatic Text Processing by G. Salton, Addison-Wesley, 1989.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Salton2-1 Automatic Indexing Hsin-Hsi Chen. Salton2-2 Indexing indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers:
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text Classification, Active/Interactive learning.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Presentation transcript:

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Automatic Indexing Indexing:  assign identifiers (index terms) to text documents. Identifiers:  single-term vs. term phrase  controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, …  objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

Two Issues Issue 1: indexing exhaustivity  exhaustive: assign a large number of terms  nonexhaustive Issue 2: term specificity  broad terms (generic) cannot distinguish relevant from nonrelevant documents  narrow terms (specific) retrieve relatively fewer documents, but most of them are relevant

Term-Frequency Consideration Function words  for example, "and", "or", "of", "but", …  the frequencies of these words are high in all texts Content words  words that actually relate to document content  varying frequencies in the different texts of a collect  indicate term importance for content

A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.

How to compute w ij ? Inverse document frequency, idf j  tf ij *idf j (TFxIDF) Term discrimination value, dv j  tf ij *dv j Probabilistic term weighting tr j  tf ij *tr j Global properties of terms in a document collection

Inverse Document Frequency Inverse Document Frequency (IDF) for term T j where df j (document frequency of term T j ) is the number of documents in which T j occurs.  fulfil both the recall and the precision  occur frequently in individual documents but rarely in the remainder of the collection

TFxIDF Weight w ij of a term T j in a document d i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

Term-discrimination Value Useful index terms  Distinguish the documents of a collection from each other Document Space  Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together  When a high-frequency term without discrimination is assigned, it will increase the document space density

Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space

Good Term Assignment When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

Poor Term Assignment A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar. This is reflected in an increase in document space density.

Term Discrimination Value Definition dv j = Q - Q j whereQ and Q j are space densities before and after the assignments of term T j. dv j >0, T j is a good term; dv j <0, T j is a poor term.

Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Variations of Term-Discrimination Value with Document Frequency

TF ij x dv j w ij = tf ij x dv j compared with  : decrease steadily with increasing document frequency  dv j : increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

Document Centroid Issue: efficiency problem N(N-1) pairwise similarities Document centroid C = (c 1, c 2, c 3,..., c t ) where w ij is the j-th term in document i. Space density

Probabilistic Term Weighting Goal Explicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection Definition Given a user query q, and the ideal answer set of the relevant documents From decision theory, the best ranking algorithm for a document D

Probabilistic Term Weighting Pr(rel), Pr(nonrel): document ’ s a priori probabilities of relevance and nonrelevance Pr(D|rel), Pr(D|nonrel): occurrence probabilities of document D in the relevant and nonrelevant document sets

Assumptions Terms occur independently in documents

Derivation Process

Given a document D=(d 1, d 2, …, d t ) Assume d i is either 0 (absent) or 1 (present). Pr(x i =1|rel) = p i Pr(x i =0|rel) = 1-p i Pr(x i =1|nonrel) = q i Pr(x i =0|nonrel) = 1-q i For a specific document D

Term Relevance Weight

Issue How to compute p j and q j ? p j = r j / R q j = (df j -r j )/(N-R)  R: the total number of relevant documents  N: the total number of documents

Estimation of Term-Relevance The occurrence probability of a term in the nonrelevant documents q j is approximated by the occurrence probability of the term in the entire document collection q j = df j / N The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value p j = 0.5 for all j.

When N is sufficiently large, N-df j  N,  = idf j Comparison