Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Automatic Indexing Indexing: assign identifiers (index terms) to text documents. Identifiers: single-term vs. term phrase controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, … objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …
Two Issues Issue 1: indexing exhaustivity exhaustive: assign a large number of terms nonexhaustive Issue 2: term specificity broad terms (generic) cannot distinguish relevant from nonrelevant documents narrow terms (specific) retrieve relatively fewer documents, but most of them are relevant
Term-Frequency Consideration Function words for example, "and", "or", "of", "but", … the frequencies of these words are high in all texts Content words words that actually relate to document content varying frequencies in the different texts of a collect indicate term importance for content
A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.
How to compute w ij ? Inverse document frequency, idf j tf ij *idf j (TFxIDF) Term discrimination value, dv j tf ij *dv j Probabilistic term weighting tr j tf ij *tr j Global properties of terms in a document collection
Inverse Document Frequency Inverse Document Frequency (IDF) for term T j where df j (document frequency of term T j ) is the number of documents in which T j occurs. fulfil both the recall and the precision occur frequently in individual documents but rarely in the remainder of the collection
TFxIDF Weight w ij of a term T j in a document d i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors
Term-discrimination Value Useful index terms Distinguish the documents of a collection from each other Document Space Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together When a high-frequency term without discrimination is assigned, it will increase the document space density
Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space
Good Term Assignment When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.
Poor Term Assignment A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar. This is reflected in an increase in document space density.
Term Discrimination Value Definition dv j = Q - Q j whereQ and Q j are space densities before and after the assignments of term T j. dv j >0, T j is a good term; dv j <0, T j is a poor term.
Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Variations of Term-Discrimination Value with Document Frequency
TF ij x dv j w ij = tf ij x dv j compared with : decrease steadily with increasing document frequency dv j : increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.
Document Centroid Issue: efficiency problem N(N-1) pairwise similarities Document centroid C = (c 1, c 2, c 3,..., c t ) where w ij is the j-th term in document i. Space density
Probabilistic Term Weighting Goal Explicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection Definition Given a user query q, and the ideal answer set of the relevant documents From decision theory, the best ranking algorithm for a document D
Probabilistic Term Weighting Pr(rel), Pr(nonrel): document ’ s a priori probabilities of relevance and nonrelevance Pr(D|rel), Pr(D|nonrel): occurrence probabilities of document D in the relevant and nonrelevant document sets
Assumptions Terms occur independently in documents
Derivation Process
Given a document D=(d 1, d 2, …, d t ) Assume d i is either 0 (absent) or 1 (present). Pr(x i =1|rel) = p i Pr(x i =0|rel) = 1-p i Pr(x i =1|nonrel) = q i Pr(x i =0|nonrel) = 1-q i For a specific document D
Term Relevance Weight
Issue How to compute p j and q j ? p j = r j / R q j = (df j -r j )/(N-R) R: the total number of relevant documents N: the total number of documents
Estimation of Term-Relevance The occurrence probability of a term in the nonrelevant documents q j is approximated by the occurrence probability of the term in the entire document collection q j = df j / N The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value p j = 0.5 for all j.
When N is sufficiently large, N-df j N, = idf j Comparison