1 Automatic Indexing Automatic Text Processing by G. Salton, Addison-Wesley, 1989.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Salton2-1 Automatic Indexing Hsin-Hsi Chen. Salton2-2 Indexing indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers:
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text Classification, Active/Interactive learning.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Hsin-Hsi Chen1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Multimedia Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Inf 722 Information Organisation
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

1 Automatic Indexing Automatic Text Processing by G. Salton, Addison-Wesley, 1989.

2 Indexing l indexing: assign identifiers to text documents. l assign: manual vs. automatic indexing l identifiers: »objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, … »controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, … »single-term vs. term phrase

3 Two Issues l Issue 1: indexing exhaustivity »exhaustive: assign a large number of terms »nonexhaustive l Issue 2: term specificity »broad terms (generic) cannot distinguish relevant from nonrelevant documents »narrow terms (specific) retrieve relatively fewer documents, but most of them are relevant

4 Parameters of retrieval effectiveness l Recall l Precision l Goal high recall and high precision

Nonrelevant Items Relevant Items Retrieved Part a b c d

6 A Joint Measure l F-score »  is a parameter that encode the importance of recall and procedure. »  =1: equal weight »  <1: precision is more important »  >1: recall is more important

7 Choices of Recall and Precision l Both recall and precision vary from 0 to 1. l In principle, the average user wants to achieve both high recall and high precision. l In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.

8 Choices of Recall and Precision ( Continued ) l Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall. l In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

9 Term-Frequency Consideration l Function words »for example, "and", "or", "of", "but", … »the frequencies of these words are high in all texts l Content words »words that actually relate to document content »varying frequencies in the different texts of a collect »indicate term importance for content

10 A Frequency-Based Indexing Method l Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. l Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. l Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.

11 Inverse Document Frequency l Inverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is number of documents in which Tj occurs. »fulfil both the recall and the precision »occur frequently in individual documents but rarely in the remainder of the collection

12 New Term Importance Indicator l weight w ij of a term T j in a document d i l Eliminating common function words l Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

13 Term-discrimination Value l Useful index terms »distinguish the documents of a collection from each other l Document Space »two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together »when a high-frequency term without discrimination is assigned, it will increase the document space density

14 Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space

15 Good Term Assignment l When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. l This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

16 Poor Term Assignment l A high frequency term is assigned that does not discriminate between the objects of a collection. l Its assignment will render the document more similar. l This is reflected in an increase in document space density.

17 Term Discrimination Value l definition dv j = Q - Q j whereQ and Q j are space densities before and after the assignments of term T j. l dv j >0, T j is a good term; dv j <0, T j is a poor term.

18 Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Variations of Term-Discrimination Value with Document Frequency

19 Another Term Weighting l w ij = tf ij x dv j l compared with » : decrease steadily with increasing document frequency »dv j : increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

20 Document Centroid l Issue: efficiency problem N(N-1) pairwise similarities l document centroid C = (c 1, c 2, c 3,..., c t ) where w ij is the j-th term in document i. l space density

21 Discussions l dv j and idf j global properties of terms in a document collection l ideal term characteristics that occur between relevant and nonrelevant documents of a collection

22 Probabilistic Term Weighting l Goal Explicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection l Definition Given a user query q, and the ideal answer set of the relevant documents l From decision theory, the best ranking algorithm

23 Probabilistic Term Weighting l Pr(x|rel), Pr(x|nonrel): occurrence probabilities of document x in the relevant and nonrelevant document sets l Pr(rel), Pr(nonrel): document’s a priori probabilities of relevance and nonrelevance l Further assumptions Terms occur independently in relevant documents; terms occur independently in nonrelevant documents.

25 Derivation Process

Given a document D=(d 1, d 2, …, d t ), the retrieval value of D is: where d i : term weights of term x i.

Assume d i is either 0 or 1. 0: i-th term is absent from D. 1: i-th term is present in D. p i =Pr(x i =1|rel) 1-p i =Pr(x i =0|rel) q i =Pr(x i =1|nonrel) 1-q i =Pr(x i =0|nonrel)

The retrieval value of each T j present in a document (i.e., d j =1) is: term relevance weight

30 New Term Weighting l term-relevance weight of term T j : tr j l indexing value of term T j in document D j : w ij = tf ij *tr j l Issue It is necessary to characterize both the relevant and nonrelevant documents of a collection. how to find a representative document sample feedback information from retrieved documents??

31 Estimation of Term-Relevance l Little is known about the relevance properties of terms. »The occurrence probability of a term in the nonrelevant documents q j is approximated by the occurrence probability of the term in the entire document collection q j = df j / N »The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value p j = 0.5 for all j.

32 When N is sufficiently large, N-df j  N,  = idf j

33 Estimation of Term-Relevance l Estimate the number of relevant documents r j in the collection that contain term T j as a function of the known document frequency tf j of the term T j. p j = r j / R q j = (df j -r j )/(N-R) R: an estimate of the total number of relevant documents in the collection.

34 Term Relationships in Indexing l Single-term indexing »Single terms are often ambiguous. »Many single terms are either too specific or too broad to be useful. l Complex text identifiers »subject experts and trained indexers »linguistic analysis algorithms, e.g., NP chunker »term-grouping or term clustering methods

35 Tree-Dependence Model l Only certain dependent term pairs are actually included, the other term pairs and all higher-order term combinations being disregarded. l Example: sample term-dependence tree

36 (children, school), (school, girls), (school, boys)  (children, girls), (children, boys)  (school, girls, boys), (children, achievement, ability) 

37 Term Classification (Clustering)

38 Term Classification (Clustering) l Column part Group terms whose corresponding column representation reveal similar assignments to the documents of the collection. l Row part Group documents that exhibit sufficiently similar term assignment.

39 Linguistic Methodologies l Indexing phrases: nominal constructions including adjectives and nouns »Assign syntactic class indicators (i.e., part of speech) to the words occurring in document texts. »Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun-noun and adjective-noun sequences).

40 Term-Phrase Formation l Term Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., “computer science” vs. computer; Document Frequency N Thesaurus transformation Phrase transformation

41 Simple Phrase-Formation Process l the principal phrase component (phrase head) a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator value l the other components of the phrase medium- or low- frequency terms with stated co-occurrence relationships with the phrase head l common function words not used in the phrase-formation process

42 An Example l Effective retrieval systems are essential for people in need of information. »“are”, “for”, “in” and “of”: common function words »“system”, “people”, and “information”: phrase heads

43 The Formatted Term-Phrases effective retrieval systems essential people need information *: phrases assumed to be useful for content identification 2/55/12

44 The Problems l A phrase-formation process controlled only by word co- occurrences and the document frequencies of certain words in not likely to generate a large number of high-quality phrases. l Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.

45 Additional Term-Phrase Formation Steps l Syntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective-noun and noun-noun sequences. Adverb-adjective  adverb-noun  l The phrase elements are all chosen from within the same syntactic unit, such as subject phrase, term phrase, and verb phrase.

46 Consider Syntactic Unit l effective retrieval systems are essential for people in the need of information l subject phrase »effective retrieval systems l verb phrase »are essential l term phrase »people in need of information

47 Phrases within Syntactic Components l Adjacent phrase heads and components within syntactic components »retrieval systems* »people need »need information* l Phrase heads and components co-occur within syntactic components »effective systems [ subj effective retrieval systems] [ vp are essential ] for [ obj people need information] 2/3

48 Problems l More stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies. l Prepositional phrase attachment, e.g., The man saw the girl with the telescope. l Anaphora resolution He dropped the plate on his foot and broke it.

49 Problems ( Continued ) l Any phrase matching system must be able to deal with the problems of »synonym recognition »differing word orders »intervening extraneous word l Example »retrieval of information vs. information retrieval

50 Equivalent Phrase Formulation l Base form: text analysis system l Variants: »system analyzes the text »text is analyzed by the system »system carries out text analysis »text is subjected to system analysis l Related term substitution »text: documents, information documents »analysis: processing, transformation, manipulation »system: program, process

51 Thesaurus-Group Generation l Thesaurus transformation »broadens index terms whose scope is too narrow to be useful in retrieval »a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Thesaurus transformation Phrase transformation

52 Sample Classes of Roget’s Thesaurus

53 Methods of Thesaurus Construction l d ij : value of term T j in document D i l sim(T j, T k ): similarity measure between T j and T k l possible measures

54 Term-classification strategies l single-link Each term must have a similarity exceeding a stated threshold value with at least one other term in the same class. Produce fewer, much larger term classes l complete-link Each term has a similarity to all other terms in the same class that exceeds the the threshold value. Produce a large number of small classes

55 The Indexing Prescription (1) l Identify the individual words in the document collection. l Use a stop list to delete from the texts the function words. l Use an suffix-stripping routine to reduce each remaining word to word- stem form. l For each remaining word stem T j in document D i, compute w ij. l Represent each document D i by D i =(T 1, w i1 ; T 2, w i2 ; …, T t, w it )

56 Word Stemming l effectiveness --> effective --> effect l picnicking --> picnic l king -\-> k

57 Some Morphological Rules l Restore a silent e after suffix removal from certain words to produce “hope” from “hoping” rather than “hop” l Delete certain doubled consonants after suffix removal, so as to generate “hop” from “hopping” rather than “hopp”. l Use a final y for an I in forms such as “easier”, so as to generate “easy” instead of “easi”.

58 The Indexing Prescription (2) l Identify individual text words. l Use stop list to delete common function words. l Use automatic suffix stripping to produce word stems. l Compute term-discrimination value for all word stems. l Use thesaurus class replacement for all low-frequency terms with discrimination values near zero. l Use phrase-formation process for all high-frequency terms with negative discrimination values. l Compute weighting factors for complex indexing units. l Assign to each document single term weights, term phrases, and thesaurus classes with weights.

59 Query vs. Document l Differences »Query texts are short. »Fewer terms are assigned to queries. »The occurrence of query terms rarely exceeds 1. Q=(w q1, w q2, …, w qt ) where w qj : inverse document frequency D i =(d i1, d i2, …, d it ) where d ij : term frequency*inverse document frequency

60 Query vs. Document l When non-normalized documents are used, the longer documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors. or

61 Relevance Feedback l Terms present in previously retrieved documents that have been identified as relevant to the user’s query are added to the original formulations. l The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.

62 Relevance Feedback l Q = (w q1, w q2,..., w qt ) l D i = (d i1, d i2,..., d it ) New query may be the following form Q’ =  {w q1, w q2,..., w qt }+  {w’ qt+1, w’ qt+2,..., w’ qt+m } l The weights of the newly added terms T t+1 to T t+m may consist of a combined term-frequency and term- relevance weight.

63 Final Indexing l Identify individual text words. l Use a stop list to delete common words. l Use suffix stripping to produce word stems. l Replace low-frequency terms with thesaurus classes. l Replace high-frequency terms with phrases. l Compute term weights for all single terms, phrases, and thesaurus classes. l Compare query statements with document vectors. l Identify some retrieved documents as relevant and some as nonrelevant to the query.

64 Final Indexing l Compute term-relevance factors based on available relevance assessments. l Construct new queries with added terms from relevant documents and term weights based on combined frequency and term-relevance weight. l Return to step (7). Compare query statements with document vectors ……..

65 Summary of expected effectiveness of automatic indexing l Basic single-term automatic indexing- l Use of thesaurus to group related terms in the given topic area +10% to +20% l Use of automatically derived term associations obtained from joint term assignments found in sample document collections 0% to -10% l Use of automatically derived term phrases obtained by using co- occurring terms found in the texts of sample collections +5% to +10% l Use of one iteration of relevant feedback to add new query terms extracted from previously retrieved relevant documents +30% to +60%