Download presentation
Presentation is loading. Please wait.
Published byGodwin Little Modified over 9 years ago
1
Term Weighting approaches in automatic text retrieval. Presented by Ehsan
2
References Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself
3
The main idea Text indexing system based on weighted single terms is better than the one based on more complex text representation Crucial importance: effective term weighting.
4
Basic IR Attach content identifier to both stored texts and user queries. A content identifier/term is a word or a group of words extracted from the document/queries Underlying assumption Semantics of the documents and queries can be expressed by this terms
5
Two things to consider What is an appropriate content identifier? Are all the identifier of same importance? If not, how can we discriminate a term from the others?
6
Choosing content identifier Use single term/word as individual identifier Use more complex text representation as identifier An example “Industry is the mother of good luck” Mother said, “Good luck”.
7
Complex text representation 1. Set of related terms based on statistical co- occurrence 2. Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms 3. Grouping words under a common heading like thesaurus 4. Constructing knowledge base to represent the content of the subject area
8
What is better: single or complex terms? Construction of complex text representation is inherently difficult. Need sophisticated syntactic/statistical analysis program An example Using term phrase 20% increase in some cases Other cases it is quite discouraging Knowledge base Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development Conclusion Using single terms as content identifier is preferable
9
The second issue How to discriminate terms? Term weight of course! Effectiveness of IR system Document with relevant items must be retrieved Documents with irrelevant/extraneous items must be rejected.
10
Precision and Recall Recall Number of relevant document retrieved divided by total number of relevant documents Precision Out of the documents retrieved, how many of them are relevant Our goal High recall to retrieve as many relevant documents as possible High precision to reject extraneous documents. Basically, it is a trade off.
11
Weighting mechanism To get high recall Term frequency, tf When high frequency term are prevalent in the whole document collection With high tf every single documents will be retrieved To get high precision Inverse document frequency Varies inversely with the number of documents, n in which the term appears. Idf is given by log 2 (N/ n), where N is total number of documents To discriminate terms We use tf X idf
12
Two more things to consider Current “tf X id” mechanism favors larger documents introduce a normalizing factor in the weight to equalize the length of the document. Probabilistic mode Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs Is given by log ((N-n)/n)
13
Term weighting components Term frequency components b, t, n Collection frequency components x, f, p Normalization components x, c What would be weighting system given by tfc.nfx?
14
Experimental evidence Query vectors For tf short query, use n Long query, use t For idf Use f For normalization Use x
15
Experimental evidence Document vectors For tf Technical vocabulary, use n More varied vocabulary, use t For idf Use f in general Documents from different domain use x For normalization Documents with heterogeneous length, use c Homogenous documents, use x
16
Conclusion Best document weighting tfc, nfc (or tpc, npc) Best query weighting nfx, tfx, bfx (or npx, tpx, bpx) Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.