Download presentation
Presentation is loading. Please wait.
Published byMargaret Terry Modified over 9 years ago
1
A Comparison of Document, Sentence, and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 cablake@email.unc.edu
2
Classic Information Retrieval Document Representation Query Information Need Match ? ? ? Representation ? Matching - Exact match = Boolean Model - Weighted match = Vector Model ?
3
Term Weighting Goal : Favor discriminating terms Commonly used : TF x IDF IDF(t i )=log 2 (N)–log 2 (n i )+1 –N = total number of documents in the corpus –t i = a term (typically an stemmed word) –n i = number of documents that contain at least one occurrence of the term t i Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21. Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23
4
Practical Motivations Systems moving toward sub-document retrieval –Document Summarization – Why not use Inverse Sentence Frequency (ISF) ? –Question Answering – Why not use Inverse Term Frequency (ITF) ? Calculating IDF is problematic –How many documents to have stable IDF estimates ? Corpora have changed since initial experiments –# documents- Vocabulary size –# terms per document
5
Theoretical Motivations TF x IDF combines two different event spaces –TF – number of terms –IDF – number of documents –Are the limits of these spaces really the same ? Foundational theories use the term space –Zipf’s Law (Zipf, 1949) –Shannon’s Theory (Shannon, 1948)
6
Goal : Compare and Contrast 1.Raw term comparison 2.Zipf Law comparison 3.Direct IDF, ISF, and ITF comparison 4.Abstract versus full-text comparison 5.IDF Sensitivity
7
Corpora Full text scientific articles in chemistry Initial corpus: –103,262 articles –Published in 27 journals over the last 4 years –Two journals excluded due to formatting inconsistencies These experiments: –100,830 articles –16,538,655 sentences –526,025,066 total unstemmed terms –2,001,730 distinct unstemmed terms –1,391,763 distinct stemmed terms (Porter algorithm) Table 1. Corpus summary. –1,391,763 distinct stemmed terms (Porter algorithm)
8
Journal# Docs % CorpusAvg Length Million % ACHRE4 548 0.5 4923 2.71 ANCHAM 4012 4.0 4860 19.54 BICHAW 8799 8.7 6674 58.711 BIPRET 1067 1.1 4552 4.91 BOMAF6 1068 1.1 4847 5.21 CGDEFU 566 0.5 3741 2.1<1 CMATEX 3598 3.6 4807 17.33 ESTHAG 4120 4.1 5248 21.64 IECRED 3975 3.9 5329 21.24 INOCAJ 5422 5.4 6292 34.16 JACSAT 14400 14.3 4349 62.612 JAFCAU 5884 5.8 4185 24.65 JCCHFF 500 0.5 5526 2.81 JCISD8 1092 1.1 4931 5.41 JMCMAR 3202 3.2 8809 28.25 JNPRDF 2291 2.2 4144 9.52 JOCEAH 7307 7.2 6605 48.39 JPCAFH 7654 7.6 6181 47.39 JPCBFK 9990 9.9 5750 57.411 JPROBS 268 0.3 4917 1.3<1 MAMOBX 6887 6.8 5283 36.47 MPOHBP 58 0.1 4868 0.3<1 NALEFD 1272 1.3 2609 3.31 OPRDFK 858 0.8 3616 3.11 ORLEF7 5992 5.9 1477 8.82
9
Example IDF, ISF, ITF DocumentSentenceTerm Abst ract Non- AbsAll Abst ract Non- AbsAll Abst ract Non- AbsAll the1.0 1.31.4 4.69.45.2 chemist11.16.05.713.612.812.622.817.6 synthesis14.311.210.817.118.017.626.422.622.5 eletrochem17.515.315.020.322.622.429.627.027.5 IDF(t i )=log 2 (N)–log 2 (n i )+1
10
1) Raw term comparison Document vs Sentence Frequency (log scales)
11
1) Raw term comparison Document vs Term Frequency (log scales)
12
Luhn Image Source: Van Rijsbergen, 1979
13
1) Raw term comparison Sentence vs Term Frequency (log scales)
14
2) Zipf Law comparison Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/j θ where θ is close to 1 (Zipf, 1949) Term distributions followed a power law θ differed between the event spaces –Average θ in document space = -1.65 –Average θ in sentence space = -1.73 –Average θ in term spaces = -1.73
15
2) Example Document Distribution
16
2) θ Comparison of all journals
17
3) Direct IDF vs ISF comparison
18
3) Direct IDF vs ITF comparison
19
3) Direct ISF vs ITF comparison
20
4) Abstract versus full-text
21
4) IDF Sensitivity
23
Conclusions raw document frequencies differ from sentence & term frequencies. –around the areas of important terms –difficult to perform a linear transformation from the document to a sub-document space raw term frequencies correlate well with the sentence frequencies IDF, ISF and ITF are highly correlated
24
Conclusions IDF values are surprisingly stable –with respect to random samples at 10% of the total corpus. –average IDF values based on only a 20% random stratified sample correlated almost perfectly to IDF Journal based IDF samples did not correlate well to the global IDF language used in abstracts is systematically different from the language used in the body of a full-text scientific document.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.