Download presentation
Presentation is loading. Please wait.
Published byJustin Phillips Modified over 6 years ago
1
Correlation of Term Count and Document Frequency for Google N-Grams
Martin Klein and Michael L. Nelson Old Dominion University ECIR 2009 Toulouse, France 04/08/2009
2
Background & Motivation
Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs) TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! Most text corpora provide term count values (TC) D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” D3 = “All You Need Is Love” D4 = “Long, Long, Long” TC >= DF but is there a correlation? Can we use TC to estimate DF? Term All Buy Can’t Is Love Me Need Please You Long TC 1 2 3 DF
3
Experimental Setup & Results
Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms
4
Experimental Setup & Results
Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ
5
Experimental Setup & Results Show similarity between WaC based TC and
Google N-Gram based TC TC frequencies
6
Experimental Setup & Results
Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Rank WaC-DF WaC-TC Google N-Grams 1 IR 2 RETRIEVAL IRSG 3 4 BCS IRIT CONFERENCE 5 EUROPEAN 6 2009 GRANT 7 GOOGLE FILTERING 8 9 ACM 10 ARIA PAPERS Google: screen scraping DF (?) values from the Google web interface
7
Thank You & Come See My Poster!!!
Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.