Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009
Background & Motivation Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs) TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated! Most text corpora provide term count values (TC) D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” D3 = “All You Need Is Love” D4 = “Long, Long, Long” TC >= DF but is there a correlation? Can we use TC to estimate DF? Term All Buy Can’t Is Love Me Need Please You Long TC 1 2 3 DF
Experimental Setup & Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms
Experimental Setup & Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ
Experimental Setup & Results Show similarity between WaC based TC and Google N-Gram based TC TC frequencies
Experimental Setup & Results Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Rank WaC-DF WaC-TC Google N-Grams 1 IR 2 RETRIEVAL IRSG 3 4 BCS IRIT CONFERENCE 5 EUROPEAN 6 2009 GRANT 7 GOOGLE FILTERING 8 9 ACM 10 ARIA PAPERS Google: screen scraping DF (?) values from the Google web interface
Thank You & Come See My Poster!!! Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu