Correlation of Term Count and Document Frequency for Google N-Grams

Correlation of Term Count and Document Frequency for Google N-Grams
Martin Klein and Michael L. Nelson Old Dominion University ECIR 2009 Toulouse, France 04/08/2009

Background & Motivation
Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs) TF is not hard to compute, IDF is since it depends on global knowledge about the corpus  When the entire web is the corpus IDF can only be estimated! Most text corpora provide term count values (TC) D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” D3 = “All You Need Is Love” D4 = “Long, Long, Long” TC >= DF but is there a correlation? Can we use TC to estimate DF? Term All Buy Can’t Is Love Me Need Please You Long TC 1 2 3 DF

Experimental Setup & Results
Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms

Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ

Experimental Setup & Results Show similarity between WaC based TC and
Google N-Gram based TC TC frequencies

Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Rank WaC-DF WaC-TC Google N-Grams 1 IR 2 RETRIEVAL IRSG 3 4 BCS IRIT CONFERENCE 5 EUROPEAN 6 2009 GRANT 7 GOOGLE FILTERING 8 9 ACM 10 ARIA PAPERS Google: screen scraping DF (?) values from the Google web interface

Thank You & Come See My Poster!!!
Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University

Correlation of Term Count and Document Frequency for Google N-Grams

Similar presentations

Presentation on theme: "Correlation of Term Count and Document Frequency for Google N-Grams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlation of Term Count and Document Frequency for Google N-Grams

Similar presentations

Presentation on theme: "Correlation of Term Count and Document Frequency for Google N-Grams"— Presentation transcript:

Similar presentations

About project

Feedback