Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting
Motivation Input: A list of items, L Output: For each a, b in L, the cooccurrence value between a and b Often done by querying some database for document frequencies e.g. f(a), f(b) and f(a b) Many cooccurrence measures need f(a b)
Problem Statement Input A list of items, L Output For each a, b in L, the document frequency f(a b) in database Naïve pairwise algorithm need O(n 2 ) queries Not scalable (e.g. n ~ 1000) Bandwidth issues and server overload
Related Work C-PANKOW (Cimiano et al., 2005) Matching named entities to concepts POLYPHONET (Matsuo et al., 2006) Building a social network Avoid pairwise queries as far as possible Both C-PANKOW and POLYPHONET perform “document processing” to achieve this goal Is document processing really necessary?
Related Work QProber (Ipeirotis, 2002) Obtain a sample of documents from database Select some words to query and fit a power law curve Estimate document frequencies of the rest Figure from Ipeirotis (2002)
This Project Extend QProber algorithm to collocations Algorithm Obtain a sample of documents from database Select some collocations to query and fit a power law curve Estimate document frequencies of the rest
Query Selection Strategy Query selection strategy For each word w, order collocations in sampled documents containing w by rank Uniformly select q collocations to query Use O(qn) queries, with q << n decreasing rank
Experiment Database of 2000 newsgroup articles Evaluated on a lexicon of 100 words Vary sample size s and number of queries q
Conclusion Possible to estimate document frequencies of collocations reliably using O(n) queries Next step Can the methods be applied to disambiguating author names, publication venue titles, etc.?
Additional Slides
Estimating Actual Document Frequencies Alternative method For each word w, fit a power law curve using the collocations containing w Estimation for unknown collocation w 1 w 2 : Average the values estimated from the curve of w 1 and the curve of w 2 Problem Quality of each curve is not as good as lesser training examples used
Query Selection Strategy Alternative strategy Uniform selection of collocations to query without regard to frequencies Problem Together with alternative method, can produce large errors due to selection of collocations at the tail of the power law curve to query