Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting

Motivation Input:  A list of items, L Output:  For each a, b in L, the cooccurrence value between a and b Often done by querying some database for document frequencies  e.g. f(a), f(b) and f(a  b) Many cooccurrence measures need f(a  b)

Problem Statement Input  A list of items, L Output  For each a, b in L, the document frequency f(a  b) in database Naïve pairwise algorithm need O(n 2 ) queries  Not scalable (e.g. n ~ 1000)  Bandwidth issues and server overload

Related Work C-PANKOW (Cimiano et al., 2005)  Matching named entities to concepts POLYPHONET (Matsuo et al., 2006)  Building a social network Avoid pairwise queries as far as possible  Both C-PANKOW and POLYPHONET perform “document processing” to achieve this goal  Is document processing really necessary?

Related Work QProber (Ipeirotis, 2002)  Obtain a sample of documents from database  Select some words to query and fit a power law curve  Estimate document frequencies of the rest Figure from Ipeirotis (2002)

This Project Extend QProber algorithm to collocations Algorithm  Obtain a sample of documents from database  Select some collocations to query and fit a power law curve  Estimate document frequencies of the rest

Query Selection Strategy Query selection strategy  For each word w, order collocations in sampled documents containing w by rank  Uniformly select q collocations to query Use O(qn) queries, with q << n decreasing rank

Experiment Database of 2000 newsgroup articles Evaluated on a lexicon of 100 words Vary sample size s and number of queries q

Conclusion Possible to estimate document frequencies of collocations reliably using O(n) queries Next step  Can the methods be applied to disambiguating author names, publication venue titles, etc.?

Additional Slides

Estimating Actual Document Frequencies Alternative method  For each word w, fit a power law curve using the collocations containing w  Estimation for unknown collocation w 1  w 2 : Average the values estimated from the curve of w 1  and the curve of w 2 Problem  Quality of each curve is not as good as lesser training examples used

Query Selection Strategy Alternative strategy  Uniform selection of collocations to query without regard to frequencies Problem  Together with alternative method, can produce large errors due to selection of collocations at the tail of the power law curve to query

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.

Similar presentations

Presentation on theme: "Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.

Similar presentations

Presentation on theme: "Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting."— Presentation transcript:

Similar presentations

About project

Feedback