Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Every Bit Counts – Fast and Scalable RFID Estimation Muhammad Shahzad and Alex X. Liu Dept. of Computer Science and Engineering Michigan State University.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Task analysis 1 © Copyright De Montfort University 1998 All Rights Reserved Task Analysis Preece et al Chapter 7.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Research Problem.
Welcome to Scopus Training by : Arash Nikyar June 2014
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Querying Structured Text in an XML Database By Xuemei Luo.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Doc.: IEEE n Submission January 2004 A. Poloni, S. Valle, STMicroelectronicsSlide 1 Time-Correlated Packet Errors in MAC Simulations.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
UNIVERSITAT DE BARCELONA Facultat de Biblioteconomia i Documentació Grau d’Informació i Documentació Research Methods Research reports Professor: Ángel.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting.
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Fitting normal distribution: ML 1Computer vision: models, learning and inference. ©2011 Simon J.D. Prince.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Unit 4: Promoting learning by managing progression.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Bayesian Template-Based Approach to Classifying SDSS-II Supernovae from 3-Year Survey Brian Connolly Photometric Supernova ID Workshop 3/16/12.
What is an Object? —— an experimental evaluation Presented by: Yao Pan.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Finding Frequent Items in Data Streams
Bibliometric Analysis of Water Research
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Scalability of Persistent Queries
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Active learning The learning algorithm must have some control over the data from which it learns It must be able to query an oracle, requesting for labels.
Presentation Title Presentation Title Presentation Title
POLYNOMIAL INTERPOLATION
Panos Ipeirotis Luis Gravano
Example: Academic Search
Panagiotis G. Ipeirotis Luis Gravano
Whitening-Rotation Based MIMO Channel Estimation
VCE IT Theory Slideshows
By: Ran Ben Basat, Technion, Israel
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
Retrieval Evaluation - Measures
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Retrieval Performance Evaluation - Measures
Presentation transcript:

Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting

Motivation Input:  A list of items, L Output:  For each a, b in L, the cooccurrence value between a and b Often done by querying some database for document frequencies  e.g. f(a), f(b) and f(a  b) Many cooccurrence measures need f(a  b)

Problem Statement Input  A list of items, L Output  For each a, b in L, the document frequency f(a  b) in database Naïve pairwise algorithm need O(n 2 ) queries  Not scalable (e.g. n ~ 1000)  Bandwidth issues and server overload

Related Work C-PANKOW (Cimiano et al., 2005)  Matching named entities to concepts POLYPHONET (Matsuo et al., 2006)  Building a social network Avoid pairwise queries as far as possible  Both C-PANKOW and POLYPHONET perform “document processing” to achieve this goal  Is document processing really necessary?

Related Work QProber (Ipeirotis, 2002)  Obtain a sample of documents from database  Select some words to query and fit a power law curve  Estimate document frequencies of the rest Figure from Ipeirotis (2002)

This Project Extend QProber algorithm to collocations Algorithm  Obtain a sample of documents from database  Select some collocations to query and fit a power law curve  Estimate document frequencies of the rest

Query Selection Strategy Query selection strategy  For each word w, order collocations in sampled documents containing w by rank  Uniformly select q collocations to query Use O(qn) queries, with q << n decreasing rank

Experiment Database of 2000 newsgroup articles Evaluated on a lexicon of 100 words Vary sample size s and number of queries q

Conclusion Possible to estimate document frequencies of collocations reliably using O(n) queries Next step  Can the methods be applied to disambiguating author names, publication venue titles, etc.?

Additional Slides

Estimating Actual Document Frequencies Alternative method  For each word w, fit a power law curve using the collocations containing w  Estimation for unknown collocation w 1  w 2 : Average the values estimated from the curve of w 1  and the curve of w 2 Problem  Quality of each curve is not as good as lesser training examples used

Query Selection Strategy Alternative strategy  Uniform selection of collocations to query without regard to frequencies Problem  Together with alternative method, can produce large errors due to selection of collocations at the tail of the power law curve to query