Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Query Processing in Data Integration (Gathering and Using Source Statistics)
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Modern Information Retrieval
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Supporting keyword queries over databases: A simpl{e/istic} first step As you may have heard, Google struck deals with AZ, UT and some other states to.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Adaptive Information Integration Subbarao Kambhampati Thanks to Zaiqing Nie, Ullas Nambiar & Thomas Hernandez Talk at USC/Information.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Active Learning for Class Imbalance Problem
Search Engines and Information Retrieval Chapter 1.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
DISCERN: Cooperative Whitespace Scanning in Practical Environments Tarun Bansal, Bo Chen and Prasun Sinha Ohio State Univeristy.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Clustering XML Documents for Query Performance Enhancement Wang Lian.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Presented By Amarjit Datta
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Neighborhood - based Tag Prediction
A paper on Join Synopses for Approximate Query Answering
Data Integration with Dependent Sources
Donghui Zhang, Tian Xia Northeastern University
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer –More than real user queries collected Mediated schema relation in BibFinder: paper(title, author, conference/journal, year) Primary key: title+author+year Focus on Selection queries Q(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD

Selecting top-K sources for a given query Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q –P(S|Q) is the coverage (Probability that a random tuple belonging to Q is exported by source S) –P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q (Probability that a random tuple belonging to Q is exported by all the sources S1..Sj). –If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q.

Computing Effective Coverage provided by a set of sources Suppose we are calling 3 sources S1, S2, S3 to answer a query Q. The effective coverage we get is P(S1US2US3|Q). In order to compute this union, we need the intersection (overlap) statistics (in addition to the coverage statistics) Given the above, we can pick the optimal 3-sources for answering Q by considering all 3-sized subsets of source set S1….Sn, and picking the set with highest coverage

Selecting top-K sources: the greedy way Selecting optimal K sources is hard in general. One way to reduce cost is to select sources greedily, one after other. For example, to select 3 sources, we select first source Si as the source with highest P(Si|Q) value. To pick the j th source, we will compute the residual coverage of each of the remaining sources, given the 1,2…j-1 sources we have already picked (the residual coverage computation requires overlap statistics). For example picking a third source in the context of sources S1 and S2 will require us to calculate:

What good is a high coverage source that is off-line? Sources vary significantly in terms of their response times –The response time depends both on the source itself, as well as the query that is asked of it Specifically, what fields are bound in the selection query can make a difference Hard enough to get a high coverage or a low response time plan. But now we have to combine them… Qn: How do we define an optimal plan in the context of both coverage/overlap and response time requirements?

Response time can depend on the query type Range queries on yearEffect of binding author field --Response times can also depend on the time of the day, and the day of the week.

Multi-objective Query optimization Need to optimize queries jointly for both high coverage and low response time –Staged optimization won’t quite work. An idea: Make the source selection be dependent on both (residual)coverage and response time

Results on BibFinder

Challenges Sources are incomplete and partially overlapping Calling every possible source is inefficient and impolite Need coverage and overlap statistics to figure out what sources are most relevant for every possible query! We introduce a frequency-based approach for mining these statistics

Outline  Motivation BibFinder/StatMiner Architecture StatMiner Approach –Automatically learning AV Hierarchies –Discovering frequent query classes –Learning coverage and overlap Statistics Using Coverage and Overlap Statistics StatMiner evaluation with BibFinder Related Work Conclusion

Motivation We introduce StatMiner –A threshold based hierarchical mining approach –Store statistics w.r.t. query classes –Keep more accurate statistics for more frequently asked queries –Handling the efficiency and accuracy tradeoffs by adjusting the thresholds Challenges of gathering coverage and overlap statistics –It’s impractical to assume that the sources will export such statistics, because the sources are autonomous. –It’s impractical to learn and store all the statistics for every query. Necessitate different statistics, is the number possible queries, is the number of sources Impractical to assume knowledge of entire query population a priori

BibFinder/StatMiner

Query List

AV Hierarchies and Query Classes

StatMiner

Using Coverage and Overlap Statistics to Rank Sources

Outline Motivation BibFinder/StatMiner Architecture StatMiner Approach –Automatically learning AV Hierarchies –Discovering frequent query classes –Learning coverage and overlap Statistics Using Coverage and Overlap Statistics  StatMiner evaluation with BibFinder Related Work Conclusion

BibFinder/StatMiner Evaluation Experimental setup with BibFinder: Mediator relation: Paper(title,author,conference/jo urnal,year) real user queries are used. Among them 4500 queries are randomly chosen as test queries. AV Hierarchies for all of the four attributes are learned automatically distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.

Learned Conference Hierarchy

Space Consumption for Different minfreq and minoverlap We use a threshold on the support of a class, called minfreq, to identify frequent classes We use a minimum support threshold minoverlap to prune overlap statistics for uncorrelated source sets. As we increase any of the these two thresholds, the memory consumption drops, especially in the beginning.

Accuracy of the Learned Statistics Absolute Error No dramatic increases Keeping very detailed overlap statistics would not necessarily increase the accuracy while requiring much more space. For example: minfreq=0.13 and minoverlap=0.1 versus minfreq=0.33 and minoverlap=0

Plan Precision Here we observe the average precision of the top-2 source plans The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.

Plan Precision on Controlled Sources We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed information

Number of Distinct Results Here we observe the average number of distinct results of top-2 source plans. Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.

Applications Path Selection in Bioinformatics [LNRV03] –More and More Bioinformatics sources available on Internet –Thousands of paths existing for answering users’ queries –Path Coverage and Overlap Statistics are needed Text Database Selection in Information Retrieval –StatMiner can provide a better way of learning and storing representatives of the databases –Main Ideas Maintain a query list and discover frequent asked keyword-sets Learn keyword-set hierarchy based on the statistics distance Learn and store coverage (document frequency) for frequent asked keyword-set classes. A new query will be mapped to a set of close classes and use their statistics to estimate statistics for the query. –Advantages Multiple-word-term & Scalability