VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Google News Personalization: Scalable Online Collaborative Filtering
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Book Recommender System Guided By: Prof. Ellis Horowitz Kaijian Xu Group 3 Ameet Nanda Bhaskar Upadhyay Bhavana Parekh.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Efficient Query Evaluation on Probabilistic Databases
Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Presented by Zeehasham Rasheed
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Lesley Charles November 23, 2009.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Uncertainty Management in Rule-based Expert Systems
Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Relaxing Queries Presented by Ashwin Joshi Kapil Patil Sapan Shah.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
A paper on Join Synopses for Approximate Query Answering
Subbarao Kambhampati (Arizona State University)
Compact Query Term Selection Using Topically Related Text
Machine Learning for Online Query Relaxation
Data Integration for Relational Web
Subbarao Kambhampati (Arizona State University)
iSRD Spam Review Detection with Imbalanced Data Distributions
Stratified Sampling for Data Mining on the Deep Web
Michal Rosen-Zvi University of California, Irvine
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
Probabilistic Ranking of Database Query Results
Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.
Presentation transcript:

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona State University) Jianchun Fan (Amazon) Yi Chen (Arizona State University) Subbarao Kambhampati (Arizona State University)

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Introduction  More and more data is becoming accessible via web servers which are supported by backend databases – E.g. Cars.com, Realtor.com, Google Base, Etc.

VLDB ‘07 Query Processing over Incomplete Autonomous Databases  Inaccurate Extraction / Recognition Incompleteness in Web Databases  Heterogeneous Schemas  User-defined Schemas  Incomplete Entry Website# of AttributesTotal TuplesIncomplete %Body Style %Engine % AutoTrader.com %3.6%8.1% CarsDirect.com %55.7%55.8% Google Base %83.36%91.98%

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Problem  Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints. Want a ‘Honda Accord’ with a ‘sedan’ body style for under ‘$12,000’ High Precision Low Recall How to retrieve relevant uncertain results in a ranked fashion? Many entities corresponding to tuples with missing values might be relevant to the user query

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Low Precision, Infeasible Possible Naïve Approaches Query Q: (Body Style = Convt) Low Recall Costly, Infeasible 1. C ERTAIN O NLY : Return certain answers only as in traditional databases, namely those having Body Style = Convt 2. A LL R ETURNED : Null matches any concrete value, hence return all answers having Body Style = Convt along with answers having body style as null 3. A LL R ANKED : Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Outline  Core Techniques  Peripheral Techniques  Implementation & Evaluation  Conclusion & Future Work

VLDB ‘07 Query Processing over Incomplete Autonomous Databases The QPIAD Solution Select Top K Rewritten Queries Q 1 ’: Model=A4 Q 2 ’: Model=Z4 Q 3 ’: Model=Boxster Given a query Q:( Body=Convt ) retrieve all relevant tuples IdMakeModelYearBody 1AudiA42001Convt 2BMWZ42002Convt 3PorscheBoxster2005Convt 4BMWZ42003NULL 5HondaCivic2004NULL 6ToyotaCamry2002Sedan 7AudiA42006NULL IdMakeModelYearBody 1AudiA42001Convt 2BMWZ42002Convt 3PorscheBoxster2005Convt Base Result Set AFD: Model~> Body style IdMakeModelYearBodyConfidence 4BMWZ42003NULL0.7 7AudiA42006NULL0.3 Ranked Relevant Uncertain Answers Re-order queries based on Estimated Precision LEARN REWRITE RANK EXPLAIN

VLDB ‘07 Query Processing over Incomplete Autonomous Databases LEARN REWRITE RANK EXPLAIN

VLDB ‘07 Query Processing over Incomplete Autonomous Databases  What is hard? – Learning correlations useful for rewriting – Efficiently assessing the probability distribution – Cannot modify the underlying autonomous sources Learning Statistics to Support Ranking & Rewriting  Value Distributions - Naïve Bayes Classifiers (NBC)  Attribute Correlations - Approximate Functional Dependencies (AFDs) & Approximate Keys (AKeys) EstPrec(Q|R) = (A m =v m |dtrSet(A m )) {A i,…,A k } ~> A m 0<conf<=1 LEARN – REWRITE – RANK – EXPLAIN Make, Body ~> Model P(Model=Accord | Make=Honda, Body=Coupe) MakeModelYearBody HondaNULL2001Coupe

VLDB ‘07 Query Processing over Incomplete Autonomous Databases  Generate rewritten queries for each distinct Model: – Q1’: Model=A4 – Q2’: Model=Z4 – Q3’: Model=Boxster Rewriting to Retrieve Relevant Uncertain Results MakeModelYearBody AudiA42001Convt BMWZ42002Convt PorscheBoxster2005Convt BMWZ42003Convt Base Set for Q:(Body=Convt) AFD: Model~> Body  Given an AFD and Base Set, it is likely that tuples with 1)Model of A4, Z4 or Boxster 2)Body of NULL are actually convertible. LEARN – REWRITE – RANK – EXPLAIN  What is hard? – Retrieving relevant uncertain tuples with missing values – Rewriting queries given the limited query access patterns

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Selecting/Ordering Top-K Rewritten Queries P – Estimated Precision R – Estimated Recall  Select top-k queries based on F-Measure  Reorder queries based on Estimated precision – No need to re-rank tuples after retrieving them!  Retrieves tuples in order of their final ranks All tuples returned for a single query are ranked equally LEARN – REWRITE – RANK – EXPLAIN  What is hard? – Retrieving precise, non-empty result sets – Working under source-imposed resource limitations

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Explaining Results to the Users Explanations based on AFDs. Certain Answers Provide to the user: Relevant Uncertain Answers Explanations LEARN – REWRITE – RANK – EXPLAIN  What is hard? – Gaining the user’s trust – Generating meaningful explanations Make, Body ~> Model yields This car is 83% likely to have Model=Accord given that its Make=Honda and Body=Sedan

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Outline  Core Techniques  Peripheral Techniques  Implementation & Evaluation  Conclusion & Future Work

VLDB ‘07 Query Processing over Incomplete Autonomous Databases LS(Body, Make, Model, Year, Price, Mileage) Leveraging Correlation between Data Sources Q:(Body=Coupe) Mediator: GS(Body, Make, Model, Year, Price, Mileage) LS(Make, Model, Year, Price, Mileage) Q:(Body=Coupe) Q:(Model=Civic) Q:(Model=Corvette) … UNION Base Set + AFD (Model  Body) Two main uses: Source doesn’t support all the attributes in GS Sample/statistics aren’t available AFDs learned from Cars.com

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Handling Aggregate and Join Queries  Aggregate Queries  Join Queries MakeModelYearMileage HondaAccordNULL60,400 ToyotaCamry200421,150 NULLCivic200253,275 AudiA ,650 MakeYearPart Honda2001Brakes NULL2002Windows Toyota2005Air Bags MakeModelYearMileagePart HondaAccord NULL/ ,400Brakes NULL/ NULL Civic200253,275Windows Honda/ NULL Accord NULL/ ,400Windows Make=Honda IdMakeModelBody t1AudiA4Convt t2BMWZ4NULL P(Convt)=.9, P(Coupe)=.1 t3PorscheBoxsterConvt t4BMW325iNULL P(Convt)=.4, P(Coupe)=.6 t5HondaCivicCoupe t1 + t3 + t2 = 3 t1 + t3 +.9(t2) +.4(t4) =3.3 Include a portion of each tuple relative to the probability its missing value matches the query constraint Count(*) = 3.3 Q:(Count(*) Where Body=Convt) Only include tuples whose most likely missing value matches the query constraint Count(*) = 3

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Outline  Core Techniques  Peripheral Techniques  Implementation & Evaluation  Conclusion & Future Work

VLDB ‘07 Query Processing over Incomplete Autonomous Databases QPIAD Web Interface

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Empirical Evaluation  Datasets: – Cars Cars.com 7 attributes, 55,000 tuples – Complaints NHSTA Office of Defect Investigation 11 attributes, 200,000 tuples – Census US Census dataset, UCI repository 12 attributes, 45,000 tuples  Sample Size: – 3-15% of full database  Incompleteness: – 10% of tuples contain missing values – Artificially introduced null values in order to compare with the ground truth  Evaluation: – Ranking and Rewriting Methods (e.g. quality, efficiency, etc.) – General Queries (e.g. correlated sources, aggregates, joins, etc.) – Learning Methods (e.g. accuracy, sample size, etc.)

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Experimental Results – Ranking & Rewriting  QPIAD vs. A LL R ETURNED - Quality A LL R ETURNED – all certain answers + all answers with nulls on constrained attributes

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Experimental Results – Ranking & Rewriting  QPIAD vs. A LL R ANKED - Efficiency A LL R ANKED – all certain answers + all answers with predicted missing value probability above a threshold

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Experimental Results – General Queries  Aggregates  Joins

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Experimental Results – Learning Methods  Accuracy of Classifiers

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Experimental Summary  Rewriting / Ranking – Quality – QPIAD achieves higher precision than A LL R ETURNED by only retrieving the relevant tuples – Efficiency – QPIAD requires fewer tuples to be retrieved to obtain the same level of recall as A LL R ANKED  Learning Methods – AFDs for feature selection improved accuracy  General Queries – Aggregate queries achieve higher accuracy when missing value prediction is used – QPIAD achieves higher levels of recall for join queries while trading off only a small bit of precision  Additional Experiments – Robustness of learning methods w.r.t. sample size – Effect of alpha value on F-measure – Effectiveness of using correlation between sources

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Outline  Core Techniques  Peripheral Techniques  Implementation & Evaluation  Conclusion & Future Work

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Related Work  Querying Incomplete Databases – Possible World Approaches – tracks the completions of incomplete tuples (Codd Tables, V-Tables, Conditional Tables) – Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers  Probabilistic Databases – Tuples are associated with an attribute describing the probability of its existence – However, in our work, the mediator does not have the capability to modify the underlying autonomous databases  Query Reformulation / Relaxation – Aims to return similar or approximate answers to the user after returning or in the absence of exact answers – Our focus is on retrieving tuples with missing values on constrained attributes  Learning Missing Values – Common imputation approaches replace missing values by substituting the mean, most common value, default value, or using kNN, association rules, etc. – Our work requires schema level dependencies between attributes as well as distribution information over missing values Our work fits here All citations found in paper

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Contributions  Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns – Query Rewriting  Retrieves answers with missing values on constrained attributes without modifying the underlying databases – AFD-Enhanced Classifiers  Rewriting & ranking considers the natural tension between precision and recall – F-Measure based ranking  AFDs play a major role in: – Query Rewriting – Feature Selection – Explanations

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Current Directions – QUIC (CIDR `07 Demo) Imprecise Queries User’s needs are not clearly defined:  Queries may be too general  Queries may be too specific General Solution: “Expected Relevance Ranking” Challenge: Automated & Non-intrusive assessment of Relevance and Density functions Incomplete Data Databases are often populated by:  Lay users entering data  Automated extraction Estimating Relevance (R): Learn relevance for user population as a whole in terms of value similarity  Sum of weighted similarity for each constrained attribute – Content Based Similarity – Co-click Based Similarity – Co-occurrence Based Similarity Density Function Relevance Function