Subbarao Kambhampati (Arizona State University)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof.
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Information Retrieval in Practice
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Uncertainty Management in Rule-based Expert Systems
Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Virtual University of Pakistan
Stats Methods at IC Lecture 3: Regression.
Information Retrieval in Practice
Data Science Credibility: Evaluating What’s Been Learned
Unsupervised Learning
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Queensland University of Technology
What Is Cluster Analysis?
Multimedia Content-Based Retrieval
5.3 The Central Limit Theorem
Personalized Social Image Recommendation
Associative Query Answering via Query Feature Similarity
Chapter 15 QUERY EXECUTION.
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Introduction to Summary Statistics
Gerd Kortemeyer, William F. Punch
A Modified Naïve Possibilistic Classifier for Numerical Data
Information Retrieval
Objective of This Course
Introduction to Summary Statistics
Inferential Statistics
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Data Integration for Relational Web
Chapter 13 Quality Management
Disambiguation Algorithm for People Search on the Web
Subbarao Kambhampati (Arizona State University)
Panos Ipeirotis Luis Gravano
Web Mining Department of Computer Science and Engg.
Ying Dai Faculty of software and information science,
Michal Rosen-Zvi University of California, Irvine
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
5.3 The Central Limit Theorem
Web Mining Research: A Survey
Trustworthy Semantic Web
Probabilistic Ranking of Database Query Results
Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.
Connecting the Dots Between News Article
Presentation transcript:

QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases Subbarao Kambhampati (Arizona State University) Garrett Wolf (Arizona State University) Yi Chen (Arizona State University) Hemal Khatri (Arizona State University, currently at Microsoft) Bhaumik Chokshi (Arizona State University) Jianchun Fan (Arizona State University) Ullas Nambiar (IBM Research, India)

Challenges in Querying Autonomous Databases Imprecise Queries User’s needs are not clearly defined hence: Queries may be too general Queries may be too specific Incomplete Data Databases are often populated by: Lay users entering data Automated extraction Relevance Function Density Function General Solution: “Expected Relevance Ranking” Challenge: Automated & Non-intrusive assessment of Relevance and Density functions On the web there exists many autonomous databases which users access via form-based interfaces. When querying these autonomous databases, often users do not clearly define what it is they are looking for. Users may specify queries which are overly general (e.g. A user may ask the query Q:(Model=Civic) when what they really want is a “Civic” with low mileage) Similarly, users may specify queries which are overly specific (e.g. A user may ask the query Q:(Model=Civic) when what they really want is a reliable Japanese car in which case, an “Accord” or “Corolla” may suit their needs) Therefore, in addition to returning the tuples which exactly satisfy the user’s query constraints, we would also like to return tuples with values which are similar to the original query constraints. In addition to posing imprecise queries, another concern is that the data provided by the autonomous databases may be incomplete due to the methods used to populate them. Many autonomous databases are populated by lay web users entering in data through forms (e.g. a user trying to sell their car may enter the “Model” as “Civic” omit the “Make” assuming it is obvious) Similarly, many autonomous databases are populated using automated extraction techniques (e.g. often these extraction techniques are not able to extract all the desired information especially when dealing with free-form text as in Craigslist.com) Therefore, in addition to returning the tuples which exactly satisfy the user’s query constraints, we would also like to return tuples which have “null/missing” values on the constrained attributes but are highly likely to be relevant to the user. Thus, we would like to rewrite the user’s original query in order to retrieve such similar and incomplete tuples. However, rather than randomly sending the rewritten queries to the autonomous database, we rather issue them intelligently such that the tuples they return are likely to highly relevant to the user (in addition to keeping the network/processing costs manageable). A general solution to this problem is a model we call “Expected Relevance Ranking (ERR)” which ranks in order of the expected relevance to the user. Here the ERR model can be defined in terms of Relevance and Density functions. *** CHALLENGE 1 *** This model brings forth our first challenge, namely, how do we automatically and non-intrusively assess the relevance and density functions? Once the ranking model has been established, we must go back and consider how should the query rewriting work in the first place? *** CHALLENGE 2 *** How can we rewrite the user’s original query to bring back both similar and incomplete tuples? After we figure out how to rewrite the users query and rank the queries/tuples in order of their expected relevance to the user, we come across a final challenge. *** CHALLENGE *** Given that we are showing the user tuples which do not exactly satisfy the constraints of their query, how can we explain the results in order to gain the user’s trust? on challenges on querying autonomous db, as you are about to leave the slide--that handling autonomous db naturally brings together challenges that cross the traditional IR and DB  boundaries. However, how can we retrieve similar/ incomplete tuples in the first place? Once the similar/incomplete tuples have been retrieved, why should users believe them? Challenge: Rewriting a user’s query to retrieve highly relevant Similar/ Incomplete tuples Challenge: Provide explanations for the uncertain answers in order to gain the user’s trust QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases

QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases

Expected Relevance Ranking Model Problem: How to automatically and non-intrusively assess the Relevance & Density functions? Estimating Relevance (R): Learn relevance for user population as a whole in terms of value similarity Sum of weighted similarity for each constrained attribute Content Based Similarity (Mined from probed sample using SuperTuples) Co-click Based Similarity (Yahoo Autos recommendations) Co-occurrence Based Similarity (GoogleSets) Estimating Density (P): Learn density for each attribute independent of the other attributes AFDs used for feature selection AFD-Enhanced Naïve Bayes Classifiers(NBC) right after the yellow problem statement--that there is a lot of work in AI and machine learning communities directed at density and relevance estimation, and our challenge is to adapt the right ones for autonomous db. In order to automatically and non-intrusively assess the relevance and density functions, we start of by obaining a sample of the database and mining attribute correlations from the sample. Given the autonomous database, we first issue probing queries the results of which are used to build a sample. Next, we feed the sample (in our case we used a sample 10% of the original database size) to the TANE algorithm which is used to discover Approximate Functional Dependencies (AFDs). For example, we may learn an AFD {Make, Body Style ~~> Model}. So it may be that given a Honda which is a Coupe, the Model is likely to be Civic with some confidence. These AFDs play an important role in the QUIC system. They are used for computing the attribute importance which is in turn used in the relevance calculation. They are used as a feature selection tool when building classifiers for the density calculation. Finally, they are used for query rewriting which we will see later on. Estimating Relevance: In QUIC we learn relevance in terms of the entire user population rather than learning it for each individual user. Moreover, where relevance can be defined in terms of many metrics, we have taken relevance to equal similarity between attribute values. We experimented with three forms of similarity metrics, namely content-based similarity which forms super tuples (Ullas’s prior work) and then computes the Jaccard similarity between the supertuples using bag semantics. The second type of similarity is co-click based similarity in which we mined the collaborative recommendations from a car website (Yahoo Autos). Here given a webpage for a car, the page often contained 1-3 other recommended cars that were highly viewed by users who also viewed the current page. Using these recommendations, we constructed a undirected graph in which multiple links between two nodes (where nodes represent attribute values) are aggregated into a single link whose weight is equal to the total number of links between the nodes (prior to aggregation). We then used a modified version of Dijkstra’s shortest path algorithm where the shortest path is taken as the product of each link weight along the path between two nodes. Here the link weights are made to between 0 and 1 so that the product decreases as more paths are taken. Similarly, we mined co-occurrence statistics using GoogleSets for which we created a similar graph between attribute values. Here we used the same modified version of the shortest path algorithm to compute the similarity between two attribute values. Estimating Density: In QUIC we make the assumption that attribute values are missing independent of the other attributes. Using this assumption, we learn NBC classifiers for each of the attributes and use these classifiers to find the missing value distributions. When constructing the classifiers, we use the AFD’s determining set attributes for feature selection. For example, if we had an AFD {Make, Body Style ~~> Model} and we wanted to find the distribution on Model, we would restrict the features used in the classifier to just Make and Body Style as opposed to the entire set of attributes. This table shows an example of what the relevance and density measurements might look like. *** Explain density QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases

Retrieving Relevant Answers via Query Rewriting Problem: How to rewrite a query to retrieve answers which are highly relevant to the user? Given a query Q:(Model≈Civic) retrieve all the relevant tuples Retrieve certain answers namely tuples t1 and t6(base result set) Given an AFD, rewrite the query using the determining set of attributes of base result tuples in order to retrieve possible answers Q1’: Make=Honda Λ Body Style=coupe Q2’: Make=Honda Λ Body Style=sedan Thus we retrieve: Certain Answers Incomplete Answers Similar Answers QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases

Explaining Results to Users Problem: How to gain users trust when showing them similar/incomplete tuples? In traditional databases, we only show the users the certain tuples and hence the user has no reason to doubt the results presented to them. However, when we begin to consider imprecise queries and incomplete data, the user may be hesitant to fully trust the answers which they are shown. Therefore, in order to gain the user’s trust, QUIC must provide explanations to the user outlining the reasoning behind each answer it provides. Here is a snapshot which depicts some of the explanations provided by QUIC. These explanations are based are derived from the AFDs, relevance scores, and density calculations. For incomplete tuples, the values of the determining set attributes are used to justify the prediction along with the probability that the missing value is in fact the value the user was looking for. For similar tuples, the user is provided with an explanation which states how similar the tuple’s value is to the query’s constrained value. In addition, when co-click/co-occurrence statistics are available, the user is shown how many people who viewed “Car A” also viewed “Car B”. QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases

Conclusion QUIC is able to handle both imprecise queries and incomplete data over autonomous databases By an automatic and non-intrusive assessment of relevance and density functions, QUIC is able to rank tuples in order of their expected relevance to the user By rewriting the original user query, QUIC is able to efficiently retrieve both similar and incomplete answers to a query By providing users with explanations as to why they are being shown answers which do not exactly match the query constraints, QUIC is able to gain the user’s trust QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases