Machine Learning for Online Query Relaxation

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Lazy vs. Eager Learning Lazy vs. eager learning
Exploring Reduction for Long Web Queries Niranjan Balasubramanian, Giridhar Kuamaran, Vitor R. Carvalho Speaker: Razvan Belet 1.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006.
Data Mining.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Solving Failing Queries *) Zbigniew W. Ras University of North Carolina Charlotte, N.C , USA
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Data Mining Techniques
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Machine Learning CSE 681 CH2 - Supervised Learning.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 5 Normalization. 2 5 Database Design Give some body of data to be represented in a database, how do we decide on a suitable logical structure for that.
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
Relaxing Queries Presented by Ashwin Joshi Kapil Patil Sapan Shah.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Dense-Region Based Compact Data Cube
The Impact of Concurrent Coverage Metrics on Testing Effectiveness
Semi-Supervised Clustering
Data Driven Resource Allocation for Distributed Learning
Jacob R. Lorch Microsoft Research
CACTUS-Clustering Categorical Data Using Summaries
Effects of User Similarity in Social Media Ashton Anderson Jure Leskovec Daniel Huttenlocher Jon Kleinberg Stanford University Cornell University Avia.
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Rule Induction for Classification Using
Efficient Image Classification on Vertically Decomposed Data
Challenges in Creating an Automated Protein Structure Metaserver
A paper on Join Synopses for Approximate Query Answering
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatio-temporal Pattern Queries
Towards Effective Partition Management for Large Graphs
Instance Based Learning (Adapted from various sources)
Spatial Online Sampling and Aggregation
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Efficient Image Classification on Vertically Decomposed Data
K Nearest Neighbor Classification
Data Integration with Dependent Sources
Association Rule Mining
A Fast and Scalable Nearest Neighbor Based Classification
The Scientific Method.
Supporting End-User Access
Matching Methods & Propensity Scores
COSC 4335: Other Classification Techniques
Lecture 2- Query Processing (continued)
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Constraint satisfaction problems
Conjoint analysis.
Chapter 7: Transformations
Relaxing Join and Selection Queries
Extracting Patterns and Relations from the World Wide Web
Implementation of Learning Systems
Constraint satisfaction problems
Connecting the Dots Between News Article
Probability, contd.
Presentation transcript:

Machine Learning for Online Query Relaxation Ion Muslea SRI International 333 Ravenswood Menlo Park, CA 94025 Published in SIGKDD - 2004

Ion Muslea Director of Research & Product Development, Scalability and Big Data at SDL

What is paper is about? given a query that returns an empty Failing query problem given a query that returns an empty answer, how can one relax the query’s constraints so that it returns a non-empty set of tuples?

Motivation

Motivation

Why Query fails? Too many constraints Database does not have enough tuples Users often want everything to be satisfied

Intuition Relax failing queries Relax the constraints Discover implicit relationships among various domain attributes Use this knowledge to relax the constraints

Intuition explained And It Fails!!! - laptops that have large screens (i.e., Display ≥ 1700) weigh more than three pounds; - fast laptops with large hard disks (CPU ≥ 2.5GHz And HDD ≥60GB) cost more than $2,000.

Intuition Formalized Step 1: Extracting domain knowledge In the form of rules Step 2: Finding the “most useful” rule Step 3: Relaxing the failing query

Intuition explained Step 1: Extracting domain knowledge Randomly-chosen subset, D’ For each constraint in Q0 (e.g., CPU ≥ 2.5 GHz), use D’ to find patterns that predict whether this constraint is satisfied

Intuition explained Step 1: Extracting domain knowledge

Intuition explained Step 1: Extracting domain knowledge

Intuition explained Step 2: Finding the “most useful” rule Converts rules to existential statements

Intuition explained Step 2: Finding the “most useful” rule Find the Q1 which is the most similar to Q0 How? (nearest-neighbor techniques)

Intuition explained Step 3: Relaxing the failing query

Intuition explained Step 3: Relaxing the failing query How to get Qr from Q1 and Q0? dropping the original constraint on the hard disk keeping the constraint on CPU unchanged setting the values in the constraints on Price, Display, and W eight to the least constraining ones

Intuition explained Step 3: Relaxing the failing query Price and HDD are dropped out

LOQR Learning for Online Query Relaxation

LOQR

Step 1: Extracting domain knowledge

Step 2: Finding the “refiner statement”

Step 3: Refining the failing conjunction

EXPERIMENTS Five algorithms evaluated Loqr loqr-50 (one variant of loqr) loqr-90 (another variant of loqr) S-nn (Baseline 1) r-nn (Baseline 2)

S-nn Find the example Ex ∈ D that is the most similar to Ck Use Ex to create a conjunction Ck’ Use Ck’ as the relaxed query Doesn’t learn rules at all

r-nn Apply s-nn uses Ck’ to relax Ck and apply the relaxed rule Doesn’t learn rules at all

loqr-50 Vs loqr-90 Generate over-relaxed queries that are highly unlikely to fail, but return a (relatively) large number of tuples [loqr-90] Allow 90% of the possible tuples VS Create under-relaxed queries that return fewer tuples, but are more likely to fail [loqr-50] Allow 50% of the possible tuples

loqr-50 Vs loqr-90

Datasets Six different datasets Laptops 1257 laptop configurations extracted from yahoo.com five numeric attributes: price, CPU speed, RAM, HDD space and weight Other Five (UC Irvine repository) breast cancer Wisconsin (bcw) low resolution spectrometer (lrs) Pima Indians diabetes (pima) water treatment plant (water) waveform data generator (wa ve)

The Setup Given a failing query Q and a dataset D, each algorithm uses D to generate a relaxed query QR QR is then evaluated on a test set that consists of all examples in the target database except the ones in D Dataset D variations: 50, 100, 150, . . . , 350 examples 100 arbitrary instances of D for each number

The Setup 7 failing queries For each query, each query relaxation algorithms is run 100 times The results reported here are the average of these 700 runs

Performance measures Robustness Coverage what percentage of the failing queries are successfully relaxed (i.e., they don’t fail anymore)? Coverage what percentage of the examples in the test set satisfy the relaxed query?

Results (Robustness)

Results (Robustness) Loqr obtains by far the best results s-nn and r-nn, display extremely poor robustness Loqr-50 and loqr-90 are better than baselines

Results (Coverage)

Results (Coverage) Low-coverage results preferred? ! A low-coverage, non-robust algorithm is of little practical importance loqr - not so spectacular!! loqr-90 is excellent (authors claim ) robustness levels between 69% and 98% coverage under 5%

Time complexity loqr is extremely fast Depends on the size of the dataset D the number of attributes in the query creates a new dataset for each attribute in the query. the more attributes in the query , the longer it takes to process the query .

Online vs offline learning OFF-k, an offline variant performs the learning step only once, independently of the constraints for discrete attributes: learns to predict each discrete value from the values of the other attributes for continuous attributes it discretizes the attribute’s range of values in D

Online vs offline learning Two offline versions (i.e., k = 2 and k = 3) for both loqr and loqr-90

Online vs offline learning both loqr and loqr-90 clearly outperform their offline variants

Query-driven learning four main scenarios no constraints Offline class-attribute constraints LOQR set of hard constraints subset of constraints that must be satisfied all constraints simultaneously replacing the original values of all the attributes

Some Issues Sampling Create separate datasets from each attribute? Entirely Random? Create separate datasets from each attribute? C4.5 is a greedy algorithm Is that a problem? Use only the closet rule to relax the query? Why not use more?

Conclusion Novel, data-driven approach to query relaxation loqr is a fast algorithm Successfully relaxes the vast majority of the failing queries