Discovering the Skyline of Web Databases

Slides:



Advertisements
Similar presentations
Optimal Top-k Generation of Attribute Combinations based on Ranked Lists Jiaheng Lu, Renmin University of China Joint work with Pierre Senellart, Chunbin.
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 Continuous k-dominant Skyline Query Processing Presented by Prasad Sriram Nilu Thakur.
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Aggregation Algorithms and Instance Optimality
EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
Component 4: Introduction to Information and Computer Science Unit 2: Internet and the World Wide Web Lecture 2 This material was developed by Oregon Health.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation).
An Energy-Efficient Mobile Recommender Systems Bingchun Zhu Dung Phan Hien Le February 22, 2011.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Search Engines.
Efficient Processing of Top-k Spatial Preference Queries
Probabilistic Contextual Skylines D. Sacharidis 1, A. Arvanitis 12, T. Sellis 12 1 Institute for the Management of Information Systems — “Athena” R.C.,
Lecture 3: Uninformed Search
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Daniel A. Keim, Hans-Peter Kriegel Institute for Computer Science, University of Munich 3/23/ VisDB: Database exploration using Multidimensional.
Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
1 Subscription Partitioning and Routing in Content-based Publish/Subscribe Networks Yi-Min Wang, Lili Qiu, Dimitris Achlioptas, Gautam Das, Paul Larson,
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Dense-Region Based Compact Data Cube
Game Playing Why do AI researchers study game playing?
Statistical Schema Matching across Web Query Interfaces
Spatio-Temporal Databases
Query Reranking As A Service
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Stochastic Skyline Operator
Mining the Most Influential k-Location Set from Massive Trajectories
Artificial Intelligence Problem solving by searching CSC 361
Types of Algorithms.
Preference Query Evaluation Over Expensive Attributes
More SQL Nested and Union queries, and more
Machine Learning for Online Query Relaxation
Spatio-Temporal Databases
Types of Algorithms.
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Crowd Simulation (INFOMCRWS) - A* Search
Coverage Approximation Algorithms
A Restaurant Recommendation System Based on Range and Skyline Queries
Lecture 3: Environs and Algorithms
Workshop: A* Search.
INFO/CSE 100, Spring 2006 Fluency in Information Technology
Types of Algorithms.
Brad Clement and Ed Durfee University of Michigan
Context-Aware Internet
Efficient Processing of Top-k Spatial Preference Queries
Query Specific Ranking
We have the following incomplete B&B tree:
Probabilistic Information Retrieval
Multiobjective Optimization
Presentation transcript:

Discovering the Skyline of Web Databases Abolfazl Asudeh Saravanan Thirumuruganathan Nan Zhang Gautam DaS University of Texas at Arlington George Washington University © 2016 VLDB Endowment 21508097/16/03

Some Terms Hidden (web) Database Limited query interface m attributes Limited number of (Top-k) results Aj n tuples ti[Aj] based on its-own ranking function ti

Some Terms Domination Skyline 𝑎[1.7,0.9,0.5] 𝑏[1.7,1.1,0.5] 𝑎≻𝑏

Skyline contains the Top-1 of any monotonic function Why this problem? What if the user have a different ranking function in mind? How to minimize cost per mileage? Skyline contains the Top-1 of any monotonic function any function that does not prefer a dominated tuple over the dominating one k-sky band contains the Top-k (extension details in paper) Other applications: Multi-criteria decision making , …

Problem Statement Wait! almost all such DBs limit the number of queries per IP example: 50 free queries per user per day in Google Flight! Given: A hidden database D, without knowledge of its ranking function except being domination-consistent (monotonic) Find: all skyline tuples while minimizing the number of queries issued through the interface

Categories of Search Interfaces Single-ended range Query predicate (SQ): specify only the upper-bound. Range Query predicate (RQ): have the freedom to specify lower and upper bounds. Point Query predicate (PQ): predicated can only be in form of equality. Mixed Query predicate (MQ): interface contains a mixture of range and point predicates.

SQ Skyline Discovery (SQ-DB-SKY): 2D example select * select * where x<t1[x] select * where y<t1[y] select * where x<t2[x] select * where x<t1[x] and y<t2[y] select * where y<t1[y] and x<t3[x] select * where y<t3[y] Two queries per skyline tuple  O(S) S is the skyline size

SQ-DB-SKY: HD example, its problem select * q1:t3 A1 A2 A3 t1 5 1 9 t2 4 8 t3 3 7 t4 2 where A3<7 q4:t4 where A2<3 q3:t4 where A1<1 q2:null where A3<3 and A1<3 q5:null and A3<3 and A1<3 q8:null where A2<2 q6:t1 q10:null and A2<2 q7:null and A1<5 and A3<9 where A2<1 q9:null q11:null q12:null q13:null

SQ-DB-SKY: HD example, its problem select * q1:t3 It may discover a skyline tuple many times where A3<7 q4:t4 where A2<3 q3:t4 where A1<1 q2:null  worst-case O(m.Sm+1) where A3<3 Reason: the intersection between branches is not empty and A1<3 q5:null and A3<3 and A1<3 q8:null where A2<2 q6:t1 q10:null and A2<2 q7:null and A1<5 and A3<9 where A2<1 q9:null q11:null q12:null q13:null It cannot get resolved due to the interface limitation There exists cases in which no algorithm can do better than O(Sm)!

RQ Skyline Discovery (RQ-DB-SKY): High-level idea Here we have the freedom to specify the lower (as well as the upper) bound. can partition the search space to mutually exclusive sub-spaces  discover each tuple at most once! Example: q1: select * q2: select * where A1<t1[A1] q3: select * where A1≥t1[A1] and A2<t1[A2] q3: select * where A1≥t1[A1] and A2≥t1[A2] and A3<t1[A3] … Resolution: combine it with SQ-DB-SKY if a query matches one of the previously discovered skylines, switch to partitioning mode not every returned tuple is skyline!  Can be as bad as crawling all the tuple

× RQ-DB-SKY: example select * q1:t3 A1 A2 A3 t1 5 1 9 t2 4 8 t3 3 7 t4 where A3<7 and A2≥3 where A3<7 q4:t4 where A2<3 q3:t4 where A1<1 q2:null R(q4): null and A1<3 q5:null and A3<3 where A2<2 q6:t1 q7:null and A1<5 and A3<9 where A2<1 q8:null q9:null q10:null

PQ 2D Skyline Discovery (PQ-2D-SKY): example select *  t1[5,1] select * where x=0  null select * where x=1  t2[1,4] select * where y=2  null select * where y=3  null select * where y=0  t3[7,0] Proved to be instance optimal

PQ Skyline Discovery (PQ-DB-SKY): HD For m>2, the problem changes drastically unlike in the 2D case, instance optimality becomes provably unachievable! Even for a greedy solution over all 2D subspaces, PQ-2D-SKY is not directly applicable  PQ-2DSUB-SKY High-level greedy heuristic: Prune search space based on the first discovered tuple while search space is not fully explored, Pick the 2D subspace with largest domain sizes and apply PQ-2DSUB-SKY to identify its skylines

MQ Skyline Discovery (MQ-DB-SKY): The combination of previously discussed algorithms. High-level idea: apply the RQ-DB-SKY (or SQ-DB-SKY if one-ended) on range predicates. Find the dominated-on-range-attributes regions according to the current skylines. For each point-predicate value that can lead to a new skyline in the dominated regions check if the query on that value&region contains more than k tuples (while updating the skylines). If so, crawl the tuples in its 2D subspaces and update the skyline.

Experiments setup Simulating the hidden DB on top of an offline dataset. US Department of Transportation (DOT): 457,013 tuples and over 28 attributes. Online Experiments Blue Nile (BN) diamonds: largest online retailer of diamonds; contained 209,666 tuples (diamonds) over 6 attributes. Google Flights (GF): one of the largest flight search services; 4 ordinal attributes. Yahoo! Autos (YA): offers a popular search service for used cars; contained 125,149 cars within 30 mile of New York city; 3 ordinal attributes.

Offline Experiment Results RQ, Impact of k RQ, Impact of n RQ, Impact of m

Offline Experiment Results PQ, Impact of n,m MQ, Impact of n MQ, Impact of m

Online Experiment Results BN, anytime property GF, anytime property YA, anytime property

Questions?