ReDrive: Result-Driven Database Exploration through Recommendations Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
CS4432: Database Systems II
Representing and Querying Correlated Tuples in Probabilistic Databases
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Preferential Publish/Subscribe Marina Drosou Department of Computer Science University of Ioannina, Greece Joint work with Evaggelia Pitoura and Kostas.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Perk: Personalized Keyword Search in Relational Databases through Preferences Marina Drosou Computer Science Department University of Ioannina, Greece.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
“You May Also Like” Results in Relational Databases Kostas Stefanidis Department of Computer Science University of Ioannina, Greece Joint work with Marina.
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
Fast Algorithms for Association Rule Mining
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Marina Drosou Department of Computer Science University of Ioannina, Greece Thesis Advisor: Evaggelia Pitoura
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Multiple testing correction
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Dynamic Diversification of Continuous Data Marina Drosou and Evaggelia Pitoura Computer Science Department University of Ioannina, Greece
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Efficient Processing of Top-k Spatial Preference Queries
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Marina Drosou, Evaggelia Pitoura Computer Science Department
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Date: 2012/07/02 Source: Marina Drosou, Evaggelia Pitoura (CIKM’11) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Bug Localization with Association Rule Mining Wujie Zheng
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
CACTUS-Clustering Categorical Data Using Summaries
CARPENTER Find Closed Patterns in Long Biological Datasets
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

ReDrive: Result-Driven Database Exploration through Recommendations Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina

ReDrive: Result-Driven Database Exploration through Recommendations Motivation Users interact with databases by formulating queries Exploratory queries ▫No clear understanding of information needs ▫Not knowing the exact content of the database 2 DMOD Lab, University of Ioannina Assisting users in database exploration be recommending to them additional items that are highly related with the items in the result of their original user query Goal:

ReDrive: Result-Driven Database Exploration through Recommendations Example DMOD Lab, University of Ioannina 3 Show me movies directed by F.F. Coppola DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama Query Q Query Result Res(Q)

ReDrive: Result-Driven Database Exploration through Recommendations Outline The ReDRIVE framework ▫faSets, interesting faSets, recommendations Top-k faSets computation ▫Statistics maintenance, Two-Phase algorithm Experimental Results DMOD Lab, University of Ioannina 4

ReDrive: Result-Driven Database Exploration through Recommendations Outline The ReDRIVE framework ▫faSets, interesting faSets, recommendations Top-k faSets computation ▫Statistics maintenance, Two-Phase algorithm Experimental Results DMOD Lab, University of Ioannina 5

ReDrive: Result-Driven Database Exploration through Recommendations FaSets Facet condition: A condition A i = a i on some attribute of Res(Q) m-FaSet: A set of m facet conditions on m different attributes of Res(Q) DMOD Lab, University of Ioannina 6 DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama 1-faSet 2-faSet

ReDrive: Result-Driven Database Exploration through Recommendations Interestingness score of a FaSet The relevance of a faSet f to a user information need u Q is defined as: DMOD Lab, University of Ioannina 7 : The set of tuples relevant to u Q Same for all faSets Probability that f is satisfied by a tuple in the result We assume that |Res(Q)| ≪ |D| and therefore: Support of f in Res(Q) Support of f in the database

ReDrive: Result-Driven Database Exploration through Recommendations Extended Interestingness score of a FaSet We extend our definition to include attributes in relations that do not appear in Q but may disclose interesting information DMOD Lab, University of Ioannina 8 DirectorTitleYearGenreCountry F.F. CoppolaTetro2009DramaItaly F.F. CoppolaYouth Without Youth2007FantasyItaly F.F. CoppolaThe Godfather1972DramaUSA F.F. CoppolaRumble Fish1983DramaUSA F.F. CoppolaThe Conversation1974ThrillerUSA F.F. CoppolaThe Outsiders1983DramaUSA F.F. CoppolaSupernova2000ThrillerUSA F.F. CoppolaApocalypse Now1979DramaUSA

ReDrive: Result-Driven Database Exploration through Recommendations Exploratory Queries For each interesting faSet f, we construct an exploratory query used to recommend results to the user DMOD Lab, University of Ioannina 9 SELECT title, year, genre FROM movies, directors, genres WHERE director = ‘F.F. Coppola’ AND join(Q) DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. Coppola Youth Without Youth 2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama f = ‘1983, Drama’ SELECT director FROM movies, directors, genres WHERE year = 1983 AND genre = ‘Drama’ AND join(Q) 1 User query Q 2 Res(Q) 3 Interesting faSet 4 Exploratory query

ReDrive: Result-Driven Database Exploration through Recommendations Outline The ReDRIVE framework ▫faSets, interesting faSets, recommendations Top-k faSets computation ▫Statistics maintenance, Two-Phase algorithm Experimental Results DMOD Lab, University of Ioannina 10

ReDrive: Result-Driven Database Exploration through Recommendations Estimating p(f |D) To compute the interestingness of a faSet we need: ▫p(f |Res(Q)) ▫p(f |D) p(f |Res(Q)) is computed on-line p(f |D) is too expensive ⇒ must be estimated DMOD Lab, University of Ioannina 11 Compute off-line and store statistics that will allow us to estimate p(f |D) for any faSet f. Our approach:

ReDrive: Result-Driven Database Exploration through Recommendations Statistics We maintain statistics in the form of -Tolerance Closed Rare FaSets (-CRFs): Let f be a faSet and C(f ) be its closest -CRF. Then, it holds that: DMOD Lab, University of Ioannina 12 A faSet f is an -CRF for a set of tuples S if and only if: 1.it is rare for S and 2.it has no proper rare subset f’, |f’ |=|f |-1, such that: count(f’,S) < (1+ )count(f,S), ≥ 0 where = (1+ ) |f|-|C(f)|

ReDrive: Result-Driven Database Exploration through Recommendations Lack of Monotonicity score(f,Q) is neither an upwards nor a downwards closed measure No monotonic properties to exploit for pruning the search space DMOD Lab, University of Ioannina 13

ReDrive: Result-Driven Database Exploration through Recommendations The Two-Phase Algorithm Maintain all -CRFs, where rare is defined by minsupp r First Phase: ▫X = {all 1-faSets in Res(Q)} ▫Y = {-CRFs that consist only of 1-faSets in X} ▫Z = {faSets in Res(Q) that are supersets of some faSet in Y} ▫Compute scores for faSets in Z Second Phase: ▫s = k th highest score in Z ▫Run a priori with threshold minsupp f = s * minsupp r DMOD Lab, University of Ioannina 14

ReDrive: Result-Driven Database Exploration through Recommendations Outline The ReDRIVE framework ▫faSets, interesting faSets, recommendations Top-k faSets computation ▫Statistics maintenance, Two-Phase algorithm Experimental Results DMOD Lab, University of Ioannina 15

ReDrive: Result-Driven Database Exploration through Recommendations Datasets We experimented using real datasets: ▫AUTOS: single-relation, tuples, 41 attributes ▫MOVIES: 13 relations, tuples, 2-5 attributes And synthetic ones: ▫UNIFORM: single relation, 10 3 tuples, 5 attributes ▫ZIPF: single relation, 10 3 tuples, 5 attributes DMOD Lab, University of Ioannina 16

ReDrive: Result-Driven Database Exploration through Recommendations Statistics Generation Locating rare faSets is the most expensive computation ▫Used a Random Walks technique Estimation was very close to Actual value for large ε DMOD Lab, University of Ioannina 17

ReDrive: Result-Driven Database Exploration through Recommendations Top-k faSets discovery Is it worth the trouble? ▫Baseline: Consider only frequent faSets in Res(Q) ▫TPA: Two-Phase Algorithm DMOD Lab, University of Ioannina 18

ReDrive: Result-Driven Database Exploration through Recommendations Exploring the real datasets Top-faSets: ▫Mercedes-Benz G55 AMG, Land Rover Discovery II HSE7 When expanding towards State, new faSets include: ▫Land Rover Range Rover in VA, Cadilac in DE DMOD Lab, University of Ioannina 19 SELECT make, name FROM AUTOS WHERE navigation_system = 'Yes'

ReDrive: Result-Driven Database Exploration through Recommendations Summary We introduced ReDRIVE, a novel database exploration framework for recommending to users items which may be of interest to them although not part of the results of their original query We defined faSets and their interestingness We proposed a frequency estimation method based on - CRFs We proposed a Two-Phase Algorithm for locating the top-k most interesting faSets DMOD Lab, University of Ioannina 20

ReDrive: Result-Driven Database Exploration through Recommendations DMOD Lab, University of Ioannina 21 Thank you!

ReDrive: Result-Driven Database Exploration through Recommendations Queries Queries are Select-Project-Join SQL queries: DMOD Lab, University of Ioannina 22 SELECT proj(Q) FROM rel(Q) WHERE sel(Q) AND join(Q) rel(Q): A set of relations sel(Q): A conjunction of selection conditions join(Q): A set of join conditions among rel(Q) proj(Q) : A set of projected attributes

ReDrive: Result-Driven Database Exploration through Recommendations Definitions In correspondence to data mining, we define: ▫Rare faSet (RF) : A faSet with frequency under a threshold ▫Closed Rare faSet (CRF) : A rare faSet with no proper subset with the same frequency ▫Minimal Rare faSet (MRF) : A rare faSet with no rare subset It holds: |MRFs| ≤ |CRFs| ≤ |RFs| MRFs can tell us if f is rare but not its frequency CRFs can tell us its frequency but are still too many DMOD Lab, University of Ioannina 23

ReDrive: Result-Driven Database Exploration through Recommendations The Two-Phase Algorithm (2/2) Second Phase: ▫s = k th highest score in Z ▫Run a priori with threshold minsupp f = s * minsupp r Let f be a faSet examined in the second phase. This means that p(f |D) > minsupp r Therefore, for score(f,Q) > s to hold, it must be that: p(f |Res(Q)) > s * p(f |D) ⇒ p(f |Res(Q)) > s * minsupp r ⇒ p(f |Res(Q)) > minsupp f DMOD Lab, University of Ioannina 24