Kyriaki Dimitriadou, Brandeis University

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Lazy vs. Eager Learning Lazy vs. eager learning
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Information Retrieval Review
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Presented by Zeehasham Rasheed
Using Relevance Feedback in Multimedia Databases
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Data Mining – Intro.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Local Cost Estimation for Global Query Optimization in a Multidatabase System Dr. Qiang Zhu The University of Michigan Dearborn, MI, USA.
Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Exploratory Visualization of Infectious Disease Propagation Ben Houston, Neuralsoft Zack Jacobson, Health Canada NX-Workshop on Social Network Analysis.
DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Data Mining and Decision Support
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Closing the Query Processing Loop in Oracle 11g Allison Lee, Mohamed Zait.
E XPLORE - BY -E XAMPLE : A N A UTOMATIC Q UERY S TEERING F RAMEWORK FOR I NTERACTIVE D ATA E XPLORATION By Kyriaki Dimitriadou, Olga Papaemmanouil and.
Introduction to Machine Learning, its potential usage in network area,
Presented by Archana Kumari ( ) | Supervised By Mr Vikram Singh
Data Mining – Intro.
What Is Cluster Analysis?
Applying Deep Neural Network to Enhance EMPI Searching
CSCI5570 Large Scale Data Processing Systems
Machine Learning overview Chapter 18, 21
A Black-Box Approach to Query Cardinality Estimation
Augmented Sketch: Faster and More Accurate Stream Processing
Dr. Sudha Ram Huimin Zhao Department of MIS University of Arizona
On Spatial Joins in MapReduce
Data Warehousing and Data Mining
Ying Dai Faculty of software and information science,
iSRD Spam Review Detection with Imbalanced Data Distributions
Course Introduction CSC 576: Data Mining.
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Panagiotis G. Ipeirotis Luis Gravano
Faceted Filter Jidong Jiang
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Interactive Information Retrieval
Presentation transcript:

Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration Kyriaki Dimitriadou, Brandeis University Olga Papaemmanouil, Brandeis University Yanlei Diao, UMass Amherst

Interactive Data Exploration Human-in-the-loop application Explore big datasets to discover interesting data

Current state-of-the-art [SDSS DR7] Example: astrophysical data (e.g., SDSS) Discovery of interesting sky objects “Manual” iterative exploration: Query formulation Query processing Result reviewing (and back to step 1) Challenges: Ad-hoc queries: “correct” predicates are unknown a priori Labor intensive: thousands of objects to review Resource intensive: execution of long query sequences on big data

AIDE: Automatic Interactive Data Exploration AIDE’s exploration model Relies on user’s relevance feedback on data samples Eliminates query formulation step Navigates the user through the data space Reduces result reviewing overhead AIDE’s performance goals Effectiveness Captures user interests with high accuracy Efficiency Minimizes reviewing effort Offers interactive experience

AIDE Framework SQL Data Classification Space Exploration Relevance Feedback Data Classification Relevant Samples Irrelevant Samples User Model User Model Samples Query Formulation Space Exploration SQL Sampling queries Data Extraction Query

Research Challenges Which data samples to show to the user? User interests are unknown a-priori Labeled samples are collected in real time Active learning: Domain specific (e.g. image retrieval, document ranking) Examine all samples in the database Data samples provided by a different party How to offer interactive exploration times? Sample extraction cost is significant in big data set Accuracy vs efficiency trade-off AIDE couples model learning and sample acquisition

User Interests Assumption: user interests form N relevant areas Target Queries: disjunctive or conjunctive (range) queries Relevant Area Red Wavelength Relevant Area Relevant Area Green Wavelength

Space Exploration Steps Relevant Object Discovery Discover relevant objects Misclassified Sample Exploitation Identify relevant areas Boundary Exploitation Refine relevant areas √

Relevant Object Discovery Red wavelength Green Wavelength x x x √ √ x x x x x x x x √ x √ x x x x

Skewed Data Distributions * * * Red wavelength Green Wavelength * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Classification & Query Formulation Sample Red Green Relevant Object A 13.67 12.34 Yes Object B 15.32 14.50 No .. ... Object X 14.21 13.57 red red<=14.82 red>14.82 Irrelevant green red<13.55 red>=13.55 green<=13.74 Relevant green>13.74 Decision Tree Classifier SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND green<=13.74

Misclassified Samples False negative Red wavelength Green Wavelength False negative x x x √ √ x x x x x Predicted Area False positive x x x √ x √ x x x x

Misclassified Sample Exploitation Sampling Areas Red wavelength Green Wavelength x x x √ √ x √ √ √ √ √ √ √ x x x x x x x x √ x √ x x x x 13

Clustering-based Sampling Red wavelength Green Wavelength √ x √ x √ x √ √ √ √ √ x √ √ √ √ x x √ √ √ x Clusters-Sampling Areas x √ √ 14 14

Boundary Exploitation Eliminate irrelevant attributes Refine the areas Red wavelength Sampling Areas Green Wavelength 15 15 15

Performance Experiments SDSS dataset (10 GB-100GB) Our target queries: Based on SDSS query workload Effectiveness:

Effectiveness vs User Effort SMALL NUMBER OF SAMPLES TO PREDICT COMMON CONJUCTIVE QUERIES

Efficiency (User Wait Time) USER WAIT TIME IS <6sec FOR CONJUCTIVE QUERIES AND <8 secs FOR COMPLEX DISJUNCTIVE QUERIES.

Scalability (DB Size) SCALLING TO LARGER DATA SETS HAS NO SIGNIFICANT IMPACT ON THE EFFECTIVENESS.

Scalability (Exploration Space) F-measure: >70% AIDE IDENTIFIES IRRELEVANT ATTRIBUTES IN N-DIMENSIONAL SPACES WITH SMALL OVERHEAD

User Study User study with 7 real users: Manual Exploration: AuctionMark dataset Exploration for good deals Manual Exploration: Query formulation Query processing Review results & repeat

AIDE outperformed manual exploration: 1. select i_initial_price, i_current_price from ITEM order by i_initial_price; 2. select i_initial_price, i_current_price from ITEM where i_initial_price < i_current_price order by i_initial_price; 3. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 2 order by i_initial_price; 4. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 2 and i_current_price > 1000 order by i_initial_price; 5. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 3 order by i_initial_price; 6. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 3 and i_current_price > 1000 order by i_initial_price; 7. select i_initial_price, i_current_price from ITEM where i_current_price > i_initial_price * 4 order by i_initial_price; 8. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 10 order by i_initial_price; 9. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 50 order by i_initial_price; 10. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 70 order by i_initial_price; 11. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 2 and i_num_bids > 90 order by i_initial_price; 12. select i_initial_price, i_current_price, i_num_bids from ITEM where i_current_price > i_initial_price * 3 and i_num_bids > 90 order by i_initial_price; 13. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 2 order by i_initial_price; 14. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 2 and i_days_to_close > 0 order by i_initial_price; 15. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 3 and i_days_to_close > 0 order by i_initial_price; 16. select i_initial_price, i_current_price, i_days_to_close from ITEM where i_current_price > i_initial_price * 3 and i_days_to_close > 5 order by i_initial_price; 17. select i_initial_price, i_current_price, i_num_comments from ITEM where i_current_price > i_initial_price * 3 order by i_initial_price; … AIDE outperformed manual exploration: 66% average reduction on user effort 42% average reduction in exploration time

Related Work N. Kamat et al. Distributed Interactive Cube Exploration. ICDE 2014 Leilani Battle et al. Dynamic Reduction of Query Result Sets for Interactive Visualization. BigDataVis Workshop 2013 A. Kalinin et al. Interactive Data Exploration using Semantic Windows. SIGMOD 2014 M. Drosou et al. YMALDB: exploring relational databases via result-driven recommendations. VLDB Journal 2013 Neophytou et al. AstroShelf: Understanding the Universe through Scalable Navigation of a Galaxy of Annotations. SIGMOD 2012 A. Albarrak et al. SAQR: An Efficient Scheme for Similarity- Aware Query Refinement. DASFAA 2014

Conclusions AIDE (Automatic Interactive Data Exploration): Assists user in discovering interesting data objects Eliminates ad-hoc exploratory queries Highly efficient and effective exploration: Captures user interests with high accuracy Requires low reviewing effort from the user Offers interactive experience

Questions?

BACK UP SLIDES

Effectiveness vs User Effort SMALL NUMBER OF SAMPLES TO ACCURATELY PREDICT COMPLEX DISJUNCTIVE QUERIES

Efficiency for Skewed Datasets

Efficiency (User Wait Time) RUNNING OUR EXPLORATION ON A SAMPLE DATABASE LED TO 85-98% IMPROVEMENT ON TIME. TIME PER ITERATION: 2-7 SEC

Phases

AIDE vs Random Predicting areas (accuracy >70%): Both random solutions fail for small areas (>5,450 samples) For medium and large areas Random needs > 1180 samples and Random-Grid > 1275 samples Predicting multiple areas (accuracy >60%): AIDE needs <500 samples in all cases. Random and Random-Grid >1000 samples in all cases.