The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Random Forest Predrag Radenković 3237/10
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Determining How to Select a Sample
A Sampling Distribution
PHP-based Image Recognition and Retrieval of Late 18th Century Artwork Ben Goodwin Handouts are available for students writing summaries for class assignments.
Discussion Sampling Methods
1 Editing Administrative Data and Combined Data Sources Introduction.
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
SAMPLING Chapter 7. DESIGNING A SAMPLING STRATEGY The major interest in sampling has to do with the generalizability of a research study’s findings Sampling.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
9 th Workshop on Labour Force Survey Methodology – Rome, May 2014 The Italian LFS sampling design: recent and future developments 9 th Workshop on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Data Structures & Algorithms and The Internet: A different way of thinking.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources PhD Defense The Ohio State University Summer 2010 SEEDEEP: A System for Exploring and.
Lecture 4. Sampling is the process of selecting a small number of elements from a larger defined target group of elements such that the information gathered.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 Selectivity Estimation for Exclusive Query Translation in Deep Web Data Integration Fangjiao Jiang Renmin University of China Joint work with Weiyi Meng.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
IPDET Module 9: Choosing the Sampling Strategy. IPDET © Introduction Introduction to Sampling Types of Samples: Random and Nonrandom Determining.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Query Reranking As A Service
ICICLES: Self-tuning Samples for Approximate Query Answering
Approximate Frequency Counts over Data Streams
Stratified Sampling for Data Mining on the Deep Web
Example: Academic Search
Panagiotis G. Ipeirotis Luis Gravano
Answering Cross-Source Keyword Queries Over Biological Data Sources
Presentation transcript:

The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu Satuluri

The Ohio State University 2 The Deep Web The definition of “the deep web” from Wikipedia The deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines. Some Examples: Expedia, Priceline

The Ohio State University 3 The Deep Web is Huge and Informative 500 times larger than the surface web 7500 terabytes of information (19 terabytes in the surface web) 550 billion documents (1 billion in the surface web) More than 200,000 deep web sites Relevant to every domain: scientific, e-commerce, market 95 percent of the deep web is publicly accessible (with access limitations)

The Ohio State University 4 How to Access Deep Web Data 1. A user issues query through input interfaces of deep web data sources 2. Query is translated into SQL style query 3. Trigger search on backend database 4. Answers returned through network Select price From Expedia Where depart=CMH and arrive=SEA and dedate=“7/13/10” and redate=“7/16/10”

The Ohio State University 5 High-level Goal Develop a deep web search tool which could support online (real time) structured, and high level queries (semi)automatically

The Ohio State University 6 Challenges Challenges for Integration –Self-maintained and created –Heterogeneous, hidden and dynamically updated metadata Challenges for Searching –Limited data access pattern –Data redundancy and data quality –Data source dependency Challenges for Performance –Network latency –Fault tolerance issue –…….

The Ohio State University 7 Specific Problem: Online Aggregation Aggregation queries requiring data enumeration I want to know the average airfare from US to Europe across all major US airline flights in the next week Select AVG(airfare) From AirTable AT Where AT.depart=any US city and AT.arrive=any European city Need Enumeration!!! Relational Database Deep Web Data Source NYC, London Boston, Paris LA, Rome AA: 500 UA: 550 USAir: 450 Delta: 400 UA: 600 AA: 650 Where do you get these names? How long can you wait? What if the data is updated dynamically?

The Ohio State University 8 Initial Thoughts Sampling: Approximate answers Simple random sampling (SRS) –Every data record has the same probability to be selected Drawbacks of SRS –Bad performance on skewed data –High sampling cost to perform SRS on deep web (Dasgupta et al, HDSampler)

The Ohio State University 9 We Want To Achieve Handle data with (probably high) skew –Top 20 IT companies account for 80% of the sales among all top 100 IT companies in 2005 –Hidden data (hard to gather statistical information) –Has skew or not? –Unknown data distribution –Pilot sample, how much can you trust your pilot sample? Lower sampling cost for sampling deep web data

The Ohio State University 10 Our Contributions Two Sampling Algorithms –ANS (Adaptive Neighborhood Sampling): handling skewed data, sample skew causing data easier –TPS (Two Phase adaptive Sampling): lower sampling cost Performance –Accurate estimates without prior knowledge –ANS and TPS outperform HDSampler by a factor of 4 on skewed data –TPS has one-third of the sampling cost of HDSampler

The Ohio State University 11 Background Knowledge A survey on a type of rare monkey, which only lives in a small but dangerous area in southern China Associated Samples

The Ohio State University 12 Why this is good and Can we use it? Sample more rare but significant data records –Good for handling with skewed data Associated samples have relatively low sampling cost –Cheaper than SRS with the same sample size Yes, we can use it! ~~ with modification –Much real world data has skew (IT company, household income) –Rare data often form clusters –Deep web data sources often return multiple records w.r.t. one input sample

The Ohio State University 13 Drawbacks Performance depends on the initial sample Initial sample is simple random sample No cost limit explicitly considered –What is the size of the initial sample? –How many associated samples should be added?

The Ohio State University 14 Select a random sample –Stop random sampling if any of the two termination rules applies We have sampled k number of units of interest We have reached the cost limit –Take the sampled data record, add it to our sample –If this data record is a unit of interest Obtain its neighbors (neighborhood sampling) For each data records obtained from neighborhood sampling –Add it to our sample –Perform recursive neighborhood sampling if necessary –If neighborhoods are too large Increase unit of interest threshold –If neighborhoods are too small Decrease unit of interest threshold The ANS Sampling Algorithm Aggressively sample skew causing data Control sampling cost

The Ohio State University 15 ANS Example Estimate the total sale of IT companies in 2005 Each point represents a company’s sale record Color shows the scale of the sale value, the darker, the higher Neighborhood of data records is defined according to some rules

The Ohio State University 16 1.Select initial random samples sequentially until we have k units of interest (k=3) Unit of interest: sales larger than a threshold 2. Explore the neighborhood recursively for all units of interest until the total number of samples reach a limit 3. If too many neighbors are Included, we increase the threshold of unit of interest

The Ohio State University 17 Estimators and Analysis for ANS Estimator for AVG, fixed unit of interest threshold Lemma 1: The above estimator is a biased estimator, when k is small, the bias is very small We also proposed estimator for variable unit of interest threshold using post-stratification is the estimated average value from the h th stratum (all samples corresponding to one specific unit of interest threshold value) β is the percentage of units of Interest w.r.t. entire data set β=(k-1)/(n-1)

The Ohio State University 18 Drawbacks of ANS Initial samples are simple random samples SRS: one input search only gets one sample from the output page High cost

The Ohio State University 19 The TPS Sampling Algorithm Partition data set D into M sub-spaces –According to combinations of input attribute values Random select m sub-spaces Select a sample of size n1 from each of the m selected sub-spaces –First phase sampling –For one selected sub-space, if any such selected data record is a unit of interest, proceed Select a sample of size n2 from the corresponding sub-spaces –Second phase sampling –Sub-spaces contains units of interest in the first sampling phase may give us more units of interests in the second sampling phase Aggressively draw skew Causing data Low sampling cost

The Ohio State University 20 Unit of interest: sales larger than a threshold 1.Random select the sub spaces (input search) 2. Random select N1 samples In each selected subspace N1=4 3. If any sample selected in a sub space is a unit of interest, select N2 samples from the sub space again, N2=3

The Ohio State University 21 Evaluation Data sets –Synthetic data sets: generated using MINITAB, varying data skew from 1 to 9 –US Census data: 2002 US economic census data on wholesale trade product lines listed by the kind of business (skew=8) –Yahoo! Auto: prices of used Ford cars from 2000 to 2009 located within 50 miles of a zipcode (skew=0.7) Metrics –AER: Absolute Error Rate –Sampling cost: number of input samples needed Methods –ANS –TPS –SRS

The Ohio State University 22 ANS Performance w.r.t. Data Skew 1. AER increases moderately with the increase of data skew 2. When k=8, AER is consistently smaller than 19% 3. Larger k doesn’t help much to improve

The Ohio State University 23 TPS Performance w.r.t. Data Skew 1. With the increase of data skew, AER increases moderately 2. For sub-space sample size of 30%, AER always smaller than 17%

The Ohio State University 24 AER Comparison on Synthetic Data 1. All methods work well on small skew data 2. HDSampler has bad performance on data with skew>2 3. Our two methods outperform HDSampler by a factor of 5

The Ohio State University 25 AER Comparison on US Census Data

The Ohio State University 26 AER Comparison on Yahoo! Data 1. For AVG, three methods are comparable in terms of accuracy 2. For MAX, our methods are better (by a factor of 4)

The Ohio State University 27 Sampling Cost Comparison on Yahoo Data 1. To achieve a low AER, TPS has one-third of the sampling cost of HDSampler 2. The total number of samples TPS obtained (with same cost in time) is twice the number of samples HDSampler obtained More Results

The Ohio State University Summary Effective methods for sampling skewed data on hidden data Outperform other methods Questions: – the authors!! 28