Download presentation
Presentation is loading. Please wait.
Published byHester Clarke Modified over 6 years ago
1
Stratified Sampling for Data Mining on the Deep Web
Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa Dec. 16, 2010
2
Outline Introduction Background Knowledge Basic Formulation
Association Rule Mining Differential Rule Mining Basic Formulation Main Technical Approach A Greedy Stratification Method Experiment Result Conclusion
3
Introduction Deep Web Data mining on the deep web
Query interface vs. backend database Input attribute vs. Output attribute Data mining on the deep web High level summary of the data Challenge Databases cannot be accessed directly Sampling Deep web querying is time consuming Efficient Sampling Method
4
Background Knowledge-Association Rule Mining
Aim: co-occurrence patterns for items Frequent Itemset: Support of the itemset is larger than a threshold Rule: is a frequent itemset Confidence is larger than threshold
5
Background Knowledge-Differential Rule Mining
Aim: differences between two deep web data sources E.g. Price of the same hotels on two web sites Identical attributes vs. Differential attributes Same vs. different values Rule: X: Frequent itemset composed of identical attributes t: differential or target attribute D1, D2: data sources
6
Basic Formulation-Problem Formulation
Two step sampling procedure A pilot sample Randomly drawn from the deep web Interesting rules are identified Additional sample Verify identified rules Association rules and differential rules Sampling more data records satisfying X X only contains input attributes – easy X contains output attributes Randomly sampling ? not efficient! how?
7
Basic Formulation-Problem Formulation in Detail
Considering rules with A single output attribute in the left hand Association Rule Estimate or, Differential Rule Estimate mean of given A=a Goal – sampling High estimation accuracy Low sampling cost
8
Basic Formulation-Stratified Sampling
Sampling separately from strata Heterogeneous across strata & homogenous within stratum Estimating mean value of : : size, and sampled mean value Association Rule Mining : whether an itemset is contained in a transaction If an itemset is contained in a transaction, Differential Rule Mining :the value of target attribute
9
Background-Neymann Allocation
Sample Allocation Determining sample size for each stratum Fixed sum of sample size Neymann Allocation Minimizing variance of the stratified sampling Problem of application in Deep Web The probability of A = a in each stratum is not considered Possible large sampling cost Sampling cost: number of queries submitted to the deep web
10
Sampling Cost Sampling Cost on the Deep web Integrated Cost
Aim: obtain data records with Sampling Cost: : number of data records with : probability of finding a data record with Integrated Cost Combing sampling cost and estimation variance Two adjustable weights
11
Main technical Approach –Stratification Process
Stratification by a tree on the query space A top-down construction manner Best split to create child nodes Input attribute with the smallest integrated cost The splitting process stops Integrated cost at each leaf node is small Leaf nodes: final strata for sampling
12
Experiment Result Data Set: US census Two Metrics
The income of US households from 2008 US Census 40,000 data records 7 categorical and 2 numerical attributes Two Metrics Variance of Estimation Sampling Cost
13
Experiment Result-Settings
Five sampling procedures Four different weights for variance and sampling cost Full_Var: Var7 : Var5 : Var3 : Rand : simple random sampling
14
Experiment Result – Variance of Estimation
Association Rule Mining Increase of variance of estimation by decreasing Random Sampling has higher estimation of variance
15
Experiment Result – Sampling Cost
Association Rule Mining Decrease of sampling cost by decreasing Random Sampling has higher sampling cost
16
Conclusion Stratified sampling for data mining on the deep web
Considering estimation accuracy and sampling cost A tree model for the relation between input attributes and output attributes A greedy stratification to maximally reduce an integrated cost metric Our experiments show that Higher sampling accuracy and lower sampling cost compared with simple random sampling Reducing sampling costs by trading-off a fraction of estimation error
17
Questions & Comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.