Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010
Outline Introduction Background Knowledge Basic Formulation Association Rule Mining Differential Rule Mining Basic Formulation Main Technical Approach A Greedy Stratification Method Experiment Result Conclusion
Introduction Deep Web Data mining on the deep web Query interface vs. backend database Input attribute vs. Output attribute Data mining on the deep web High level summary of the data Challenge Databases cannot be accessed directly Sampling Deep web querying is time consuming Efficient Sampling Method
Background Knowledge-Association Rule Mining Aim: co-occurrence patterns for items Frequent Itemset: Support of the itemset is larger than a threshold Rule: is a frequent itemset Confidence is larger than threshold
Background Knowledge-Differential Rule Mining Aim: differences between two deep web data sources E.g. Price of the same hotels on two web sites Identical attributes vs. Differential attributes Same vs. different values Rule: X: Frequent itemset composed of identical attributes t: differential or target attribute D1, D2: data sources
Basic Formulation-Problem Formulation Two step sampling procedure A pilot sample Randomly drawn from the deep web Interesting rules are identified Additional sample Verify identified rules Association rules and differential rules Sampling more data records satisfying X X only contains input attributes – easy X contains output attributes Randomly sampling ? not efficient! how?
Basic Formulation-Problem Formulation in Detail Considering rules with A single output attribute in the left hand Association Rule Estimate or, Differential Rule Estimate mean of given A=a Goal – sampling High estimation accuracy Low sampling cost
Basic Formulation-Stratified Sampling Sampling separately from strata Heterogeneous across strata & homogenous within stratum Estimating mean value of : : size, and sampled mean value Association Rule Mining : whether an itemset is contained in a transaction If an itemset is contained in a transaction, Differential Rule Mining :the value of target attribute
Background-Neymann Allocation Sample Allocation Determining sample size for each stratum Fixed sum of sample size Neymann Allocation Minimizing variance of the stratified sampling Problem of application in Deep Web The probability of A = a in each stratum is not considered Possible large sampling cost Sampling cost: number of queries submitted to the deep web
Sampling Cost Sampling Cost on the Deep web Integrated Cost Aim: obtain data records with Sampling Cost: : number of data records with : probability of finding a data record with Integrated Cost Combing sampling cost and estimation variance Two adjustable weights
Main technical Approach –Stratification Process Stratification by a tree on the query space A top-down construction manner Best split to create child nodes Input attribute with the smallest integrated cost The splitting process stops Integrated cost at each leaf node is small Leaf nodes: final strata for sampling
Experiment Result Data Set: US census Two Metrics The income of US households from 2008 US Census 40,000 data records 7 categorical and 2 numerical attributes Two Metrics Variance of Estimation Sampling Cost
Experiment Result-Settings Five sampling procedures Four different weights for variance and sampling cost Full_Var: Var7 : Var5 : Var3 : Rand : simple random sampling
Experiment Result – Variance of Estimation Association Rule Mining Increase of variance of estimation by decreasing Random Sampling has higher estimation of variance
Experiment Result – Sampling Cost Association Rule Mining Decrease of sampling cost by decreasing Random Sampling has higher sampling cost
Conclusion Stratified sampling for data mining on the deep web Considering estimation accuracy and sampling cost A tree model for the relation between input attributes and output attributes A greedy stratification to maximally reduce an integrated cost metric Our experiments show that Higher sampling accuracy and lower sampling cost compared with simple random sampling Reducing sampling costs by trading-off a fraction of estimation error
Questions & Comments?