Download presentation
Presentation is loading. Please wait.
Published byMercy Stafford Modified over 9 years ago
1
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal {liut,agrawal}@cse.ohio-state.edu Ohio State University April 12, 2011
2
2 Outline Introduction –Deep Web –Data Mining on the deep web Frequent itemset mining over the deep web –Bayesian network –Active learning based sampling method Experiment Result Conclusion
3
3 Deep Web Data sources hidden from the Internet –Online query interface vs. Database –Database accessible through online Interface –Input attribute vs. Output attribute An example of Deep Web
4
4 Data Mining over the Deep Web High level summary of data –Scenario 1: A student wants to find a job as a software Engineer Will a master degree help? Which language to learn: Java, C, or C#? Try MSN careers – to much information! Frequent itemset mining!
5
5 Challenges Databases cannot be accessed directly –S–Sampling method for Deep web mining Obtaining data is time consuming –E–Efficient sampling method –H–High accuracy with low sampling cost
6
6 Roadmap Introduction –Deep Web –Data Mining Frequent Itemset mining over the deep web –Bayesian Network –Active learning based sampling method Experiment Result Conclusion
7
7 Frequent Itemset Mining Itemset: a set of attributes with instantiations, e.g I={Brand=benz, Age>5} Support(Brand=Benz, Age>5)=2/8=0.25 Frequent Itemset: Support is larger than a threshold
8
8 Frequent Itemset Mining on Deep Web Challenges –Support of itemsets is unavailable –The size of itemsets could be huge Considering 1-itemsets –Simple random sample – Inefficient Support of itemsets of input attributes is known –# of data records satisfying the query is provided
9
9 Main Idea Task: Estimating the support of itemsets of output attributes Questions –Can we use information about input attributes? Bayesian Network –Relation between input attributes and output attributes –Compute support for itemsets of output attributes –How to quickly build the model Active learning based sample method
10
10 Bayesian Network Relation between input and output attributes Graphical model –Random variables Input and output attributes –Conditional dependencies Output attributes depend on input attributes
11
11 Active Learning In machine learning –Passive learning: data are randomly chosen –Active Learning Certain data are selected, to help build a better model Active Learning –Obtaining data is costly and/or time-consuming Frequent Itemset Mining on Deep Web
12
12 An Example of Bayesian Network Brand Age Mileage Price 5000 Brand Age H <=5 0.5 0.0 H >5 B <=5 B >5 Support of Itemsets depends on parameters in the Bayesian network Parameters are estimated based on Sample ‒ Parameter: p(price<=5000|H,<=5) 2 data records satisfying brand=H, Age<=5 1 data records satisfying brand=H, Age<=5, price<=5000 ‒ p(price<=5000|H,<=5)=1/2=0.5 p Support(Price<=5000)= 0.25 H, B 5 knownEstimate! [0.125 0.125 0.0 0.0] 0.25 0.5 1.0 0.25 0.5 0.0
13
13 Example of Active learning on Deep Web Deep Web Data Source B=H& Age<=5 B=H& Age>5 B=B& Age<=5 B=B& Age>5 Price Q1Q2Q3Q4 Q1 Q2 Q3 Q4 Price 5000 p11 p12 p21 p22 p31 p32 p41 p42 Qi, i=1,…, 4 Sampled Data
14
14 An Example of Active Learning Based Sampling Hidden idea –Sampling heavily on query spaces with high impurity Q1(B=H)Q2(B=B) Price 0.01 0.99 0.5 0.5 Q2 Q1 Deep Web Data Source
15
15 Detailed Formulation Support for output attributes ‒ : an instantiation of input attributes, or a query – : prior probability Known –, Conditional probability Parameters in conditional table Unknown, need to estimate
16
16 Parameters in Bayesian Network are estimated based on a sample Difference between estimated values and true values Consider as statistical variables Conjugate distribution –After observing data D, is in the same family with Hyper parameter –Expectation:, where Estimation for support of output attributes –Expectation on the distribution
17
17 Active Learning on Deep Web Risk Function –Risk with the estimation for 1-itemsets composed of output attributes –Based on the hyper parameter in the Bayesian Network, Data Selection –Data are obtained by queries : query selection –Data records are selected step by step –Choosing the query with most reduction on risk function Updating Model –For, and sample where denotes the number of data records containing
18
18 Support for n-itemsets(n>1) Estimation based on the Bayesian network – Support value of in the query space
19
19 Roadmap Introduction –Deep Web –Data Mining Frequent Itemset mining over the deep web –Bayesian Network –Active learning based sampling method Experiment Result Conclusion
20
20 Experiment Result Data set: US census –2008 US Census on the income of US households –40,000 data records Three Methods –Dir: Random Sample Direct Computation –Bay Random Sample Computation Based on Bayesian Network –Act: our proposed method Active Learning based Sample Computation Based on Bayesian Network
21
21 US census Square Error Rate: Absolute Error Rate (AER):
22
22 Conclusion Data mining on the deep web is challenging Frequent itemset mining over the deep web Bayesian network is used to model the deep web A active learning based sampling method The experiment results show the efficiency of our work
23
23 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.