Download presentation
Presentation is loading. Please wait.
Published byMyles Webb Modified over 9 years ago
1
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center
2
Thesis Data mining has started to live up to its promise in the commercial world, particularly in applications involving structured data Promising data mining applications in non- conventional domains are beginning to emerge, involving combination of structured and unstructured data Investment in data mining research can have large payoff
3
Outline Examples of some promising non- conventional data mining applications and technologies Some hurdles we need to cross
4
Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages
5
Website Profiling using Classification Input: Example pages for each category during training
6
Discovering Trends Using Sequential Patterns & Shape Queries Input: i) patent database ii) shape of interest
7
Discovering Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related.
8
Technical Chasms Privacy Concerns? – Privacy-preserving data mining Data for data mining? – Data mining over compartmentalized databases
9
Inducing Classifiers over Privacy Preserved Numeric Data 30 | 25K | …50 | 40K | … Randomizer 65 | 50K | … Randomizer 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model 30 become s 65 (30+35) Alice’s age Alice’s salary John’s age
10
Reconstruction Algorithm f X 0 := Uniform distribution j := 0 repeat f X j+1 (a) := Bayes’ Rule j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001.
11
Works Well
12
Accuracy vs. Randomization
13
Discovering frequent itemsets Itemset Size True Itemsets True Positives False Drops False Positives 12662541231 22171952245 34843526 Itemset Size True Itemsets True Positives False Drops False Positives 165 00 22282121628 3221845 Soccer: s min = 0.2% Mailorder: s min = 0.2% Breach level = 50%.
14
Computation over Compartmentalized Databases
15
Some Hard Problems Past may be a poor predictor of future – Abrupt changes – Wrong training examples Reliability and quality of data Actionable patterns (principled use of domain knowledge?) Over-fitting vs. not missing the rare nuggets Richer patterns Simultaneous mining over multiple data types When to use which algorithm? Automatic, data-dependent selection of algorithm parameters
16
Summary Data mining has shown promise but we need further research to realize its full potential We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.