Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center
Thesis Data mining has started to live up to its promise in the commercial world, particularly in applications involving structured data Promising data mining applications in non- conventional domains are beginning to emerge, involving combination of structured and unstructured data Investment in data mining research can have large payoff
Outline Examples of some promising non- conventional data mining applications and technologies Some hurdles we need to cross
Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages
Website Profiling using Classification Input: Example pages for each category during training
Discovering Trends Using Sequential Patterns & Shape Queries Input: i) patent database ii) shape of interest
Discovering Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related.
Technical Chasms Privacy Concerns? – Privacy-preserving data mining Data for data mining? – Data mining over compartmentalized databases
Inducing Classifiers over Privacy Preserved Numeric Data 30 | 25K | …50 | 40K | … Randomizer 65 | 50K | … Randomizer 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model 30 become s 65 (30+35) Alice’s age Alice’s salary John’s age
Reconstruction Algorithm f X 0 := Uniform distribution j := 0 repeat f X j+1 (a) := Bayes’ Rule j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001.
Works Well
Accuracy vs. Randomization
Discovering frequent itemsets Itemset Size True Itemsets True Positives False Drops False Positives Itemset Size True Itemsets True Positives False Drops False Positives Soccer: s min = 0.2% Mailorder: s min = 0.2% Breach level = 50%.
Computation over Compartmentalized Databases
Some Hard Problems Past may be a poor predictor of future – Abrupt changes – Wrong training examples Reliability and quality of data Actionable patterns (principled use of domain knowledge?) Over-fitting vs. not missing the rare nuggets Richer patterns Simultaneous mining over multiple data types When to use which algorithm? Automatic, data-dependent selection of algorithm parameters
Summary Data mining has shown promise but we need further research to realize its full potential We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley