Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
PAKDD Panel: What Next Ramakrishnan Srikant. What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW.
Advertisements

Bayes rule, priors and maximum a posteriori
Data Mining: Potentials and Challenges Rakesh Agrawal & Jeff Ullman.
Mining Association Rules from Microarray Gene Expression Data.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Data Mining: Next 10 Years Rakesh Agrawal IBM Almaden Research Center Position from KDD-2001 Revisited.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Some Interesting Problems Rakesh Agrawal IBM Almaden Research Center.
Summarization of Frequent Pattern Mining. What is FPM? Why being frequent is so important? Application of FPM Decision make/Business Software Debugging.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Knowledge Compilation from the Web. Some Examples  Finding relationships  Discovering micro-communities  Creating concept hierarchies.
Introduction to Directed Data Mining: Decision Trees
Basic Data Mining Techniques
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Course on Data Mining: Seminar Meetings Page 1/17 Course on Data Mining ( ): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining
Data Mining By Dave Maung.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
Today Ensemble Methods. Recap of the course. Classifier Fusion
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
Data Mining and Decision Support
1 Limiting Privacy Breaches in Privacy Preserving Data Mining In Proceedings of the 22 nd ACM SIGACT – SIGMOD – SIFART Symposium on Principles of Database.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Chapter 9 Sampling Distributions 9.1 Sampling Distributions.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Semi-Supervised Clustering
Chapter 3: Maximum-Likelihood Parameter Estimation
Privacy-Preserving Data Mining
DATA MINING © Prentice Hall.
Jiawei Han Department of Computer Science
Targeted Association Mining in Time-Varying Domains
A Modified Naïve Possibilistic Classifier for Numerical Data
SEG 4630 E-Commerce Data Mining — Final Review —
Clustering.
Stratified Sampling for Data Mining on the Deep Web
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center

Thesis Data mining has started to live up to its promise in the commercial world, particularly in applications involving structured data Promising data mining applications in non- conventional domains are beginning to emerge, involving combination of structured and unstructured data Investment in data mining research can have large payoff

Outline Examples of some promising non- conventional data mining applications and technologies Some hurdles we need to cross

Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages

Website Profiling using Classification Input: Example pages for each category during training

Discovering Trends Using Sequential Patterns & Shape Queries Input: i) patent database ii) shape of interest

Discovering Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related.

Technical Chasms Privacy Concerns? – Privacy-preserving data mining Data for data mining? – Data mining over compartmentalized databases

Inducing Classifiers over Privacy Preserved Numeric Data 30 | 25K | …50 | 40K | … Randomizer 65 | 50K | … Randomizer 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model 30 become s 65 (30+35) Alice’s age Alice’s salary John’s age

Reconstruction Algorithm f X 0 := Uniform distribution j := 0 repeat f X j+1 (a) := Bayes’ Rule j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001.

Works Well

Accuracy vs. Randomization

Discovering frequent itemsets Itemset Size True Itemsets True Positives False Drops False Positives Itemset Size True Itemsets True Positives False Drops False Positives Soccer: s min = 0.2% Mailorder: s min = 0.2% Breach level = 50%.

Computation over Compartmentalized Databases

Some Hard Problems Past may be a poor predictor of future – Abrupt changes – Wrong training examples Reliability and quality of data Actionable patterns (principled use of domain knowledge?) Over-fitting vs. not missing the rare nuggets Richer patterns Simultaneous mining over multiple data types When to use which algorithm? Automatic, data-dependent selection of algorithm parameters

Summary Data mining has shown promise but we need further research to realize its full potential We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley