From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets.

Slides:



Advertisements
Similar presentations
Data Mining in Genomics: the dawn of personalized medicine
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
/faculteit technologie management Introduction to Data Mining a.j.m.m. (ton) weijters (slides are partially based on an introduction of Gregory Piatetsky-Shapiro)
Week 9 Data Mining System (Knowledge Data Discovery)
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Data Mining By Archana Ketkar.
Machine Learning, Data Mining, and Knowledge Discovery: An Introduction Prof. Ahmed Sameh PSU.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Lesson Outline Introduction: Data Flood
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
© 2006 KDnuggets Data Mining Tutorial Gregory Piatetsky-Shapiro KDnuggets.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Data Mining Knowledge Discovery: An Introduction
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
CIS 9002 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Data Mining Copyright KEYSOFT Solutions.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
An Introduction to Data Mining
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
KNOWLEDGE DISCOVERY & DATA MINING Abhishek M. Mehta ROLL NO:24.
Data Mining.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Introduction C.Eng 714 Spring 2010.
Data and Applications Security Introduction to Data Mining
Computers and Data Collection
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
From Data Mining to Knowledge Discovery: An Introduction
Data Mining: Introduction
Data Warehousing Data Mining Privacy
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets

2 Outline  Introduction  Data Mining Tasks  Application Examples

3 Trends leading to Data Flood  More data is generated:  Bank, telecom, other business transactions...  Scientific Data: astronomy, biology, etc  Web, text, and e-commerce  More data is captured:  Storage technology faster and cheaper  DBMS capable of handling bigger DB

4 Examples  Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session  storage and analysis a big problem  Walmart reported to have 24 Tera-byte DB  AT&T handles billions of calls per day  data cannot be stored -- analysis is done on the fly

5 Growth Trends  Moore’s law  Computer Speed doubles every 18 months  Storage law  total storage doubles every 9 months  Consequence  very little data will ever be looked at by a human  Knowledge Discovery is NEEDED to make sense and use of data.

6 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying  valid  novel  potentially useful  and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

7 Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery

8 ____ __ __ Transformed Data Patterns and Rules Target Data Raw Dat a Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge

9 Outline  Introduction  Data Mining Tasks  Application Examples

10 Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks,...

11 Classification: Linear Regression  Linear Regression w 0 + w 1 x + w 2 y >= 0  Regression computes w i from data to minimize squared error to ‘fit’ the data  Not flexible enough

12 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 52 3

13 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise

14 Data Mining Central Quest Find true patterns and avoid overfitting (false patterns due to randomness)

15 Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data

16 Major Data Mining Tasks  Classification: predicting an item class  Clustering: finding clusters in data  Associations: e.g. A & B & C occur frequently  Visualization: to facilitate human discovery  Estimation: predicting a continuous value  Deviation Detection: finding changes  Link Analysis: finding relationships  …

17 Data Mining Software Guide

18 Outline  Introduction  Data Mining Tasks  Application Examples

19 Major Application Areas for Data Mining Solutions  Advertising  Bioinformatics  Customer Relationship Management (CRM)  Database Marketing  Fraud Detection  eCommerce  Health Care  Investment/Securities  Manufacturing, Process Control  Sports and Entertainment  Telecommunications  Web

20 Case Study: Search Engines  Early search engines used mainly keywords on a page – were subject to manipulation  Google success is due to its algorithm which uses mainly links to the page  Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google

21 Case Study: Direct Marketing and CRM  Most major direct marketing companies are using modeling and data mining  Most financial companies are using customer modeling  Modeling is easier than changing customer behaviour  Some successes  Verizon Wireless reduced churn rate from 2% to 1.5%

22 Biology: Molecular Diagnostics  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML)  72 samples, about 7,000 genes ALLAML Results: 33 correct (97% accuracy), 1 error (sample suspected mislabelled) Outcome predictions?

23 AF1q: New Marker for Medulloblastoma?  AF1Q ALL1-fused gene from chromosome 1q  transmembrane protein  Related to leukemia (3 PUBMED entries) but not to Medulloblastoma

24 Case Study: Security and Fraud Detection  Credit Card Fraud Detection  Money laundering  FAIS (US Treasury)  Securities Fraud  NASDAQ Sonar system  Phone fraud  AT&T, Bell Atlantic, British Telecom/MCI  Bio-terrorism detection at Salt Lake Olympics 2002

25 Data Mining and Terrorism: Controversy in the News  TIA: Terrorism (formerly Total) Information Awareness Program –  DARPA program closed by Congress  some functions transferred to intelligence agencies  CAPPS II – screen all airline passengers  controversial  …  Invasion of Privacy or Defensive Shield?

26 Criticism of analytic approach to Threat Detection: Data Mining will  invade privacy  generate millions of false positives But can it be effective?

27 Can Data Mining and Statistics be Effective for Threat Detection?  Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives  Reality: Analytical models correlate many items of information to reduce false positives.  Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws

28 Another Approach: Link Analysis Can Find Unusual Patterns in the Network Structure

29 Analytic technology can be effective  Combining multiple models and link analysis can reduce false positives  Today there are millions of false positives with manual analysis  Data Mining is just one additional tool to help analysts  Analytic Technology has the potential to reduce the current high rate of false positives

30 Data Mining with Privacy  Data Mining looks for patterns, not people!  Technical solutions can limit privacy invasion  Replacing sensitive personal data with anon. ID  Give randomized outputs  Multi-party computation – distributed data  …  Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003

31 The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations

32 Summary That’s all folks! – the website for Data Mining and Knowledge Discovery Contact: Gregory Piatetsky-Shapiro