1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues.

Slides:



Advertisements
Similar presentations
DATA MINING Introductory
Advertisements

Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Classification and Prediction
DATA MINING Introductory and Advanced Topics Part I
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Classification.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
CIS 674 Introduction to Data Mining
Data Mining: A Closer Look
INFORMATION RETRIEVAL
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,
Data Mining Chun-Hung Chou
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
1 DATA MINING Source : Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Classification And Bayesian Learning
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Classification and Prediction
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining,
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Dimensional Modeling Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
1 DATA MINING Introductory and Advanced Topics Part I References from Dunham.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING LECTURE 1 INTRODUCTION TO DATA MINING.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Functionalities
Data Mining – Intro.
Data Mining ICCM
DATA MINING © Prentice Hall.
DATA MINING CSE 8331 Spring 2002 Part I
Chapter 6 Classification and Prediction
DATA MINING Introductory and Advanced Topics Part I
Classification and Prediction
Sangeeta Devadiga CS 157B, Spring 2007
DATA MINING Introductory and Advanced Topics Part I
Data Warehousing and Data Mining
Supporting End-User Access
DATA MINING Introductory and Advanced Topics Part I
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
DATA MINING Introductory and Advanced Topics Part I
©Jiawei Han and Micheline Kamber
DATA MINING Source : Margaret H. Dunham
Presentation transcript:

1 DATA MINING

2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues Goal: Provide an overview of data mining.

3 Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING

4 Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms –Exploratory data analysis –Data driven discovery –Deductive learning

5 Database Processing vs. Data Mining Processing Query –Well defined –SQL Query –Poorly defined –No precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database

6 Query Examples Database Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

7 Data Mining Models and Tasks

8 Basic Data Mining Tasks Classification maps data into predefined groups or classes –Supervised learning –Pattern recognition –Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. –Unsupervised learning –Segmentation –Partitioning

9 Basic Data Mining Tasks (cont’d) Summarization maps data into subsets with associated simple descriptions. –Characterization –Generalization Link Analysis uncovers relationships among data. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns.

10 Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

11 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

12 KDD Process Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Modified from [FPSS96C]

13 KDD Process Ex: Web Log Selection: –Select log data (dates and locations) to use Preprocessing: – Remove identifying URLs – Remove error logs Transformation: –Sessionize (sort and group) Data Mining: –Identify and count patterns –Construct data structure Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: –Cache prediction –Personalization

14 Data Mining Development Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

15 Social Implications of DM Privacy Profiling Unauthorized use

16 Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time

17 Database Perspective on Data Mining Scalability Real World Data Updates Ease of Use

18 June 25, 2015Data Mining: Concepts and Techniques18 Classification –predicts categorical class labels (discrete or nominal) –classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction –models continuous-valued functions, i.e., predicts unknown or missing values Typical applications –Credit approval –Target marketing –Medical diagnosis –Fraud detection Classification vs. Prediction

19 June 25, 2015Data Mining: Concepts and Techniques19 Classification—A Two-Step Process Model construction: describing a set of predetermined classes –Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute –The set of tuples used for model construction is training set –The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects –Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur –If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

20 June 25, 2015Data Mining: Concepts and Techniques20 Process (1): Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)

21 June 25, 2015Data Mining: Concepts and Techniques21 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

22 Ex2: Illustrating Classification Task

23 June 25, 2015Data Mining: Concepts and Techniques23 Supervised vs. Unsupervised Learning Supervised learning (classification) –Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –New data is classified based on the training set Unsupervised learning (clustering) –The class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

24 June 25, 2015Data Mining: Concepts and Techniques24 Issues: Data Preparation Data cleaning –Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) –Remove the irrelevant or redundant attributes Data transformation –Generalize and/or normalize data

25 June 25, 2015Data Mining: Concepts and Techniques25 Issues: Evaluating Classification Methods Accuracy –classifier accuracy: predicting class label –predictor accuracy: guessing value of predicted attributes Speed –time to construct the model (training time) –time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability –understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

26 Related Concepts Outline Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching Goal: Examine some areas which are related to data mining.

27 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data.

28 IR Query Result Measures and Classification IRClassification

29 Dimensional Modeling View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional.

30 Relational View of Data

31 Dimensional Modeling Queries Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.

32 Cube view of Data

33 Aggregation Hierarchies

34 Data Warehousing “ Subject-oriented, integrated, time-variant, nonvolatile” William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse.

35 Operational vs. Informational Operational DataData Warehouse ApplicationOLTPOLAP UsePrecise QueriesAd Hoc TemporalSnapshotHistorical ModificationDynamicStatic OrientationApplicationBusiness DataOperational ValuesIntegrated SizeGigabitsTerabits LevelDetailedSummarized AccessOftenLess Often ResponseFew SecondsMinutes Data SchemaRelationalStar/Snowflake

36 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations: –Slice: examine sub-cube. –Dice: rotate cube to look at another dimension. –Roll Up/Drill Down DM: May use OLAP queries.

37 OLAP Operations Single CellMultiple CellsSliceDice Roll Up Drill Down