Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 1 Introduction to Data Mining

Similar presentations


Presentation on theme: "Chapter 1 Introduction to Data Mining"— Presentation transcript:

1 Chapter 1 Introduction to Data Mining
Chen. Chun-Hsien Department of Information Management Chang Gung University 2018年8月23日星期四 Introduction to Data Mining Introduction to DM & DW

2 Introduction to Data Mining
Outline Motivation to data mining What is data mining? Applications of data mining Data mining process Main data mining techniques Classification of data mining systems 2018年8月23日星期四 Introduction to Data Mining

3 Introduction to Data Mining
Motivation Phenomenon : data explosion (Automated data collection tools and mature database technology) Tremendous amount of Web pages 40+ billion photos on Facebook 1 million new transactions/hour added in Walmart database Data from wearable devices for healthcare Big data in Clouds Problem : we are drowning in data, but starving for knowledge Solution : data Mining One of the 10 emerging technologies that will change the world in the near future 2018年8月23日星期四 Introduction to Data Mining

4 Introduction to Data Mining
What Is Data Mining? Formal Definition of Data mining Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, affinities) from large amount of data Alternative names Business intelligence (BI), knowledge discovery in databases (KDD), data/pattern analysis, knowledge extraction, data dredging, information harvesting, data archeology, etc. 2018年8月23日星期四 Introduction to Data Mining

5 Example : Mining a Concept Hierarchy
all all Europe ... North_America region Germany ... Spain Canada ... Mexico country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office 2018年8月23日星期四 Introduction to Data Mining

6 Part of International Sales, Shipping Data
(嘉義市是淘寶網2012年人均支付金額第一名: 32萬元) 2018年8月23日星期四 Introduction to Data Mining

7 Introduction to Data Mining
Confluence of Multiple Disciplines Artificial Intelligence Statistics Data Mining Machine Learning Visualization Information Science Database Technology KDD process 2018年8月23日星期四 Introduction to Data Mining

8 Introduction to Data Mining
Evolution of Database Technology 1960s: Data collection, database creation, network DBMS 1970s: Relational data model, relational DBMS 1980s: Advanced data models (extended-relational, OO, spatial, temporal D/Bs, etc.) 1990s ~: Data mining, data warehousing, multimedia D/B, and Web 2018年8月23日星期四 Introduction to Data Mining

9 Applications of Data Mining
Decision support Business decision support Consumer understanding and service improvement Market trend analysis and management Risk analysis and management Fraud detection and management Medical decision support Other Applications Text mining Web analysis Bioinformatics 2018年8月23日星期四 Introduction to Data Mining

10 Applications of Data Mining (Market Analysis and Management)
(1/2) Data sources for analysis Transactions of credit card, retail industry, etc. Public lifestyle studies Customer complaint calls Market basket analysis and cross selling Associations/co-relations between product sales Prediction based on the association information 2018年8月23日星期四 Introduction to Data Mining

11 Applications of Data Mining (Market Analysis and Management)
(2/2) Customer profiling Find clusters of “model” customers who share the same characteristics: spending habits, income level, interest, etc. Data mining can tell you what types of customers buy what products (by clustering or classification techniques) Identifying customer requirements Identifying potential product sales for eC customers Use prediction to find what factors will attract new customers 2018年8月23日星期四 Introduction to Data Mining

12 Introduction to Data Mining
Applications of Data Mining (Risk Analysis and Management) Finance planning and asset evaluation Cash flow analysis and prediction Asset evaluation Time series analysis (trend analysis) Competitive analysis and market segmentation Monitoring competitors and market directions Setting pricing strategy in a highly competitive market Grouping customers/a class-based pricing procedure 2018年8月23日星期四 Introduction to Data Mining

13 Introduction to Data Mining
Applications of Data Mining (Fraud Detection and Management) Applications Health care, insurance, credit card services Approach use historical data to build models of fraudulent behavior, and use data mining techniques to help identify similar instances Examples Detection of money laundering: Detect suspicious money transaction patterns in banks Fraud detection of medical insurance: Detect cheating ring of patients and doctors 2018年8月23日星期四 Introduction to Data Mining

14 Introduction to Data Mining
Applications of Data Mining (Other Applications) Text Ming News classification : find related articles Detection of spam : analyze content Medical informatics : automatic classification of cancer reports Web Mining : mining web access logs Discovering customer preference and behavior Analyzing effectiveness of Web marketing Improving Web site organization Biomedical Informatics Finding related genes of genetic diseases Drug discovery 2018年8月23日星期四 Introduction to Data Mining

15 Steps in KDD Process (Technically)
Knowledge Data mining The core step of KDD process Evaluation/Presentation Pattern Data Mining Relevant Data Data Preprocessing Databases 2018年8月23日星期四 Introduction to Data Mining

16 Main Steps of a KDD Process (Fully)
Domain knowledge Acquisition Learning relevant prior knowledge and goals of application Data collection and preprocessing (may take 60% of effort!) Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction Data mining Choosing functions of data mining association, classification, clustering, regression, summarization. Choosing the mining algorithm(s) Searching for patterns of interest Pattern evaluation and knowledge presentation removing redundant patterns, transformation, visualization, etc. Use of discovered knowledge 2018年8月23日星期四 Introduction to Data Mining

17 Introduction to Data Mining
Mining On What Kind of Data? Relational databases Transactional databases Data warehouses Advanced D/B and information repositories Web pages Temporal data (Time-series data) Spatial databases Text databases and multimedia databases Object-oriented databases Heterogeneous and legacy databases 2018年8月23日星期四 Introduction to Data Mining

18 Introduction to Data Mining
Steps in KDD Process Relevant Data Data Preprocessing Databases 2018年8月23日星期四 Introduction to Data Mining

19 Why Data Preprocessing?
Data in the real world is dirty (e.g., FaceBook) incomplete lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy containing errors or outliers inconsistent containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data 2018年8月23日星期四 Introduction to Data Mining

20 What Major Tasks in Data Preprocessing
Data cleaning Data integration Data transformation Data reduction Data discretization 2018年8月23日星期四 Introduction to Data Mining

21 Introduction to Data Mining
Steps in KDD Process Pattern Data Mining Relevant Data Data Preprocessing Databases 2018年8月23日星期四 Introduction to Data Mining

22 Main Data Mining Techniques
Association Rule Mining (Descriptive Analysis) Classification and Prediction (Predictive Analysis) Cluster Analysis (Exploratory Analysis) Regression Analysis Outlier Analysis Trend Analysis 2018年8月23日星期四 Introduction to Data Mining

23 Main Data Mining Techniques
(1/4) Association Rule Mining (association rule : correlation and causality) Form of association rules buy(T, “Beer”) à buy(T, “Diaper”) [support = 2%, confidence = 70%] Walmart story sales(T, “computer”) à sales(T, “software”) [support = 1%, confidence = 75%] C retail stores age(X, “21..25”) ^ income(X, “30..39K”) à buys(X, “PC”) [support = 2%, confidence = 60%] IBM story age(X, “31..35”) ^ income(X, “40..49K”) à buys(X, “iPad”) [support = 1%, confidence = 70%] Acer story 2018年8月23日星期四 Introduction to Data Mining

24 Association Rule Mining (Support and Confidence)
transactions buy both Given a transaction D/B, find all the rules X  Y with minimum support and confidence support, S, probability that a transaction contains {X & Y } confidence, C, conditional probability that a transaction having {X} also contains Y transactions buy X transactions buy Y all transactions Association rules with sup. >= 50% A  C (50%, 66.6%) C  A (50%, 100%) 2018年8月23日星期四 Introduction to Data Mining

25 Main Data Mining Techniques Supervised Learning
(2/4) Use a training set to construct a model for the outcome forecast of future events. Two main types Classification Finding a model that distinguishes classes for future events e.g., loan approval, customer classification, recognition of finger print Model representation: decision-tree, artificial neural networks Prediction Finding a model that predicts numerical values for future events e.g., stock price prediction Model representation: regression, artificial neural networks 2018年8月23日星期四 Introduction to Data Mining

26 Classification vs. Prediction
Use a training set to construct a model for the outcome forecast of future events Classification predicts categorical class labels constructs a classification model to classify new data Prediction predicts numerical values Constructs a continuous-valued function to predict unknown or missing values Typical Applications credit card approval medical diagnosis & treatment Pattern recognition 2018年8月23日星期四 Introduction to Data Mining Introduction to DM & DW

27 Data Mining: Concepts and Techniques
Process of Classification & Prediction (A Two-Step Process) Model construction Training Data (I, O) Learning Algorithms Model f y=f(x) (xI, yO) Model usage Model f input features x’ output y’ class label or value : 2018年8月23日星期四 Data Mining: Concepts and Techniques

28 An Example of Training Dataset (Data of Consumers' Buying Behavior)
Input features (I) : customer characteristics class label (O) This follows an example from Quinlan’s ID3 2018年8月23日星期四 Introduction to Data Mining

29 A Decision Tree Model for Predicting buy_PC
Model : buy_PC = f (age, student, credit rating) <age, student, credit rating> : x : test (input) attribute : class label for Buy_PC : attribute value ? f no yes fair excellent <= 30 > 40 30..40 student? age? credit rating? buy_PC : y 29

30 A Decision Tree for CAD Screening (Constructed from ~500 Records)
2018年8月23日星期四 2018年8月23日星期四 30 Main Data Mining Techniques for Biomedical Informatics Introduction to Data Mining

31 Main Data Mining Techniques Cluster analysis
(3/4) Cluster analysis (unsupervised learning) Class label is unknown: Group data to form new classes Application example : Customer profiling for product recommendation (Online Bookstores) Typical clustering principle Maximizing the intra-class similarity and minimizing the interclass similarity 2018年8月23日星期四 Introduction to Data Mining

32 Introduction to Data Mining
Example of 2D Cluster Analysis X Y Z 3 clusters with points X, Y, and Z as outliers B C A Difficulty : Data distribution of high dimension is not visually visible. 2018年8月23日星期四 Introduction to Data Mining

33 Introduction to Data Mining
Clustering Example in High Dimension (Gene Expression Analysis by Clustering) Clustering dendrogram Finding differentially regulated genes Clustering Many people have heard of microarrays, or genechips. Say it’s a way to look at the fingerprint of gene expression for hundreds or thousands of genes in a cell. Note that all proteins are involved in interconnected pathways, so you look for relationships. Groups of genes turned up or down, on or off. Unchanging ones not likely involved directly in the disease cause or course. Data matrix for visualization 2018年8月23日星期四 Introduction to Data Mining Introduction to DM & DW

34 Profile of Stroke Patients (Diagnosed by Indices of Chinese Medicine)
Many people have heard of microarrays, or genechips. Say it’s a way to look at the fingerprint of gene expression for hundreds or thousands of genes in a cell. Note that all proteins are involved in interconnected pathways, so you look for relationships. Groups of genes turned up or down, on or off. Unchanging ones not likely involved directly in the disease cause or course. 2018年8月23日星期四 2018年8月23日星期四 34 Main Data Mining Techniques for Biomedical Informatics Introduction to Data Mining Introduction to DM & DW Introduction to DM & DW 34

35 Main Data Mining Techniques
Example of Linear Regression (4/4) y y = a x + b Predict y’s value at X1 using linear regression y = f (x), what is f ? explore the meaning of a and b Y1 ? x X1 2018年8月23日星期四 2018年8月23日星期四 35 Main Data Mining Techniques for Biomedical Informatics Introduction to Data Mining

36 Other Data Mining Techniques
(4/4) Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Other pattern-directed or statistical analyses 2018年8月23日星期四 Introduction to Data Mining

37 Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting. Pattern screening becomes a problem. Interestingness : a measure for automatic pattern screening A pattern is interesting if it is easily understood, potentially useful, novel, valid on new or test data with some degree of certainty, or it validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures for data screening Objective: based on statistics and structures of data patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. 2018年8月23日星期四 Introduction to Data Mining

38 Can We Find All and Only Interesting Patterns?
Completeness vs. Optimization Completeness : Find all the interesting patterns Can a data mining system find all the interesting patterns? Optimization : Only find interesting patterns Can a data mining system find only the interesting patterns? Approaches First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns—mining query optimization 2018年8月23日星期四 Introduction to Data Mining

39 Classification Scheme of DM Techniques
General functionality Descriptive/Exploratory data mining Predictive data mining Different views, different classifications Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted 2018年8月23日星期四 Introduction to Data Mining

40 A Multi-Dimensional View of DM Technique Classification
Databases to be mined Relational, transactional , Web, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, etc. Knowledge to be mined Association, classification, clustering, trend, characterization, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, stock market analysis, Web mining, Biomedical informatics, etc. 2018年8月23日星期四 Introduction to Data Mining

41 Summary for Data Mining
Data mining: automatic discovery of interesting knowledge from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data pre-processing, data mining, pattern evaluation, and knowledge presentation Main data mining functions: association, classification, clustering, outlier and trend analysis, characterization, etc. 2018年8月23日星期四 Introduction to Data Mining Introduction to DM & DW

42 Introduction to Data Mining
Thanks !!!! Have a Nice Day ! 2018年8月23日星期四2018年8月23日星期四 Introduction to Data Mining


Download ppt "Chapter 1 Introduction to Data Mining"

Similar presentations


Ads by Google