Download presentation
Presentation is loading. Please wait.
Published byMoris Welch Modified over 8 years ago
1
Data Mining Tarek Soukieh 11/18/2010
2
Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data Mining Algorithms 1.Association 2.Classification & Prediction 3.Cluster Analysis 6.Data Mining Example 7.Major Issues in Data Mining 8.Data Mining Applications 9.Trends in Data Mining
3
The Evolution of Database Technology
4
Introduction Data Mining refers to extracting or mining knowledge from large amounts of data. It is famous acronym is KDD “Knowledge Discovery from Data” Nowadays we have abundance of data but these are called “Data tombs” “Data rich but information poor”
5
Introduction (Cont.) Data Mining Cycle: – Identifying the business problem – Validate, explore, and clean the data – Prepare the model – Check performance of the model – Act on the results (Training - Testing - Scoring) Data Mining Assumptions: – The past is a good predictor of the future Data Mining Categorization: – Directed vs. Undirected – Descriptive vs. Predictive
6
Introduction (Cont.)
7
Data Preprocessing Data Cleaning – Measuring dispersion of data – Principle Component Analysis – Correlation Analysis – Regression – Clustering – Sampling Data Transformation – Smoothing – Aggregation
9
OLAP vs. Data Mining OLAP is a data summarization/aggregation tool that helps simplify data analysis, while Data Mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data Data Mining employs sophisticated patterns recognition algorithms on the data, while OLAP reports aggregated data from data warehouses OLAP allows the user to do drilling, pivoting, slicing and dicing, while data mining covers a much broader spectrum like association, classification, prediction, clustering, and other algorithms
10
OLAP vs. Data Mining (Cont.) OLAP targets business problems while data mining can have socioeconomic applications Data mining is not confined to the analysis of data stored in data warehouses Data mining is more versatile
11
Association Frequent Itemset refers to a set of items that frequently appear together in a transactional data set, such as milk and bread Frequent Sequential Pattern is a frequently occurring subsequence such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card Market Basket Analysis is a typical example of frequent itemset mining
12
Association (Cont.) Let I = { I 1, I 2, I 3, …} set of items Let A be a set of items Let B be a set of items Association rule A B holds where: – A I – B I – A B =
13
Association (Cont.) Support is the percentage of transactions that contain A B. This is taken by the probability of union of sets A and B Confidence is the percentage of transactions containing A that also contain B
14
Association (Cont.) Frequent pattern mining classification: – Different levels of abstraction – Number of dimensions
15
Association (Cont.) Strong association rules are not necessarily interesting Correlation analysis:
17
Classification & Prediction Classification predicts categorical variables, a classifier is constructed to predict labels such as “safe” or “risky” for loan application data Prediction models continuous valued functions, regression analysis is most often used methodology
18
Classification Learning step, where a classification algorithm builds the classifier by learning from a training set Classification step, where test data are used to estimate the accuracy of the classification rules
19
Classification by Decision Tree Decision Tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label
20
Classification by Decision Tree (Cont.) Attribute Selection Measure is heuristic for selecting the splitting criterion that best separates a given data partition Ideally each partition should be pure, where all of the tuples that fall into a given partition would belong to the same class Famous attribute selection measures are “information gain”, “gain ratio”, and “gini index”
21
Classification by Decision Tree (Cont.) Appropriate for exploratory knowledge discovery Decision tree can handle high dimensional data Their representation of acquired knowledge in tree form is intuitive, easy and fast to assimilate by humans
22
Clustering Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters In classification, the class label of each object is known Clustering is an example of “unsupervised learning” or “learning by observation”, it does not rely on predefined classes Clustering is also called data segmentation, and is used for outlier detection Categorization: Partitioning methods, Hierarchical methods, Density-based methods
23
Clustering (Cont.) Partitioning methods – Each group must contain at least one object – Each object must belong to exactly one group – It creates an initial partitioning, then uses an “iterative relocation technique” – K-means algorithm – K-Medoids algorithm – Density-based method
24
Clustering (Cont.) Partitioning K-Means
25
Clustering (Cont.) Partitioning K-Medoids
26
Clustering (Cont.) Hierarchical methods – Agglomerative (bottom-up) or divisive (top-down) – Once a step is done, it can never be undone
27
Clustering (Cont.) Density-based methods – Number of data points in the neighborhood exceeds some threshold
29
Data Mining Example Vermont Country Store – Created a score for each customer based on RFM (Recency, Frequency, Monetary) – Created a model for mailing catalogs, then used the model against older mailings and found significant impact – Created a catalog for each of the customer segments produced by data mining – Found association rules that certain car owners are frequent buyers of certain products. The company purchased a list of all new car owners of that specific type and increased their sales substantially – Data Mining ROI was calculated as the ratio of the extra revenue brought in due to the models, to the money invested in data mining. It was 1,182 percent!
30
Major Issues in Data Mining Massive datasets and high dimensionality User interaction and prior knowledge Overfitting and assessing statistical significance Missing data Understandability of patterns Managing changing data and knowledge Integration Multimedia and object oriented data
31
Data Mining Applications Financial Data Analysis: – Loan Payment Prediction – Clustering customers for targeted marketing – Detection of financial crimes Retail Industry: – Effectiveness of sales campaigns – Customer retention – Product recommendation Telecommunication Industry: – Identification of unusual patterns – Multidimensional association analysis – Mobile telecommunication services
32
Data Mining Trends Data Preprocessing and Integration Increasing Usability Spatial Data Mining, Social Media Mining, Multimedia Mining, Visual Data Mining, Graph Mining, Mobile Data Mining Privacy Protection
34
Resources “Data Mining: Concepts and Techniques” – Jiawei Han and Micheline Kamber “Mastering Data Mining: The Art and Science of Customer Relationship Management” – Michael Berry and Gordon Linoff “Statistical Analysis and Data Mining Applications” – Robert Nisbet, John elder, and Gary Miner
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.