Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data.

Data Mining Tarek Soukieh 11/18/2010

Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data Mining Algorithms 1.Association 2.Classification & Prediction 3.Cluster Analysis 6.Data Mining Example 7.Major Issues in Data Mining 8.Data Mining Applications 9.Trends in Data Mining

The Evolution of Database Technology

Introduction Data Mining refers to extracting or mining knowledge from large amounts of data. It is famous acronym is KDD “Knowledge Discovery from Data” Nowadays we have abundance of data but these are called “Data tombs” “Data rich but information poor”

Introduction (Cont.) Data Mining Cycle: – Identifying the business problem – Validate, explore, and clean the data – Prepare the model – Check performance of the model – Act on the results (Training - Testing - Scoring) Data Mining Assumptions: – The past is a good predictor of the future Data Mining Categorization: – Directed vs. Undirected – Descriptive vs. Predictive

Introduction (Cont.)

Data Preprocessing Data Cleaning – Measuring dispersion of data – Principle Component Analysis – Correlation Analysis – Regression – Clustering – Sampling Data Transformation – Smoothing – Aggregation

OLAP vs. Data Mining OLAP is a data summarization/aggregation tool that helps simplify data analysis, while Data Mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data Data Mining employs sophisticated patterns recognition algorithms on the data, while OLAP reports aggregated data from data warehouses OLAP allows the user to do drilling, pivoting, slicing and dicing, while data mining covers a much broader spectrum like association, classification, prediction, clustering, and other algorithms

OLAP vs. Data Mining (Cont.) OLAP targets business problems while data mining can have socioeconomic applications Data mining is not confined to the analysis of data stored in data warehouses Data mining is more versatile

Association Frequent Itemset refers to a set of items that frequently appear together in a transactional data set, such as milk and bread Frequent Sequential Pattern is a frequently occurring subsequence such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card Market Basket Analysis is a typical example of frequent itemset mining

Association (Cont.) Let I = { I 1, I 2, I 3, …} set of items Let A be a set of items Let B be a set of items Association rule A  B holds where: – A  I – B  I – A  B = 

Association (Cont.) Support is the percentage of transactions that contain A  B. This is taken by the probability of union of sets A and B Confidence is the percentage of transactions containing A that also contain B

Association (Cont.) Frequent pattern mining classification: – Different levels of abstraction – Number of dimensions

Association (Cont.) Strong association rules are not necessarily interesting Correlation analysis:

Classification & Prediction Classification predicts categorical variables, a classifier is constructed to predict labels such as “safe” or “risky” for loan application data Prediction models continuous valued functions, regression analysis is most often used methodology

Classification Learning step, where a classification algorithm builds the classifier by learning from a training set Classification step, where test data are used to estimate the accuracy of the classification rules

Classification by Decision Tree Decision Tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label

Classification by Decision Tree (Cont.) Attribute Selection Measure is heuristic for selecting the splitting criterion that best separates a given data partition Ideally each partition should be pure, where all of the tuples that fall into a given partition would belong to the same class Famous attribute selection measures are “information gain”, “gain ratio”, and “gini index”

Classification by Decision Tree (Cont.) Appropriate for exploratory knowledge discovery Decision tree can handle high dimensional data Their representation of acquired knowledge in tree form is intuitive, easy and fast to assimilate by humans

Clustering Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters In classification, the class label of each object is known Clustering is an example of “unsupervised learning” or “learning by observation”, it does not rely on predefined classes Clustering is also called data segmentation, and is used for outlier detection Categorization: Partitioning methods, Hierarchical methods, Density-based methods

Clustering (Cont.) Partitioning methods – Each group must contain at least one object – Each object must belong to exactly one group – It creates an initial partitioning, then uses an “iterative relocation technique” – K-means algorithm – K-Medoids algorithm – Density-based method

Clustering (Cont.) Partitioning K-Means

Clustering (Cont.) Partitioning K-Medoids

Clustering (Cont.) Hierarchical methods – Agglomerative (bottom-up) or divisive (top-down) – Once a step is done, it can never be undone

Clustering (Cont.) Density-based methods – Number of data points in the neighborhood exceeds some threshold

Data Mining Example Vermont Country Store – Created a score for each customer based on RFM (Recency, Frequency, Monetary) – Created a model for mailing catalogs, then used the model against older mailings and found significant impact – Created a catalog for each of the customer segments produced by data mining – Found association rules that certain car owners are frequent buyers of certain products. The company purchased a list of all new car owners of that specific type and increased their sales substantially – Data Mining ROI was calculated as the ratio of the extra revenue brought in due to the models, to the money invested in data mining. It was 1,182 percent!

Major Issues in Data Mining Massive datasets and high dimensionality User interaction and prior knowledge Overfitting and assessing statistical significance Missing data Understandability of patterns Managing changing data and knowledge Integration Multimedia and object oriented data

Data Mining Applications Financial Data Analysis: – Loan Payment Prediction – Clustering customers for targeted marketing – Detection of financial crimes Retail Industry: – Effectiveness of sales campaigns – Customer retention – Product recommendation Telecommunication Industry: – Identification of unusual patterns – Multidimensional association analysis – Mobile telecommunication services

Data Mining Trends Data Preprocessing and Integration Increasing Usability Spatial Data Mining, Social Media Mining, Multimedia Mining, Visual Data Mining, Graph Mining, Mobile Data Mining Privacy Protection

Resources “Data Mining: Concepts and Techniques” – Jiawei Han and Micheline Kamber “Mastering Data Mining: The Art and Science of Customer Relationship Management” – Michael Berry and Gordon Linoff “Statistical Analysis and Data Mining Applications” – Robert Nisbet, John elder, and Gary Miner

Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data.

Similar presentations

Presentation on theme: "Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data.

Similar presentations

Presentation on theme: "Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data."— Presentation transcript:

Similar presentations

About project

Feedback