Data Warehousing and Data Mining

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

IT 433 Data Warehousing and Data Mining

Decision Tree Approach in Data Mining

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Classification Techniques: Decision Tree Learning

Bayesian Classification

Ch5 Stochastic Methods Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.

Machine Learning Neural Networks

Classification and Prediction

What is Cluster Analysis

Basic Data Mining Techniques

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Lecture 5 (Classification with Decision Trees)

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Classification.

Gini Index (IBM IntelligentMiner)

Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Mining: A Closer Look

Chapter 5 Data mining : A Closer Look.

Classification and Prediction: Regression Analysis

Chapter 7 Decision Tree.

Basic Data Mining Techniques

Data Mining Techniques

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Bayesian Networks. Male brain wiring Female brain wiring.

Data Mining and Neural Networks Danny Leung CS157B, Spring 2006 Professor Sin-Min Lee.

Classification and Prediction (cont.) Pertemuan 10 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb

Chapter 9 – Classification and Regression Trees

Basic Data Mining Technique

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.

CS690L Data Mining: Classification

Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Bayesian Classification

Classification And Bayesian Learning

Classification and Prediction

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.

Data Mining and Decision Support

Classification Today: Basic Problem Decision Trees.

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Decision Trees.

Classification and Regression Trees

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.

DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.

What Is Cluster Analysis?

Chapter 7. Classification and Prediction

Ch9: Decision Trees 9.1 Introduction A decision tree:

Chapter 6 Classification and Prediction

Rule-Based Classification

Classification and Prediction

Classification by Decision Tree Induction

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —

©Jiawei Han and Micheline Kamber

Classification 1.

Presentation transcript:

Data Warehousing and Data Mining Prepared By: Amit Patel http://ajpatelit.hpage.com

Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on an attribute, Each branch represents an outcome of the test, And each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. http://ajpatelit.hpage.com

Decision Tree http://ajpatelit.hpage.com

Decision Tree Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D. Input: Data partition, D, which is a set of training tuples and their associated class labels; Attribute_list, the set of candidate attributes; Attribute_selection_method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset. Output: A decision tree. http://ajpatelit.hpage.com

Decision Tree http://ajpatelit.hpage.com

Let A be the splitting attribute Let A be the splitting attribute. (a) If A is discrete-valued, then one branch is grown for each known value of A. (b) If A is continuous-valued, then two branches are grown, corresponding to A < split point and A > split point. (c) If A is discrete-valued and a binary tree must be produced, then the test is of the form A 2 SA, where SA is the splitting subset for A.

Let A be the splitting attribute Let A be the splitting attribute. (a) If A is discrete-valued, then one branch is grown for each known value of A. (b) If A is continuous-valued, then two branches are grown, corresponding to A split point and A > split point. (c) If A is discrete-valued and a binary tree must be produced, then the test is of the form A 2 SA, where SA is the splitting subset for A. where http://ajpatelit.hpage.com

Naïve Bayesian Classification P(Ci|X)=[P(X|Ci).P(Ci)]/P(X) Procedure: Assume that there are a set of m samples S={s1, s2,…,sm} (the training data set). Where every sample si is represented as an n-dimensional vector (x1,x2,…,xn) Values xi correspond to attributes A1, A2,…, An respectively. Also, there are k classes C1, C2,…, Ck, and every sample belongs to one of these classes. These probabilities are computed using Bayes’ Theorem: P(Ci|X)=[P(X|Ci).P(Ci)]/P(X) As P(X) is constant for all classes, only the product P(X|Ci).P(Ci) needs to be maximized. http://ajpatelit.hpage.com

Naïve Bayesian Classification http://ajpatelit.hpage.com

Naïve Bayesian Classification http://ajpatelit.hpage.com

Naïve Bayesian Classification The data tuples are described by the attributes age, income, student, and credit rating. The class label attribute, buys_computer, has two distinct values namely, {yes, no}. Let Class: C1: buys_computer= “yes” C2: buys_computer= “no” The data sample we want to classify is: X=(age=youth, income=medium, student=yes, credit rating=fair) http://ajpatelit.hpage.com

Naïve Bayesian Classification We need to maximize P(X|Ci).P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can be computed based on the training tuples: P(buys computer = yes) P(buys computer = no) To compute P(X|Ci), for i = 1, 2, we compute the following conditional probabilities: = 9/14 = 0.643 = 5/14 = 0.357 http://ajpatelit.hpage.com

Naïve Bayesian Classification P(age = youth | buys_computer = yes) P(age = youth | buys_computer = no) P(income = medium | buys_computer = yes) P(income = medium | buys_computer = no) P(student = yes | buys_computer = yes) P(student = yes | buys_computer = no) P(credit_rating = fair | buys_computer = yes) P(credit_rating = fair | buys_computer = no) X=(age=youth, income=medium, student=yes, credit_rating=fair) = 2/9 = 0.222 = 3/5 = 0.600 = 4/9 = 0.444 = 2/5 = 0.400 = 6/9 = 0.667 = 1/5 = 0.200 = 6/9 = 0.667 = 2/5 = 0.400 http://ajpatelit.hpage.com

Naïve Bayesian Classification Using the above probabilities, we obtain P(X|buys_computer = yes) = P(age = youth | buys_computer = yes) * P(income = medium | buys_computer = yes) * P(student = yes | buys_computer = yes) * P(credit_rating = fair | buys_computer = yes) = 0.2220 * 0.4440 * 0.6670 * 0.667 = 0.044 http://ajpatelit.hpage.com

Naïve Bayesian Classification Similarly, P(X|buys_computer = no) = 0.6000 * 0.4000 * 0.2000 * 0.4000 = 0.019 To find the class, Ci, that maximizes P(X|Ci).P(Ci), we compute P(X|buys_computer = yes) * P(buys_computer = yes) = 0:0440 * 0.643 = 0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.0190 * 0.357 = 0.007 Therefore, the naïve Bayesian classifier predicts buys_computer = yes for tuple X. http://ajpatelit.hpage.com

Rule-Based Classification Using IF-THEN Rules for Classification Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion. An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes. http://ajpatelit.hpage.com

Rule-Based Classification The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition. The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age = youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer). R1 can also be written as R1: (age = youth)^(student = yes)  (buys computer = yes). http://ajpatelit.hpage.com

Rule-Based Classification If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple. A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set, D, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |D| be the number of tuples in D. http://ajpatelit.hpage.com

Rule-Based Classification We can define the coverage and accuracy of R as: Find the Coverage and Accuracy for the Rule R1: IF age = youth AND student = yes THEN buys_computer = yes ??? http://ajpatelit.hpage.com

Rule-Based Classification http://ajpatelit.hpage.com

Rule-Based Classification These are class-labeled tuples from the AllElectronics customer database. Our task is to predict whether a customer will buy a computer. Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples. Therefore, coverage(R1) accuracy (R1) = 2/14 = 14.28% = 2/2 = 100% http://ajpatelit.hpage.com

Rule-Based Classification R2: IF age = youth AND student = no THEN buys_computer = no coverage(R2) = accuracy (R2) = R3: IF age = senior AND credit_rating = excellent THEN buys_computer = yes coverage(R3) = accuracy (R3) = R4: IF age = middle_aged THEN buys_computer = yes coverage(R4) = accuracy (R4) = 3/14 = 21.43% 3/3 = 100% 2/14 = 14.28% 0/2 = 0% 4/14 = 28.57% 4/4 = 100% http://ajpatelit.hpage.com

Basic Sequential Covering algorithm IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using a sequential covering algorithm. Input: D, a data set class-labeled tuples; Att_vals, the set of all attributes and their possible values. http://ajpatelit.hpage.com

Rule-Based Classification http://ajpatelit.hpage.com

CART Classification and Regression Trees University of California- a study into patients after admission for a heart attack. 19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours) Question: Can the high risk (will not survive 30 days) patients be identified? http://ajpatelit.hpage.com

CART Is the minimum systolic blood pressure over the 1st 24 hours>91? Is age>62.5? H Is sinus present? L H L http://ajpatelit.hpage.com

CART Plan for construction of a CART Selection of the Splits Decisions when to decide that a node is a terminal node (i.e. not to split it any further) Assigning a class to each terminal node http://ajpatelit.hpage.com

CART Where splits need to be evaluated…??? Sorted by Age Sorted by Blood Pressure X X http://ajpatelit.hpage.com

CART Features of CART Binary Splits Splits based only on one variable http://ajpatelit.hpage.com

CART Advantages of CART Can cope with any data structure or type Classification has a simple form Uses conditional information effectively Invariant under transformations of the variables Is robust with respect to outliers Gives an estimate of the misclassification rate http://ajpatelit.hpage.com

CART Disadvantages of CART CART does not use combinations of variables Tree can be deceptive – if variable not included it could be as it was “masked” by another Tree structures may be unstable – a change in the sample may give different trees Tree is optimal at each split – it may not be globally optimal. http://ajpatelit.hpage.com

Neural Networks Neural networks are useful for data mining and decision- support applications. People are good at generalizing from experience. Computers excel at following explicit instructions over and over. Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains. http://ajpatelit.hpage.com

Neural Networks Neural Networks map a set of input-nodes to a set of output-nodes Number of inputs/outputs is variable The Network itself is composed of an arbitrary number of nodes with an arbitrary topology http://ajpatelit.hpage.com

Neural Networks A neuron: many-inputs / one-output unit Output can be excited or not excited Incoming signals from other neurons determine if the neuron shall excite ("fire") Output subject to attenuation in the synapses, which are junction parts of the neuron http://ajpatelit.hpage.com

Neural Networks Neural Networks A multilayer feed-forward neural network. http://ajpatelit.hpage.com

Prediction (Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given input Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions http://ajpatelit.hpage.com

Prediction Major method for prediction: regression Model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees http://ajpatelit.hpage.com

Linear Regression Linear regression: involves a response variable y and a single predictor variable x y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: estimates the best-fitting straight line http://ajpatelit.hpage.com

Linear Regression Multiple linear regression: involves more than one predictor variable Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2 Many nonlinear functions can be transformed into the above http://ajpatelit.hpage.com

Nonlinear Regression Some nonlinear models can be modeled by a polynomial function A polynomial regression model can be transformed into linear regression model. For example, y = w0 + w1 x + w2 x2 + w3 x3 convertible to linear with new variables: x2 = x2, x3= x3 Other functions, such as power function, can also be transformed to linear model Some models are intractable nonlinear (e.g., sum of exponential terms) possible to obtain least square estimates through extensive calculation on more complex formulae http://ajpatelit.hpage.com

Information Gain ID3 uses information gain as its attribute selection measure. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions. http://ajpatelit.hpage.com

Information Gain (cont…) InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A. Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). http://ajpatelit.hpage.com

Information Gain (cont…) Examples Over all Information for a database http://ajpatelit.hpage.com

Information Gain (cont…) Hence, the gain in information from such a partitioning would be http://ajpatelit.hpage.com

Gain Ratio Consider an attribute that acts as a unique identifier, such as product_ID. A split on product_ID would result in a large number of partitions (as many as there are values), each one containing just one tuple. Because each partition is pure, the information required to classify data set D based on this partitioning would be Infoproduct_ID(D) = 0. Therefore, the information gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless for classification. http://ajpatelit.hpage.com

Gain Ratio (cont…) This value represents the potential information generated by splitting the training data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in D. It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as http://ajpatelit.hpage.com

Cluster Analysis Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms http://ajpatelit.hpage.com

Clustering: Rich Application Pattern Recognition Spatial Data Analysis Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns http://ajpatelit.hpage.com

Requirements of Clustering in DM Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability http://ajpatelit.hpage.com

End of the Unit-5 http://ajpatelit.hpage.com