Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Brief introduction on Logistic Regression
Unsupervised Learning
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Chapter 4: Linear Models for Classification
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
x – independent variable (input)
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Data Mining Techniques Outline
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Neural Networks Chapter Feed-Forward Neural Networks.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Chapter 11 Multiple Regression.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Topic 3: Regression.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Data Mining Techniques
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Data mining and machine learning A brief introduction.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 Neural Network.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Classification Techniques: Bayesian Classification
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Data Mining and Decision Support
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
BINARY LOGISTIC REGRESSION
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Machine Learning Basics
Data Mining Lecture 11.
Clustering.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Chapter 11 Statistical Techniques

Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate data mining technique.  Know how to perform linear regression with Microsoft Excel’s LINEST function.  Know that logistic regression can be used to build supervised learner models for datasets having a binary outcome.  Understand how Bayes classifier is able to build supervised models for datasets having categorical data, numeric data, or a combination of both data types.

Data Warehouse and Data Mining Chapter 11 3 Chapter Objectives  Know how agglomerative clustering is applied partition data instances into disjoint clusters.  Understand that conceptual clustering is an unsupervised data mining technique that builds a concept hierarchy to partition data instances.  Know that the EM algorithm uses a statistical parameter adjustment technique to cluster data instances.  Understand the basic features that differentiate statistical and machine learning data mining methods

Data Warehouse and Data Mining Chapter 11 4 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 5 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 6 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 7 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 8 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 9 Logistic Regression

Data Warehouse and Data Mining Chapter Logistic Regression

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Heuristics or Statistics? Here is one way to categorize inductive problem- solving methods: Query and visualization techniques Machine learning techniques Statistical techniques Query and visualization techniques generally fall into one of three group: Query tools OLAP tools Visualization tools

Data Warehouse and Data Mining Chapter Data mining techniques come in many shapes and forms. A favorite statistical technique for estimation and prediction problems is linear regression. Linear regression attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Linear regression is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. Microsoft Excel’s LINEST function provides an easy mechanism for performing multiple linear regression. Chapter Summary

Data Warehouse and Data Mining Chapter Chapter Summary Linear regression is a poor choice when the outcome is binary. The problem lies in the fact that the value restriction placed on the dependent variable is not observed by the regression equation. That is, because linear regression produces a straight-line function, values of the dependent variable are unbounded in both the positive and negative directions. For the two-outcome case, logistic regression is a better choice. Logistic regression is a nonlinear regression technique that associates a conditional probability value with each data instance.

Data Warehouse and Data Mining Chapter Chapter Summary Bayes classifier offers a simple yet powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent of one another. Even though these assumptions are likely to be false, Bayes classifier still works quite well in practice. Bayes classifier can be applied to datasets containing both categorical and numeric data. Also, unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items.

Data Warehouse and Data Mining Chapter Chapter Summary Agglomerative clustering is a favorite unsupervised clustering technique. Agglomerative clustering begins by assuming each data instance represents its own cluster. Each iteration of the algorithm merges the most similar pair of clusters. Several options for computing instance and cluster similarity scores and cluster merging procedures exist. Also, when the data to be clustered is real-valued, defining a measure of instance similarity can be a challenge. One common approach is to use simple Euclidean distance. A widespread application of agglomerative clustering is its use as a prelude to other clustering techniques.

Data Warehouse and Data Mining Chapter Chapter Summary Conceptual clustering is an unsupervised technique that incorporates incremental learning to form a hierarchy of concepts. The concept hierarchy takes the form of a tree structure where the root node represents the highest level of concept generalization. Conceptual clustering systems are particularly appealing because the trees they form have been shown to consistently determine psychologically preferred levels in human classification hierarchies. Also, conceptual clustering systems lend themselves well to explaining their behavior. A major problem with conceptual clustering systems is that instance ordering can have a marked impact on the results of the clustering. A nonrepresentative ordering of data instances can lead to a less than optimal clustering.

Data Warehouse and Data Mining Chapter Chapter Summary The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model. The mixtures model assigns each individual data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster. The model assumes all attributes to be independent random variables. The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desired convergence value is achieved. A lack of explanation about what has been discovered is a problem with EM as it is with many clustering systems. Applying a supervised model to analyze the results of an unsupervised clustering is one technique to help explain the results of an EM clustering.

Data Warehouse and Data Mining Chapter Key Terms A priori probability. The probability a hypothesis is true lacking evidence to support or reject the hypothesis. Agglomerative clustering. An unsupervised technique where each data instance initially represents its own cluster. Successive iterations of the algorithm merge pairs of highly similar clusters until all instance become members of a single cluster. In the last step, a decision is made about which clustering is a best final result. Basic-level nodes. The nodes in a concept hierarchy that represent concepts easily identified by humans.

Data Warehouse and Data Mining Chapter Key Terms Bayes classifier. A supervised learning approach that classifies new instances by using Bayes theorem. Bayes theorem. The probability of a hypothesis given some evidence is equal to the probability of the evidence given the hypothesis, times the probability of the hypothesis, divided by the probability of the evidence. Bayesian Information Criterion (BIC). The BIC gives the posterior odds for one data mining model against another model assuming neither model is favored initially.

Data Warehouse and Data Mining Chapter Key Terms Category utility. An unsupervised evaluation function that measures the gain in the “expected number” of correct attribute-value predictions for a specific object if it were placed within a given category or cluster. Coefficient of determination. For a regression analysis, the correlation between actual and estimated values for the dependent variable. Concept hierarchy. A tree structure where each node of the tree represents a concept at some level of abstraction. Nodes toward the top of the tree are the most general. Leaf nodes represent individual data instances.

Data Warehouse and Data Mining Chapter Key Terms Conceptual clustering. An incremental unsupervised clustering method that creates a concept hierarchy from a set of input instances. Conditional probability. The conditional probability of evidence E given hypothesis H denoted by P(E | H), is the probability E is true given H is true. Incremental learning. A form of learning that is supported in an unsupervised environment where instances are presented sequentially. As each new instance is seen, the learning model is modified to reflect the addition of the new instance.

Data Warehouse and Data Mining Chapter Key Terms Linear regression. A statistical technique that models the variation in a numeric dependent variable as a linear combination of one or several independent variables. Logistic regression. A nonlinear regression technique for problems having a binary outcome. A created regression equation limits the values of the output attribute to values between 0 and 1.This allows output values to represent a probability of class membership.

Data Warehouse and Data Mining Chapter Key Terms Logit. The natural logarithm of the odds ratio p(y = 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the conditional probability that the value of the linear regression equation determined by feature vector x is 1. Mixture. A set of n probability distributions where each distribution represent a cluster. Model tree. A decision tree where each leaf node contains a linear regression equation.

Data Warehouse and Data Mining Chapter Key Terms Regression. The process of developing an expression that predicts a numeric output value. Regression tree. A decision tree where leaf nodes contain averaged numeric values. Simple linear regression. A regression equation with a single independent variable. Slope-intercept form. A linear equation of the form y = ax + b where a is the slope of the line and b is the y-intercept.

Data Warehouse and Data Mining Chapter 11 34