Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification and Prediction. - The Course DS OLAP DM Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining DP.

Similar presentations


Presentation on theme: "Classification and Prediction. - The Course DS OLAP DM Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining DP."— Presentation transcript:

1 Classification and Prediction

2 - The Course DS OLAP DM Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining DP = Staging Database DPDW

3 Chapter Objectives Learn basic techniques for data classification and prediction. Realize the difference between the following classifications of data: –supervised classification –prediction –unsupervised classification

4 Chapter Outline What is classification and prediction of data? How do we classify data by decision tree induction? What are neural networks and how can they classify? What is Bayesian classification? Are there other classification techniques? How do we predict continuous values?

5 What is Classification? The goal of data classification is to organize and categorize data in distinct classes. –A model is first created based on the data distribution. –The model is then used to classify new data. –Given the model, a class can be predicted for new data. Classification = prediction for discrete and nominal values

6 What is Prediction? The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes. –A model is first created based on the data distribution. –The model is then used to predict future or unknown values In Data Mining –If forecasting discrete value  Classification –If forecasting continuous value  Prediction

7 Supervised and Unsupervised Supervised Classification = Classification –We know the class labels and the number of classes Unsupervised Classification = Clustering –We do not know the class labels and may not know the number of classes

8 Preparing Data Before Classification Data transformation: –Discretization of continuous data –Normalization to [-1..1] or [0..1] Data Cleaning: –Smoothing to reduce noise Relevance Analysis: –Feature selection to eliminate irrelevant attributes

9 Application Credit approval Target marketing Medical diagnosis Defective parts identification in manufacturing Crime zoning Treatment effectiveness analysis Etc

10 Supervised learning process: 3 steps Training Data Class 1. Test Data 2. Accuracy New Data 3. Classification Method Classification Model

11 Classification is a 3-step process 1. Model construction (Learning): Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label. The set of all tuples used for construction of the model is called training set. –The model is represented in the following forms: Classification rules, (IF-THEN statements), Decision tree Mathematical formulae

12 1. Classification Process (Learning) NameIncomeAge SamirLow<30 AhmedMedium[30...40] SalahHigh<30 AliMedium>40 SamiLow[30..40] EmadMedium<30 Classification Method IF Income = ‘High’ OR Age > 30 THEN Class = ‘Good OR Decision Tree OR Mathematical For Classification Model Credit rating bad good bad class Training Data

13 Classification is a 3-step process 2. Model Evaluation (Accuracy): –Estimate accuracy rate of the model based on a test set. –The known label of test sample is compared with the classified result from the model. –Accuracy rate is the percentage of test set samples that are correctly classified by the model. –Test set is independent of training set otherwise over-fitting will occur

14 2. Classification Process (Accuracy Evaluation) NameIncomeAge NaserLow<30 LutfiMedium<30 AdelHigh>40 FahdMedium[30..40] Classification Model Credit rating Bad good class Accuracy 75% Model Bad good

15 Classification is a three-step process 3. Model Use (Classification): –The model is used to classify unseen objects. Give a class label to a new tuple Predict the value of an actual attribute

16 3. Classification Process (Use) NameIncomeAge AdhamLow<30 Classification Model Credit rating ?

17 Classification Methods Decision Tree Induction Neural Networks Bayesian Classification Association-Based Classification K-Nearest Neighbour Case-Based Reasoning Genetic Algorithms Rough Set Theory Fuzzy Sets Etc. Classification Method

18 Evaluating Classification Methods Predictive accuracy –Ability of the model to correctly predict the class label Speed and scalability –Time to construct the model –Time to use the model Robustness –Handling noise and missing values Scalability –Efficiency in large databases (not memory resident data) Interpretability: –The level of understanding and insight provided by the model

19 Chapter Outline What is classification and prediction of data? How do we classify data by decision tree induction ? What are neural networks and how can they classify? What is Bayesian classification? Are there other classification techniques? How do we predict continuous values?

20 Decision Tree

21 What is a Decision Tree? A decision tree is a flow-chart-like tree structure. –Internal node denotes a test on an attribute –Branch represents an outcome of the test All tuples in branch have the same value for the tested attribute. Leaf node represents class label or class label distribution

22 Sample Decision Tree Income Age 2000 6000 10000 20 50 80 Income YES No < 6K Excellent customers Fair customers >= 6K

23 Sample Decision Tree Income Age 2000600010000 20 50 80 Income Age NO <6k >=6k <50 >=50 Yes

24 Sample Decision Tree OutlookTempHumidityWindy sunnyhothighFALSE sunnyhothighTRUE overcasthothighFALSE rainymildhighFALSE rainycoolnormalFALSE rainycoolNormalTRUE overcastcoolNormalTRUE sunnymildHighFALSE sunnycoolNormalFALSE rainymildNormalFALSE sunnymildnormalTRUE overcastmildHighTRUE overcasthotNormalFALSE rainymildhighTRUE Play? No Yes No Yes No Yes No http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/

25 Decision-Tree Classification Methods The basic top-down decision tree generation approach usually consists of two phases: 1.Tree construction At the start, all the training examples are at the root. Partition examples are recursively based on selected attributes. 2.Tree pruning Aiming at removing tree branches that may reflect noise in the training data and lead to errors when classifying test data  improve classification accuracy

26 How to Specify Test Condition? Depends on attribute types –Nominal –Ordinal –Continuous Depends on number of ways to split –2-way split –Multi-way split

27 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR

28 Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium}

29 Splitting Based on Continuous Attributes Different ways of handling –Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. –Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive

30 Splitting Based on Continuous Attributes

31 Tree Induction Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting

32 How to determine the Best Split Income Age >=10k <10k youngold Customers fair customers Good customers

33 How to determine the Best Split Greedy approach: –Nodes with homogeneous class distribution are preferred Need a measure of node impurity: High degree of impurity Low degree of impurity pure 50% red 50% green 75% red 25% green 100% red 0% green

34 Measures of Node Impurity Information gain –Uses Entropy Gain Ratio –Uses Information Gain and Splitinfo Gini Index –Used only for binary splits

35 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

36 Classification Algorithms ID3 –Uses information gain C4.5 –Uses Gain Ratio CART –Uses Gini

37 Entropy: Used by ID3  Entropy measures the impurity of S  S is a set of examples  p is the proportion of positive examples  q is the proportion of negative examples Entropy(S) = - p log 2 p - q log 2 q

38 ID3 play don’t play p no = 5/14 p yes = 9/14 Impurity = - p yes log 2 p yes - p no log 2 p no = - 9/14 log 2 9/14 - 5/14 log 2 5/14 = 0.94 bits

39 ID3 play don’t play amount of information required to specify class of an example given that it reaches node 0.94 bits 0.0 bits * 4/14 0.97 bits * 5/14 0.97 bits * 5/14 0.98 bits * 7/14 0.59 bits * 7/14 0.92 bits * 6/14 0.81 bits * 4/14 0.81 bits * 8/14 1.0 bits * 4/14 1.0 bits * 6/14 outlook sunny overcastrainy + = 0.69 bits gain: 0.25 bits + = 0.79 bits + = 0.91 bits + = 0.89 bits gain: 0.15 bitsgain: 0.03 bitsgain: 0.05 bits humiditytemperaturewindy highnormalhotmildcoolfalsetrue maximal information gain

40 ID3 play don’t play outlook sunny overcastrainy maximal information gain 0.97 bits 0.0 bits * 3/5 humiditytemperaturewindy highnormalhotmildcoolfalsetrue + = 0.0 bits gain: 0.97 bits + = 0.40 bits gain: 0.57 bits + = 0.95 bits gain: 0.02 bits 0.0 bits * 2/5 0.0 bits * 2/5 1.0 bits * 2/5 0.0 bits * 1/5 0.92 bits * 3/5 1.0 bits * 2/5

41 ID3 play don’t play outlook sunny overcast rainy humidity highnormal 1.0 bits *2/5 temperaturewindy hotmildcoolfalsetrue + = 0.95 bits gain: 0.02 bits + = 0.95 bits gain: 0.02 bits + = 0.0 bits gain: 0.97 bits humidity highnormal 0.92 bits * 3/5 0.92 bits * 3/5 1.0 bits * 2/5 0.0 bits * 3/5 0.0 bits * 2/5  0.97 bits

42 ID3 play don’t play outlook sunny overcastrainy windy false true humidity high normal Yes No Yes

43 C4.5 Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) –GainRatio(A) = Gain(A)/SplitInfo(A) Ex. –gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute

44 CART If a data set D contains examples from n classes, gini index, gini(D) is defined as where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

45 CART Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 but gini {medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

46 Comparing Attribute Selection Measures The three measures, in general, return good results but –Information gain: biased towards multivalued attributes –Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others –Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions

47 Other Attribute Selection Measures CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence C-SEP: performs better than info. gain and gini index in certain cases G-statistics: has a close approximation to χ 2 distribution MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): –The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree Multivariate splits (partition based on multiple variable combinations) –CART: finds multivariate splits based on a linear comb. of attrs. Which attribute selection measure is the best? – Most give good results, none is significantly superior than others

48 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

49 Overfitting due to Noise Decision boundary is distorted by noise point

50 Underfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

51 Two approaches to avoid Overfitting Prepruning : –Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold –Difficult to choose an appropriate threshold Postpruning : –Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees –Use a set of data different from the training data to decide which is the “best pruned tree”

52 Scalable Decision Tree Induction Methods ID3, C4.5, and CART are not efficient when the training set doesn’t fit the available memory. Instead the following algorithms are used –SLIQ Builds an index for each attribute and only class list and the current attribute list reside in memory –SPRINT Constructs an attribute list data structure –RainForest Builds an AVC-list (attribute, value, class label) –BOAT Uses bootstrapping to create several small samples

53 BOAT BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) –Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory –Each subset is used to create a tree, resulting in several trees –These trees are examined and used to construct a new tree T’ It turns out that T’ is very close to the tree that would be generated using the whole data set together –Adv: requires only two scans of DB, an incremental alg.

54 Why decision tree induction in data mining? Relatively faster learning speed (than other classification methods) Convertible to simple and easy to understand classification rules Comparable classification accuracy with other methods

55 Converting Tree to Rules R 1 : IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No R 2 : IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes R 3 : IF (Outlook=Overcast) THEN Play=Yes R 4 : IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No R 5 : IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes Outlook SunnyOvercastRain Humidity High Normal Wind StrongWeak NoYes No

56 Decision trees: The Weka tool @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no http://www.cs.waikato.ac.nz/ml/weka/

57 Bayesian Classifier Thomas Bayes (1702-1761)

58 X C D 16 6 74 |X| = 10 |C| = 20 |D| = 100 P(X) = 10/100 P(C) = 20/100 P(X,C) = 4/100 4 Basic Statistics P(X,C) = P(C|X)*P(X) = P(X|C)*P(C) P(X|C) = P(X,C)/P(C) = 4/20 P(C|X) = P(X,C)/P(X) = 4/10 Assume D = All students X = ICS students C = SWE students

59 Bayesian Classifier – Basic Equation Class Posterior Probability Class Prior ProbabilityDescriptor Posterior Probability Descriptor Prior Probability P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)

60 Naive Bayesian Classifier Independence assumption about descriptors

61 Training Data OutlookTempHumidityWindy sunnyhothighFALSE sunnyhothighTRUE overcasthothighFALSE rainymildhighFALSE rainycoolnormalFALSE rainycoolNormalTRUE overcastcoolNormalTRUE sunnymildHighFALSE sunnycoolNormalFALSE rainymildNormalFALSE sunnymildnormalTRUE overcastmildHighTRUE overcasthotNormalFALSE rainymildhighTRUE Play? No Yes No Yes No Yes No P(yes) = 9/14 P(no) = 5/14

62 Bayesian Classifier – Probabilities for the weather data Outlook | No Yes ---------------------------------- Sunny | 3 2 ---------------------------------- Overcast | 0 4 ---------------------------------- Rainy | 2 3 Temp. | No Yes ---------------------------------- Hot | 2 2 ---------------------------------- Mild | 2 4 ---------------------------------- Cool | 1 3 Humidity | No Yes ---------------------------------- High | 4 3 ---------------------------------- Normal | 1 6 Windy | No Yes ---------------------------------- False | 2 6 ---------------------------------- True | 3 3 Outlook | No Yes ---------------------------------- Sunny | 3/5 2/9 ---------------------------------- Overcast | 0/5 4/9 ---------------------------------- Rainy | 2/5 3/9 Temp. | No Yes ---------------------------------- Hot | 2/5 2/9 ---------------------------------- Mild | 2/5 4/9 ---------------------------------- Cool | 1/5 3/9 Humidity | No Yes ---------------------------------- High | 4/5 3/9 ---------------------------------- Normal | 1/5 6/9 Windy | No Yes ---------------------------------- False | 2/5 6/9 ---------------------------------- True | 3/5 3/9 Frequency Tables Likelihood Tables

63 Bayesian Classifier – Predicting a new day Class? Outlook Temp. Humidity Windy Play sunny cool high true ? P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes) = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205 P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no) = 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795 X

64 Bayesian Classifier – zero frequency problem  What if a descriptor value doesn’t occur with every class value P(outlook=overcast|No)=0  Remedy: add 1 to the count for every descriptor-class combination (Laplace Estimator) Outlook | No Yes ---------------------------------- Sunny | 3+1 2+1 ---------------------------------- Overcast | 0+1 4+1 ---------------------------------- Rainy | 2+1 3+1 Temp. | No Yes ---------------------------------- Hot | 2+1 2+1 ---------------------------------- Mild | 2+1 4+1 ---------------------------------- Cool | 1+1 3+1 Humidity | No Yes ---------------------------------- High | 4+1 3+1 ---------------------------------- Normal | 1+1 6+1 Windy | No Yes ---------------------------------- False | 2+1 6+1 ---------------------------------- True | 3+1 3+1

65 Likelihood: Continues variable: Bayesian Classifier – General Equation

66 Bayesian Classifier – Dealing with numeric attributes

67

68 Naïve Bayesian Classifier: Comments Advantages –Easy to implement –Good results obtained in most of the cases Disadvantages –Assumption: class conditional independence, therefore loss of accuracy –Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? –Bayesian Belief Networks

69 Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships –Represents dependency among the variables –Gives a specification of joint probability distribution X Y Z P  Nodes: random variables  Links: dependency  X and Y are the parents of Z, and Y is the parent of P  No dependency between Z and P  Has no loops or cycles

70 Bayesian Belief Network: An Example Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea LC ~LC (FH, S)(FH, ~S)(~FH, S)(~FH, ~S) 0.8 0.2 0.5 0.7 0.3 0.1 0.9 Bayesian Belief Networks The conditional probability table (CPT) for variable LungCancer: CPT shows the conditional probability for each possible combination of its parents Derivation of the probability of a particular combination of values of X, from CPT:

71 Training Bayesian Networks Several scenarios: –Given both the network structure and all variables observable: learn only the CPTs –Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning –Network structure unknown, all variables observable: search through the model space to reconstruct network topology –Unknown structure, all hidden variables: No good algorithms known for this purpose.

72 Support Vector Machines

73 Sabic Email Mohammed S. Al-Shahrani –shahranims@sabic.com

74 Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data

75 Support Vector Machines One Possible Solution

76 Support Vector Machines Another possible solution

77 Support Vector Machines Other possible solutions

78 Support Vector Machines Which one is better? B1 or B2? How do you define better?

79 Support Vector Machines Find a hyper plane that maximizes the margin => B1 is better than B2

80 Support Vectors

81 Support Vector Machines Support Vectors

82 Support Vector Machines

83 Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i  {1,-1} be the class label of x i The decision boundary should classify all points correctly  The decision boundary can be found by solving the following constrained optimization problem This is a constrained optimization problem. Solving it is beyond our course

84 Support Vector Machines We want to maximize: –Which is equivalent to minimizing: –But subjected to the following constraints: This is a constrained optimization problem –Numerical approaches to solve it (e.g., quadratic programming)

85 Classifying new Tuples The decision boundary is determined only by the support vectors Let t j (j=1,..., s) be the indices of the s support vectors. For testing with a new data z –Compute and classify z as class 1 if the sum is positive, and class 2 otherwise

86 Support Vector Machines Support Vectors

87 Support Vector Machines What if the training set is not linearly separable? Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξiξi ξiξi

88 Support Vector Machines What if the problem is not linearly separable? –Introduce slack variables Need to minimize: Subject to:

89 Nonlinear Support Vector Machines What if decision boundary is not linear?

90 Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 0 0 x2x2 x x x

91 Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

92 prediction Linear Regression

93 What Is Prediction? (Numerical) prediction is similar to classification –construct a model –use model to predict continuous or ordered value for a given input Prediction is different from classification –Classification refers to predict categorical class label –Prediction models continuous-valued functions Major method for prediction: regression –model the relationship between one or more predictor variables and a response variable

94 Prediction Attribute (X)Attribute (Y) Predictor Response Training data

95 Types of Correlation Positive correlationNegative correlationNo correlation

96 Regression Analysis Simple Linear regression multiple regression Non-linear regression Other regression methods: –generalized linear model, –Poisson regression, –log-linear models, –regression trees

97 describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis X Y Simple Linear Regression

98 X Y 1.0 Simple Linear Regression

99 X Y

100 X Y ε ε

101 Fitting data to a linear model interceptslope residuals Simple Linear Regression

102 How to fit data to a linear model? Least Square Method Simple Linear Regression

103 Least Squares Regression Residual (ε) = Sum of squares of residuals = Model line: we must find values of and that minimise

104 Linear Regression A model line: y = w 0 + w 1 x acquired by using Method of least squares to estimates the best-fitting straight line has:

105 Multiple Linear Regression Multiple linear regression: involves more than one predictor variable The linear model with a single predictor variable X can easily be extended to two or more predictor variables –Solvable by extension of least square method or using SAS, S-Plus

106 Some nonlinear models can be modeled by a polynomial function A polynomial regression model can be transformed into linear regression model. For example, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 convertible to linear with new variables: x 2 = x 2, x 3 = x 3 y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 Other functions, such as power function, can also be transformed to linear model Some models are intractable nonlinear –possible to obtain least square estimates through extensive calculation on more complex formulae Nonlinear Regression

107 Artificial Neural Networks (ANN)

108 What is a ANN? ANN is a data structure that supposedly simulates the behavior of neurons in a biological brain. ANN is composed of layers of units interconnected. Messages are passed along the connections from one unit to the other. Messages can change based on the weight of the connection and the value in the node

109 General Structure of ANN kk - f  w0w0 w1w1 wnwn x0x0 x1x1 xnxn

110 ANN Output Y is 1 if at least two of the three inputs are equal to 1.

111 ANN

112 Artificial Neural Networks Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Perceptron Model or

113 Neural Networks Advantages –prediction accuracy is generally high. –robust, works when training examples contain errors. –output may be discrete, real-valued, or a vector of several discrete or real-valued attributes. –fast evaluation of the learned target function. Criticism –long training time. –difficult to understand the learned function (weights). –not easy to incorporate domain knowledge.

114 Learning Algorithms Back propagation for classification Kohonen feature maps for clustering Recurrent back propagation for classification Radial basis function for classification Adaptive resonance theory Probabilistic neural networks

115 Major Steps for Back Propagation Network Constructing a network –input data representation –selection of number of layers, number of nodes in each layer. Training the network using training data Pruning the network Interpret the results

116 A Multi-Layer Feed-Forward Neural Network w ij

117 How A Multi-Layer Neural Network Works? The inputs to the network correspond to the attributes measured for each training tuple Inputs are fed simultaneously into the units making up the input layer They are then weighted and fed simultaneously to a hidden layer The number of hidden layers is arbitrary, although usually only one The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a previous layer From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function

118 Defining a Network Topology First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer Normalizing the input values for each attribute measured in the training tuples to [0.0 — 1.0] One input unit per domain value Output, if for classification and more than two classes, one output unit per class is used Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights

119 Backpropagation Iteratively process a set of training tuples & compare the network's prediction with the actual known target value For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation” Steps –Initialize weights (to small random #s) and biases in the network –Propagate the inputs forward (by applying activation function) –Backpropagate the error (by updating weights and biases) –Terminating condition (when error is very small, etc.)

120 Backpropagation Generated value Correct value

121 Network Pruning Fully connected network will be hard to articulate n input nodes, h hidden nodes and m output nodes lead to h(m+n) links (weights) Pruning: Remove some of the links without affecting classification accuracy of the network.

122 Other Classification Methods Associative classification: Association rule based condSet  class Genetic algorithm: Initial population of encoded rules are changed by mutation and cross-over based on survival of accurate once (survival). K-nearest neighbor classifier: Learning by analogy. Case-based reasoning: Similarity with other cases. Rough set theory: Approximation to equivalence classes. Fuzzy sets: Based on fuzzy logic (truth values between 0..1).

123 Lazy Learners

124 Lazy vs. Eager Learning Lazy vs. eager learning –Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple –Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify Lazy: less time in training but more time in predicting

125 Lazy Learner: Instance-Based Methods Instance-based learning: –Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified Typical approaches –k-nearest neighbor approach Instances represented as points in a Euclidean space. –Case-based reasoning Uses symbolic representations and knowledge- based inference

126 Nearest Neighbor Classifiers Basic idea: –If it walks like a duck, quacks like a duck, then it’s probably a duck Test Record Compute Distance Choose k of the “nearest” records Training records

127 Instance-Based Classifiers Store the training records Use training records to predict the class label of unseen cases

128 Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

129 The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance, dist(X 1, X 2 ) Target function could be discrete- or real- valued For discrete-valued, k-NN returns the most common value among the k training examples nearest to x q Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. _ + _ xqxq + _ _ + _ _ +.....

130 Nearest-Neighbor Classifiers l Requires three things –The set of stored records –Distance Metric to compute distance between records –The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: –Compute distance to other training records –Identify k nearest neighbors –Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

131 Nearest Neighbor Classification Compute distance between two points: –Euclidean distance Determine the class from nearest neighbor list –take the majority vote of class labels among the k- nearest neighbors –Weigh the vote according to distance weight factor, w = 1/d 2

132 Nearest Neighbor Classification… Scaling issues –Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes –Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

133 Nearest Neighbor Classification… Choosing the value of k: –If k is too small, sensitive to noise points –If k is too large, neighborhood may include points from other classes

134 Metrics for Performance Evaluation Focus on the predictive capability of a model –Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

135 Metrics for Performance Evaluation… Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN) Error Rate = 1 - Accuracy

136 Limitation of Accuracy Consider a 2-class problem –Number of Class 0 examples = 9990 –Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % –Accuracy is misleading because model does not detect any class 1 example

137 Alternative Classifier Accuracy Measures accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) –sensitivity = tp/pos /* true positive recognition rate */ –specificity = tn/neg /* true negative recognition rate */ precision = tp/(tp + fp)

138 Predictor Error Measures Test error (generalization error): the average loss over the test set –Mean absolute error : –Mean squared error : –Relative absolute error : –Relative squared error : –The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error

139 Evaluating Accuracy Holdout method –Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation –Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) –Randomly partition the data into k mutually exclusive subsets, each approximately equal size –At i-th iteration, use D i as test set and others as training set

140 Evaluating Accuracy Bootstrap –Works well with small data sets –Samples the given training tuples uniformly with replacement Several boostrap methods, and a common one is.632 boostrap –Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d) d ≈ e -1 = 0.368) –Repeat the sampling procedure k times, overall accuracy of the model :

141 Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers –Use a combination of models to increase accuracy –Combine a series of k learned models, M 1, M 2, …, M k, with the aim of creating an improved model M* Popular ensemble methods –Bagging averaging the prediction over a collection of classifiers –Boosting weighted vote with a collection of classifiers

142 General Idea

143 Bagging: Boostrap Aggregation Analogy: Diagnosis based on multiple doctors’ majority vote Training –Given a set D of d tuples, at each iteration i, a training set D i of d tuples is sampled with replacement from D (i.e., boostrap) –A classifier model M i is learned for each training set D i Classification: classify an unknown sample X –Each classifier M i returns its class prediction –The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple

144 Bagging: Boostrap Aggregation Accuracy –Often significant better than a single classifier derived from D –For noise data: not considerably worse, more robust –Proved improved accuracy in prediction

145 Boosting Analogy: Consult several doctors, based on a combination of weighted diagnoses — weight assigned based on the previous diagnosis accuracy How boosting works? –Weights are assigned to each training tuple –A series of k classifiers is iteratively learned –After a classifier M i is learned, the weights are updated to allow the subsequent classifier, M i+1, to pay more attention to the training tuples that were misclassified by M i –The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy

146 Boosting The boosting algorithm can be extended for the prediction of continuous values Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data

147 Boosting: Adaboost Given a set of d class-labeled tuples, (X 1, y 1 ), …, (X d, y d ) Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i, –Tuples from D are sampled (with replacement) to form a training set D i of the same size –Each tuple’s chance of being selected is based on its weight –A classification model M i is derived from D i –Its error rate is calculated using D i as a test set –If a tuple is misclassified, its weight is increased, otherwise it is decreased Error rate: err(X j ) is the misclassification error of tuple X j. Classifier M i error rate is the sum of the weights of the misclassified tuples: The weight of classifier M i ’s vote is

148 Summary Classification Vs prediction Eager learners –Decision tree –Bayesian –Support vector Machines (SVM) –Neural Networks –Linear regression Lazy learners –K-Nearest Neighbor (KNN) Performance (Accuracy) Evaluation –Holdout –Cross validation –Bootstrap Ensemble Methods –Bagging –Boosting

149 END


Download ppt "Classification and Prediction. - The Course DS OLAP DM Association Classification Clustering DS = Data source DW = Data warehouse DM = Data Mining DP."

Similar presentations


Ads by Google