Download presentation
Presentation is loading. Please wait.
Published byBrandon Hubbard Modified over 9 years ago
1
Chapter 8 Discriminant Analysis
2
8.1 Introduction Classification is an important issue in multivariate analysis and data mining. Classification: classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data, i.e., predicts unknown or missing values
3
Classification — A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Prediction: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
4
Classification Process : Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
5
Classification Process: Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?
6
Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
7
Discrimination— Introduction Discrimination is a technique concerned with allocating new observations to previously defined groups. There are k samples from k distinct populations: One wants to find the so-called discriminant function and related rule to identify the new observations.
8
Example 11.3 Bivariate case
9
Discriminant function and rule
10
Example 11.1: Riding mowers Consider two groups in city: riding-mower owners and those without riding mowers. In order to identify the best sales prospects for an intensive sales campaign, a riding-mower manufacturer is interested in classifying families as prospective owners or non- owners on the basis of income and lot size.
11
Example 11.1: Riding mowers
13
8.2 Discriminant by Distance Assume k=2 for simplicity
14
Consider the Mahalanobis distance 8.2 Discriminant by Distance
15
Let 8.2 Discriminant by Distance
17
Example Univariate Case with equal variance a
18
a* Example Univariate Case with equal variance
19
8.3 Fisher’s Discriminant Function Idea: projection, ANOVA
20
Training samples 8.3 Fisher’s Discriminant Function
21
Projection the data on a direction, the F-statistics where 8.3 Fisher’s Discriminant Function
22
To find such that The solution of is the eigenvector associated with the largest eigenvalue of. Discriminant function: 8.3 Fisher’s Discriminant Function
23
(B) Two Populations Note We haveand There is only one non-zero eigenvalue of as
24
The associated eigenvector is where (B) Two Populations
25
Whenis replaced by where (B) Two Populations
26
Example Inset Classification Note:data x1 and x2 are the characteristics of insect (Hoel,1947) n.g.means natural group (species), c.g.the classified group, y the value of the discriminant function
27
The eigenvalue of is 1.9187 and the associated eigenvector is Example Inset Classification
28
The discriminant function is and the associated value of each observation is given in the table. The cutting point is Classification is If we use, we have the same classification. Example Inset Classification
29
8.4 Bayes’ Discriminant Analysis A.Idea There are k populations G 1, …, G k in R p. A partition of R p, R 1, …, R k, is determined based on a training sample. Rule:if falls into R i Loss: is from G i, but falls into R j The Probability of this misclassification whereis the density of.
30
Expected cost of misclassification is where q 1, …, q k are prior probabilities. We want to minimize ECM(R 1, …, R k ) w.r.t. R 1, …, R k. 8.4 Bayes’ Discriminant Analysis
31
Theorem 6.4.1 Let Then the optimal R t ’s are B. Method
32
Take if and 0 if. Then Proof: Corollary 1
33
In the case of k=2 we have Corollary 2
35
In the case of k=2 and Corollary 3
36
Then
37
C. Example 11.3: Detection of hemophilia A carriers For the detection of hemophilia A carriers, to construct a procedure for detecting potential hemophilia A carriers, blood samples were assayed for two groups of women and measurements on the two variables. The first group of 30 women were selected from a population of women who did not carry the hemophilia gene. This group was called the normal group. The second group of 22 women was selected from known hemophilia A carriers. This group was called the obligatory carriers.
38
Variables: log 10 (AHF activity) log 10 (AHF-like antigen) Populations:population of women who did not carry the hemophilia gene (n 1 =30) population of women who are known hemophilia A carriers (n 2 =45) hemophilia A carriers (n 2 =45) C. Example 11.3: Detection of hemophilia a carriers
39
C. Example 11.3: Detection of hemophilia a carriers
40
Data set -0.0056 -0.1698 -0.3469 -0.0894 -0.1679 -0.0836 -0.1979 -0.0762 -0.1913 -0.1092 -0.5268 -0.0842 -0.0225 0.0084 -0.1827 0.1237 -0.4702 -0.1519 0.0006 -0.2015 -0.1932 0.1507 -0.1259 -0.1551 -0.1952 0.0291 -0.228 -0.0997 -0.1972 -0.0867 -0.1657 -0.1585 -0.1879 0.0064 0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119 -0.4773 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 -0.0669 -0.1232 -0.1007 0.0442 -0.171 -0.0733 -0.0607 -0.056 -0.3478 -0.3618 -0.4986 -0.5015 -0.1326 -0.6911 -0.3608 -0.4535 -0.3479 -0.3539 -0.4719 -0.361 -0.3226 -0.4319 -0.2734 -0.5573 -0.3755 -0.495 -0.5107 -0.1652 -0.2447 -0.4232 -0.2375 -0.2205 -0.2154 -0.3447 -0.254 -0.3778 -0.4046 -0.0639 -0.3351 -0.0149 -0.0312 -0.174 -0.1416 -0.1508 -0.0964 -0.2642 -0.0234 -0.3352 -0.1878 -0.1744 -0.4055 -0.2444 -0.4784 0.1151 -0.2008 -0.086 -0.2984 0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722 -0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548 -0.1865 -0.0153 -0.2483 0.2132 -0.0407 -0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573 -0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14 -0.0776 0.1642 0.1137 0.0531 0.0867 0.0804 0.0875 0.251 0.1892 -0.2418 0.1614 0.0282 normal log10(AHF activity) log10(AHF-like antigen) Obligatorycarrier log10(AHF activity) log10(AHF-like antigen) C. Example 11.3: Detection of hemophilia a carriers
41
SAS output C. Example 11.3: Detection of hemophilia a carriers
42
C. Example 11.3: Detection of hemophilia a carriers
43
C. Example 11.3: Detection of hemophilia a carriers
44
C. Example 11.3: Detection of hemophilia a carriers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.