Download presentation
Presentation is loading. Please wait.
Published byJuniper McCormick Modified over 8 years ago
1
Lets talk some more about features.
2
(Western Pipistrelle (Parastrellus hesperus) Photo by Michael Durham
3
Characteristic frequency Call duration Western pipistrelle calls We can easily measure two features of bat calls. Their characteristic frequency and their call duration
7
Quick Review We have seen the simple linear classifier. One way to generalize this algorithm is to consider other polynomials…
8
Quick Review Another way to generalize this algorithm is to consider piecewise linear decision boundaries.
9
Quick Review There really are datasets for which the more expressive models are better… Left Bar 10 12345678 9 1 2 3 4 5 6 7 8 9 Right Bar
11
Overfitting How do we chose the right model? It is tempting to say: Test all models, using cross validation, and pick the best one. However, this has a problem. If we do this, we will find that a more complex model will almost certainly do better on our training set, but will worse when we deploy it. This is overfitting, a major headache in data mining.
12
Imagine the following problem: There are two features, the Y-axis is irrelevant to the task (but we do not know that) and scoring above 5 on the X-axis means you are red-class, otherwise you are blue class. Again, we do not know this, as we prepare to build a classifier.
13
10 123456789 1 2 3 4 5 6 7 8 9 In this case, we would expect to learn a decision boundary that is almost exactly correct. Suppose we had a billion exemplars, what would we see?
14
With less data, our decision boundary makes some errors. In the green area, it claims that instances are red, when they should be blue. In the pink area, it claims that instances are blue, when they should be red. However, overall it is doing a pretty good job.
15
If we allow a more complex model, we will end up doing worse when we deploy the model, even though it performs well now. In the green area, it claims that instances are red, when they should be blue. In the pink area, it claims that instances are blue, when they should be red.
16
If we allow a more complex model, we will end up doing worse when we deploy the model, even though it performs well now. In the green area, it claims that instances are red, when they should be blue. In the pink area, it claims that instances are blue, when they should be red.
17
Complexity of the model Training Data
18
Complexity of the Model Training Data Validation Data
19
Complexity of the Model Rule of Thumb: When doing machine learning, prefer simpler models. This is called Occam's Razor.
20
We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! One possible approach. Delete all instances that are surrounded by members of their own class. We can also speed up classification with indexing
21
Manhattan (p=1) Max (p=inf) Mahalanobis Weighted Euclidean Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…
22
So far we have only seen features that are real numbers. But features could be: Boolean (Has Wings?) Categorical (Green, Brown, Gray) etc How do we handle such features? The good news is that we can always define some measure of “nearest” for nearest neighbor for basically any kinds of features. Such measures are called distance measures (or sometime, similarity measures).
23
Let us consider an example that uses Boolean features: Features: Has wings? Has spur on front legs? Has cone-shaped head? length(antenna) > 1.5* length(abdomen) Under this representation, every insect is a just Boolean vector: Insect 17 ={true, true, false, false} or Insect 17 ={1,1,0,0} Instead of using the Euclidean distance, we can use the Hamming distance (or one of many other measures). Which insect is the nearest neighbor of Insect 17 ={1,1,0,0}? Insect 1 ={1,1,0,1}, Insect 2 ={0,0,0,0}, Insect 3 ={0,1,1,1}, Insect 3 ={0,1,1,1} Here we would say Insect 17 is in the blue class. Insect 17 The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
24
We can use the nearest neighbor algorithm with any distance/similarity function IDNameClass 1GunopulosGreek 2PapadopoulosGreek 3KolliosGreek 4DardanosGreek 5 KeoghIrish 6GoughIrish 7GreenhaughIrish 8HadleighIrish For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…
25
Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Deletion (e) Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution1 Unit Insertion1 Unit Deletion1 Unit D( Peter,Piotr ) is 3 Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter
26
Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length Abdomen Length Abdomen Length > 7.1? no yes Katydid Antenna Length Antenna Length > 6.0? no yes Katydid Grasshopper
27
Decision Tree Classifier Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length Here is a different tree. (exercise, draw out the full tree) In general, if we have n Boolean features, the number of trees is: This is both good and bad news.
28
Grasshopper Antennae shorter than body? Cricket Foretiba has ears? KatydidsCamel Cricket Yes No 3 Tarsi? No Decision trees predate computers
29
Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree Decision Tree Classification
30
Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they can be discretized in advance) –Examples are partitioned recursively based on selected attributes. –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left How do we construct the decision tree?
31
Information Gain as A Splitting Criteria Select the attribute with the highest information gain ( information gain is the expected reduction in entropy ). Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as 0 log(0) is defined as 0
33
Information Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Note: entropy is at its minimum if the collection of objects is completely uniform
34
Person Hair Length WeightAgeClass Homer0”25036M Marge10”15034F Bart2”9010M Lisa6”788F Maggie4”201F Abe1”17070M Selma8”16041F Otto10”18038M Krusty6”20045M Comic8”29038?
35
Hair Length <= 5? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(1F,3M) = -(1/4)log 2 (1/4) - (3/4)log 2 (3/4) = 0.8113 Entropy(3F,2M) = -(3/5)log 2 (3/5) - (2/5)log 2 (2/5) = 0.9710 Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 Let us try splitting on Hair length
36
Weight <= 160? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(4F,1M) = -(4/5)log 2 (4/5) - (1/5)log 2 (1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log 2 (0/4) - (4/4)log 2 (4/4) = 0 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 Let us try splitting on Weight
37
age <= 40? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(3F,3M) = -(3/6)log 2 (3/6) - (3/6)log 2 (3/6) = 1 Entropy(1F,2M) = -(1/3)log 2 (1/3) - (2/3)log 2 (2/3) = 0.9183 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 Let us try splitting on Age
38
Weight <= 160? yes no Hair Length <= 2? yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse! This time we find that we can split on Hair length, and we are done!
39
Weight <= 160? yesno Hair Length <= 2? yes no We need don’t need to keep the data around, just the test conditions. Male Female How would these people be classified?
40
It is trivial to convert Decision Trees to rules… Weight <= 160? yesno Hair Length <= 2? yes no Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female
41
Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions. Once we have learned the decision tree, we don’t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call.
42
If the search was conducted in a home. If the search was conducted in a business. If the search was conducted on one’s person. If the search was conducted in a car. If the search was a full search, as opposed to a less extensive intrusion. If the search was conducted incident to arrest. If the search was conducted after a lawful arrest. If an exception to the warrant requirement existed (beyond that of search incident to a lawful arrest). Classification Problem: Fourth Amendment Cases before the Supreme Court I The Fourth Amendment (Amendment IV) to the United States Constitution is the part of the Bill of Rights that prohibits unreasonable searches and seizures and requires any warrant to be judicially sanctioned and supported by probable cause. Suppose we have a 4 th Amendment case, Keogh vs. State of California. Keogh argues that the search that found evidence of him taking bribes was and Unreasonable search (U), and the state argues that the search was Reasonable (R). The case is appealed to the Supreme Court, will the court decide U or R? We can use machine learning to try to predict their decision. What should the features be? Keogh vs. State of California = {0,1,1,0,0,0,1,0} The Statistical Analysis of Judicial Decisions and Legal Rules with Classification Treesjels_1176 202..230 Jonathan P. Kastellec
43
The Supreme Court’s search and seizure decisions, 1962–1984 terms. U = Unreasonable R = Reasonable Classification Problem: Fourth Amendment Cases before the Supreme Court II Keogh vs. State of California = {0,1,1,0,0,0,1,0}
44
Decision Tree for Supreme Court Justice Sandra Day O'Connor We can also learn decision trees for individual Supreme Court Members. Using similar decision trees for the other eight justices, these models correctly predicted the majority opinion in 75 percent of the cases, substantially outperforming the experts' 59 percent.
45
Wears green? Yes No The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”… Male Female
46
Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”
47
10 12345678 9 1 2 3 4 5 6 7 8 9 12345678 9 1 2 3 4 5 6 7 8 9 Which of the “Pigeon Problems” can be solved by a Decision Tree? 1)Deep Bushy Tree 2)Useless 3)Deep Bushy Tree The Decision Tree has a hard time with correlated attributes ?
48
Advantages: –Easy to understand (Doctors love them!) –Easy to generate rules Disadvantages: –May suffer from overfitting. –Classifies by rectangular partitioning (so does not handle correlated features very well). –Can be quite large – pruning is necessary. –Does not handle streaming data easily Advantages/Disadvantages of Decision Trees
49
Naïve Bayes Classifier We will start off with a visual intuition, before looking at the math… Thomas Bayes 1702 - 1761
50
Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Grasshoppers Katydids Abdomen Length Remember this example? Let’s get lots more data…
51
Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Katydids Grasshoppers With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now…
52
We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides…
53
p(c j | d) = probability of class c j, given that we have observed d 3 Antennae length is 3 We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. There is a formal way to discuss the most probable classification…
54
10 2 P( Grasshopper | 3 ) = 10 / (10 + 2)= 0.833 P( Katydid | 3 ) = 2 / (10 + 2)= 0.166 3 Antennae length is 3 p(c j | d) = probability of class c j, given that we have observed d
55
9 3 P( Grasshopper | 7 ) = 3 / (3 + 9)= 0.250 P( Katydid | 7 ) = 9 / (3 + 9)= 0.750 7 Antennae length is 7 p(c j | d) = probability of class c j, given that we have observed d
56
6 6 P( Grasshopper | 5 ) = 6 / (6 + 6)= 0.500 P( Katydid | 5 ) = 6 / (6 + 6)= 0.500 5 Antennae length is 5 p(c j | d) = probability of class c j, given that we have observed d
57
Bayes Classifiers That was a visual intuition for a simple case of the Bayes classifier, also called: Idiot Bayes Naïve Bayes Simple Bayes We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. previously unseen instance Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.
58
Bayes Classifiers Bayesian classifiers use Bayes theorem, which says p(c j | d ) = p(d | c j ) p(c j ) p(d) p(c j | d) = probability of instance d being in class c j, This is what we are trying to compute p(d | c j ) = probability of generating instance d given class c j, We can imagine that being in class c j, causes you to have feature d with some probability p(c j ) = probability of occurrence of class c j, This is just how frequent the class c j, is in our database p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes
59
Assume that we have two classes malefemale c 1 = male, and c 2 = female. We have a person whose sex we do not know, say “drew” or d. malefemale malefemale Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male | drew) or p(female | drew) malemalemale p(male | drew) = p(drew | male ) p(male) p(drew) (Note: “Drew can be a male or female name”) What is the probability of being called “drew” given that you are a male? What is the probability of being a male? What is the probability of being named “drew”? (actually irrelevant, since it is that same for all classes) Drew Carey Drew Barrymore
60
p(c j | d) = p(d | c j ) p(c j ) p(d) Officer Drew NameSex DrewMale ClaudiaFemale DrewFemale Female AlbertoMale KarinFemale NinaFemale SergioMale This is Officer Drew (who arrested me in 1997). Is Officer Drew a Male or Female? Luckily, we have a small database with names and sex. We can use it to apply Bayes rule…
61
male p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 female p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8 Officer Drew p(c j | d) = p(d | c j ) p(c j ) p(d) NameSex DrewMale ClaudiaFemale DrewFemale Female AlbertoMale KarinFemale NinaFemale SergioMale Female Officer Drew is more likely to be a Female.
62
Officer Drew IS a female! Officer Drew male p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 female p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8
63
NameOver 170 CM EyeHair lengthSex DrewNoBlueShortMale ClaudiaYesBrownLongFemale DrewNoBlueLongFemale DrewNoBlueLongFemale AlbertoYesBrownShortMale KarinNoBlueLongFemale NinaYesBrownShortFemale SergioYesBlueLongMale p(c j | d) = p(d | c j ) p(c j ) p(d) So far we have only considered Bayes Classification when we have one attribute (the “antennae length”, or the “name”). But we may have many features. How do we use all the features?
64
To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|c j ) = p(d 1 |c j ) * p(d 2 |c j ) * ….* p(d n |c j ) The probability of class c j generating instance d, equals…. The probability of class c j generating the observed value for feature 1, multiplied by.. The probability of class c j generating the observed value for feature 2, multiplied by..
65
To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|c j ) = p(d 1 |c j ) * p(d 2 |c j ) * ….* p(d n |c j ) p( officer drew |c j ) = p(over_170 cm = yes|c j ) * p(eye =blue|c j ) * …. Officer Drew is blue-eyed, over 170 cm tall, and has long hair Female p( officer drew | Female) = 2/5 * 3/5 * …. Male p( officer drew | Male) = 2/3 * 2/3 * ….
66
p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) cjcj The Naive Bayes classifiers is often represented as this type of graph… Note the direction of the arrows, which state that each class causes certain features, with a certain probability …
67
Naïve Bayes is fast and space efficient We can look up all the probabilities with a single scan of the database and store them in a (small) table… SexOver190 cmMaleYes0.15 No0.85 FemaleYes0.01 No0.99 cjcj … p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) SexLong HairMaleYes0.05 No0.95 FemaleYes0.70 No0.30 SexMale Female
68
Naïve Bayes is NOT sensitive to irrelevant features... Suppose we are trying to classify a persons sex based on several features, including eye color. (Of course, eye color is completely irrelevant to a persons gender) Female p( Jessica | Female) = 9,000/10,000 * 9,975/10,000 * …. Male p( Jessica | Male) = 9,001/10,000 * 2/10,000 * …. p( Jessica |c j ) = p(eye = brown|c j ) * p( wears_dress = yes|c j ) * …. However, this assumes that we have good enough estimates of the probabilities, so the more data the better. Almost the same!
69
An obvious point. I have used a simple two class problem, and two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values AnimalMass >10 kgCatYes0.15 No0.85 DogYes0.91 No0.09 PigYes0.99 No0.01 cjcj … p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) AnimalCat Dog Pig ColorCatBlack0.33 White0.23 Brown0.44 DogBlack0.97 White0.03 Brown0.90 PigBlack0.04 White0.01 Brown0.95
70
Naïve Bayesian Classifier p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) p(d|cj)p(d|cj) Problem! Naïve Bayes assumes independence of features… SexOver 6 foot MaleYes0.15 No0.85 FemaleYes0.01 No0.99 SexOver 200 pounds MaleYes0.11 No0.80 FemaleYes0.05 No0.95
71
Naïve Bayesian Classifier p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) p(d|cj)p(d|cj) Solution Consider the relationships between attributes… SexOver 6 foot MaleYes0.15 No0.85 FemaleYes0.01 No0.99 SexOver 200 pounds MaleYes and Over 6 foot0.11 No and Over 6 foot0.59 Yes and NOT Over 6 foot0.05 No and NOT Over 6 foot0.35 FemaleYes and Over 6 foot0.01
72
Naïve Bayesian Classifier p(d 1 |c j ) p(d 2 |c j ) p(d n |c j ) p(d|cj)p(d|cj) Solution Consider the relationships between attributes… But how do we find the set of connecting arcs??
73
10 123456789 1 2 3 4 5 6 7 8 9 The Naïve Bayesian Classifier has a quadratic decision boundary
74
Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner. …
75
This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details. Content analysis details: (12.20 points, 5 required) NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary URGENT_BIZ (2.7 points) BODY: Contains urgent matter US_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN) DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)' BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728]
76
Advantages: –Fast to train (single scan). Fast to classify –Not sensitive to irrelevant features –Handles real and discrete data –Handles streaming data well Disadvantages: –Assumes independence of features Advantages/Disadvantages of Naïve Bayes
77
Summary of Classification We have seen 4 major classification techniques: Simple linear classifier, Nearest neighbor, Decision tree. There are other techniques: Neural Networks, Support Vector Machines, Genetic algorithms.. In general, there is no one best classifier for all problems. You have to consider what you hope to achieve, and the data itself… Let us now move on to the other classic problem of data mining and machine learning, Clustering…
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.