Download presentation
Presentation is loading. Please wait.
Published byAntony Holmes Modified over 9 years ago
1
Seminar of Interest Friday, September 15, at 11:00 am, EMS W220. Dr. Hien Nguyen of the University of Wisconsin- Whitewater. "Hybrid User Model for Information Retrieval: Framework and Evaluation".
2
Overview of Today’s Lecture Last Time: representing examples (feature selection) HW0, intro to supervised learningLast Time: representing examples (feature selection) HW0, intro to supervised learning HW0 due on TuesdayHW0 due on Tuesday Today: K-NN wrapup, Naïve BayesToday: K-NN wrapup, Naïve Bayes Reading Assignment: Section 2.1, 2.2, Chapter 5Reading Assignment: Section 2.1, 2.2, Chapter 5
3
Nearest-Neighbor Algorithms (aka. Exemplar models, instance-based learning (IBL), case-based learning) Learning ≈ memorize training examplesLearning ≈ memorize training examples Problem solving = find most similar example in memory; output its categoryProblem solving = find most similar example in memory; output its category Venn - - - - - - - - + + + + ++ + + + + ? … “Voronoi Diagrams” (pg 233)
4
Sample Experimental Results Testbed Testset Correctness IBLD-Trees Neural Nets Wisconsin Cancer 98%95%96% Heart Disease 78%76%? Tumor37%38%? Appendicitis83%85%86% Simple algorithm works quite well!
5
“Hamming Distance” Ex 1 = 2Ex 1 = 2 Ex 2 = 1Ex 2 = 1 Ex 3 = 2Ex 3 = 2 Simple Example – 1-NN Training Set 1.a=0, b=0, c=1 + 2.a=0, b=1, c=0 - 3.a=1, b=1, c=1 - Test Example a=0, b=1, c=0 ?a=0, b=1, c=0 ? So output - (1-NN ≡ one nearest neighbor)
6
K-NN Algorithm Collect K nearest neighbors, select majority classification (or somehow combine their classes) What should K be?What should K be? Problem dependentProblem dependent Can use tuning sets (later) to select a good setting for KCan use tuning sets (later) to select a good setting for K Tuning Set Error Rate 1 2345K
7
What is the “distance” between two examples? distance between examples 1 and 2 numeric feature specific weight distance for feature i only One possibility: sum the distances between features
8
Using K neighbors to classify an example Given: nearest neighbors e 1,..., e k with output categories O 1,..., O k The output for example e t is O t = the kernel “delta” function (=1 if O i =c, else =0)
9
Kernel Functions Term “kernel” comes from statisticsTerm “kernel” comes from statistics Major topic for support vector machines (later)Major topic for support vector machines (later) Weights interaction between pairs of examplesWeights interaction between pairs of examples can involve a similarity measurecan involve a similarity measure
10
Kernel function (e i, e t ) Examples (e i, e t ) = 1 If (e i, e t ) =1 / dist(e i, e t ) simple majority vote (? classified as -) inverse distance weight (? could be classified as +) - - + ? In the diagram to the right, the example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.
11
Gaussian Kernel: popular in SVMs Euler’s constant distance between two examples “standard deviation”
12
y = 1 / x y = 1 / exp(x 2 ) y = 1 / x 2
13
Instance-Based Learning (IBL) and Efficiency IBL algorithms postpone work from training to testingIBL algorithms postpone work from training to testing Pure NN/IBL just memorizes the training dataPure NN/IBL just memorizes the training data Computationally intensiveComputationally intensive Match all features of all training examplesMatch all features of all training examples
14
Instance-Based Learning (IBL) and Efficiency Possible Speed-upsPossible Speed-ups Use a subset of the training examples (Aha)Use a subset of the training examples (Aha) Use clever data structures (A. Moore)Use clever data structures (A. Moore) KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams Use a subset of the featuresUse a subset of the features Feature selectionFeature selection
15
Feature Selection as Search Problem State = set of featuresState = set of features Start state:Start state: No feature (forward selection) orNo feature (forward selection) or All features (backward selection)All features (backward selection) Operators = add/subtract featuresOperators = add/subtract features Scoring function = acc. on tuning setScoring function = acc. on tuning set
16
Forward and Backward Selection of Features Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search {} 50% {F N } 71% {F 1 } 62% add F N add F 1 {F 1,F 2,...,F N } 73% {F 2,...,F N } 79% ForwardBackward add F 1... subtract F 1 subtract F 2 Features to use Accuracy on tuning set (our heuristic function)...
17
Forward vs. Backward Feature Selection Faster in early steps because fewer features to testFaster in early steps because fewer features to test Fast for choosing a small subset of the featuresFast for choosing a small subset of the features Misses useful features whose usefulness requires other features (feature synergy)Misses useful features whose usefulness requires other features (feature synergy) Fast for choosing all but a small subset of the featuresFast for choosing all but a small subset of the features Preserves useful features whose usefulness requires other featuresPreserves useful features whose usefulness requires other features Example: area important, features = length, widthExample: area important, features = length, width ForwardBackward
18
Feature Selection and Machine Learning Filtering-Based Feature Selection all features subset of features model Wrapper-Based Feature Selection FS algorithm ML algorithm all features model FS algorithm calls ML algorithm many times, uses it to help select features
19
Number of Features and Performance Too many features can hurt test set performance Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect
20
“Vanilla” K-NN Report Card Learning Efficiency A+ Classification Efficiency F StabilityC Robustness (to noise) D Empirical Performance C Domain Insight F Implementation Ease A Incremental Ease A But is a good baseline
21
K-NN Summary K-NN can be an effective ML algorithmK-NN can be an effective ML algorithm Especially if few irrelevant featuresEspecially if few irrelevant features Good baseline for experimentsGood baseline for experiments
22
A Different Approach to Classification: Probabilistic Models Indicate confidence in classificationIndicate confidence in classification Given feature vector:Given feature vector: F = (f 1 = v 1, …, f n = v n ) Output probability:Output probability: P(class = + | F) The probability the class is positive given”the feature vector
23
Probabilistic K-NN Output probability using k neighborsOutput probability using k neighbors Possible algorithm:Possible algorithm: P(class = + | F) = number of “+” neighbors k
24
Bayes’ Rule Definitions:Definitions: P(A^B) P(B)*P(A|B) P(A^B) P(A)*P(B|A) So (assuming P(B) > 0):So (assuming P(B) > 0): P(B)*P(A|B) = P(A)*P(B|A) P(A|B) = P(A)*P(B|A) P(B) P(B) AB Bayes’ rule
25
Conditional Probabilities Note the difference:Note the difference: P(A|B) is smallP(A|B) is small P(B|A) is largeP(B|A) is large
26
Bayes’ Rule Applied to ML P(class | F) = P(F | class) * P(class) P(F) Why do we care about Bayes’ rule? Because while P(class|F) is typically difficult to directly measure, the values on the RHS are often easy to estimate (especially if we make simplifying assumptions) Shorthand for P(class = c | f 1 = v 1, …, f n = v n )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.