Download presentation
Presentation is loading. Please wait.
Published byLoreen Carson Modified over 8 years ago
1
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides
2
Introduction Why and what of machine learning: past experience, Optimize a performance criterion Difference between supervised and unsupervised learning
3
Supervised Learning Terminology Domain: set of all possible x vectors Hypothesis/Concept: a Boolean function on domain Target: the concept to be learned Hypothesis class/space: is the set of hypotheses (concepts) that can be output by a given learning algorithm Version space: all concepts in hypotheses space consistent with training set. Noisy
4
Supervised Learning Terminology Inductive bias Overfitting/Underfitting Feature selection
5
Bayesian Learning Maximum Likelihood Maximum a Posteriori Mean a Posteriori Generative/Discriminative Models Bayes optimal prediction minimizes risk risk of payment t on x is ∑ t’ L(t, t’)P(t’|x)
6
Instance Based Learning K-nearest neighbor Edited NN: Reduce memory and computation by only storing “important” points Instance Based Density Estimation: Histogram method Smoothing models Regressogram Running mean smoother Kernel smoother
7
Decision trees Impurity measures Gini index: 2p(1-p) Entropy: -p lg p - (1-p) lg (1-p) Error rate: 1-max[p, 1-p] Generalized entropy for multiple classes Avoiding overfitting PrePruning PostPruning Random Forests
8
Naïve Bayes Naïve independence assumption P(x | t) = j P(x j | t) Predict the label t maximizing P(t) j P(x j | t) Numeric Features: use Gaussian or other density Attributes for text classification? Bag of words model
9
Linear Regression Basis function From maximum likelihood to least squares (eq. 3.11~3.12) Maximum likelihood weight vector (eq. 3.15) Sequential Learning/ stochastic gradient descent
10
Linear Regression (cont.) Regularized least squares Multiple outputs: same basis function for all components of target vector The Bias-Variance Decomposition Predictive distribution
11
Linear Classification Linear threshold: wx = w 1 x 1 +w 2 x 2 +…+w n x n ≥ w 0 Multi-class Learn a linear function y k for each class k: y k (x) = w k T x + w k,0 Predict class k if y k (x) > y j (x) for all other j’s Perceptron if (wx i ) t i ≤ 0 then mistake: w gets w+ i t i x i
12
Linear Discriminant Analysis For each class y: Estimate P(x | y) with a Gaussian (same covariance matrix for all classes) Estimate prior P(y) as fraction of training data with class y Predict the y maximizing P(y) P(x | y), Or maximizing log(P(y)) + log(P(x | y))
13
Logistic Regression Logistic regression gives distribution on labels: p(t=1| x, w) =1/(1+e -wx ) Use gradient descent to learn w wx is equal to log odds: log( p(t=1|w,x) / p(t=0|w,x) )
14
Artificial Neural Networks activation a j = i w j,i z i Node j output z j is some f j (a j ), Common f(a j ) are tanh and logistic sigmoid Backpropagation used to learn weights
15
Support Vector Machines Pick a linear threshold hypothesis with biggest margin Find w, b, such that: w x i + b when y i = +1 w x i + b - when y i = -1 and as big as possible Scaling issue - fix by setting =1 and finding shortest w : min w,b ||w|| 2 subject to y i (w x i + b) 1 for all examples (x i,y i )
16
Kernel Functions predictions (and finding a i ’s) depend only on dot products can use any dot-product like function K(x,x’) K(x,z) is a kernel function if K(x,z) = (x) (z)
17
Variants of SVM Multiclass Regression: f(x)=w T x+w 0 One-Class
18
Clustering Hierarchical – Creates Tree of clusterings Agglomerative (bottom up – merge “closest”) Divisive (top down - less common) Partitional – One set of clusters created, usually # of clusters supplied by user Clusters can be: Overlapping (soft) / Non-overlapping (hard)
19
Partitional Algorithms K-Means Gaussian mixtures (EM)
20
Hidden Markov Models Three Basic Problems of HMMs Evaluation: Given parameters, and outputs, calculate P (outputs | parameters ) Dynamic Prog. State sequence: Given parameters, and outputs, find state sequence Q * such maximizing probability of generating outputs Dynamic Prog. Learning: Given a set of observation sequences, find parameters maximizing likelihood of the observation sequences EM algorithm
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.