CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides

Introduction  Why and what of machine learning: past experience, Optimize a performance criterion  Difference between supervised and unsupervised learning

Supervised Learning Terminology  Domain: set of all possible x vectors  Hypothesis/Concept: a Boolean function on domain  Target: the concept to be learned  Hypothesis class/space: is the set of hypotheses (concepts) that can be output by a given learning algorithm  Version space: all concepts in hypotheses space consistent with training set.  Noisy

Supervised Learning Terminology  Inductive bias  Overfitting/Underfitting  Feature selection

Bayesian Learning  Maximum Likelihood  Maximum a Posteriori  Mean a Posteriori  Generative/Discriminative Models  Bayes optimal prediction minimizes risk  risk of payment t on x is ∑ t’ L(t, t’)P(t’|x)

Instance Based Learning  K-nearest neighbor  Edited NN: Reduce memory and computation by only storing “important” points  Instance Based Density Estimation: Histogram method  Smoothing models  Regressogram  Running mean smoother  Kernel smoother

Decision trees  Impurity measures  Gini index: 2p(1-p)  Entropy: -p lg p - (1-p) lg (1-p)  Error rate: 1-max[p, 1-p]  Generalized entropy for multiple classes  Avoiding overfitting  PrePruning  PostPruning  Random Forests

Naïve Bayes  Naïve independence assumption P(x | t) =  j P(x j | t)  Predict the label t maximizing P(t)  j P(x j | t)  Numeric Features: use Gaussian or other density  Attributes for text classification?  Bag of words model

Linear Regression  Basis function  From maximum likelihood to least squares (eq. 3.11~3.12)  Maximum likelihood weight vector (eq. 3.15)  Sequential Learning/ stochastic gradient descent

Linear Regression (cont.)  Regularized least squares  Multiple outputs: same basis function for all components of target vector  The Bias-Variance Decomposition  Predictive distribution

Linear Classification  Linear threshold:  wx = w 1 x 1 +w 2 x 2 +…+w n x n ≥ w 0  Multi-class  Learn a linear function y k for each class k: y k (x) = w k T x + w k,0  Predict class k if y k (x) > y j (x) for all other j’s  Perceptron if (wx i ) t i ≤ 0 then mistake: w gets w+  i t i x i

Linear Discriminant Analysis  For each class y:  Estimate P(x | y) with a Gaussian (same covariance matrix for all classes)  Estimate prior P(y) as fraction of training data with class y  Predict the y maximizing P(y) P(x | y), Or maximizing log(P(y)) + log(P(x | y))

Logistic Regression  Logistic regression gives distribution on labels: p(t=1| x, w) =1/(1+e -wx )  Use gradient descent to learn w  wx is equal to log odds: log( p(t=1|w,x) / p(t=0|w,x) )

Artificial Neural Networks  activation a j =  i w j,i z i  Node j output z j is some f j (a j ), Common f(a j ) are tanh and logistic sigmoid  Backpropagation used to learn weights

Support Vector Machines  Pick a linear threshold hypothesis with biggest margin  Find w, b,  such that: w  x i + b   when y i = +1 w  x i + b  -  when y i = -1 and  as big as possible  Scaling issue - fix by setting  =1 and finding shortest w : min w,b ||w|| 2 subject to y i (w  x i + b)  1 for all examples (x i,y i )

Kernel Functions  predictions (and finding a i ’s) depend only on dot products  can use any dot-product like function K(x,x’)  K(x,z) is a kernel function if K(x,z) =  (x)  (z)

Variants of SVM  Multiclass  Regression: f(x)=w T x+w 0  One-Class

Clustering  Hierarchical – Creates Tree of clusterings  Agglomerative (bottom up – merge “closest”)  Divisive (top down - less common)  Partitional – One set of clusters created, usually # of clusters supplied by user  Clusters can be:  Overlapping (soft) / Non-overlapping (hard)

Partitional Algorithms  K-Means  Gaussian mixtures (EM)

Hidden Markov Models  Three Basic Problems of HMMs  Evaluation: Given parameters, and outputs, calculate P (outputs | parameters ) Dynamic Prog.  State sequence: Given parameters, and outputs, find state sequence Q * such maximizing probability of generating outputs Dynamic Prog.  Learning: Given a set of observation sequences, find parameters maximizing likelihood of the observation sequences EM algorithm

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Similar presentations

Presentation on theme: "CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Similar presentations

Presentation on theme: "CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides."— Presentation transcript:

Similar presentations

About project

Feedback