Machine Learning Basics 周岚
Supervised Learning Algorithms Learn to associate some input with some output given a training set of examples of inputs x and outputs y
Probabilistic Supervised Learning Estimating a probability distribution p(y | x) Using maximum likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ) Linear regression logistic regression
Support Vector Machines One of the most influential approaches Similar to logistic regression in that it is driven by a linear function kernel trick x(i) is a training example and α is a vector of coefficients Replace x by the output of a given feature function φ(x) Replace the dot product with a function
Support Vector Machines Make predictions using the function kernel trick is powerful allow us to learn models that are nonlinear often admit an implementation that is significantly more computational efficient Gaussian kernel
Support Vector Machines drawback to kernel machines the cost of evaluating the decision function is linear in the number of training examples mitigate this by learning an α vector that contains mostly zeros classifying a new example only for the training examples that have non-zero αi
Other Simple Supervised Learning Algorithms k-nearest neighbors Very high capacity obtain high accuracy given a large training set Weakness high computational cost generalize very badly given a small, finite training set cannot learn that one feature is more discriminative than another
Other Simple Supervised Learning Algorithms decision tree non-parametric regularized with size constraints
Unsupervised Learning Algorithms A classic unsupervised learning task find the “best” representation of the data preserves as much information about x as possible keeping the representation simpler or more accessible than x itself Three ways of defining a simpler representation lower dimensional representations sparse representations independent representations
Principal Components Analysis PCA learns a representation that has lower dimensionality than the original input whose elements have no linear correlation with each other
Principal Components Analysis
k-means Clustering divide the training set into k different clusters of examples that are near each other initializing k different centroids {μ(1), . . . ,μ(k)} to different values alternating between two different steps until convergence each training example is assigned to cluster i, where i is the index of the nearest centroid μ(i) each centroid μ(i) is updated to the mean of all training examples x(j) assigned to cluster I One difficulty is that the clustering problem is inherently ill-posed
Stochastic Gradient Descent An extension of the gradient descent algorithm gradient descent requires computing Computational cost : O(m)
Stochastic Gradient Descent sample a minibatch of examples B = {x(1), . . . , x(m) } The estimate of the gradient is formed as The stochastic gradient descent algorithm then follows the estimated gradient downhill is the learning rate
Building a Machine Learning Algorithm Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model Linear regression algorithm a dataset consisting of X and y cost function the model specification optimization algorithm defined by solving for where the gradient of the cost is zero using the normal equations. Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.
Challenges Motivating Deep Learning The simple machine learning algorithms work very well on a wide variety of important problems. not succeeded in solving the central problems in AI The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks.
The Curse of Dimensionality Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. known as the curse of dimensionality The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases
The Curse of Dimensionality
Local Constancy and Smoothness Regularization In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. implicit " priors" : smoothness prior The function we learn should not change very much within a small region
Local Constancy and Smoothness Regularization
Local Constancy and Smoothness Regularization work extremely well enough examples for the learning algorithm to observe high points on most peaks and low points on most valleys of the true underlying function to be learned. If the function additionally behaves differently in different regions, it can become extremely complicated to describe with a set of training examples. core idea in deep learning the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy.
Manifold Learning A manifold is a connected region
Manifold Learning Many machine learning problems seem hopeless if learn functions with interesting variations across all of Rn Manifold learning algorithms surmount this obstacle by assuming that most of Rn consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds Evidence in favor of this assumption the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. we can also imagine such neighborhoods and transformations, at least informally.
Manifold Learning The data lies on a low-dimensional manifold represent the data in terms of coordinates on the manifold rather than in terms of coordinates in Rn. Example : roads as 1-D manifolds embedded in 3-D space Extracting these manifold coordinates is challenging, but holds the promise to improve many machine learning algorithms.
Thank you!