Machine Learning Basics

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Principal Component Analysis
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Crash Course on Machine Learning
Collaborative Filtering Matrix Factorization Approach
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Linear Models for Classification
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Spectral Methods for Dimensionality
Data Mining Practical Machine Learning Tools and Techniques
Big data classification using neural network
Chapter 7. Classification and Prediction
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Goodfellow: Chap 5 Machine Learning Basics
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Computer vision: models, learning and inference
Multimodal Learning with Deep Boltzmann Machines
LECTURE 16: SUPPORT VECTOR MACHINES
10701 / Machine Learning.
The Elements of Statistical Learning
Neural networks (3) Regularization Autoencoder
Clustering (3) Center-based algorithms Fuzzy k-means
Data Mining Lecture 11.
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Learning with information of features
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Collaborative Filtering Matrix Factorization Approach
Goodfellow: Chapter 14 Autoencoders
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 17: SUPPORT VECTOR MACHINES
Generally Discriminant Analysis
LECTURE 07: BAYESIAN ESTIMATION
Presented by Wanxue Dong
Multivariate Methods Berlin Chen
Neural networks (3) Regularization Autoencoder
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
CAMCOS Report Day December 9th, 2015 San Jose State University
What is Artificial Intelligence?
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Machine Learning Basics 周岚

Supervised Learning Algorithms Learn to associate some input with some output given a training set of examples of inputs x and outputs y

Probabilistic Supervised Learning Estimating a probability distribution p(y | x) Using maximum likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ) Linear regression logistic regression

Support Vector Machines One of the most influential approaches Similar to logistic regression in that it is driven by a linear function kernel trick x(i) is a training example and α is a vector of coefficients Replace x by the output of a given feature function φ(x) Replace the dot product with a function

Support Vector Machines Make predictions using the function kernel trick is powerful allow us to learn models that are nonlinear often admit an implementation that is significantly more computational efficient Gaussian kernel

Support Vector Machines drawback to kernel machines the cost of evaluating the decision function is linear in the number of training examples mitigate this by learning an α vector that contains mostly zeros classifying a new example only for the training examples that have non-zero αi

Other Simple Supervised Learning Algorithms k-nearest neighbors Very high capacity obtain high accuracy given a large training set Weakness high computational cost generalize very badly given a small, finite training set cannot learn that one feature is more discriminative than another

Other Simple Supervised Learning Algorithms decision tree non-parametric regularized with size constraints

Unsupervised Learning Algorithms A classic unsupervised learning task find the “best” representation of the data preserves as much information about x as possible keeping the representation simpler or more accessible than x itself Three ways of defining a simpler representation lower dimensional representations sparse representations independent representations

Principal Components Analysis PCA learns a representation that has lower dimensionality than the original input whose elements have no linear correlation with each other

Principal Components Analysis

k-means Clustering divide the training set into k different clusters of examples that are near each other initializing k different centroids {μ(1), . . . ,μ(k)} to different values alternating between two different steps until convergence each training example is assigned to cluster i, where i is the index of the nearest centroid μ(i) each centroid μ(i) is updated to the mean of all training examples x(j) assigned to cluster I One difficulty is that the clustering problem is inherently ill-posed

Stochastic Gradient Descent An extension of the gradient descent algorithm gradient descent requires computing Computational cost : O(m)

Stochastic Gradient Descent sample a minibatch of examples B = {x(1), . . . , x(m) } The estimate of the gradient is formed as The stochastic gradient descent algorithm then follows the estimated gradient downhill is the learning rate

Building a Machine Learning Algorithm Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model Linear regression algorithm a dataset consisting of X and y cost function the model specification optimization algorithm defined by solving for where the gradient of the cost is zero using the normal equations. Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.

Challenges Motivating Deep Learning The simple machine learning algorithms work very well on a wide variety of important problems. not succeeded in solving the central problems in AI The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks.

The Curse of Dimensionality Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. known as the curse of dimensionality The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases

The Curse of Dimensionality

Local Constancy and Smoothness Regularization In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. implicit " priors" : smoothness prior The function we learn should not change very much within a small region

Local Constancy and Smoothness Regularization

Local Constancy and Smoothness Regularization work extremely well enough examples for the learning algorithm to observe high points on most peaks and low points on most valleys of the true underlying function to be learned. If the function additionally behaves differently in different regions, it can become extremely complicated to describe with a set of training examples. core idea in deep learning the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy.

Manifold Learning A manifold is a connected region

Manifold Learning Many machine learning problems seem hopeless if learn functions with interesting variations across all of Rn Manifold learning algorithms surmount this obstacle by assuming that most of Rn consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds Evidence in favor of this assumption the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. we can also imagine such neighborhoods and transformations, at least informally.

Manifold Learning The data lies on a low-dimensional manifold represent the data in terms of coordinates on the manifold rather than in terms of coordinates in Rn. Example : roads as 1-D manifolds embedded in 3-D space Extracting these manifold coordinates is challenging, but holds the promise to improve many machine learning algorithms.

Thank you!