Machine Learning Basics

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Chapter 4: Linear Models for Classification

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

Principal Component Analysis

Lecture 5: Learning models using EM

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Crash Course on Machine Learning

Collaborative Filtering Matrix Factorization Approach

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Outline Separating Hyperplanes – Separable Case

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Linear Models for Classification

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Spectral Methods for Dimensionality

Data Mining Practical Machine Learning Tools and Techniques

Big data classification using neural network

Chapter 7. Classification and Prediction

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

CEE 6410 Water Resources Systems Analysis

Deep Feedforward Networks

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Goodfellow: Chap 5 Machine Learning Basics

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Computer vision: models, learning and inference

Multimodal Learning with Deep Boltzmann Machines

LECTURE 16: SUPPORT VECTOR MACHINES

10701 / Machine Learning.

The Elements of Statistical Learning

Neural networks (3) Regularization Autoencoder

Clustering (3) Center-based algorithms Fuzzy k-means

Data Mining Lecture 11.

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

Learning with information of features

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Collaborative Filtering Matrix Factorization Approach

Goodfellow: Chapter 14 Autoencoders

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

LECTURE 17: SUPPORT VECTOR MACHINES

Generally Discriminant Analysis

LECTURE 07: BAYESIAN ESTIMATION

Presented by Wanxue Dong

Multivariate Methods Berlin Chen

Neural networks (3) Regularization Autoencoder

Multivariate Methods Berlin Chen, 2005 References:

Machine Learning – a Probabilistic Perspective

CAMCOS Report Day December 9th, 2015 San Jose State University

What is Artificial Intelligence?

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Machine Learning Basics 周岚

Supervised Learning Algorithms Learn to associate some input with some output given a training set of examples of inputs x and outputs y

Probabilistic Supervised Learning Estimating a probability distribution p(y | x) Using maximum likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ) Linear regression logistic regression

Support Vector Machines One of the most influential approaches Similar to logistic regression in that it is driven by a linear function kernel trick x(i) is a training example and α is a vector of coefficients Replace x by the output of a given feature function φ(x) Replace the dot product with a function

Support Vector Machines Make predictions using the function kernel trick is powerful allow us to learn models that are nonlinear often admit an implementation that is significantly more computational efficient Gaussian kernel

Support Vector Machines drawback to kernel machines the cost of evaluating the decision function is linear in the number of training examples mitigate this by learning an α vector that contains mostly zeros classifying a new example only for the training examples that have non-zero αi

Other Simple Supervised Learning Algorithms k-nearest neighbors Very high capacity obtain high accuracy given a large training set Weakness high computational cost generalize very badly given a small, finite training set cannot learn that one feature is more discriminative than another

Other Simple Supervised Learning Algorithms decision tree non-parametric regularized with size constraints

Unsupervised Learning Algorithms A classic unsupervised learning task find the “best” representation of the data preserves as much information about x as possible keeping the representation simpler or more accessible than x itself Three ways of defining a simpler representation lower dimensional representations sparse representations independent representations

Principal Components Analysis PCA learns a representation that has lower dimensionality than the original input whose elements have no linear correlation with each other

Principal Components Analysis

k-means Clustering divide the training set into k different clusters of examples that are near each other initializing k different centroids {μ(1), . . . ,μ(k)} to different values alternating between two different steps until convergence each training example is assigned to cluster i, where i is the index of the nearest centroid μ(i) each centroid μ(i) is updated to the mean of all training examples x(j) assigned to cluster I One difficulty is that the clustering problem is inherently ill-posed

Stochastic Gradient Descent An extension of the gradient descent algorithm gradient descent requires computing Computational cost : O(m)

Stochastic Gradient Descent sample a minibatch of examples B = {x(1), . . . , x(m) } The estimate of the gradient is formed as The stochastic gradient descent algorithm then follows the estimated gradient downhill is the learning rate

Building a Machine Learning Algorithm Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model Linear regression algorithm a dataset consisting of X and y cost function the model specification optimization algorithm defined by solving for where the gradient of the cost is zero using the normal equations. Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.

Challenges Motivating Deep Learning The simple machine learning algorithms work very well on a wide variety of important problems. not succeeded in solving the central problems in AI The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks.

The Curse of Dimensionality Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. known as the curse of dimensionality The number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases

The Curse of Dimensionality

Local Constancy and Smoothness Regularization In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. implicit " priors" : smoothness prior The function we learn should not change very much within a small region

Local Constancy and Smoothness Regularization

Local Constancy and Smoothness Regularization work extremely well enough examples for the learning algorithm to observe high points on most peaks and low points on most valleys of the true underlying function to be learned. If the function additionally behaves differently in different regions, it can become extremely complicated to describe with a set of training examples. core idea in deep learning the data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy.

Manifold Learning A manifold is a connected region

Manifold Learning Many machine learning problems seem hopeless if learn functions with interesting variations across all of Rn Manifold learning algorithms surmount this obstacle by assuming that most of Rn consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds Evidence in favor of this assumption the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. we can also imagine such neighborhoods and transformations, at least informally.

Manifold Learning The data lies on a low-dimensional manifold represent the data in terms of coordinates on the manifold rather than in terms of coordinates in Rn. Example : roads as 1-D manifolds embedded in 3-D space Extracting these manifold coordinates is challenging, but holds the promise to improve many machine learning algorithms.

Thank you!