Machine learning overview

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Classification and Prediction: Regression Analysis
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Collaborative Filtering Matrix Factorization Approach
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
INTRODUCTION TO Machine Learning 3rd Edition
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
M Machine Learning F# and Accord.net.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Gaussian Processes For Regression, Classification, and Prediction.
Machine Learning 5. Parametric Methods.
CS B553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Constrained optimization.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Linear Algebra Curve Fitting. Last Class: Curve Fitting.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Learning Tree Structures
CSE 4705 Artificial Intelligence
Machine learning, pattern recognition and statistical data modelling
Support Vector Machines
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Data Mining Lecture 11.
Machine Learning – Regression David Fenyő
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
with observed random variables
Statistical Learning Dong Liu Dept. EEIS, USTC.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Collaborative Filtering Matrix Factorization Approach
10701 / Machine Learning Today: - Cross validation,
Deep Learning for Non-Linear Control
Contact: Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:
ML – Lecture 3B Deep NN.
Overfitting and Underfitting
Model generalization Brief summary of methods
Review for test #2 Fundamentals of ANN Dimensionality reduction
Neural networks (1) Traditional multi-layer perceptrons
Machine learning overview
Image Classification & Training of Neural Networks
Presentation transcript:

Machine learning overview Computational method to improve performance on a task by using training data. This shows a NN, but can replace with other ML methods

Example: Linear regression Task: predict y from x, using or Other forms possible, such as This is still linear in the parameters w. Loss: mean squared error (MSE) between predictions and targets

Capacity and data fitting Capacity: a measure of the ability to fit complex data Increased capacity means we can make the training error small. Overfitting: like memorizing the training inputs. Capacity large enough to reproduce training data, but does poorly on test data. Too much capacity for the available data. Underfitting: like ignoring details. Not enough capacity for the available detail.

Capacity and data fitting

Capacity and generalization error

Regularization Sometimes minimizing the performance or loss function directly promotes overfitting. E.g., the Runge phenomenon of interpolating by a polynomial using evenly spaced points. Red = target function Blue = degree 5 Green = degree 9 Output is linear in the coeffs

Regularization Can get a better fit by using a penalty on the coefficients. E.g.,

Example: Classification Task: predict one of several classes for a given input. E.g., Decide if a movie review is positive or negative. Identify one of several possible topics for a news piece Output: A probability distribution on possible outcomes. Loss: Cross-entropy (a way to compare distributions)

Information For a probability distribution p(X) for a rv X, define the information of outcome x to be (log = nat log) I(x) = - log p(x) This is 0 if p is 1 (no information for a certain outcome) and is large if p is near 0 (lots of information if the event is not likely). Additivity: If X and Y are indep, then info is additive: I(x,y) = - log p(x,y) = - log p(x)p(y) = - log p(x) - log p(y) = I(x) + I(y)

Entropy Entropy is the expected information of a rand var: Note that 0 log 0 = 0 Entropy is a measure of unpredictability of a random variable. For a given set of states, equal probability gives maximum entropy.

Cross-entropy Compare one distribution to another. Suppose we have distribution p,q on same set W. Then In the discrete case,

Cross entropy as loss function Question: given p(x), what q(x) minimizes the cross entropy (in the discrete case)? Constrained optimization:

Constrained optimization More general constrained optimization: f is the objection function (loss) gi are the equality constraints hj are the inequality constraints If no constraints: look for a point where gradient of f vanishes. But we need to include constraints.

Intuition Given g = 0. Try to find points where f’ = 0 since these points might be minima. Two possibilities: We could be following a contour line of f (f does not change along contour lines). So the contour lines of f and g are parallel here. We have a local minimum of f (gradient of f is 0). https://en.wikipedia.org/wiki/Lagrange_multiplier

Intuition If the contours of f and g are parallel, then the gradients of f and g are parallel. Thus we want points (x, y) where g(x, y) = 0 and for some λ. This is the idea of Lagrange multipliers. https://en.wikipedia.org/wiki/Lagrange_multiplier

KKT conditions Start with Make the Lagrangian function Take gradient and set to 0 – but other conds also.

KKT conditions Make the Lagrangian function Necessary conditions to have a minimum are

Cross entropy Exercise for the reader: Use the KKT conditions to show that if pi are fixed, positive, and sum to 1, then the qi that solves is qi = pi. That is,

Regularization and constraints Regularization is something like a weak constraint. E.g., for L2 penalty, instead of requiring the weights to be small with a penalty like < c we just prefer them to be small by adding to the objective function.