Introduction to Machine learning

Slides:

Advertisements

Similar presentations

Machine Learning and Data Mining Linear regression

Advertisements

Linear Regression.

Pattern Recognition and Machine Learning: Kernel Methods.

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Lecture 14 – Neural Networks

x – independent variable (input)

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

Maximum likelihood (ML)

Collaborative Filtering Matrix Factorization Approach

Efficient Model Selection for Support Vector Machines

Non Negative Matrix Factorization

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Machine Learning CSE 681 CH2 - Supervised Learning.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.

Model representation Linear regression with one variable

Andrew Ng Linear regression with one variable Model representation Machine Learning.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

The problem of overfitting

Regularization (Additional)

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Chapter 2-OPTIMIZATION

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Computacion Inteligente Least-Square Methods for System Identification.

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Chapter 7. Classification and Prediction

Gradient descent David Kauchak CS 158 – Fall 2016.

Artificial Neural Networks

Machine Learning & Deep Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 3: Linear Regression (with One Variable)

CSE 4705 Artificial Intelligence

A Simple Artificial Neuron

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Introduction to Data Science Lecture 7 Machine Learning Overview

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Machine Learning – Regression David Fenyő

Probabilistic Models for Linear Regression

Logistic Regression Classification Machine Learning.

Singular Value Decomposition

Collaborative Filtering Matrix Factorization Approach

Logistic Regression & Parallel SGD

Large Scale Support Vector Machines

Neural Networks Geoff Hulten.

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Softmax Classifier.

COSC 4368 Machine Learning Organization

The Math of Machine Learning

Statistical Models and Machine Learning Algorithms --Review

Shih-Yang Su Virginia Tech

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Introduction to Machine learning

Reinforcement Learning (2)

Introduction to Machine learning

Linear regression with one variable

Reinforcement Learning (2)

Presentation transcript:

Introduction to Machine learning Prof. Eduardo Bezerra (CEFET/RJ) ebezerra@cefet-rj.br

Linear Regression

Overview Introduction Univariate Linear Regression Model representation Model evaluation Model optimization Multivariate Linear Regression Practical issues: feature scaling & feature engineering Polynomial Regression

Introduction

Linear regression problem (2D)

Notation m = amount of training examples x = a vector of features y = target value (scalar)

Components In order to apply any ML algorithm (including linear regression), we need to define three components: Model representation Model evaluation Model optimization

Model representation

Representation (univariate case) A hypothesis is a function that maps from x's to y's. How can a hypothesis be represented in the univariate linear regression setting?

Model Evaluation (with a cost function)

Model parameters Once... we have the training set in hand, and we define the form (of representation) of the hypothesis ... ... how do we determine the parameters of the model? Idea: choose the combination of parameters such that the hypothesis produces values close to the values y contained in the training set.

Mean Squared Error (MSE) measure In linear regression, hypothesis are evaluated through the MSE measure.

Level curves for J (two parameters)

Model Optimization (parameter learning)

Gradient Descent algorithm Given a cost function J, we want to determine the combination of parameter values that minimizes J. Optimization procedure: Initialize the parameter vector Iterate: update parameter vector with the purpose of finding the minimum value of J

Gradient Descent algorithm Updating must be simultaneous! Derivada parcial Taxa de aprendizado (learning rate) α is a small positive constante, the learning rate (more later)

Gradient Descent - intuition Calculamos a derivada no ponto correspondente ao valor atual de theta_1. Esse valor de derivada nos informa se devemos nos mover para a direita ou para a esquerda (isto é, aumentar ou diminuir o valor de theta_1). se um ponto tem derivada positiva, então devemos nos mover para a esquerda (diminuir o valor de theta_1). se um ponto tem derivada negativa, então devemos nos mover para a direita (diminuir o valor de theta_1).

Learning rate (learning rate): hyperparameter that needs to be carefully chosen... How? Try multiple values and pick the best one model selection (more latter). Digression: AutoML

GD for Linear Regression (univariate case) Gradient Descent Linear Regression Model Batch Gradient Descent: in each iteration of the algorithm, the entire training set is used.

GD for Linear Regression (univariate case) Realizar atualização simultânea

Multivariate Linear Regression

Multiple features (n=2) Source: https://gerardnico.com

Multiple features (n=4)

Notation : i-th training example. : j-th feature value in the i-th example; : amount of features

Model representation Univariate case (n=1): Multivariate case (n > 1): Podemos reescrever a hipótese da Regressão Linear Multivariada como um produto escalar (scalar product, dot product) Definir x_0^{(i)} = 1 é a penas uma conveniência de notação, para possilitar definir a hipótese como um produto escalar de dois vetores (n+1)-dimensionais.

Gradient Descent (n = 1)

Gradient Descent (n ≥ 1) Reminder:

Practical issue: feature scaling Here we study the effect of the existence of different scales on the dependent variables.

Scaled features make ML algorithms converge better and faster. Feature scaling “Since the range of values of raw data varies widely, […], objective functions will not work properly without normalization. […] If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.” --Wikipedia https://en.wikipedia.org/wiki/Feature_scaling Scaled features make ML algorithms converge better and faster.

Why feature scaling? https://stackoverflow.com/questions/26225344/why-feature-scaling

Some feature scaling techniques Min-max scaling z-score Scaling to unit length https://en.wikipedia.org/wiki/Feature_scaling

Scaling techniques - example Source: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

Practical issue: feature engineering Here we study techniques to create new features.

Feature engineering In ML, we are not limited to using only the original features for training a model. Depending on the knowledge we have of a particular dataset, we can combine the original features to create new ones. This can lead to a better predictive model.

Feature engineering The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering. [...] We were also very careful to discard features likely to expose us to the risk of overfitting our model. — Xavier Conort, "Q&A with Xavier Conort …some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. — Pedro Domingos, "A Few Useful Things to Know about Machine Learning" Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. — Andrew Ng, Machine Learning and AI via Brain simulations Quotes taken from https://en.wikipedia.org/wiki/Feature_engineering

Feature engineering - example

Polynomial regression Here we study how to get approximations for non-linear functions.

Polynomial regression A method to find a hypothesis that corresponds to a polynomial (quadratic, cubic, ...). Related to the idea of features engineering. It allows to use the linear regression machinery to find hypotheses for more complicated functions. RLM = regressão linear multivariada

Polynomial regression - example

Polynomial regression - example (cont.)

Polynomial regression - example (cont.) The definition of adequate features involves both insight and knowledge of the problem domain. Esse exemplo ilustra que podemos transformar características de forma bastante flexível. O exemplo apresenta uma escolha razoável de hipótese, usando a raiz quadrada do tamanho da casa.

Polynomial regression vs feature scaling Scaling features is even more important in the context of polynomial regression.