Machine Learning Week 2 Lecture 1.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Linear Regression.
Evaluating Classifiers
Machine Learning Week 3 Lecture 1. Programming Competition
CMPUT 466/551 Principal Source: CMU
Machine Learning Week 1, Lecture 2. Recap Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set 5 0.
The loss function, the normal equation,
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Machine Learning.
Visual Recognition Tutorial
Machine Learning Week 2 Lecture 2.
x – independent variable (input)
Linear Models Tony Dodd January 2007An Overview of State-of-the-Art Data Modelling Overview Linear models. Parameter estimation. Linear in the.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
CS 4700: Foundations of Artificial Intelligence
Crash Course on Machine Learning
Computational Optimization
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Probability theory: (lecture 2 on AMLbook.com)
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 2B *Courtesy of Associate Professor Andrew Ng’s Notes, Stanford University.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Logistic Regression Week 3 – Soft Computing By Yosi Kristian.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Chapter 5. Evaluating Hypotheses
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Logistic Regression William Cohen.
Chapter 2-OPTIMIZATION
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computacion Inteligente Least-Square Methods for System Identification.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Chapter 7. Classification and Prediction
Dan Roth Department of Computer and Information Science
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Probabilistic Models for Linear Regression
Collaborative Filtering Matrix Factorization Approach
Computational Learning Theory
Computational Learning Theory
Deep Learning for Non-Linear Control
Presentation transcript:

Machine Learning Week 2 Lecture 1

Quiz and Hand in data Test what you know so I can adapt! We need data for the hand in

Any Problems Any Questions Quiz Any Problems Any Questions

Recap Target: House Price Input: Size, Rooms, Age, Garage, … Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set 5 0 4 1 9 2 1 3 1 4 Classification (10 classes) Target: House Price Input: Size, Rooms, Age, Garage, … Data: Historical Data of House Sales Regression

Linear Models Example: Target House Price Input: Size, Rooms, Age, Garage, … Data: Historical House Sales Weigh each input dimension to effect the target function in a good way House Price = 1234 x 1 + 88 x size + 42 x Rooms - 666 x age + 0.01 x Garage θ0 x0 θ4 x4 θ3 x3 θ2 * x2 θ1 x1 (matrix product) Linear in θ Nonlinear Transform

Three Models Classification Logistic Regression (Perceptron) w Logistic Regression Estimating Probabilities Classify y = 1 if Equivalent to w

Maximum Likelihood Use Logarithm to make into a sum. Then Optimize. Assumption: Independent Data Likelihood Use Logarithm to make into a sum. Then Optimize. For Logistic Regression we get cross entropy error:

Convex Optimization Convex Non-convex f and g are convex, h is affine x,f(x) y,f(y) f and g are convex, h is affine Local Minima are Global Minima x,f(x) y,f(y) f(x)+f’(x)(y-x)

Descent Methods Iteratively move toward a better solution where f is twice continuously differentiable Iteratively move toward a better solution Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v

Simple Gradient Descent Pick start point x LR = 0.1 Repeat 50 rounds Set v Update: x = x + LR v Descent Direction is Step size:

Learning Rate Learning Rate Learning Rate Do you want code for generating these figures?

Gradient Descent Jump Around Use Exact Line Search Starting From (10,1)

Gradient Checking If You Use Gradient Descent Compute Gradient Correctly. Choose small h and compute Use this two sided formula. Reduces the estimation error significantly. Move epsilon out and check n-dimensional gradient: Use formula for each variable Usually works well

Handin 1. It comes online after class today Supervised Learning It comes online after class today Include Matlab examples but not a long intro. Google is your friend. Questions are always welcome Get Busy 5 0 4 1 9 2 1 3 1 4

Today Learning feasibility Probabilistic Approach Learning Formalized

Learning Diagram Unknown Target f Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Impossibility of Learning! x1 x2 x3 f(x) 1 ? What is f? There are 256 potential functions 8 of them has in sample error 0 Assumptions are needed

No Free Lunch "All models are wrong, but some models are useful.” George Box Machine Learning has many different models and algorithms. There is no single best model that works best for all problems (No Free Lunch Theorem) Assumptions that works well in one domain may fail in another.

Probabilistic Games

Probabilistic Approach Repeat N times independently μ is unknown Sample mean: ν #heads/N What does sample mean say about μ? Sample: h,h,h,t,t,h,t,t,h With Certainty? Nothing really Probabilistically? Yes sample mean is likely close to bias

Hoeffdings Inequality Binary Variables Sample mean is probably close to μ coin bias μ Probability increase with #samples N Sample mean ν Hoeffdings Inequality Bound is independent of sample mean and actual probability, e.g. the probability distribution P(x) Sample mean is probably approximately correct PAC

Classification Connection Testing a Hypothesis Unknown Target Fixed Hypothesis Probability Distribution over x is probability of picking x such that f(x) ≠ h(x) is probability of picking x such that f(x) = h(x) μ is the sum of the probability of all the points X where hypothesis is wrong This is just the sum Sample Mean - Out of sample Error

Unknown Input Probability Distribution P(x) Learning Diagram Unknown Target f Unknown Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Coins to hypotheses Sample size N: h,h,h,t,t,h,t,t,h Sample mean unknown μ

Not Learning Yet Hypothesis fixed before seeing data Every hypothesis has its own error (different coin for each hypothesis) In learning we have a training algorithm that picks the “best” hypothesis from the set We are only verifying fixed hypothesis Hoeffding has left the building again.

Coin Analogy – Exercise 1.10 Book Flip a fair coin 10 times What is Probability of 10 heads? Repeat 1000 times (1000 coins) What is the probability that some coin has 10 heads? Approximately 63%

Crude Approach Apply Union Bound ≤ P(True for some hypothesis) . Apply Union Bound and Then Hoeffding to each one

Result Classification Problem. Error is f(x)≠h(x) Finite Hypothesis set with M hypotheses. Data Set with N points It explains the idea of what we are looking for (model complexity is a factor it seems) Our “simple” linear models have infinite size hypothesis sets…

Input Probability Distribution P(x) New Learning Diagram Unknown Target f Input Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set finite X

Learning Feasibility Deterministic/No assumptions NOT SO MUCH Probabilisticly YES: Generalization: Out of sample error Close to In Sample Error Make In Sample Error Small If target function is complex learning should be harder? But the bound does not seem to care. But complex functions needs complex hypothesis sets which should increase their complexity, e.g. M is a very simple and crude measure.

Error Functions User Specified, Heavily Problem Dependent. Identity System, Fingerprints. Is the person who he says he is. h(x)/f(x) Lying True Estimate Lying True Negative False Negative Estimate True False Positive True Positive Walmart. Discount for a given person Error Function CIA Access (Friday bar stock) Error Function h(x)/f(x) Lying True Est. Lying Est. True h(x)/f(x) Lying True Est. Lying Est. True 1000 1 1 1000

Error Functions If Not Given Base it on making problem “solvable”.. Making the problem smooth and convex seems like a good idea. Least Squares Linear Regression was very nice indeed. Base on assumptions about target and noise Logistic Regression: Gives Cross Entropy Assume linear and Gaussian noise: Gives Least Squares If no one tells you.

Unknown Probability Distribution P(x) Formalize Everything Unknown Target Unknown Probability Distribution P(x) Data Set (x1,y1,...,xn,yn) Hypothesis h h(x) ≈ f(x) Learning Algorithm Hypothesis Set

Unknown Probability Distribution P(x) Final Diagram Unknown Target Unknown Probability Distribution P(x) P(y | x) Learn Importance Data Set Learning Algorithm Hypothesis Set If x has very low probability then it is not really gonna count. Error Measure e Final Hypothesis

Words on out of sample error Imagine X,y are finite sets

Quick Summary Learning Without Assumptions is impossible Probabilistically learning is possible Hoeffding bound Work needed for infinite hypothesis spaces! Error function depend on problem Formalized Learning Approach Ensure out of sample error is close to in sample error Minimize in sample error Complexity of hypothesis set (size M currently) matters More data helps