Learning From Data Chichang Jou Tamkang University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Supervised Learning Recap
Model Assessment and Selection
What is Statistical Modeling
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
The Nature of Statistical Learning Theory by V. Vapnik
x – independent variable (input)
Introduction to Predictive Learning
Statistical Methods Chichang Jou Tamkang University.
Computational Learning Theory
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 7: Expert Systems and Artificial Intelligence Decision Support.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Sparse vs. Ensemble Approaches to Supervised Learning
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part I: Classification and Bayesian Learning
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 5 Data mining : A Closer Look.
Collaborative Filtering Matrix Factorization Approach
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Learning from observations
Machine Learning.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CpSc 810: Machine Learning Evaluation of Classifier.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Data Mining and Decision Support
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Predictive Learning from Data
CEE 6410 Water Resources Systems Analysis
KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional.
School of Computer Science & Engineering
Data Mining Lecture 11.
Predictive Learning from Data
Collaborative Filtering Matrix Factorization Approach
10701 / Machine Learning Today: - Cross validation,
Overview of Machine Learning
Predictive Learning from Data
3.1.1 Introduction to Machine Learning
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Learning From Data Chichang Jou Tamkang University

2 Chapter Objectives Analyze the general model of inductive learning Analyze the general model of inductive learning Explain how to select an approximating function Explain how to select an approximating function Introduce risk functional for regression and classification problems Introduce risk functional for regression and classification problems Identify concepts in statistical learning theory Identify concepts in statistical learning theory Discuss the differences of inductive principles, empirical risk minimization, and structural risk minimization Discuss the differences of inductive principles, empirical risk minimization, and structural risk minimization Discuss practical aspects of VC dimension Discuss practical aspects of VC dimension Compare inductive learning tasks using graphics Compare inductive learning tasks using graphics Introduce validation methods of inductive learning results Introduce validation methods of inductive learning results

3 Background Biological systems learn to cope with the unknown, statistical environment in a data-driven fashion Biological systems learn to cope with the unknown, statistical environment in a data-driven fashion Two-phases of predictive-learning process: Two-phases of predictive-learning process: –Learning or estimating unknown dependencies Induction: progressing from particular cases to a model Induction: progressing from particular cases to a model –Using estimated dependencies to predict Deduction: progressing from a model and given input to particular cases Deduction: progressing from a model and given input to particular cases

4 Induction, Deduction, Transduction Local estimation, like association rules

5 4.1 Learning machine Machine learning algorithms vary in their goals, in the available training data sets, and in the learning strategies and representation of data Machine learning algorithms vary in their goals, in the available training data sets, and in the learning strategies and representation of data Inductive machine learning Inductive machine learning –A generalization of models is obtained from a set of samples

6 Observational setting of a Learning machine Conditional probability p(Y/X) Real-worlds systems often have un- measured inputs

7 Inductive Learning machine Try to form generalizations from particular true facts (called training data set). Try to form generalizations from particular true facts (called training data set). –Formalized as a set of functions that approximate a system ’ s behavior Given X as an input, implementing a set of functions f(X, w), w is a parameter of the function Given X as an input, implementing a set of functions f(X, w), w is a parameter of the function –Its solution requires a priori knowledge

8 Inductive Learning machine The task of inductive inference The task of inductive inference –Given samples (x i, f(x i )), return a function h(x), called hypothesis, that approximate f(x) linear non-linear

9 Inductive Learning machine Statistical dependency vs. causality Statistical dependency vs. causality –Inductive-learning processes build the model of dependencies, but they should not be automatically interpreted as causality relations –Example: people in Florida are on average older than in other states. Married mnn live longer than single men.

10 L(y, f(X,w)) L(y, f(X,w)) –Measures the difference between y and f(X,w) Induction learning is the process of estimating f(X,w opt ), which minimizes R(w) Induction learning is the process of estimating f(X,w opt ), which minimizes R(w) Loss function and Risk function

11 Common Loss function

12 Inductive principle An inductive principle is a general prescription (what to do with the data) for obtaining an estimate f(X, w opt * ) An inductive principle is a general prescription (what to do with the data) for obtaining an estimate f(X, w opt * ) Human intervention in the learning algorithm Human intervention in the learning algorithm –Selection of input and output variables –Data encoding and representation –Incorporating a priori knowledge –Influence over the generator of the sampling rate or distribution

Statistical Learning Method A formalized theory for finite-sample inductive learning, mainly for classification or pattern recognition A formalized theory for finite-sample inductive learning, mainly for classification or pattern recognition –Provide quantitative description of the trade-off between model complexity and the available information –Also called VC (Vapnik-Chervonenkis) theory Other approaches are more engineering- oriented, without proofs and formalizations Other approaches are more engineering- oriented, without proofs and formalizations

14 Empirical risk minimization (ERM) Typically used when the model is given or approximated first, and then its parameters are estimated from the data

15 Empirical risk minimization (ERM) The consistency property The consistency property –Minimizing one risk for a given data set will also minimize the other risk Nontrivial Consistency Nontrivial Consistency –Consistency requirement must hold for all approximating functions

16 Behavior of the Growth function G(n) Approximating functions in the form of G(n) will have a consistency property

17 Structural Risk Minimization (SRM) ERM is good when n/h is large ERM is good when n/h is large When n/h < 20, use SRM When n/h < 20, use SRM 1.Selecting an element of a structure having optimal complexity 2.Estimating the model based on the set of approximating functions defined in the selected element of the structure

18 SRM in practice

19 SRM Applications of SRM for non-linear approximations are difficult, impossible in many cases Applications of SRM for non-linear approximations are difficult, impossible in many cases –use heuristics, like early stopping rules and weight initialization Three optimization approaches Three optimization approaches –Stochastic approximation (gradient descent) –Iterative methods –Greedy optimization

20 SRM Problems with the optimization approaches Problems with the optimization approaches –Too sensitive to initial conditions –Too sensitive to stopping rules –Too sensitive to many local minima Two useful guidelines Two useful guidelines –Do not attempt to solve a problem by indirectly solving a harder general problem –Occam ’ s razor: The best performance is provided by a model of optimal complexity

21 Requirement of any inductive-learning process

22 Types of Learning Methods Examples: logistic regression, multilayered perception, decision rules, decision trees, etc. Emphasis on a task-independent measure of quality of representation. Examples: cluster analysis, artificial neural network, association rules

23 Common Learning Tasks Classification Classification Regression Regression Clustering Clustering Summarization (Formalized Description) Summarization (Formalized Description) Dependency-modeling Dependency-modeling Deviation Detection (Outlier, Changes in time) Deviation Detection (Outlier, Changes in time)

24 Data-mining and Knowledge-discovery techniques Statistical Methods Statistical Methods Cluster Analysis Cluster Analysis Decision Trees and Decision Rules Decision Trees and Decision Rules Association Rules Association Rules Artificial Neural Network Artificial Neural Network Genetic Algorithms Genetic Algorithms Fuzzy Inference Systems Fuzzy Inference Systems N-dimensional Visualization Methods N-dimensional Visualization Methods

Model Estimation

26 Testing

27 Objective of Testing

28 How to Split Samples

29 Common Resampling Methods Resubstitution Method Resubstitution Method Holdout Method Holdout Method Leave-one-out Method Leave-one-out Method Rotation Method Rotation Method Bootstrap Method Bootstrap Method

30 Error rate, Accuracy R= E / S R= E / S A = 1 – R = (S – E) / S A = 1 – R = (S – E) / S Two classes Two classes –False Negative: False Reject Rate (FRR) –False Positive: False Acceptance Rate (FAR) More than two classes More than two classes –Confusion matrix

31 Confusion matrix for three classes

32 Receiver Operating Characteristic (ROC) Curve To evaluate FAR and FRR at the same time To evaluate FAR and FRR at the same time The following ROC shows sensitivity (FAR) vs. 1-specificity (1-FRR) The following ROC shows sensitivity (FAR) vs. 1-specificity (1-FRR) FAR