The goal of machine learning

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
What is Statistical Modeling
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Learning from Observations Chapter 18 Section 1 – 4.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Introduction to machine learning
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning CSE 681 CH2 - Supervised Learning.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Applying Neural Networks Michael J. Watts
An Introduction to Support Vector Machines (M. Law)
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Data Mining and Decision Support
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Big data classification using neural network
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning for Computer Security
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Applying Neural Networks
General principles in building a predictive model
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
CH. 2: Supervised Learning
CH. 1: Introduction 1.1 What is Machine Learning Example:
Machine Learning Basics
Data Mining Lecture 11.
Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
Data Mining Practical Machine Learning Tools and Techniques
The basic notions related to machine learning
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Overview of Machine Learning
Classification and Prediction
The basics of Bayes decision theory
3.1.1 Introduction to Machine Learning
Artificial Intelligence Lecture No. 28
Why Machine Learning Flood of data
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Ensemble learning.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Machine Learning in Practice Lecture 17
Support Vector Machines and Kernels
The loss function, the normal equation,
Learning Chapter 18 and Parts of Chapter 20
Mathematical Foundations of BME Reza Shadmehr
Text Categorization Berlin Chen 2003 Reference:
Supervised machine learning: creating a model
A task of induction to find patterns
Machine Learning: Decision Tree Learning
INTRODUCTION TO Machine Learning 3rd Edition
Instructor: Vincent Conitzer
Presentation transcript:

Machine learning methods – Introduction The main properties of learning algorithms

The goal of machine learning Goal: To construct programs that are able to improve their performance using the experience collected during their operation Learning algorithm: algorithms that are able to deduct regularities, relationships from a set of training examples Note 1.: The main aim is not to memorize the actual training examples, but to correctly generalize to other samples not seen during training (also known as inductive learning) Assumption: the examples faithfully represent the relationship that we try to learn Note 2.: We can never be 100% sure that the relationship we found will generalize to unseen data Because of this, we will call the found relationship a „hypothesis” After receiving further examples the algorithm may refine the hypothesis

The main types of learning tasks Supervised learning: the correct answer is also given with the training examples The most common task: classification Example: character recognition: 16x16 pixels  letter 16x16 pixels: input features Letter: class label Practically, we have to learn a function from examples This will be the dominant topic of this semester Unsupervised learning: no helping information is given Most common task: clustering Mapping data points into automatically found classes classes based on some kind similarity measure

The main types of learning tasks 2 Modelling processes along time In the classic function learning task we assume that the samples following each other are independent, or at least come in a random order On contrary, when modelling time series we assume that the order carries crucial information that must be modelled Examples: speech recognition, text analysis, modelling stock echange data Reinforcement learning Example: artificial living „creatures” -- autonomous agents Interaction with the environment, collection of experiences The experiences have no labels in themselves, only a long-term goal is defined A special sub-field within machine learning Other special learning tasks

Supervised learning of functions The input of the function: a vector of some measurement data feature vector, attribute vector The output of the function: class label or a real number The input of the learning algorithm: a set of training examples Output: A hypothesis (model) about the function It can return the (hypothesized) output value for any input vector Set of training examples: a set of pairs of a feature vectors and the corresponding class label Examples: does the patient have influenza? F e a t u r e v e c t o r Class label (Y/N) Fever Joint pain Cough Influenza 38,2 Yes No 36,7 Dry 41,2 Wet 38,5 37,2 training instances

The main properties of a learning system We have to think about these features when designing a new learning method or when we look for a suitable method for a given task The type of input/output of the function to be learned The representation method of the learned function (hypothesis) Hypothesis space: what is the set of functions that the method selects from Which hypothesis will it prefer when the are more hypotheses that fit the data What algorithm is uses to find a/the best hypothesis

The output of the function to be learned Classification: the output value is from a finite, discrete set Example: character recognition. We have to tell which letter is shown in images of 16x16 pixels. Range of output values = letters of the ABC The classification task is the typical machine learning task Concept learning: the function has a binary range Example: we want to train a robot the notion of “chair”. Each object in its environment either belongs to this notion or not. Regression: the range of the function is continuous Example: assessing the value of used cars based on features like brand, age, motor capacity,…

The input of the function to be learned Binary features Discrete features Also called nominal, symbolic or categoric features Continuous features Binarydiscretecontinuous conversion trivial Discrete  binary: Class labels: learning N class labels can always be solved as having N concept learning tasks („one against the rest”) Features: N different values can be represented by log2N binary values Continuous Discrete : Can be solved by quantization (with some error), eg. (fever) 39,7high Quantization is only for features, less usual for training targets

Why does the type of input/output matter? Different type of input/output requires a different type of inner representation Some algorithms work only with a certain type of features/targets Or they might work with other types of features, but not optimally Examples: Concept learning with binary features: we have to learn a boolean function In the 60-70’s logic formulas were thought to be the best representation of human thinking A lot of research effort was put into the learning of logic formulas, these algorithms do not work on other types of data The classic SVM algorithm is defined for two classes Several extensions exist for multi-class tasks

Input/output examples 2 The classic decision tree algorithms was defined for discrete features There are several extensions for continuous feature, but these are not really efficient The Gaussian mixture model of statistical pattern recognition This assumes continuous features There is not much sense in fitting Gaussian distributions on discrete features, in many cases the algorithm would crash in practice Classification in general, when we have continuous features The characteristic function of each class is a discontinuous function that is hard to represent There are two general solutions to represent it using continuous functions: Geometric approach Decision-theoretic approach

The feature space and the decision boundary When we have ta feature vector of N components, then our training examples can be displayed as points in an N-dimensional space Example: 2 features –> 2 axes (x1, x2) Class label: by colors Goal: to find the decision boundary between the classes Generally: give an estimate of the (x1,x2)c function based on the training examples The same as specifying the (x1,x2){0,1} characteristic function (or indicator function) of each ci class

Representing the decision boundary Direct (geometric) approach: We directly represent the decision surface Using some simple, continuous function like lines (planes) Indirect (decision-theoretic) approach: 1. We assign a function to each class that can tell for any point of the space the probability that the point belongs to the given class 2. The given point is identified by the class label for which the discrimininant function takes the largest value The boundary between the classes is defined indirectly by the section of the discriminant functions This way, the classification task is solved indirectly by learning the discriminant fuctions

Further remarks (Input/output) It is important whether the examples have missing feature values There exist methods to estimate the missing values But most algorithms cannot handle these by default This might happen in several practical tasks (pl. medical diagnostics) It is important whether the algorithm can handle contradicting examples (same feature vector with different class labels) There are solutions to this But some algorithms cannot handle this It is very frequent in practice Due to labelling mistakes, e.g. ambiguous diagnosis

Representation of the function to be learned Symbolic representation numeric representation This is an ancient debate in AI 60-70’s : symbolic representation was preferred E.g.: logic formulas, if-then rules Currently: numeric representation is preferred Pl. neural networks  the representation consists of a bunch of real numbers For certain tasks symbolic representation seems to be more suitable E.g. automatic proving of mathematical theorems For other tasks it makes no sense E.g. image recognition The most important aspect: does the model have to be well-structured, interpretable for human inspection? Sometimes it does not matter, e.g. speech recognition Sometimes human understanding is the goal, e.g. medical data mining

What is the hypothesis space used Hypothesis space: the set of functions from which the algorithm selects the best fitting one Example: parametric methods In the case of a continuous feature space most methods use some paramteric curve to represent the function to be learned Example: regression with 1 variable We fit a polinomial on the training points Restricting the hypotheis space: we specify the degree of the polinomial This restricts the set of possible functions The parameters that influence the size of the hypothesis space are called meta-parameters Training = find the optimal parameters of the polinomial In the example these are the coefficients of the polinomial These are called the parameters of the model

What is the hypothesis space used 2 Hypothesis space: the set of functions from which the algorithm selects its hypothesis Restricting the hypothesis space is technically necessary Continuous feature space: it is impossible to represent all possible functions Discrete space: the number of possible functions is finite, so theoretically we could represent all of them, the practically usually there are too many combinations It is also necessary for efficient (meaning well-generalizing) learning Generalization requires that the system can give a reply for previously unseen examples During training, we fit a model (function) on the data from the hypothesis space The shape of this function plays a critical role in how the system replies to previously unseen data („inductive bias”) Usually we work with mathematically simple function families The optimal hypothesis space depends on the actual task!! Too restricted hypothesis space  the model won’t be able to learn even the training examples Too wide hypothesis space  it „mugs up”, the training examples, but cannot generalize Similar to human learning (though we adjust the task to the child, and not the other way round..)

Which one it selects from among the possible hypothesis Consistent hypothesis: gives a correct return value for the training examples If there are more than one correct hypotheses, ten we have to chose from them The training examples cannot help in this! We need some heuristics for this The principle of Occam’s razor: „when ther are more than one possible explanations, then usually the simplest one turns out to be right” Of course, we have to mathematically define the notion of „simplest” Eg.: minimum description length

What algorithm is used to find the best hypothesis In the previous step we defined the criterium of the optimal hypothesis In practice we will frequently define it as a target function Defining it is not enough, we have to find it somehow In the case of numerical models, the task of optimizing the target function usually leads to a multivariate global optimization problem Theoretically, we may use general-purpose global optimization algorithms for this In most cases, however, we will have a training algorithm specially adjusted for the needs of the actual machine learning model