The goal of machine learning

Machine learning methods – Introduction The main properties of learning algorithms

The goal of machine learning
Goal: To construct programs that are able to improve their performance using the experience collected during their operation Learning algorithm: algorithms that are able to deduct regularities, relationships from a set of training examples Note 1.: The main aim is not to memorize the actual training examples, but to correctly generalize to other samples not seen during training (also known as inductive learning) Assumption: the examples faithfully represent the relationship that we try to learn Note 2.: We can never be 100% sure that the relationship we found will generalize to unseen data Because of this, we will call the found relationship a „hypothesis” After receiving further examples the algorithm may refine the hypothesis

The main types of learning tasks
Supervised learning: the correct answer is also given with the training examples The most common task: classification Example: character recognition: 16x16 pixels  letter 16x16 pixels: input features Letter: class label Practically, we have to learn a function from examples This will be the dominant topic of this semester Unsupervised learning: no helping information is given Most common task: clustering Mapping data points into automatically found classes classes based on some kind similarity measure

The main types of learning tasks 2
Modelling processes along time In the classic function learning task we assume that the samples following each other are independent, or at least come in a random order On contrary, when modelling time series we assume that the order carries crucial information that must be modelled Examples: speech recognition, text analysis, modelling stock echange data Reinforcement learning Example: artificial living „creatures” -- autonomous agents Interaction with the environment, collection of experiences The experiences have no labels in themselves, only a long-term goal is defined A special sub-field within machine learning Other special learning tasks

Supervised learning of functions
The input of the function: a vector of some measurement data feature vector, attribute vector The output of the function: class label or a real number The input of the learning algorithm: a set of training examples Output: A hypothesis (model) about the function It can return the (hypothesized) output value for any input vector Set of training examples: a set of pairs of a feature vectors and the corresponding class label Examples: does the patient have influenza? F e a t u r e v e c t o r Class label (Y/N) Fever Joint pain Cough Influenza 38,2 Yes No 36,7 Dry 41,2 Wet 38,5 37,2 training instances

The main properties of a learning system
We have to think about these features when designing a new learning method or when we look for a suitable method for a given task The type of input/output of the function to be learned The representation method of the learned function (hypothesis) Hypothesis space: what is the set of functions that the method selects from Which hypothesis will it prefer when the are more hypotheses that fit the data What algorithm is uses to find a/the best hypothesis

The output of the function to be learned
Classification: the output value is from a finite, discrete set Example: character recognition. We have to tell which letter is shown in images of 16x16 pixels. Range of output values = letters of the ABC The classification task is the typical machine learning task Concept learning: the function has a binary range Example: we want to train a robot the notion of “chair”. Each object in its environment either belongs to this notion or not. Regression: the range of the function is continuous Example: assessing the value of used cars based on features like brand, age, motor capacity,…

The input of the function to be learned
Binary features Discrete features Also called nominal, symbolic or categoric features Continuous features Binarydiscretecontinuous conversion trivial Discrete  binary: Class labels: learning N class labels can always be solved as having N concept learning tasks („one against the rest”) Features: N different values can be represented by log2N binary values Continuous Discrete : Can be solved by quantization (with some error), eg. (fever) 39,7high Quantization is only for features, less usual for training targets

Why does the type of input/output matter?
Different type of input/output requires a different type of inner representation Some algorithms work only with a certain type of features/targets Or they might work with other types of features, but not optimally Examples: Concept learning with binary features: we have to learn a boolean function In the 60-70’s logic formulas were thought to be the best representation of human thinking A lot of research effort was put into the learning of logic formulas, these algorithms do not work on other types of data The classic SVM algorithm is defined for two classes Several extensions exist for multi-class tasks

Input/output examples 2
The classic decision tree algorithms was defined for discrete features There are several extensions for continuous feature, but these are not really efficient The Gaussian mixture model of statistical pattern recognition This assumes continuous features There is not much sense in fitting Gaussian distributions on discrete features, in many cases the algorithm would crash in practice Classification in general, when we have continuous features The characteristic function of each class is a discontinuous function that is hard to represent There are two general solutions to represent it using continuous functions: Geometric approach Decision-theoretic approach

The feature space and the decision boundary
When we have ta feature vector of N components, then our training examples can be displayed as points in an N-dimensional space Example: 2 features –> 2 axes (x1, x2) Class label: by colors Goal: to find the decision boundary between the classes Generally: give an estimate of the (x1,x2)c function based on the training examples The same as specifying the (x1,x2){0,1} characteristic function (or indicator function) of each ci class

Representing the decision boundary
Direct (geometric) approach: We directly represent the decision surface Using some simple, continuous function like lines (planes) Indirect (decision-theoretic) approach: 1. We assign a function to each class that can tell for any point of the space the probability that the point belongs to the given class 2. The given point is identified by the class label for which the discrimininant function takes the largest value The boundary between the classes is defined indirectly by the section of the discriminant functions This way, the classification task is solved indirectly by learning the discriminant fuctions

Further remarks (Input/output)
It is important whether the examples have missing feature values There exist methods to estimate the missing values But most algorithms cannot handle these by default This might happen in several practical tasks (pl. medical diagnostics) It is important whether the algorithm can handle contradicting examples (same feature vector with different class labels) There are solutions to this But some algorithms cannot handle this It is very frequent in practice Due to labelling mistakes, e.g. ambiguous diagnosis

Representation of the function to be learned
Symbolic representation numeric representation This is an ancient debate in AI 60-70’s : symbolic representation was preferred E.g.: logic formulas, if-then rules Currently: numeric representation is preferred Pl. neural networks  the representation consists of a bunch of real numbers For certain tasks symbolic representation seems to be more suitable E.g. automatic proving of mathematical theorems For other tasks it makes no sense E.g. image recognition The most important aspect: does the model have to be well-structured, interpretable for human inspection? Sometimes it does not matter, e.g. speech recognition Sometimes human understanding is the goal, e.g. medical data mining

What is the hypothesis space used
Hypothesis space: the set of functions from which the algorithm selects the best fitting one Example: parametric methods In the case of a continuous feature space most methods use some paramteric curve to represent the function to be learned Example: regression with 1 variable We fit a polinomial on the training points Restricting the hypotheis space: we specify the degree of the polinomial This restricts the set of possible functions The parameters that influence the size of the hypothesis space are called meta-parameters Training = find the optimal parameters of the polinomial In the example these are the coefficients of the polinomial These are called the parameters of the model

What is the hypothesis space used 2
Hypothesis space: the set of functions from which the algorithm selects its hypothesis Restricting the hypothesis space is technically necessary Continuous feature space: it is impossible to represent all possible functions Discrete space: the number of possible functions is finite, so theoretically we could represent all of them, the practically usually there are too many combinations It is also necessary for efficient (meaning well-generalizing) learning Generalization requires that the system can give a reply for previously unseen examples During training, we fit a model (function) on the data from the hypothesis space The shape of this function plays a critical role in how the system replies to previously unseen data („inductive bias”) Usually we work with mathematically simple function families The optimal hypothesis space depends on the actual task!! Too restricted hypothesis space  the model won’t be able to learn even the training examples Too wide hypothesis space  it „mugs up”, the training examples, but cannot generalize Similar to human learning (though we adjust the task to the child, and not the other way round..)

Which one it selects from among the possible hypothesis
Consistent hypothesis: gives a correct return value for the training examples If there are more than one correct hypotheses, ten we have to chose from them The training examples cannot help in this! We need some heuristics for this The principle of Occam’s razor: „when ther are more than one possible explanations, then usually the simplest one turns out to be right” Of course, we have to mathematically define the notion of „simplest” Eg.: minimum description length

What algorithm is used to find the best hypothesis
In the previous step we defined the criterium of the optimal hypothesis In practice we will frequently define it as a target function Defining it is not enough, we have to find it somehow In the case of numerical models, the task of optimizing the target function usually leads to a multivariate global optimization problem Theoretically, we may use general-purpose global optimization algorithms for this In most cases, however, we will have a training algorithm specially adjusted for the needs of the actual machine learning model

The goal of machine learning

Similar presentations

Presentation on theme: "The goal of machine learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The goal of machine learning

Similar presentations

Presentation on theme: "The goal of machine learning"— Presentation transcript:

Similar presentations

About project

Feedback