1 Chapter 10 Introduction to Machine Learning
2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific Ordering l Version Spaces l Candidate Elimination
3 Chapter 10 Contents (2) l Inductive Bias l Decision Tree Induction l Overfitting l The Nearest Neighbor Algorithm l Neural Networks l Supervised Learning l Unsupervised Learning l Reinforcement Learning
4 Training l Learning problems usually involve classifying inputs into a set of of classifications. l Learning is only possible if there is a relationship between the data and the classifications. l Training involves providing the system with data which has been manually classified. l Learning systems use the training data to learn to classify unseen data.
5 Rote Learning l A very simple learning method. l Simply involves memorizing the classifications of the training data. l Can only classify previously seen data – unseen data cannot be classified by a rote learner.
6 Concept Learning l Concept learning involves determining a mapping from a set of input variables to a Boolean value. l Such methods are known as inductive learning methods. l If a function can be found which maps training data to correct classifications, then it will also work well for unseen data – hopefully! l This process is known as generalization.
7 Hypotheses l A hypothesis is a vector of variables: l In concept learning, a training hypothesis is either a positive or negative (true or false). l A ? is used to indicate that any value will be suitable. A Ø is used to indicate that no value will be suitable.
8 Hypotheses - Example l Each hypothesis represents a set of driving conditions. l If a hypothesis is positive, then it represents a safe scenario. l For example: l This represents the hypothesis that it is safe to drive fast in rain 10ft behind the next car having drunk 2 units of alcohol. l This would be a negative training example, as clearly it is not safe!
9 General to Specific Ordering l This hypothesis is the most general hypothesis. It represents the idea that it is safe to drive in any conditions: h g = l The following hypothesis is the most specific hypothesis: it says it is not safe to drive in any conditions: h s = l We can define a partial order over the set of hypotheses: h 1 > g h 2 l This states that h 1 is more general than h 2 l One learning method is to determine the most specific hypothesis that matches all the training data.
10 Partial Order (sort) l H1 = l H2 = l Cannot be ordered, they have not relation
11 More General Hypothesis l ≥
12 Learning Algorithm l Start with the most specific hypothesis, relaxing until a match is found. l l Positive Training has n l Choose l Match with yields l Match with yields -- hypothesis – it is safe to drive if one drives slowly and doesn’t drink
13 Version Spaces l A version space is the set of hypotheses that correctly map all the training data to their categories. l A simplistic learning method would be to start from a version space of all hypotheses and to systematically remove all the ones that do not match the training data. l Clearly this would not be an efficient learning method!
14 Candidate Elimination l Candidate elimination aims to derive one hypothesis which matches all training data. l We start with a set of the most general (h g ) and most specific (h s ) hypotheses. l As each item of training data is examined, the set of hypotheses are modified such that all hypotheses in h s and h g match the training data. l When finished with training, the remaining hypothesis should match unseen data.
15 Inductive Bias l All learning methods have an inductive bias. l The inductive bias of a learning method is the set of restrictions on the learning method. l Without inductive bias, a learning method could not learn to generalize. l Occam’s razor is an example of an inductive bias: The best hypothesis to select is the simplest one.
16 Decision Tree Induction (1) l A decision tree takes an input and gives a Boolean output. l Decision trees can represent more complex scenarios than version spaces.
17 Decision Tree Induction (2) l Decision tree induction involves creating a decision tree from a set of training data that can be used to correctly classify the training data. l ID3 is an example of a decision tree learning algorithm. l ID3 builds the decision tree from the top down, selecting the features from the training data that provide the most information at each stage. l html
18 Decision Tree Induction (3) l ID3 selects attributes based on information gain. l Information gain is the reduction in entropy caused by a decision. l Entropy is defined as: H(S) = - p 1 log 2 p 1 - p 0 log 2 p 0 l p 1 is the proportion of the training data which are positive examples l p 0 is the proportion which are negative examples l The entropy of S is zero when all the examples are positive, or when all the examples are negative. l The entropy reaches its maximum value of 1 when exactly half of the examples are positive, and half are negative.
19 Values with greatest gain are placed near the top of the tree l Considering example page 279 nCountry of origin nBig star 0.01 nGenre 0.17
20 The Problem of Overfitting Black dots represent positive examples, white dots negative. The two lines represent two different hypotheses. In the first diagram, there are just a few items of training data, correctly classified by the hypothesis represented by the darker line. In the second and third diagrams we see the complete set of data, and that the simpler hypothesis which matched the training data less well matches the rest of the data better than the more complex hypothesis, which overfits.
21 The Nearest Neighbor Algorithm (1) l This is an example of instance based learning. l Instance based learning involves storing training data and using it to attempt to classify new data as it arrives. l The nearest neighbor algorithm works with data that consists of vectors of numeric attributes. l Each vector represents a point in n- dimensional space.
22 The Nearest Neighbor Algorithm (2) l When an unseen data item is to be classified, the Euclidean distance is calculated between this item and all training data. l the distance between and is: l The classification for the unseen data is usually selected as the one that is most common amongst the few nearest neighbors. l Shepard’s method involves allowing all training data to contribute to the classification with their contribution being proportional to their distance from the data item to be classified.
23 Neural Networks (1) l An neural network is a network of artificial neurons, which is based on the operation of the human brain. l Neural networks usually have their nodes arranged in layers. l One layer is the input layer, and another is an output layer. l There are one or more hidden layers between these two.
24 Neural Networks (2) l The connections between nodes have weights associated with them, which determine the behavior of the network. l Input data is applied to the input layer. l Neurons fire if their inputs are above a certain level. l If one neuron is connected to another the firing of one may cause the firing of the next.
25 Supervised Learning l Many neural networks use supervised learning. l Pre-classified training data is provided to the network before it is presented with unseen data. l The training data causes the weights in the network to be set to levels such that unseen data can be classified correctly. l Neural networks are able to learn to classify extremely complex functions.
26 Unsupervised Learning l Unsupervised learning networks learn without requiring human intervention. l No training data is required. l The system learns to cluster input data into a set of classifications that are not previously defined. l Example: Kohonen Maps.
27 Reinforcement Learning l Systems that learn using reinforcement learning are given a positive feedback when they classify data correctly, and negative feedback when they classify data incorrectly. l Credit assignment is needed to reward nodes in a network correctly.