Data Mining, Neural Network and Genetic Programming

Data Mining, Neural Network and Genetic Programming
COMP422 Week 5 Artificial Neural Network Yi Mei and Mengjie Zhang

Outline Why ANN? Origin Perceptron Multi-Layer Perceptron
Neural Network BP Algorithm Issues in BP

Why ANN? Killer applications in a lot of areas
Computer Vision/Image processing Game playing (AlphaGo, Watson, …) Big data

Origin

Origin Facts about human brain 10 11 neurons, massively connected
Each neuron is connected to to other neurons to connections in total Brain message passing is 1 million times slower than modern electronic circuits But very efficient for complex decision making Usually less than 100 serial stages 100 step rule

Origin Human brain shows amazing capability in
Learning Perception Adaptability … Simulate human brain to achieve the above functionalities ANN models for this purpose

Artificial Neuron 𝑧 𝑗

Activation Functions Threshold Sigmoid

Perceptron A special type of artificial neuron Real-valued inputs
Binary output Threshold activation function

Perceptron To perform linear classification Can do online learning
Update 𝑤 𝑗𝑖 and 𝑏 𝑗 along with new examples

Learning Perceptron How to get the optimal weights and threshold?
Only consider accuracy Optimal if 100% accuracy on training set Can have many optimal solutions To facilitate notation, transform threshold to a weight 𝑤 𝑗0 = 𝑏 𝑗 , and 𝑥 0 =1 always holds

Learning Perceptron Idea
Initialise weights and threshold randomly (or all zeros) Given a new example 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 ,𝑑 Input feature vector: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 Output (class label): 𝑑 Predicted output 𝑦 If 𝑦=0 and 𝑑=1, increase 𝑏= 𝑤 0 , increase 𝑤 𝑖 for each positive 𝑥 𝑖 If 𝑦=1 and 𝑑=0, decrease 𝑏= 𝑤 0 , decrease 𝑤 𝑖 for each positive 𝑥 𝑖 Repeat the process for each new example until the desired behaviour is achieved

Learning Perceptron Implementation
Initialise weights and threshold randomly (or all zeros) Given a new example 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 ,𝑑 Input feature vector: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 Output (class label): 𝑑 Predicted output 𝑦 Where 𝜂∈[0,1] is called the learning rate Repeat the process for each new example until the desired behaviour is achieved

Learning Perceptron Online learning: update weights for each new example Offline learning: update weights after all the (training) examples Batch learning: update weights after each batch (subset of training examples)

Problem with Perceptron
What can the perceptron learn?

Problem with Perceptron
What can the perceptron learn? Perceptron convergence theorem: The perceptron learning algorithm will converge if and only if the training set is linearly separable. Cannot learn for XOR (Minsky and Papert, 1969)

Multi-Layer Perceptron
Add one hidden node between the inputs and output

Neural Network A more general form of perceptron Hidden layer(s)
Output layer (maybe more than one nodes)

Autoencoder A type of ANN for for efficient coding
Unsupervised learning Typically dimensionality reduction Recently widely used for learning generative models of data

Neural Network Design questions?

Neural Network Design questions? Architecture Parameters
How many hidden layers? How many hidden nodes? How the layers/nodes are connected? With/Without cycles? Parameters Activation function? Learning rates Learning algorithm

Learning ANN Weights A complex optimisation problem
Usually non-convex (many local optima) Extremely high dimensional Hardly possible to solve using exact methods

Learning ANN Weights Approximate methods Hill climbing (local search)
(Stochastic) gradient descent search Simulated annealing Tabu search Evolutionary computation …

Back Propagation (BP) Algorithm
Gradient descent Initialise the weights Feedforward For each example, calculate the predicted outputs 𝑜 𝑧 using the current weights Calculates the error 𝑧 𝑑 𝑧 − 𝑜 𝑧 2 Back propogation Estimate the contribution of each weight to the error, how much error will be reduced when increasing/decreasing the weight (gradient) Change each weight (simultaneously) proportional to its contribution to reduce the error as much as possible Calculate the gradient backwards (from the last hidden layer to the first hidden layer)

How to calculate the gradient of 𝑤 𝑖→𝑗 ? Idea Proportional to the output 𝑜 𝑖 Proportional to the slope of the activation function Proportional to the contribution of node 𝑗, back propagated from the next layer: 𝛽 𝑗

Assume a neural network Activation function: sigmoid Minimise total sum squared error Output node: Hidden node:

BP Algorithm Implementation
Let 𝜂 be the learning rate Set all weights to smaller random values Until total error is small enough, repeat For each input example Feed forward pass to get predicted outputs Compute 𝛽 𝑧 = 𝑑 𝑧 − 𝑜 𝑧 for each output node Compute 𝛽 𝑗 = 𝑘 𝑤 𝑗→𝑘 𝑜 𝑘 1− 𝑜 𝑘 𝛽 𝑘 Compute the weight changes Δ 𝑤 𝑖→𝑗 =𝜂 𝑜 𝑖 𝑜 𝑗 1− 𝑜 𝑗 𝛽 𝑗 Add up weight changes for all input examples Change weights

BP Algorithm Example Calculate one pass of the BP algorithm given the example (feedforward + back propagation) Inputs Outputs 𝐼 1 𝐼 2 𝑑 5 𝑑 6

Notes on BP Algorithm Epoch: all input examples (entire training set, batch, …) A target of 0 or 1 can never be reached. Usually interpret a number > 0.9 or > 0.8 as 1 Training may require thousands of epochs. A convergence curve will help to decide when to stop

Issues in BP Algorithm What problems can BP algorithm have?

Issues in BP Algorithm What problems can BP algorithm have?
Improper learning rate 0.2 is a good starting point in practice

Overfitting Training for too long Too many weights to train Too few examples

Local minima Too many weights Non-convex optimisation

When to Stop Training Which ways can you think of?

When to Stop Training Which ways can you think of?
When a certain number of epochs/cycles is reached When the error (e.g. mean/total squared error) on the training set is smaller than a threshold Proportion of correctly classified training instances (i.e. accuracy) is larger than a threshold Early stopping strategy Validation control

Validation Control Break the training set into 2 parts
Use part 1 to compute the weight changes Every m (e.g. 10, 50, 100) epochs apply the partially trained NN to part 2 (validation set) to calculate the validation error Stop when the error on the validation set is minimum

Local Minima How can you tell if a local minimum is reached?
What to do with a local minimum?

Local Minima How can you tell if a local minimum is reached?
No improvement for a certain number of epochs Runs with different starting points end with different errors What to do with a local minimum? Nothing, if the training is “good enough” Increase the learning rate However, too large learning rate can cause oscillation Start with a large learning rate, then decrease it as training proceeds

ANN Architecture How many input and output nodes?
How many hidden layers/nodes?

ANN Architecture How many input and output nodes?
Usually determined by the problem How many hidden layers/nodes? Theorem: one hidden layer is enough for any problem But training may be faster with several layers Best to have as few hidden layers/nodes as possible: better generalisation, fewer weights to optimise (easier to solve) Make the best guess you can If training is unsuccessful try more hidden nodes If training is successful try fewer hidden nodes Observe weights after training: nodes with small weights can probably be eliminated

Representing Variables
Variables in the database can have different types

Representing Variables
Variables in the database can have different types Numeric (continuous, integer, ordinal, …) E.g. age, weight, temperature Nominal (symbolic, class, categorical) E.g. gender = male/female, colour = red/blue/green How to represent them?

Nominal Input Variables
Use a binary representation Each nominal variable has 𝑘 possible values Create 𝑘 input nodes for it (one for each possible value) 1 if the variable takes that value, and 0 otherwise Male = (1, 0) Female = (0, 1) Red = (1, 0, 0) Blue = (0, 1, 0) Green = (0, 0, 1)

Nominal Input Variables
Why not a single node, and male = 1, female = 2? Why not train a separate network for each possible value (category)? Not enough training data Network may not generalise between different categories Consider merge similar values for nominal variables E.g. occupation = (plumber, electrician), for predicting “creditworthy”, these two occupations probably have the same model. So combine these two occupations together Feature selection (may remove irrelevant nominal variables)

Numeric Input Variables
Different numeric types Continuous/Integer (age, income, …) Periodic (time of day, direction/angle, …) Ordinal (first, second, …) Trivial? Give one input node, and set the value directly?

Numeric Input Variables
Scaling: the input values have similar range (contribution) What is the scaled range? What if scale to a different range such as [-1, 1], or [0.1, 0.9]? Standardisation: assume normal distribution, change to N(0,1)

Periodic Input Variables
Smoothness on the boundary Sunday -> Monday, December -> January, 11:59pm -> 0:00am Interpolation representation The value of each node = the area covered E.g. midnight = 1 for node 1, and 0 for all other nodes E.g. 9am = 0.5 for nodes 2 and 3, and 0 for nodes 1 and 4 E.g. 9pm = 0.5 for nodes 4 and 1, and – for nodes 2 and 3 Node 1 Node 4 Node 3 Node 2 Node 1 Midnight 6am noon 6pm Midnight

Output Variables A single output node?
1 for one class, and 0 for the other class Or 1 for class one, 2 for class 2, 3 for class 3, … One output for each possible class label 2 for binary classification k for k-class classification 1 if belongs to that class label, and 0 otherwise Which one do you prefer?

Output Variables In practice, cannot reach exactly 1
What if the 2-class output vector is (0.99, 0.01)? What if (0.6, 0.4)? (0.5, 0.5)? Activation function for output nodes?

Weight Initialisation
Random Uniform from [-100, 100]? [0, 1]? [-1, 1]? [-0.01, 0.01]? Normal distribution? Not random? All zeros?

Weight Initialisation
Random Uniform from [-100, 100]? [0, 1]? [-1, 1]? [-0.01, 0.01]? Normal distribution? Not random? All zeros? Near-zero initial values Large gradient for the activation function Break the symmetry – different weights What will happen if all weights are initially equal? Fan-in factor Uniform from − 1 𝑑 , 1 𝑑 , where 𝑑 is the number of input nodes for this node

Fan-in Factor What is the variance of the weighted sum
𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 + 𝑤 3 𝑥 3 ? What if there are 100 input nodes? If X and Y are independent, and have zero means

Fan-in Factor The variance of the weighted sum (the input of the activation function) increases with the number of input nodes A large weighted sum would lead to a small gradient of the activation function

𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 + 𝑤 3 𝑥 3 ? If X and Y are independent, and have zero means

𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 + 𝑤 3 𝑥 3 ? Always 1/3 regardless of the number of input nodes If X and Y are independent, and have zero means

Speeding up BP Normal to have huge ANNs take days/weeks/months to train Speed up is important Momentum is a widely used approach Use the gradient from last step(s) Is momentum always working? Have you used/seen momentum before? How to choose 𝜂 and 𝛼?

Summary ANN to simulate human brain Perceptron (a simple neuron)
Neural network Design questions: architecture, activation, objective function, … Learning ANN weights BP algorithm Issues in BP Variable representation Overfitting Weight initialisation

Data Mining, Neural Network and Genetic Programming

Similar presentations

Presentation on theme: "Data Mining, Neural Network and Genetic Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining, Neural Network and Genetic Programming

Similar presentations

Presentation on theme: "Data Mining, Neural Network and Genetic Programming"— Presentation transcript:

Similar presentations

About project

Feedback