Last lecture summary Naïve Bayes Classifier
Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated from the data)
learning prior – A hundred independently drawn training examples will usually suffice to obtain a reasonable estimate of P(Y). larning likelihood – The Naïve Bayes Assumption: Assume that all features are independent given the class label Y.
Example – Play Tennis
Example – Learning Phase OutlookPlay=YesPlay=No Sunny 2/93/5 Overcast 4/90/5 Rain 3/92/5 TemperaturePlay=YesPlay=No Hot 2/92/5 Mild 4/92/5 Cool 3/91/5 HumidityPlay=YesPlay=No High 3/94/5 Normal 6/91/5 WindPlay=YesPlay=No Strong 3/93/5 Weak 6/92/5 P(Play=Yes) = 9/14P(Play=No) = 5/14 P(Outlook=Sunny|Play=Yes) = 2/9
Example - Prediction x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong) Look up tables P(Outl=Sunny|Play=No) = 3/5 P(Temp=Cool|Play=No) = 1/5 P(Hum=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Outl=Sunny|Play=Yes) = 2/9 P(Temp=Cool|Play=Yes) = 3/9 P(Hum=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = Given the fact P(Yes| x ’) < P(No| x ’), we label x ’ to be “No”.
Last lecture summary Binary classifier performance
TP, TN, FP, FN Precision, Positive Predictive Value (PPV) TP / (TP + FP) Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN) False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN) Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR Accuracy (TP + TN) / (TP + TN + FP + FN)
Neural networks (new stuff)
Biological motivation The human brain has been estimated to contain (~10 11 ) brain cells (neurons). A neuron is an electrically excitable cell that processes and transmits information by electrochemical signaling. Each neuron is connected with other neurons through the connections called synapses. A typical neuron possesses a cell body (often called soma), dendrites (many, mm), and an axon (one, 10 cm – 1 m).
Synapse permits a neuron to pass an electrical or chemical signal to another cell. Synapse can be either excitatory, or inhibitory. Synapses are of different strength (the stronger the synapse is, the more important it is). The effects of synapses cumulate inside the neuron. When the cumulative effect of synapses reaches certain threshold, the neuron gets activated, the signal is sent to the axon, through which the neuron is connected to other neuron(s).
Neural networks for applied science and engineering, Samarasinghe
Warren McCulloch Walter Pitts Threshold neuron
1 st mathematical model of neuron – McCulloch & Pitts binary (threshold) neuron – only binary inputs and output – the weights are pre-set, no learning x1x2t
x1x2t
Heavyside (threshold) activation function
Perceptron (1957) Frank Rosenblatt Developed the learning algorithm. Used his neuron (pattern recognizer = perceptron) for classification of letters.
Multiple output perceptron for multicategory (i.e. more than 2 classes) classification one output neuron for each class input layer output layer single layer (one-layered) vs. double layer (two-layered)
Learning
requirements for the minimum Gradient grad is a vector pointing in the direction of the greatest rate of increase of the function We want to decline, we take -grad.
Delta rule
error gradient
To find a gradient, differentiate the error E with respect to w 1 : According to the delta rule, weight change is proportional to the negative of the error gradient: New weight:
β is called a learning rate. It determines how far along the gradient it is necessary to move.
the new weight after i th iteration
This is an iterative algorithm, one pass through training set is not enough. One pass of the whole training data set is called an epoch. Adjusting the weights after each input pattern presentation (iteration) is called example-by- example (online) learning. – For some problems this can cause weights to oscillate – adjustment required by one pattern may be canceled by the next pattern. – More popular is the next method.
Batch learning – wait until all input patterns (i.e. epoch) have been processed and then adjust weights in the average sense. – More stable solution. – Obtain the error gradient for each input pattern – Average them at the end of the epoch – Use this average value to adjust the weights using the delta rule