Download presentation
Presentation is loading. Please wait.
Published byDaniella Cox Modified over 8 years ago
1
Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin
2
2 Credit Scoring Example: Given a set of inputs of income and savings, label each customer in loan application: low-risk vs high-risk Input: x = [x 1,x 2 ] T,Output: C {0,1} Prediction: Choose C=1 if P(C=1 | x1, x2) >0.5 C=0 Otherwise Prediction is to find C that maximize the conditional probability P(C|x)
3
3 Bayes’ Rule posterior likelihoodprior evidence
4
Example of Bayes Rule Given : A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? (diagnostics inference)
5
Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2, …, A n |C) = P(A 1 | C j ) P(A 2 | C j )… P(A n | C j ) Can estimate P(A i | C j ) for all A i and C j. New point is classified to C j if P(C j ) P(A i | C j ) is maximal.
6
How to Estimate Probabilities from Data? Class: P(C) = N c /N e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes : P(A i | C k ) = |A ik |/ N c where |A ik | is number of instances having attribute A i and belongs to class C k Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0
7
How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A v) choose only one of the two splits as new attribute Probability density estimation Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Parametric estimation for regression Once probability distribution is known, can use it to estimate the conditional probability P(A i |c)
8
How to Estimate Probabilities from Data? Normal distribution: One for each (A i,c i ) pair For (Income, Class=No): If Class=No sample mean = 110 sample variance = 2975
9
Example of Naïve Bayes Classifier l P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024 l P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10 -9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test Record:
10
Example of Naïve Bayes Classifier A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals
11
11 Bayes’ Rule: K>2 Classes
12
12 Bayesian Networks Aka graphical models, probabilistic networks, representing the interaction between variables visually Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis Arcs are direct direct influences between hypotheses The structure is represented as a directed acyclic graph (DAG) The parameters are the conditional probs in the arcs
13
13 Causes and Bayes’ Rule Diagnostic inference: Knowing that the grass is wet, what is the probability that rain is the cause? causal diagnostic
14
14 Causal vs Diagnostic Inference Causal inference: If the sprinkler is on, what is The prob that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = 0.95 0.4 + 0.9 0.6 = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.
15
15 Bayesian Networks: Causes Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?
16
Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes Use other techniques such as Bayesian Belief Networks (BBN)
17
Artificial Neural Network (ANN) An interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation. Biological neural network: real biological neurons that are connected or functionally-related in the nervous system. Connectionist: mental phenomena can be described by interconnected networks of simple units Data modeling tool to capture complex relationships between inputs and outputs or find patterns in data Non-linear, statistical Non-parametric Adapted from Lecture Notes of E Alpaydın 17
18
Neural Network Example Output Y is 1 if at least two of the three inputs are equal to 1.
19
ANN: Neuron
20
ANN Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Output could be a sigmoid function Perceptron Model or Y=sigmoid(o)=1/(1+exp(-o))
21
General Structure of ANN Training ANN means learning the weights of the neurons
22
Algorithm for learning ANN Initialize the weights (w 0, w 1, …, w k ) Adjust the weights to ensure the output of ANN is consistent with class labels of training examples (offline or online) Objective function: Find the weights w i ’s that minimize the above objective function E.g. backpropagation algorithm w i = (r, Y)X i ( is the learning factor, gradually decreased in time for convergence)
23
Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data
24
Support Vector Machines One Possible Solution
25
Support Vector Machines Another possible solution
26
Support Vector Machines Other possible solutions
27
Support Vector Machines Which one is better? B1 or B2? How do you define better?
28
Support Vector Machines Find hyperplane maximizes the margin => B1 is better than B2
29
Support Vector Machines What if the problem is not linearly separable?
30
Nonlinear Support Vector Machines What if decision boundary is not linear?
31
Nonlinear Support Vector Machines Transform data into higher dimensional space
32
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers
33
Why Ensemble Classifier? Suppose there are 25 base classifiers Each classifier has error rate, = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction:
34
Summary Machine Learning Approaches Supervised, Unsupervised, Reinforcement Supervised Learning for Classification Bayesian rule: diagnostic and causal inference Artificial neural networks Support vector machine Ensemble methods Adapted from Lecture Notes of E Alpaydın 34
35
Assessing and Comparing Classifiers Questions: Assessment of the expected error of a learning algorithm Comparing the expected errors of two algorithms: Is alg-1 more accurate than alg-2? Training/validation/test sets Criteria (Application- dependent): Misclassification error, or risk (loss functions) Training time/space complexity Testing time/space complexity Interpretability Easy programmability Cost-sensitive learning 35
36
36 K-Fold Cross-Validation The need for multiple training/validation sets {X i,V i } i : Training/validation sets of fold i K-fold cross-validation: Divide X into k, X i,i=1,...,K T i share K-2 parts
37
37 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity
38
38 Receiver-Operating- Characteristics (ROC) Curve
39
39 Interval Estimation X = { x t } t where x t ~ N ( μ, σ 2 ) Sample ave m ~ N ( μ, σ 2 /N) Define a unit normal distrib Z: 100(1- α) percent confidence interval
40
40 McNemar’s Test for Comparison Given single training/validation set, Contingency Table: Under the hypothesis of same error rate, we expect e 01 = e 10 =(e 01 + e 10 )/2 Chi-square statistic: McNemar’s test accepts the hypothesis at a significance level if it < X 2 α,1 (For example, X 2 0.05,1 = 3.84)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.