Download presentation
Presentation is loading. Please wait.
Published byJessica Jenkins Modified over 8 years ago
1
CMPT 310 Simon Fraser University Oliver Schulte Learning
2
2/13 The Big Picture: AI for Model-Based Agents Artificial Intelligence a modern approach 2 Action Learning Knowledge Logic Probability Heuristics Inference Planning Decision Theory Game Theory Reinforcement Learning Machine Learning Statistics
3
3/13 Motivation Building a knowledge base is a significant investment of time and resources. Prone to error, needs debugging. Alternative approach: Learn rules from examples. Grand Vision: Start with “seed rules” from expert, use examples to expand and refine.
4
4/13 Overview Many learning models exist. Will consider two representative ones that are widely used in AI. Learning Bayesian network parameters. Learning a decision tree classifier.
5
5/13 Examples Program By Example Excel Flash Fill Excel Flash Fill Kaggle Data Science Competitions
6
6/13 Learning Bayesian Networks
7
7/13 Structure Learning Example: Sleep Disorder Network Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisèle. (2006). M.Sc. Thesis, SFU.Development of Bayesian Network models for obstructive sleep apnea syndrome assessment
8
8/13 Parameter Learning Common Approach Expert specifies Bayesian network structure (nodes and links). Program fills in parameters (conditional probabilities).
9
9/13 Parameter Learning Scenarios Complete data (today). Later: Missing data (EM). Parent Node/ Child Node DiscreteContinuous DiscreteMaximum Likelihood Decision Trees logit distribution (logistic regression) Continuousconditional Gaussian (not discussed) linear Gaussian (linear regression)
10
10/13 The Parameter Learning Problem Input: a data table X NxD. One column per node (random variable) One row per instance. How to fill in Bayes net parameters? PlayTennis Humidity DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno
11
11/13 Start Small: Single Node What would you choose? Humidity How about P(Humidity = high) = 50%? DayHumidity 1high 2 3 4 5normal 6 7 8high 9normal 10normal 11normal 12high 13normal 14high P(Humidity = high) θ
12
12/13 Parameters for Two Nodes DayHumidityPlayTennis 1highno 2highno 3highyes 4highyes 5normalyes 6normalno 7normalyes 8highno 9normalyes 10normalyes 11normalyes 12highyes 13normalyes 14highno PlayTennis Humidity P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 Is θ as in single node model? How about θ 1 =3/7? How about θ 2 =6/7?
13
13/13 Maximum Likelihood Estimation
14
14/13 MLE An important general principle: Choose parameter values that maximize the likelihood of the data. Intuition: Explain the data as well as possible. Recall from Bayes’ theorem that the likelihood is P(data|parameters) = P(D| θ ).
15
15/13 Finding the Maximum Likelihood Solution: Single Node Humidity P(Hi| θ ) high θ θ θ θ normal1- θ normal1- θ normal1- θ high θ normal1- θ normal1- θ normal1- θ high θ normal1- θ high θ P(Humidity = high) θ 1. Write down 2. In example, P(D| θ )= θ 7 (1- θ ) 7. 3. Maximize θ for this function. independent identically distributed data! iid
16
16/13 Solving the Equation 1. Often convenient to apply logarithms to products. ln(P(D| θ ))= 7ln( θ ) + 7 ln(1- θ ). 2. Find derivative, set to 0. 3. Exercise: try finding the minima of L given above.
17
17/13 Finding the Maximum Likelihood Solution: Two Nodes HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 PlayTennis Humidity
18
18/13 Finding the Maximum Likelihood Solution: Two Nodes In a Bayes net, can maximize each parameter separately. Fix a parent condition single node problem. HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) 1.In example, P(D| θ, θ 1, θ 2)= θ 7 (1- θ ) 7 ( θ 1) 3 (1- θ 1) 4 ( θ 2) 6 (1- θ 2). 2.Take logs and set to 0.
19
19/13 Finding the Maximum Likelihood Solution: Single Node, >2 possible values. DayOutlook 1sunny 2 3overcast 4rain 5 6 7overcast 8sunny 9 10rain 11sunny 12overcast 13overcast 14rain Outlook P(Outlook) sunny θ1θ1 overcast θ2θ2 rain θ3θ3 1.In example, P(D| θ 1, θ 2, θ 3)= ( θ 1) 5 ( θ 2) 4 ( θ 3) 5. 2.MLE solution for 2 possible values can be generalized, but needs more advanced math. 3.General solution: MLE = observed frequencies.
20
Decision Tree Classifiers
21
21/13 Multiple Choice Question A decision tree 1. helps an agent make decisions. 2. uses a Bayesian network to compute probabilities. 3. contains nodes with attribute values. 4. assigns a class label to a list of attribute values.
22
22/13 Classification Predict a single target or class label for an object, given a vector of features. Conditional probability P(label|features). Example: predict PlayTennis given 4 other features. DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno
23
23/13 Decision Tree Popular type of classifier. Easy to visualize. Especially for discrete values, but also for continuous. Learning: Information Theory.
24
24/13 Decision Tree Example
25
25/13 Exercise Find a decision tree to represent A OR B, A AND B. (A AND B) OR (C AND notD AND E).
26
26/13 Example: Rate of Reboot Failure
27
27/13 Big Decision Tree for NHL Goal Scoring
28
28/13 Decision Tree Learning Basic Loop: 1. A := the “best” decision attribute for next node. 2. For each value of A, create new descendant of node. 3. Assign training examples to leaf nodes. 4. If training examples perfect classified, then STOP. Else iterate over new leaf nodes.
29
Entropy
30
30/13 Multiple Choice Question Entropy 1. measures the amount of uncertainty in a probability distribution. 2. is a concept from relativity theory in physics. 3. refers to the flexibility of an intelligent agent. 4. is maximized by the ID3 algorithm.
31
31/13 Uncertainty and Probability The more “balanced” a probability distribution, the less information it conveys (e.g., about class label). How to quantify? Information Theory: Entropy = Balance. S is sample, p + is proportion positive, p - negative. Entropy(S) = -p + log2(p + ) - p - log2(p - )
32
32/13 Entropy: General Definition Important quantity in coding theory statistical physics machine learning
33
33/13 Intuition
34
34/13 Entropy
35
35/13 Coding Theory Coding theory: X discrete with 8 possible states (“messages”); how many bits to transmit the state of X ? Shannon information theorem: optimal code length assigns –log 2 p(x) to each “message” X = x. All states equally likely
36
36/13 Zipf’s Law General principle: frequent messages get shorter codes. e.g., abbreviations. Information Compression.
37
37/13 Another Coding Example
38
38/13 The Kullback-Leibler Divergence Measures information-theoretic “distance” between two distributions p and q. Distributions can be discrete or continuous. Aka cross-entropy. Code length of x in true distribution Code length of x in wrong distribution
39
Information Gain ID3 Decision Tree Learning
40
40/13 Splitting Criterion A new attribute value changes the entropy. Want to split on attribute that has the greatest reduction in entropy, averaged over its attribute values. Gain(S,A) = expected reduction in entropy due to splitting on A.
41
41/13 Example
42
42/13 Playtennis Example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.