Final Exam: May 10 Thursday
If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis ) is true with probability p Bayesian reasoning
Bayesian reasoning Example: Cancer and Test P(C) = 0.01 P(¬C) = 0.99 P(+|C) = 0.9 P(-|C) = 0.1 P(+|¬C) = 0.2P(-|¬C) = 0.8 P(C|+) = ?
Expand the Bayesian rule to work with multiple hypotheses ( H 1... H m ) and evidences ( E 1... E n ) Assuming conditional independence among evidences E 1... E n Bayesian reasoning with multiple hypotheses and evidences
Expert data: Bayesian reasoning Example
user observes E 3 E 1 E 2
Bayesian reasoning Example expert system computes posterior probabilities user observes E 2
Propagation of CFs For a single antecedent rule: cf(E) is the certainty factor of the evidence. cf(R) is the certainty factor of the rule.
Single antecedent rule example IF patient has toothache THEN problem is cavity {cf 0.3} Patient has toothache {cf 0.9} What is the cf(cavity, toothache)?
Propagation of CFs (multiple antecedents) For conjunctive rules: IF AND... AND THEN {cf} For two evidences E1 and E2: cf(E1 AND E2) = min(cf(E1), cf(E2))
Propagation of CFs (multiple antecedents) For disjunctive rules: IF OR... OR THEN {cf} For two evidences E1 and E2: cf(E1 OR E2) = max(cf(E1), cf(E2))
Exercise IF (P1 AND P2) OR P3 THEN C1 (0.7) AND C2 (0.3) Assume cf(P1) = 0.6, cf(P2) = 0.4, cf(P3) = 0.2 What is cf(C1), cf(C2)?
Defining fuzzy sets with fit-vectors A can be defined as: So, for example: Tall men = (0/180, 1/190) Short men=(1/160, 0/170) Average men=(0/165,1/175,0/185)
What about linguistic values with qualifiers ? e.g. very tall, extremely short, etc. Hedges are qualifying terms that modify the shape of fuzzy sets e.g. very, somewhat, quite, slightly, extremely, etc. Qualifiers & Hedges
Representing Hedges
Crisp Set Operations
Complement To what degree do elements not belong to this set? tall men = {0/180, 0.25/182, 0.5/185, 0.75/187, 1/190}; Not tall men = {1/180, 0.75/182, 0.5/185, 0.25/187, 1/190}; Fuzzy Set Operations ¬ A ( x ) = 1 – A ( x )
Containment Which sets belong to other sets? tall men = {0/180, 0.25/182, 0.5/185, 0.75/187, 1/190}; very tall men = {0/180, 0.06/182, 0.25/185, 0.56/187, 1/190}; Fuzzy Set Operations Each element of the fuzzy subset has smaller membership than in the containing set
Intersection To what degree is the element in both sets? Fuzzy Set Operations A ∩ B ( x ) = min [ A ( x ), B ( x ) ]
tall men = {0/165, 0/175, 0/180, 0.25/182, 0.5/185, 1/190}; average men = {0/165, 1/175, 0.5/180, 0.25/182, 0/185, 0/190}; tall men ∩ average men = {0/165, 0/175, 0/180, 0.25/182, 0/185, 0/190}; or tall men ∩ average men = {0/180, 0.25/182, 0/185}; A ∩ B ( x ) = min [ A ( x ), B ( x ) ]
Union To what degree is the element in either or both sets? Fuzzy Set Operations A B ( x ) = max [ A ( x ), B ( x ) ]
tall men = {0/165, 0/175, 0/180, 0.25/182, 0.5/185, 1/190}; average men = {0/165, 1/175, 0.5/180, 0.25/182, 0/185, 0/190}; tall men average men = {0/165, 1/175, 0.5/180, 0.25/182, 0.5/185, 1/190}; A B ( x ) = max [ A ( x ), B ( x ) ]
25 Choosing the Best Attribute: Binary Classification Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction Information theory (Shannon and Weaver 49) Entropy: a measure of uncertainty of a random variable A coin that always comes up heads --> 0 A flip of a fair coin (Heads or tails) --> 1(bit) The roll of a fair four-sided die --> 2(bit) Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute
26 Formula for Entropy Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2) = -1/2log 2 1/2 -1/2log 2 1/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log log 2.99 =.08 bits
Information gain Information gain (from attribute test) = difference between the original information requirement and new requirement Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG
Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
Example contd. Decision tree learned from the 12 examples: Substantially simpler than “true”
Perceptrons X = x 1 w 1 + x 2 w 2 Y = Y step
Perceptrons How does a perceptron learn? A perceptron has initial (often random) weights typically in the range [-0.5, 0.5] Apply an established training dataset Calculate the error as expected output minus actual output : error e = Y expected – Y actual Adjust the weights to reduce the error
Perceptrons How do we adjust a perceptron’s weights to produce Y expected ? If e is positive, we need to increase Y actual (and vice versa) Use this formula:, where and α is the learning rate (between 0 and 1) e is the calculated error
Perceptron Example – AND Train a perceptron to recognize logical AND Use threshold Θ = 0.2 and learning rate α = 0.1
Perceptron Example – AND Train a perceptron to recognize logical AND Use threshold Θ = 0.2 and learning rate α = 0.1
Perceptron Example – AND Repeat until convergence i.e. final weights do not change and no error Use threshold Θ = 0.2 and learning rate α = 0.1