Today’s Topics Some Exam-Review Notes –Midterm is Thurs, 5:30-7:30pm HERE –One 8.5x11 inch page of notes (both sides), simple calculator (log’s and arithmetic) –Don’t Discuss Actual Midterm with Others until Nov 3 Planning to Attend TA’s Review Tomorrow? Bayes’ Rule Naïve Bayes (NB) NB as a BN Prob Reasoning Wrapup Next: BN’s for Playing Nannon (HW3) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 61
Topics Covered So Far Some AI History and Philosophy (more final class) Learning from Labeled Data (more ahead) Reasoning from Specific Cases (k-NN) Searching for Solutions (many variants, common core) Projecting Possible Futures (eg, game-playing) Simulating ‘Problem Solving’ Done by the Biophysical World (SA, GA, and [next] neural nets) Reasoning Probabilistically (just Ch 13 & Lec 14) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 2 If you don’t recognize this …
Detailed List of Course Topics Learning from labeled data Experimental methodologies for choosing parameter settings and estimating future accuracy Decision trees and random forests Probabilistic models, nearest-neighbor methods Genetic algorithms Neural networks Support vector machines Reinforcement learning (if time permits) Searching for solutions Heuristically finding shortest paths Algorithms for playing games like chess Simulated annealing Genetic algorithms Reasoning probabilistically Probabilistic inference (just the basics so far) Bayes' rule Bayesian networks 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 63 Reasoning from concrete cases Cased-based reasoning Nearest-neighbor algorithm Reasoning logically First-order predicate calculus Representing domain knowledge using mathematical logic Logical inference Problem-solving methods based on the biophysical world Genetic algorithms Simulated annealing Neural networks Philosophical aspects Turing test Searle's Chinese Room thought experiment The coming singularity Strong vs. weak AI Societal impact of AI
Some Key Ideas ML: Easy to fit training examples, hard to generalize to future examples (never use TESTSET to choose model!) SEARCH: OPEN holds partial solutions, how to choose which partial sol’n to extend? (CLOSED prevents infinite loops) PROB: Fill JOINT Prob table (explicitly or implicitly) simply by COUNTING data, then can answer all kinds of questions 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 4
10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 Exam Advice Mix of ‘straightforward’ concrete problem solving and brief discussion of important AI issues and techniques Problem solving graded ‘precisely’ Discussion graded ‘leniently’ Previous exams great training and tune sets (hence soln’s not posted for old exams, ie so they can be used as TUNE sets) 5
10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 Exam Advice (cont.) Think before you write Briefly discuss important points Don’t do a ‘core dump’ Some questions are open-ended so budget your time wisely Always say SOMETHING 6
Bayes’ Rule Recall P(A B) P(A | B) x P(B) P(B | A) x P(A) Equating the two RHS (right-hand-sides) we get P(A | B) = P(B | A) x P(A) / P(B) This is Bayes’ Rule! 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 7
Common Usage - Diagnosing CAUSE Given EFFECTS P(disease | symptoms) = P(symptoms | disease) x P(disease) P(symptoms) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 8 Usually a big AND of several random variables, so a JOINT probability In HW3, you’ll compute prob(this move leads to a WIN | NANNON board configuration)
Simple Example (only ONE symptom variable) Assume we have estimated from data P(headache | condition=haveFlu) = 0.90 P(headache | condition=haveStress)= 0.40 P(headache | condition=healthy) = 0.01 P(haveFlu) = 0.01 // Dropping ‘condition=’ for clarity P(haveStress) = 0.20 // Because it’s midterms time! P(healthy) = 0.79 // We assume the 3 ‘diseases’ disjoint Patient comes in with headache, what is most likely diagnosis? 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 9
Solution P(flu | headache) = 0.90 0.01 / P(headache) P(stress | headache) = 0.40 0.20 / P(headache) P(healthy | headache) = 0.01 0.79 / P(headache) STRESS most likely (by nearly a factor of 9) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 10 Note: we never need to compute the denominator to find most likely diagnosis! P(disease | symptoms) = P(symptoms | disease) x P(disease) P(symptoms)
Base-Rate Fallacy /13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 11 Assume Disease A is rare (one in 1 million, say – so picture not to scale) Assume population is 10B = So 10 4 people have it Assume testForA is 99.99% accurate You test positive. What is the prob you have Disease A? Someone (not in cs540) might naively think prob = People for whom testForA = true 9999 people that actually have Disease A 10 6 people that do NOT have Disease A Prob(A | testForA) = 0.01 A This same issue arises when have many more neg than pos ex’s – false pos overwhelm true pos 99.99% 0.01%
Recall: What if Symptoms NOT Disjoint? Assume we have symptoms A, B, and C, and they are not disjoint Convert to A’ = A B C G’ = A B C B’ = A B C H’ = A B C C’ = A B C D’ = A B C E’ = A B C F’ = A B C 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 12
Dealing with Many Boolean-Valued Symptoms (D = Disease, S i = Symptom i ) P(D | S 1 S 2 S 3 … S n ) // Bayes’ Rule = P(S 1 S 2 S 3 … S n | D) x P(D) P(S 1 S 2 S 3 … S n ) If n small, could use a full joint table If not, could design/learn a Bayes Net We’ll consider `conditional independence’ of S’s 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 13
Assuming Conditional Independence Repeatedly using P(A B | C) P(A | C) P(B | C) We get P(S 1 S 2 S 3 … S n | D) = P(S i | D) Assuming D has three possible disjoint values P(D 1 | S 1 S 2 S 3 … S n ) = [ P(S i | D 1 ) ] x P(D 1 ) / P(S 1 S 2 S 3 … S n ) P(D 2 | S 1 S 2 S 3 … S n ) = [ P(S i | D 2 ) ] x P(D 2 ) / P(S 1 S 2 S 3 … S n ) P(D 3 | S 1 S 2 S 3 … S n ) = [ P(S i | D 3 ) ] x P(D 3 ) / P(S 1 S 2 S 3 … S n ) We know P(D i | S 1 S 2 S 3 … S n ) = 1, so if we want, we could solve for P(S 1 S 2 S 3 … S n ) and, hence, need not compute/approx it! 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 14
Full Joint vs. Naïve Bayes Completely assuming conditional independence is called Naïve Bayes (NB) –We need to estimate (eg, from data) P(S i | D j ) // For each disease j, prob symptom i appears P(D j ) // Prob of each disease j If we have N binary-valued symptoms and a tertiary-valued disease, size of full joint is (3 2 N ) – 1 NB needs only (3 x N ) /13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 15
Log Odds Odds(x) prob(x) / (1 – prob(x)) Recall (and now assuming D only has TWO values) 1) P( D | S 1 S 2 S 3 … S n ) = [ P(S i | D) ] x P( D) / P(S 1 S 2 S 3 … S n ) 2)P( D | S 1 S 2 S 3 … S n ) = [ P(S i | D) ] x P( D) / P(S 1 S 2 S 3 … S n ) Dividing (1) by (2), denominators cancel out! P( D | S 1 S 2 S 3 … S n ) [ P(S i | D) ] x P( D) = P( D | S 1 S 2 S 3 … S n ) [ P(S i | D) ] x P( D) Since P( D | S 1 S 2 S 3 … S n ) = 1 - P(D | S 1 S 2 S 3 … S n ) odds(D | S 1 S 2 S 3 … S n ) = [ { P(S i | D) / P(S i | D) } ] x [ P(D) / P( D) ] 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 16 Odds > 1 iff prob > 0.5 Notice we removed one via algebra
The Missing Algebra The Implicit Algebra from Prev Page a 1 a 2 a 3 … a n b 1 b 2 b 3 … b n = (a 1 / b 1 ) (a 2 / b 2 ) (a 3 / b 3 ) … (a n / b n ) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 17
Log Odds (continued) Odds(x) prob(x) / (1 – prob(x)) We ended two slides ago with odds(D | S 1 S 2 S 3 … S n ) = [ { P(S i | D) / P(S i | D) } ] x [ P(D) / P( D) ] Recall log(A B) = log(A) + log(B), so we have log [ odds(D | S 1 S 2 S 3 … S n ) ] = { log [ P(S i | D) / P(S i | D) ] } + log [ P(D) / P( D) ] If log-odds > 0, D is more likely than D since log(x) > 0 iff x > 1 If log-odds < 0, D is less likely than D since log(x) < 0 iff x < 1 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 18
Log Odds (concluded) We ended last slide with log [ odds(D | S 1 S 2 S 3 … S n ) ] = { log [ P(S i | D) / P(S i | D) ] } + log [ P(D) / P( D) ] Consider log [ P(D) / P D) ] if D more likely than D, we start the sum with a positive value Consider each log [ P(S i | D) / P(S i | D) ] if S i more likely give D than given D, we add to the sum a pos value if less likely, we add negative value if S i independent of D, we add zero At end we see if sum is POS (D more likely), ZERO (tie), or NEG ( D more likely) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 19
Viewing NB as a PERCEPTON, the Simplest Neural Network 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 20 out S1S1 SnSn SnSn … ‘1’ S1S1 log [ P(S 1 | D) / P(S 1 | D) ] log [ P(D) / P( D) ] log [ P( S 1 | D) / P( S 1 | D) ] log [ P(S n | D) / P(S n | D) ] log [ P( S n | D) / P( S n | D) ] If S i = true, then NODE S i =1 and NODE S i =0 If S i = false, then NODE S i =0 and NODE S i =1 log-odds
Naïve Bayes Example (for simplicity, ignore m-estimates here) S1S1 S2S2 S3S3 D TFTT FTTF FTTT TTFT TFTF FTTT TFFF 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 21 Dataset P(D=true) = P(D=false) = P(S 1 =true | D = true) = P(S 1 =true | D = false) = P(S 2 =true | D = true) = P(S 2 =true | D = false) = P(S 3 =true | D = true) = P(S 3 =true | D = false) = ‘Law of Excluded Middle’ P(S 3 =true | D=false) + P(S 3 =false | D=false) = 1 so no need for the P(S i =false | D=?) estimates
Naïve Bayes Example (for simplicity, ignore m-estimates) S1S1 S2S2 S3S3 D TFTT FTTF FTTT TTFT TFTF FTTT TFFF 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 22 Dataset P(D=true) = 4 / 7 P(D=false) = 3 / 7 P(S 1 =true | D = true) = 2 / 4 P(S 1 =true | D = false) = 2 / 3 P(S 2 =true | D = true) = 3 / 4 P(S 2 =true | D = false) = 1 / 3 P(S 3 =true | D = true) = 3 / 4 P(S 3 =true | D = false) = 2 / 3
Processing a ‘Test’ Example Prob(D = true | S 1 = true S 2 = true S 3 = true) ? Odds(D | S 1 S 2 S 3 ) = // Recall Odds(x) Prob(x) / (1 – Prob(x)) P(S 1 | D) P(S 2 | D) P(S 3 | D) P( D) P(S 1 | D) P(S 2 | D) P(S 3 | D) P( D) = (3 / 4) (9 / 4) (9 / 8) (4 / 3) = 81 / 32 = 2.53 Use Prob(x) = Odds(x) / (1 + Odds(x)) to get prob = /13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 23 Here, vars = true unless NOT sign present
NB as a BN P(D | S 1 S 2 S 3 … S n ) = [ P(S i | D) ] x P(D) / P(S 1 S 2 S 3 … S n ) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 24 We only need to compute this part if we use the ‘odds’ method from prev slides S1S1 S2S2 S3S3 SnSn … D 1 CPT of size 2 n+1 (N+1) CPTs of size 1
Recap: Naïve Bayes Parameter Learning Use training data to estimate (for Naïve Bayes) in one pass through data P(f i = v j | category = POS) for each i, j P(f i = v j | category = NEG) for each i, j P(category = POS) P(category = NEG) // Note: Some of above unnecessary since some combo’s of probs sum to 1 Apply Bayes’ rule to find odds(category = POS | test example’s features) Incremental/Online Learning Easy simply increment counters (true for BN’s in general, if no ‘structure learning’) 10/13/15Lecture #6, Slide 25CS Fall 2015 (Shavlik©), Lecture 16, Week 6
10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 626 Is NB Naïve? Surprisingly, the assumption of independence, while most likely violated, is not too harmful! Naïve Bayes works quite well –Very successful in text categorization (‘bag-o- words’ rep) –Used in printer diagnosis in Windows, spam filtering, etc Prob’s not accurate (‘uncalibrated’) due to double counting, but good at seeing if prob > 0.5 or prob < 0.5 Resurgence of research activity in Naïve Bayes –Many ‘dead’ ML algo’s resuscitated by availability of large datasets (KISS Principle)
A Major Weakness of BN’s If many ‘hidden’ random vars (N binary vars, say), then the marginalization formula leads to many calls to a BN (2 N in our example; for N = 20, 2 N = 1,048,576) Using uniform-random sampling to estimate the result is too inaccurate since most of the probability might be concentrated in only a few ‘complete world states’ Hence, much research (beyond cs540’s scope) on scaling up inference in BNs and other graphical models, eg via more sophisticated sampling (eg, MCMC) 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 27
Bayesian Networks Wrapup BNs one Type of ‘Graphic Model’ Lots of Applications (though currently focus on ‘deep [neural] networks’) Bayes’ Rule Appealing Way to Go from EFFECTS to CAUSES (ie, diagnosis) Full Joint Prob Tables and Naïve Bayes are Interesting ‘Limit Cases’ of BNs With ‘Big Data,’ Counting Goes a Long Way! 10/13/15CS Fall 2015 (Shavlik©), Lecture 16, Week 6 28