Homework 3: Naive Bayes Classification

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Advertisements

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Dynamic Bayesian Networks (DBNs)
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Introduction of Probabilistic Reasoning and Bayesian Networks
Probabilistic Reasoning (2)
Speech Recognition. What makes speech recognition hard?
Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.
Bayesian network inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Announcements Homework 8 is out Final Contest (Optional)
Bayes Nets. Bayes Nets Quick Intro Topic of much current research Models dependence/independence in probability distributions Graph based - aka “graphical.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Networks Textbook: Probabilistic Reasoning, Sections 1-2, pp
Approximate Inference 2: Monte Carlo Markov Chain
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Bayesian networks Chapter 14. Outline Syntax Semantics.
A Brief Introduction to Graphical Models
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
INC 551 Artificial Intelligence Lecture 8 Models of Uncertainty.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
1 Bayesian Networks: A Tutorial. 2 Introduction Suppose you are trying to determine if a patient has tuberculosis. You observe the following symptoms:
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Probabilistic Reasoning Inference and Relational Bayesian Networks.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Inference in Bayesian Networks
Reasoning Under Uncertainty: Conditioning, Bayes Rule & Chain Rule
Markov ó Kalman Filter Localization
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Uncertainty in AI.
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Homework 3: Naive Bayes Classification

Bayesian Networks Reading assignment: S Bayesian Networks Reading assignment: S. Wooldridge, Bayesian Belief Networks (linked from course webpage)

A patient comes into a doctor’s office with a fever and a bad cough. Hypothesis space H: h1: patient has flu h2: patient does not have flu Data D: coughing = true, fever = true, smokes = true

Naive Bayes flu Cause fever cough Effects

What if attributes are not independent? flu fever cough

What if more than one possible cause? flu smokes fever cough

Full joint probability distribution smokes cough  cough Sum of all boxes is 1. Fever  Fever flu p1 p2 p3 p4 flu p5 p6 p7 p8 In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters. However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with.  smokes cough  cough fever  fever flu p9 p10 p11 p12 flu p13 p14 p15 p16

Full joint probability distribution smokes cough  cough Fever  Fever flu p1 p2 p3 p4 flu p5 p6 p7 p8 For example, what if we had another attribute, “allergies”? How many probabilities would we need to specify?  smokes cough  cough fever  fever flu p9 p10 p11 p12 flu p13 p14 p15 p16

cough  cough cough  cough cough  cough cough  cough fever  fever Allergy Allergy smokes smokes cough  cough cough  cough Fever Fever  Fever flu p1 p2 p3 p4 flu p5 p6 p7 p8 Fever Fever  Fever flu p17 p18 p19 p20 flu p21 p22 p23 p24 Allergy Allergy  smokes  smokes cough  cough cough  cough fever  fever flu p9 p10 p11 p12 flu p13 p14 p15 p16 fever  fever flu p25 p26 p27 p28 flu p29 p30 p31 p32

Allergy Allergy smokes smokes cough  cough cough  cough Fever Fever  Fever flu p1 p2 p3 p4 flu p5 p6 p7 p8 Fever Fever  Fever flu p17 p18 p19 p20 flu p21 p22 p23 p24 Allergy Allergy  smokes  smokes cough  cough cough  cough fever  fever flu p9 p10 p11 p12 flu p13 p14 p15 p16 fever  fever flu p25 p26 p27 p28 flu p29 p30 p31 p32 But can reduce this if we know which variables are conditionally independent

Bayesian networks “Graphical Models” Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized. smokes flu cough fever Allergies “Graphical Models”

Bayesian Networks = Bayesian Belief Networks = Bayes Nets Bayesian Network: Alternative representation for complete joint probability distribution “Useful for making probabilistic inference about models domains characterized by inherent complexity and uncertainty” Uncertainty can come from: incomplete knowledge of domain inherent randomness in behavior in domain

cough flu smoke smoke flu flu Example: true false True 0.95 0.05 False 0.8 0.2 0.6 0.4 cough Conditional probability tables for each node flu smoke true 0.2 false 0.8 smoke true 0.01 false 0.99 flu smoke cough fever flu true false 0.9 0.1 0.2 0.8 fever flu

Inference in Bayesian networks If network is correct, can calculate full joint probability distribution from network. where parents(Xi) denotes specific values of parents of Xi.

Naive Bayes Example flu fever cough

Example Calculate

Example Calculate

In general... If network is correct, can calculate full joint probability distribution from network. where parents(Xi) denotes specific values of parents of Xi. But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).

Example from the reading:

What is the probability that Student A is late? What is the probability that Student B is late?

What is the probability that Student A is late? What is the probability that Student B is late? Unconditional (“marginal”) probability. We don’t know if there is a train strike.

What is the probability that Student A is late? What is the probability that Student B is late? Unconditional (“marginal”) probability. We don’t know if there is a train strike.

Now, suppose we know that there is a train strike Now, suppose we know that there is a train strike. How does this revise the probability that the students are late?

Now, suppose we know that there is a train strike Now, suppose we know that there is a train strike. How does this revise the probability that the students are late? Evidence: There is a train strike.

Now, suppose we know that Student A is late. How does this revise the probability that there is a train strike? How does this revise the probability that Student B is late? Notion of “belief propagation”. Evidence: Student A is late.

Now, suppose we know that Student A is late. How does this revise the probability that there is a train strike? How does this revise the probability that Student B is late? Notion of “belief propagation”. Evidence: Student A is late.

Now, suppose we know that Student A is late. How does this revise the probability that there is a train strike? How does this revise the probability that Student B is late? Notion of “belief propagation”. Evidence: Student A is late.

Another example from the reading: pneumonia smoking temperature cough

What is P(cough)? In-class exercises

Three types of inference Diagnostic: Use evidence of an effect to infer probability of a cause. E.g., Evidence: cough=true. What is P(pneumonia | cough)? Causal inference: Use evidence of a cause to infer probability of an effect E.g., Evidence: pneumonia=true. What is P(cough | pneumonia)? Inter-causal inference: “Explain away” potentially competing causes of a shared effect. E.g., Evidence: smoking=true. What is P(pneumonia | cough and smoking)?

Diagnostic: Evidence: cough=true. What is P(pneumonia | cough)? smoking temperature cough

Diagnostic: Evidence: cough=true. What is P(pneumonia | cough)? smoking temperature cough

Causal: Evidence: pneumonia=true. What is P(cough | pneumonia)? smoking temperature cough

Causal: Evidence: pneumonia=true. What is P(cough | pneumonia)? smoking temperature cough

Inter-causal: Evidence: smoking=true Inter-causal: Evidence: smoking=true. What is P(pneumonia | cough and smoking)? pneumonia smoking temperature cough

“Explaining away”

Math we used: Definition of conditional probability: Bayes Theorem Unconditional (marginal) probability Probability inference in Bayesian networks:

Complexity of Bayesian Networks For n random Boolean variables: Full joint probability distribution: 2n entries Bayesian network with at most k parents per node: Each conditional probability table: at most 2k entries Entire network: n 2k entries

What are the advantages of Bayesian networks? Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. Represents “beliefs and knowledge” about a particular class of situations. Efficient (approximate) inference algorithms Efficient, effective learning algorithms

Issues in Bayesian Networks Building / learning network topology Assigning / learning conditional probability tables Approximate inference via sampling

Real-World Example: The Lumière Project at Microsoft Research Bayesian network approach to answering user queries about Microsoft Office. “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.” “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Networks were developed by experts from user modeling studies.

Offspring of project was Office Assistant in Office 97, otherwise known as “clippie”. http://www.youtube.com/watch?v=bt-JXQS0zYc

The famous “sprinkler” example (J The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Recall rule for inference in Bayesian networks: Example: What is

In-Class Exercise

More Exercises What is P(Cloudy| Sprinkler)? What is P(Cloudy | Rain)? What is P(Cloudy | Wet Grass)?

More Exercises What is P(Cloudy| Sprinkler)?

More Exercises What is P(Cloudy| Rain)?

More Exercises What is P(Cloudy| Wet Grass)?

Exact Inference in Bayesian Networks General question: What is P(X|e)? Notation convention: upper-case letters refer to random variables; lower-case letters refer to specific values of those variables

More Exercises 1. Suppose you observe it is cloudy and raining. What is the probability that the grass is wet? 2. Suppose you observe the sprinkler to be on and the grass is wet. What is the probability that it is raining? 3. Suppose you observe that the grass is wet and it is raining. What is the probability that it is cloudy?

General question: Given query variable X and observed evidence variable values e, what is P(X | e)?

Example: What is P(C |W, R)?

Worst-case complexity is exponential in n (number of nodes) Problem is having to enumerate all possibilities for many variables.

Can reduce computation by computing terms only once and storing for future use. E.g., “variable elimination algorithm”. (We won’t cover this.)

In general, however, exact inference in Bayesian networks is too expensive.

Approximate inference in Bayesian networks Instead of enumerating all possibilities, sample to estimate probabilities. X1 X2 X3 Xn ...

Direct Sampling Suppose we have no evidence, but we want to determine P(C,S,R,W) for all C,S,R,W. Direct sampling: Sample each variable in topological order, conditioned on values of parents. I.e., always sample from P(Xi | parents(Xi))

Example Sample from P(Cloudy). Suppose returns true. Sample from P(Sprinkler | Cloudy = true). Suppose returns false. Sample from P(Rain | Cloudy = true). Suppose returns true. Sample from P(WetGrass | Sprinkler = false, Rain = true). Suppose returns true. Here is the sampled event: [true, false, true, true]

Suppose there are N total samples, and let NS (x1, Suppose there are N total samples, and let NS (x1, ..., xn) be the observed frequency of the specific event x1, ..., xn. Suppose N samples, n nodes. Complexity O(Nn). Problem 1: Need lots of samples to get good probability estimates. Problem 2: Many samples are not realistic; low likelihood.

Markov Chain Monte Carlo Sampling One of most common methods used in real applications. Uses idea of Markov blanket of a variable Xi: parents, children, children’s other parents Fact: By construction of Bayesian network, a node is conditionally independent of its non-descendants, given its parents.

What is the Markov Blanket of Rain? What is the Markov blanket of Wet Grass? Proposition: A node Xi is conditionally independent of all other nodes in the network, given its Markov blanket.

Markov Chain Monte Carlo (MCMC) Sampling Algorithm Start with random sample from variables, with evidence variables fixed: (x1, ..., xn). This is the current “state” of the algorithm. Next state: Randomly sample value for one non-evidence variable Xi , conditioned on current values in “Markov Blanket” of Xi.

Example Query: What is P(Rain | Sprinkler = true, WetGrass = true)? MCMC: Random sample, with evidence variables fixed: [Cloudy, Sprinkler, Rain, WetGrass] = [true, true, false, true] Repeat: Sample Cloudy, given current values of its Markov blanket: Sprinkler = true, Rain = false. Suppose result is false. New state: [false, true, false, true] Sample Rain, given current values of its Markov blanket: Cloudy = false, Sprinkler = true, WetGrass = true. Suppose result is true. New state: [false, true, true, true].

Each sample contributes to estimate for query P(Rain | Sprinkler = true, WetGrass = true) Suppose we perform 100 such samples, 20 with Rain = true and 80 with Rain = false. Then answer to the query is Normalize (20,80) = .20,.80 Claim: “The sampling process settles into a dynamic equilibrium in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability, given the evidence.” That is: for all variables Xi, the probability of the value xi of Xi appearing in a sample is equal to P(xi | e). Proof of claim: Reference on request

Issues in Bayesian Networks Building / learning network topology Assigning / learning conditional probability tables Approximate inference via sampling Incorporating temporal aspects (e.g., evidence changes from one time step to the next).

Learning network topology Many different approaches, including: Heuristic search, with evaluation based on information theory measures Genetic algorithms Using “meta” Bayesian networks!

Learning conditional probabilities In general, random variables are not binary, but real-valued Conditional probability tables conditional probability distributions Estimate parameters of these distributions from data If data is missing on one or more variables, use “expectation maximization” algorithm

Speech Recognition Task: Identify sequence of words uttered by speaker, given acoustic signal. Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc. Thus speech recognition is viewed as problem of probabilistic inference.

Speech Recognition So far, we’ve looked at probabilistic reasoning in static environments. Speech: Time sequence of “static environments”. Let X be the “state variables” (i.e., set of non-evidence variables) describing the environment (e.g., Words said during time step t) Let E be the set of evidence variables (e.g., features of acoustic signal).

The E values and X joint probability distribution changes over time. t1: X1, e1 t2: X2 , e2 etc.

At each t, we want to compute P(Words | S). We know from Bayes rule: P(S | Words), for all words, is a previously learned “acoustic model”. E.g. For each word, probability distribution over phones, and for each phone, probability distribution over acoustic signals (which can vary in pitch, speed, volume). P(Words), for all words, is the “language model”, which specifies prior probability of each utterance. E.g. “bigram model”: probability of each word following each other word.

Speech recognition typically makes three assumptions: Process underlying change is itself “stationary” i.e., state transition probabilities don’t change Current state X depends on only a finite history of previous states (“Markov assumption”). Markov process of order n: Current state depends only on n previous states. Values et of evidence variables depend only on current state Xt. (“Sensor model”)

From http://www.cs.berkeley.edu/~russell/slides/

From http://www.cs.berkeley.edu/~russell/slides/

Hidden Markov Models Markov model: Given state Xt, what is probability of transitioning to next state Xt+1 ? E.g., word bigram probabilities give P (wordt+1 | wordt ) Hidden Markov model: There are observable states (e.g., signal S) and “hidden” states (e.g., Words). HMM represents probabilities of hidden states given observable states.

From http://www.cs.berkeley.edu/~russell/slides/

From http://www.cs.berkeley.edu/~russell/slides/

Example: “I’m firsty, um, can I have something to dwink?” From http://www.cs.berkeley.edu/~russell/slides/

From http://www.cs.berkeley.edu/~russell/slides/