Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1

Go over Adaboost examples

Fix to C4.5 data formatting problem?

Quiz 4

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer (P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system

For SBC, numeric values were discretized into ten equal- length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes A m and A n, dependence is defined as where A m A n is a “derived attribute”, whose values consist of the possible combinations of values of A m and A n Note: If A m and A n are independent, then D(A m, A n | C) = 0.

Results: (1) SBC is more successful than more complex methods, even when there is substantial dependence among attributes. (2) No correlation between degree of attribute dependence and SBC’s rank. But why????

An Example Let C = {+, −}, and attributes = {A, B, C}. Let P(+) = P(−) = 1/2. Suppose A and C are completely independent, and A and B are completely dependent (e.g., A = B). Optimal classification procedure:

This leads to the following Optimal Classifier conditions: If P(A|+) P(C|+) > P(A | −) P(C| −) then class = + = else class = − SBC conditions If P(A|+) 2 P(C|+) > P(A | −) 2 P(C| −) then class = + else class = −

p = P(+ | A) q = P(+ | C) OptimalSBC + − In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC.

Even though A and B are completely dependent, and the SBC assumes they are completely independent, the SBC gives the optimal classification in a very large part of the problem space! But why?

Explanation: Suppose C = {+,−} are the possible classes. Let x be a new example with attributes. What the naive Bayes classifier does is calculates two probabilities, and returns the class that has the maximum probability given x.

The probability calculations are correct only if the independence assumption is correct. However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct! The latter covers a lot more cases than the former. Thus, the SBC is effective in many cases in which the independence assumption does not hold.

More on Bias and Variance

Bias From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Variance From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Noise From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Sources of Bias and Variance Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data Variance arises when the classifier overfits the data There is often a tradeoff between bias and variance From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. 28 From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt

From: http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=e06

Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. 30 From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt Why?

SVM Bias and Variance Bias-Variance tradeoff controlled by  Biased classifier (linear SVM) gives better results than a classifier that can represent the true decision boundary! From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Effect of Boosting In the early iterations, boosting is primary a bias-reducing method In later iterations, it appears to be primarily a variance- reducing method From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt

Bayesian Networks Reading: S. Wooldridge, Bayesian belief networks (linked from class website)Bayesian belief networks

A patient comes into a doctor’s office with a fever and a bad cough. Hypothesis space H: h 1 : patient has flu h 2 : patient does not have flu Data D: coughing = true, fever = true,, smokes = true

Naive Bayes smokesflucoughfever Cause Effects

In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters. However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with. Full joint probability distribution Fever  Fever Fever  Fever flu p 1 p 2 p 3 p 4  flu p 5 p 6 p 7 p 8 cough  cough Sum of all boxes is 1. fever  fever fever  fever flu p 9 p 10 p 11 p 12  flu p 13 p 14 p 15 p 16 cough  cough  smokes smokes

Bayesian networks Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized. smokes flu cough fever “Graphical Models”

true0.01 false0.99 flu smoke cough fever truefalse true0.90.1 false0.20.8 fever flu true0.2 false0.8 smoke truefalse True 0.950.05 TrueFalse0.80.2 FalseTrue0.60.4 false 0.050.95 cough smokeflu Conditional probability tables for each node

Semantics of Bayesian networks If network is correct, can calculate full joint probability distribution from network. where parents(X i ) denotes specific values of parents of X i.

Example Calculate

Another (famous, though weird) Example Rain Wet grass Question: If you observe that the grass is wet, what is the probability it rained?

SprinklerRain Wet grass Question: If you observe that the sprinkler is on, what is the probability that the grass is wet? (Predictive inference.)

Question: If you observe that the grass is wet, what is the probability that the sprinkler is on? (Diagnostic inference.) Note that P(S) = 0.2. So, knowing that grass is wet increased probability that sprinkler is on.

Now assume the grass is wet and it rained. What is the probability that the sprinkler was on? Knowing that it rained decreases the probability that the sprinkler was on, given that the grass is wet.

SprinklerRain Wet grass Cloudy Question: Given that it is cloudy, what is the probability that the grass is wet?

In general... If network is correct, can calculate full joint probability distribution from network. where parents(X i ) denotes specific values of parents of X i. But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).

Complexity of Bayesian Networks For n random Boolean variables: Full joint probability distribution: 2 n entries Bayesian network with at most k parents per node: – Each conditional probability table: at most 2 k entries – Entire network: n 2 k entries

What are the advantages of Bayesian networks? Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables. Represents “beliefs and knowledge” about a particular class of situations. Efficient (?) (approximate) inference algorithms Efficient, effective learning algorithms

Issues in Bayesian Networks Building / learning network topology Assigning / learning conditional probability tables Approximate inference via sampling

Real-World Example: The Lumière Project at Microsoft Research Bayesian network approach to answering user queries about Microsoft Office. “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.” “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Networks were developed by experts from user modeling studies.

Offspring of project was Office Assistant in Office 97, otherwise known as “clippie”. http://www.youtube.com/watch?v=bt-JXQS0zYc

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Similar presentations

Presentation on theme: "Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Similar presentations

Presentation on theme: "Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1."— Presentation transcript:

Similar presentations

About project

Feedback