Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, 08 March 2007 William.

Similar presentations


Presentation on theme: "Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, 08 March 2007 William."— Presentation transcript:

1 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, 08 March 2007 William H. Hsu Department of Computing and Information Sciences, KSU http://www.kddresearch.org/Courses/Spring-2007/CIS732 Readings: Chapters 1-7, Mitchell Chapters 14-15, 18, Russell and Norvig Bayesian Networks Midterm Review 1 of 2 Lecture 23 of 42

2 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning x1x1 x2x2 x3x3 Input Layer u 11 h1h1 h2h2 h3h3 h4h4 Hidden Layer o1o1 o2o2 v 42 Output Layer Case Study: BOC and Gibbs Classifier for ANNs [1] Methodology –  (model parameters): a j, u ij, b k, v jk –  (hyperparameters):  a,  u,  b,  v Computing Feedforward ANN Output –Output layer activation: f k (x) = b k +  j v jk h j (x) –Hidden layer activation: h j (x) = tanh (a j +  i u ij x i ) Classifier Output: Prediction –Given new input from “inference space” –Want: Bayesian optimal test output

3 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Problem –True parameter space is infinite (real-valued weights and thresholds) –Worse yet, we know nothing about target distribution P(h | D) Solution: Markov chain Monte Carlo (MCMC) Methods –Sample from a conditional density for h = ( ,  ) –Integrate over these samples numerically (as opposed to analytically) MCMC Estimation: An Application of Gibbs Sampling –Want: a function v(  ), e.g., f k (x (n+1),  ) –Target: –MCMC estimate: –FAQ [MacKay, 1999]: http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html Case Study: BOC and Gibbs Classifier for ANNs [2]

4 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Gibbs Sampling: Approximating the BOC –Collect many Gibbs samples –Interleave the update of parameters and hyperparameters e.g., train ANN weights using Gibbs sampling Accept a candidate  w if it improves error or rand()  current threshold After every few thousand such transitions, sample hyperparameters –Convergence: lower current threshold slowly –Hypothesis: return model (e.g., network weights) –Intuitive idea: sample models (e.g., ANN snapshots) according to likelihood How Close to Bayes Optimality Can Gibbs Sampling Get? –Depends on how many samples taken (how slowly current threshold is lowered) –Simulated annealing terminology: annealing schedule –More on this when we get to genetic algorithms BOC and Gibbs Sampling

5 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Idea –Want: model that can be used to perform inference –Desired properties Ability to represent functional, logical, stochastic relationships Express uncertainty Observe the laws of probability Tractable inference when possible Can be learned from data Additional Desiderata –Ability to incorporate knowledge Knowledge acquisition and elicitation: in format familiar to domain experts Language of subjective probabilities and relative probabilities –Support decision making Represent utilities (cost or value of information, state) Probability theory + utility theory = decision theory –Ability to reason over time (temporal models) Graphical Models of Probability Distributions

6 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning A Graphical View of Simple (Naïve) Bayes –x i  {0, 1} for each i  {1, 2, …, n}; y  {0, 1} –Given: P(x i | y) for each i  {1, 2, …, n}; P(y) –Assume conditional independence  i  {1, 2, …, n}  P(x i | x  i, y)  P(x i | x 1, x 2, …, x i-1, x i+1, x i+2, …, x n, y) = P(x i | y) NB: this assumption entails the Naïve Bayes assumption Why? –Can compute P(y | x) given this info –Can also compute the joint pdf over all n + 1 variables Inference Problem for a (Simple) Bayesian Network –Use the above model to compute the probability of any conditional event –Exercise: P( x 1, x 2, y | x 3, x 4 ) Using Graphical Models y x1x1 x2x2 xnxn x3x3 P(x 1 | y) P(x 2 | y) P(x 3 | y)P(x n | y)

7 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning In-Class Exercise: Probabilistic Inference Inference Problem for a (Simple) Bayesian Network –Model: Naïve Bayes –Objective: compute the probability of any conditional event Exercise –Given P(x i | y), i  {1, 2, 3, 4} P(y) –Want: P( x 1, x 2, y | x 3, x 4 )

8 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Unsupervised Learning and Conditional Independence Given: (n + 1)-Tuples(x 1, x 2, …, x n, x n+1 ) –No notion of instance variable or label –After seeing some examples, want to know something about the domain Correlations among variables Probability of certain events Other properties Want to Learn: Most Likely Model that Generates Observed Data –In general, a very hard problem –Under certain assumptions, have shown that we can do it Assumption: Causal Markovity –Conditional independence among “effects”, given “cause” –When is the assumption appropriate? –Can it be relaxed? Structure Learning –Can we learn more general probability distributions? –Examples: automatic speech recognition (ASR), natural language, etc. y x1x1 x2x2 xnxn x3x3 P(x 1 | y) P(x 2 | y) P(x 3 | y)P(x n | y)

9 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Bayesian Belief Networks (BBNS): Definition X1X1 X2X2 X3X3 X4X4 Season: Spring Summer Fall Winter Sprinkler: On, Off Rain: None, Drizzle, Steady, Downpour Ground: Wet, Dry X5X5 Ground: Slippery, Not-Slippery P(Summer, Off, Drizzle, Wet, Not-Slippery) = P(S) · P(O | S) · P(D | S) · P(W | O, D) · P(N | W) Conditional Independence –X is conditionally independent (CI) from Y given Z (sometimes written X  Y | Z) iff P(X | Y, Z) = P(X | Z) for all values of X, Y, and Z –Example: P(Thunder | Rain, Lightning) = P(Thunder | Lightning)  T  R | L Bayesian Network –Directed graph model of conditional dependence assertions (or CI assumptions) –Vertices (nodes): denote events (each a random variable) –Edges (arcs, links): denote conditional dependencies General Product (Chain) Rule for BBNs Example (“Sprinkler” BBN)

10 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Bayesian Belief Networks: Properties Conditional Independence –Variable (node): conditionally independent of non-descendants given parents –Example –Result: chain rule for probabilistic inference Bayesian Network: Probabilistic Semantics –Node: variable –Edge: one axis of a conditional probability table (CPT) X1X1 X3X3 X4X4 X5X5 Age Exposure-To-Toxics Smoking Cancer X6X6 Serum Calcium X2X2 Gender X7X7 Lung Tumor

11 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 0: A Brief Overview of Machine Learning Overview: Topics, Applications, Motivation Learning = Improving with Experience at Some Task –Improve over task T, –with respect to performance measure P, –based on experience E. Brief Tour of Machine Learning –A case study –A taxonomy of learning –Intelligent systems engineering: specification of learning problems Issues in Machine Learning –Design choices –The performance element: intelligent systems Some Applications of Learning –Database mining, reasoning (inference/decision support), acting –Industrial usage of intelligent systems

12 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 1: Concept Learning and Version Spaces Concept Learning as Search through H –Hypothesis space H as a state space –Learning: finding the correct hypothesis General-to-Specific Ordering over H –Partially-ordered set: Less-Specific-Than (More-General-Than) relation –Upper and lower bounds in H Version Space Candidate Elimination Algorithm –S and G boundaries characterize learner’s uncertainty –Version space can be used to make predictions over unseen cases Learner Can Generate Useful Queries Next Lecture: When and Why Are Inductive Leaps Possible?

13 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 2: Inductive Bias and PAC Learning Inductive Leaps Possible Only if Learner Is Biased –Futility of learning without bias –Strength of inductive bias: proportional to restrictions on hypotheses Modeling Inductive Learners with Equivalent Deductive Systems –Representing inductive learning as theorem proving –Equivalent learning and inference problems Syntactic Restrictions –Example: m-of-n concept Views of Learning and Strategies –Removing uncertainty (“data compression”) –Role of knowledge Introduction to Computational Learning Theory (COLT) –Things COLT attempts to measure –Probably-Approximately-Correct (PAC) learning framework Next: Occam’s Razor, VC Dimension, and Error Bounds

14 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 3: PAC, VC-Dimension, and Mistake Bounds COLT: Framework Analyzing Learning Environments –Sample complexity of C (what is m?) –Computational complexity of L –Required expressive power of H –Error and confidence bounds (PAC: 0 <  < 1/2, 0 <  < 1/2) What PAC Prescribes –Whether to try to learn C with a known H –Whether to try to reformulate H (apply change of representation) Vapnik-Chervonenkis (VC) Dimension –A formal measure of the complexity of H (besides | H |) –Based on X and a worst-case labeling game Mistake Bounds –How many could L incur? –Another way to measure the cost of learning Next: Decision Trees

15 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 4: Decision Trees Decision Trees (DTs) –Can be boolean (c(x)  {+, -}) or range over multiple classes –When to use DT-based models Generic Algorithm Build-DT: Top Down Induction –Calculating best attribute upon which to split –Recursive partitioning Entropy and Information Gain –Goal: to measure uncertainty removed by splitting on a candidate attribute A Calculating information gain (change in entropy) Using information gain in construction of tree –ID3  Build-DT using Gain() ID3 as Hypothesis Space Search (in State Space of Decision Trees) Heuristic Search and Inductive Bias Data Mining using MLC++ (Machine Learning Library in C++) Next: More Biases (Occam’s Razor); Managing DT Induction

16 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 5: DTs, Occam’s Razor, and Overfitting Occam’s Razor and Decision Trees –Preference biases versus language biases –Two issues regarding Occam algorithms Why prefer smaller trees?(less chance of “coincidence”) Is Occam’s Razor well defined?(yes, under certain assumptions) –MDL principle and Occam’s Razor: more to come Overfitting –Problem: fitting training data too closely General definition of overfitting Why it happens –Overfitting prevention, avoidance, and recovery techniques Other Ways to Make Decision Tree Induction More Robust Next: Perceptrons, Neural Nets (Multi-Layer Perceptrons), Winnow

17 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 6: Perceptrons and Winnow Neural Networks: Parallel, Distributed Processing Systems –Biological and artificial (ANN) types –Perceptron (LTU, LTG): model neuron Single-Layer Networks –Variety of update rules Multiplicative (Hebbian, Winnow), additive (gradient: Perceptron, Delta Rule) Batch versus incremental mode –Various convergence and efficiency conditions –Other ways to learn linear functions Linear programming (general-purpose) Probabilistic classifiers (some assumptions) Advantages and Disadvantages –“Disadvantage” (tradeoff): simple and restrictive –“Advantage”: perform well on many realistic problems (e.g., some text learning) Next: Multi-Layer Perceptrons, Backpropagation, ANN Applications

18 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 7: MLPs and Backpropagation Multi-Layer ANNs –Focused on feedforward MLPs –Backpropagation of error: distributes penalty (loss) function throughout network –Gradient learning: takes derivative of error surface with respect to weights Error is based on difference between desired output (t) and actual output (o) Actual output (o) is based on activation function Must take partial derivative of   choose one that is easy to differentiate Two  definitions: sigmoid (aka logistic) and hyperbolic tangent (tanh) Overfitting in ANNs –Prevention: attribute subset selection –Avoidance: cross-validation, weight decay ANN Applications: Face Recognition, Text-to-Speech Open Problems Recurrent ANNs: Can Express Temporal Depth (Non-Markovity) Next: Statistical Foundations and Evaluation, Bayesian Learning Intro

19 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 8: Statistical Evaluation of Hypotheses Statistical Evaluation Methods for Learning: Three Questions –Generalization quality How well does observed accuracy estimate generalization accuracy? Estimation bias and variance Confidence intervals –Comparing generalization quality How certain are we that h 1 is better than h 2 ? Confidence intervals for paired tests –Learning and statistical evaluation What is the best way to make the most of limited data? k-fold CV Tradeoffs: Bias versus Variance Next: Sections 6.1-6.5, Mitchell (Bayes’s Theorem; ML; MAP)

20 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 9: Bayes’s Theorem, MAP, MLE Introduction to Bayesian Learning –Framework: using probabilistic criteria to search H –Probability foundations Definitions: subjectivist, objectivist; Bayesian, frequentist, logicist Kolmogorov axioms Bayes’s Theorem –Definition of conditional (posterior) probability –Product rule Maximum A Posteriori (MAP) and Maximum Likelihood (ML) Hypotheses –Bayes’s Rule and MAP –Uniform priors: allow use of MLE to generate MAP hypotheses –Relation to version spaces, candidate elimination Next: 6.6-6.10, Mitchell; Chapter 14-15, Russell and Norvig; Roth –More Bayesian learning: MDL, BOC, Gibbs, Simple (Naïve) Bayes –Learning over text

21 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Topic 10: Bayesian Classfiers: MDL, BOC, and Gibbs Minimum Description Length (MDL) Revisited –Bayesian Information Criterion (BIC): justification for Occam’s Razor Bayes Optimal Classifier (BOC) –Using BOC as a “gold standard” Gibbs Classifier –Ratio bound Simple (Naïve) Bayes –Rationale for assumption; pitfalls Practical Inference using MDL, BOC, Gibbs, Naïve Bayes –MCMC methods (Gibbs sampling) –Glossary: http://www.media.mit.edu/~tpminka/statlearn/glossary/glossary.html –To learn more: http://bulky.aecom.yu.edu/users/kknuth/bse.html Next: Sections 6.9-6.10, Mitchell –More on simple (naïve) Bayes –Application to learning over text

22 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Meta-Summary Machine Learning Formalisms –Theory of computation: PAC, mistake bounds –Statistical, probabilistic: PAC, confidence intervals Machine Learning Techniques –Models: version space, decision tree, perceptron, winnow, ANN, BBN –Algorithms: candidate elimination, ID3, backprop, MLE, Naïve Bayes, K2, EM Midterm Study Guide –Know Definitions (terminology) How to solve problems from Homework 1 (problem set) How algorithms in Homework 2 (machine problem) work –Practice Sample exam problems (handout) Example runs of algorithms in Mitchell, lecture notes –Don’t panic!

23 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Learning Distributions: Objectives Learning The Target Distribution –What is the target distribution? –Can’t use “the” target distribution Case in point: suppose target distribution was P 1 (collected over 20 examples) Using Naïve Bayes would not produce an h close to the MAP/ML estimate –Relaxing CI assumptions: expensive MLE becomes intractable; BOC approximation, highly intractable Instead, should make judicious CI assumptions –As before, goal is generalization Given D (e.g., {1011, 1001, 0100}) Would like to know P(1111) or P(11**)  P(x 1 = 1, x 2 = 1) Several Variants –Known or unknown structure –Training examples may have missing values –Known structure and no missing values: as easy as training Naïve Bayes

24 Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Summary Points Graphical Models of Probability –Bayesian networks: introduction Definition and basic principles Conditional independence (causal Markovity) assumptions, tradeoffs –Inference and learning using Bayesian networks Acquiring and applying CPTs Searching the space of trees: max likelihood Examples: Sprinkler, Cancer, Forest-Fire, generic tree learning CPT Learning: Gradient Algorithm Train-BN Structure Learning in Trees: MWST Algorithm Learn-Tree-Structure Reasoning under Uncertainty: Applications and Augmented Models Some Material From: http://robotics.Stanford.EDU/~koller Next Lecture: Read Heckerman Tutorial


Download ppt "Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, 08 March 2007 William."

Similar presentations


Ads by Google