Today Logistic Regression Decision Trees Redux Graphical Models

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CS188: Computational Models of Human Behavior
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Supervised Learning Recap
An Introduction to Variational Methods for Graphical Models.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
What is Statistical Modeling
Graduate School of Information Sciences, Tohoku University
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Crash Course on Machine Learning
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
A Brief Introduction to Graphical Models
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Machine Learning Queens College Lecture 2: Decision Trees.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’11 Machine Learning – Lecture 13 Introduction to Graphical Models Bastian.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Slides for “Data Mining” by I. H. Witten and E. Frank.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Machine Learning – Lecture 11
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Pattern Recognition and Machine Learning
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
Today Graphical Models Representing conditional dependence graphically
Markov Random Fields in Vision
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
INTRODUCTION TO Machine Learning 2nd Edition
Chapter 7. Classification and Prediction
Qian Liu CSE spring University of Pennsylvania
Data Mining Lecture 11.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Statistical Learning Dong Liu Dept. EEIS, USTC.
Markov Networks.
Read R&N Ch Next lecture: Read R&N
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Graduate School of Information Sciences, Tohoku University
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Recap: Naïve Bayes classifier
Markov Networks.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Lecture 5: Graphical Models Machine Learning CUNY Graduate Center

Today Logistic Regression Decision Trees Redux Graphical Models Maximum Entropy Formulation Decision Trees Redux Now using Information Theory Graphical Models Representing conditional dependence graphically

Logistic Regression Optimization Take the gradient in terms of w

Optimization We know the gradient of the error function, but how do we find the maximum value? Setting to zero is nontrivial Numerical approximation

Entropy Measure of uncertainty, or Measure of “Information” High uncertainty equals high entropy. Rare events are more “informative” than common events.

Examples of Entropy Uniform distributions have higher distributions.

Maximum Entropy Logistic Regression is also known as Maximum Entropy. Entropy is convex. Convergence Expectation. Constrain this optimization to enforce good classification. Increase maximum likelihood of the data while making the distribution of weights most even. Include as many useful features as possible.

Maximum Entropy with Constraints From Klein and Manning Tutorial

Optimization formulation If we let the weights represent likelihoods of value for each feature. For each feature i

Solving MaxEnt formulation Convex optimization with a concave objective function and linear constraints. Lagrange Multipliers Dual representation of the maximum likelihood estimation of Logistic Regression For each feature i

Decision Trees Nested ‘if’-statements for classification Each Decision Tree Node contains a feature and a split point. Challenges: Determine which feature and split point to use Determine which branches are worth including at all (Pruning)

Decision Trees color blue brown green h w w <66 <150 <140 w w m f <145 <66 <170 <64 m f m f m f m f

Ranking Branches Last time, we used classification accuracy to measure value of a branch. 6M / 6F height <68 1M / 5F 5M / 1F 50% Accuracy before Branch 83.3% Accuracy after Branch 33.3% Accuracy Improvement

Ranking Branches Measure Decrease in Entropy of the class distribution following the split 6M / 6F height <68 1M / 5F 5M / 1F H(x) = 2 before Branch 83.3% Accuracy after Branch 33.3% Accuracy Improvement

InfoGain Criterion Calculate the decrease in Entropy across a split point. This represents the amount of information contained in the split. This is relatively indifferent to the position on the decision tree. More applicable to N-way classification. Accuracy represents the mode of the distribution Entropy can be reduced while leaving the mode unaffected.

Graphical Models and Conditional Independence More generally about probabilities, but used in classification and clustering. Both Linear Regression and Logistic Regression use probabilistic models. Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables.

(Joint) Probability Tables Represent multinomial joint probabilities between K variables as K-dimensional tables Assuming D binary variables, how big is this table? What is we had multinomials with M entries?

Probability Models What if the variables are independent? If x and y are independent: The original distribution can be factored How big is this table, if each variable is binary?

Conditional Independence Independence assumptions are convenient (Naïve Bayes), but rarely true. More often some groups of variables are dependent, but others are independent. Still others are conditionally independent.

Conditional Independence If two variables are conditionally independent. E.g. y = flu?, x = achiness?, z = headache?

Factorization if a joint Assume How do you factorize:

Factorization if a joint What if there is no conditional independence? How do you factorize:

Structure of Graphical Models Graphical models allow us to represent dependence relationships between variables visually Graphical models are directed acyclic graphs (DAG). Nodes: random variables Edges: Dependence relationship No Edge: Independent variables Direction of the edge: indicates a parent-child relationship Parent: Source – Trigger Child: Destination – Response

Example Graphical Models y x y Parents of a node i are denoted πi Factorization of the joint in a graphical model:

Basic Graphical Models Independent Variables Observations When we observe a variable, (fix its value from data) we color the node grey. Observing a variable allows us to condition on it. E.g. p(x,z|y) Given an observation we can generate pdfs for the other variables. x y z x y z

Example Graphical Models X = cloudy? Y = raining? Z = wet ground? Markov Chain x y z

Example Graphical Models Markov Chain Are x and z conditionally independent given y? x y z

Example Graphical Models y z Markov Chain

One Trigger Two Responses X = achiness? Y = flu? Z = fever? y x z

Example Graphical Models y Are x and z conditionally independent given y? x z

Example Graphical Models y z

Two Triggers One Response X = rain? Y = wet sidewalk? Z = spilled coffee? x z y

Example Graphical Models y z Are x and z conditionally independent given y?

Example Graphical Models y z

Factorization x3 x1 x5 x0 x2 x4

Factorization x3 x1 x5 x0 x2 x4

How Large are the probability tables?

Model Parameters as Nodes Treating model parameters as a random variable, we can include these in a graphical model Multivariate Bernouli µ0 µ1 µ2 x0 x1 x2

Model Parameters as Nodes Treating model parameters as a random variable, we can include these in a graphical model Multinomial µ x0 x1 x2

Naïve Bayes Classification x0 x1 x2 Observed variables xi are independent given the class variable y The distribution can be optimized using maximum likelihood on each variable separately. Can easily combine various types of distributions

Graphical Models Graphical representation of dependency relationships Directed Acyclic Graphs Nodes as random variables Edges define dependency relations What can we do with Graphical Models Learn parameters – to fit data Understand independence relationships between variables Perform inference (marginals and conditionals) Compute likelihoods for classification.

Plate Notation y y x0 x1 xn … n xi To indicate a repeated variable, draw a plate around it.

Completely observed Graphical Model Observations for every node Simplest (least general) graph, assume each independent

Completely observed Graphical Model Observations for every node Second simplest graph, assume complete dependence

Maximum Likelihood Each node has a conditional probability table, θ Given the tables, we can construct the pdf. Use Maximum Likelihood to find the best settings of θ

Maximum likelihood

Count functions Count the number of times something appears in the data

Maximum Likelihood Define a function: Constraint:

Maximum Likelihood Use Lagrange Multipliers

Maximum A Posteriori Training Bayesians would never do that, the thetas need a prior.

Conditional Dependence Test Can check conditional independence in a graphical model “Is achiness (x3) independent of the flue (x0) given fever(x1)?” “Is achiness (x3) independent of sinus infections(x2) given fever(x1)?”

D-Separation and Bayes Ball Intuition: nodes are separated or blocked by sets of nodes. E.g. nodes x1 and x2, “block” the path from x0 to x5. So x0 is cond. ind.from x5 given x1 and x2

Bayes Ball Algorithm Shade nodes xc Place a “ball” at each node in xa Bounce balls around the graph according to rules If no balls reach xb, then cond. ind.

Ten rules of Bayes Ball Theorem

Bayes Ball Example

Bayes Ball Example

Undirected Graphs What if we allow undirected graphs? What do they correspond to? Not Cause/Effect, or Trigger/Response, but general dependence Example: Image pixels, each pixel is a bernouli P(x11,…, x1M,…, xM1,…, xMM) Bright pixels have bright neighbors No parents, just probabilities. Grid models are called Markov Random Fields

Undirected Graphs A D C B Undirected separability is easy. To check conditional independence of A and B given C, check the Graph reachability of A and B without going through nodes in C

Next Time More fun with Graphical Models Read Chapter 8.1, 8.2