Recitation4 for BigData Jay Gu Feb 7 2013 MapReduce.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Part 2: Unsupervised Learning

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Bayesian inference of normal distribution

Biointelligence Laboratory, Seoul National University

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Expectation Maximization

Supervised Learning Recap

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

An Introduction to Variational Methods for Graphical Models.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Visual Recognition Tutorial

Bayesian network inference

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Expectation Maximization Algorithm

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Maximum Likelihood (ML), Expectation Maximization (EM)

Visual Recognition Tutorial

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.

. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Biointelligence Laboratory, Seoul National University

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

EM and expected complete log-likelihood Mixture of Experts

Statistical Decision Theory

Model Inference and Averaging

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.

Lecture 19: More EM Machine Learning April 15, 2010.

Probabilistic Graphical Models

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Posterior Regularization for Structured Latent Variable Models Li Zhonghua I2R SMT Reading Group.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

Biointelligence Laboratory, Seoul National University

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

Lecture 2: Statistical learning primer for biologists

- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

CSE 517 Natural Language Processing Winter 2015

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

EE 551/451, Fall, 2006 Communication Systems Zhu Han Department of Electrical and Computer Engineering Class 15 Oct. 10 th, 2006.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

Empirical risk minimization

ICS 280 Learning in Graphical Models

Ch3: Model Building through Regression

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

STA 216 Generalized Linear Models

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Probabilistic Models with Latent Variables

Mathematical Foundations of BME Reza Shadmehr

Stochastic Optimization Maximization for Latent Variable Models

Empirical risk minimization

Lecture 11 Generalizations of EM.

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

CS639: Data Management for Data Science

Presentation transcript:

Recitation4 for BigData Jay Gu Feb MapReduce

Homework 1 Review Logistic Regression – Linear separable case, how many solutions? Suppose wx = 0 is the decision boundary, (a * w)x = 0 will have the same boundary, but more compact level set. wx=02wx=0

Homework 1 Review wx=02wx=0 When Y = 1 When Y = 0 If sign(wx) = y, then Increase w increase the likelihood exponentially. If sign(wx) <> y, then increase w decreases the likelihood exponentially. When linearly separable, every point is classified correctly. Increase w will always in creasing the total likelihood. Therefore, the sup is attained at w = infty. Dense level set Sparse level set

Outline – Hadoop Word Count Example – High level pictures of EM, Sampling and Variational Methods

Hadoop Demo

Parameter unknown. Parameter and Latent variable unknown. Not convex, hard to optimize. Frequentist Bayesian Easy to compute First attack the uncertainty at Z. “Divide and Conquer” Next, attack the uncertainty at Repeat… Conjugate prior Fully Observed Model Latent Variable Models

EM: algorithm Goal: Draw lower bounds of the data likelihood Close the gap at current Move

EM Treating Z as hidden variable (Bayesian) But treating as parameter. (Freq) - More uncertainty, because only inferred from one data - Less uncertainty, because inferred from all data What about kmeans? Let’s go full Bayesian! Too simple, not enough fun

Full Bayesian Treating both as hidden variatables, making them equally uncertain. Goal: Learn Challenge: posterior is hard to compute exactly. Sampling – Approximate by drawing samples Variational Methods – Use a nice family of distributions to approximate. – Find the distribution q in the family to minimize KL(q || p).

EMSamplingVariational GoalInferApprox ObjectiveNA Algorithm complexitylowVery highHigh IssuesE step may not be tractable depending on how you distinguish the latent variable from the parameters. Slow mixing rate Hard to validate Quality of the approximation depends on Q. Complicated to derive

Estep and Variational method

Same framework, but different goal and different challenge In Estep, we want to tighten the lower bound at a given parameter. Because the parameter is given, and also the posterior is easy to compute, we can directly set to exactly close the gap: In variational method, being full Bayesian, we want However, since all the effort is spent on minimizing the gap: In both cases, the L(q) is a lower bound of L(x).