Recitation4 for BigData Jay Gu Feb 7 2013 MapReduce.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Bayesian inference of normal distribution
Biointelligence Laboratory, Seoul National University
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Expectation Maximization
Supervised Learning Recap
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
An Introduction to Variational Methods for Graphical Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Visual Recognition Tutorial
Bayesian network inference
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Expectation Maximization Algorithm
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
EM and expected complete log-likelihood Mixture of Experts
Statistical Decision Theory
Model Inference and Averaging
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Lecture 19: More EM Machine Learning April 15, 2010.
Probabilistic Graphical Models
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Posterior Regularization for Structured Latent Variable Models Li Zhonghua I2R SMT Reading Group.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Biointelligence Laboratory, Seoul National University
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Lecture 2: Statistical learning primer for biologists
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSE 517 Natural Language Processing Winter 2015
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
EE 551/451, Fall, 2006 Communication Systems Zhu Han Department of Electrical and Computer Engineering Class 15 Oct. 10 th, 2006.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Empirical risk minimization
ICS 280 Learning in Graphical Models
Ch3: Model Building through Regression
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
STA 216 Generalized Linear Models
Latent Variables, Mixture Models and EM
Statistical Learning Dong Liu Dept. EEIS, USTC.
Probabilistic Models with Latent Variables
Mathematical Foundations of BME Reza Shadmehr
Stochastic Optimization Maximization for Latent Variable Models
Empirical risk minimization
Lecture 11 Generalizations of EM.
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
CS639: Data Management for Data Science
Presentation transcript:

Recitation4 for BigData Jay Gu Feb MapReduce

Homework 1 Review Logistic Regression – Linear separable case, how many solutions? Suppose wx = 0 is the decision boundary, (a * w)x = 0 will have the same boundary, but more compact level set. wx=02wx=0

Homework 1 Review wx=02wx=0 When Y = 1 When Y = 0 If sign(wx) = y, then Increase w increase the likelihood exponentially. If sign(wx) <> y, then increase w decreases the likelihood exponentially. When linearly separable, every point is classified correctly. Increase w will always in creasing the total likelihood. Therefore, the sup is attained at w = infty. Dense level set Sparse level set

Outline – Hadoop Word Count Example – High level pictures of EM, Sampling and Variational Methods

Hadoop Demo

Parameter unknown. Parameter and Latent variable unknown. Not convex, hard to optimize. Frequentist Bayesian Easy to compute First attack the uncertainty at Z. “Divide and Conquer” Next, attack the uncertainty at Repeat… Conjugate prior Fully Observed Model Latent Variable Models

EM: algorithm Goal: Draw lower bounds of the data likelihood Close the gap at current Move

EM Treating Z as hidden variable (Bayesian) But treating as parameter. (Freq) - More uncertainty, because only inferred from one data - Less uncertainty, because inferred from all data What about kmeans? Let’s go full Bayesian! Too simple, not enough fun

Full Bayesian Treating both as hidden variatables, making them equally uncertain. Goal: Learn Challenge: posterior is hard to compute exactly. Sampling – Approximate by drawing samples Variational Methods – Use a nice family of distributions to approximate. – Find the distribution q in the family to minimize KL(q || p).

EMSamplingVariational GoalInferApprox ObjectiveNA Algorithm complexitylowVery highHigh IssuesE step may not be tractable depending on how you distinguish the latent variable from the parameters. Slow mixing rate Hard to validate Quality of the approximation depends on Q. Complicated to derive

Estep and Variational method

Same framework, but different goal and different challenge In Estep, we want to tighten the lower bound at a given parameter. Because the parameter is given, and also the posterior is easy to compute, we can directly set to exactly close the gap: In variational method, being full Bayesian, we want However, since all the effort is spent on minimizing the gap: In both cases, the L(q) is a lower bound of L(x).