What is Statistical Modeling

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Supervised Learning Recap
Lecture 22: Evaluation April 24, 2010.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Classification and risk prediction
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Visual Recognition Tutorial
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Today Evaluation Measures Accuracy Significance Testing
Machine Learning Queens College Lecture 1: Introduction.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification Techniques: Bayesian Classification
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CSE 4705 Artificial Intelligence
CEE 6410 Water Resources Systems Analysis
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Probability Review Density Estimation
Hypothesis Testing Review
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Presentation transcript:

Introduction to Statistical Modeling and Machine Learning Lecture 8 Spoken Language Processing Prof. Andrew Rosenberg

What is Statistical Modeling Statistical Modeling is the process of using data to construct a mathematical or algorithmic device to measure the probability of some observation. Training Using a set of observations to learn parameters of a model, or construct the decision making process. Evaluation Determining the probability of a new observation

What is a Statistical Model? Mathematically, it’s a function that maps observations to probabilities. Observations can be in one dimension one number (numeric), one category (nominal) or in many dimensions two numbers: height and weight, a number and a category: height and gender Each dimension is called a feature

What is Machine Learning? Automatically identifying patterns in data Automatically making decisions based on data Hypothesis: Data Learning Algorithm Behavior ≥ Data Programmer or Expert Behavior

Basics of Probabilities. Probabilities fall in the range [0,1] Mutually Exclusive events are events that cannot simultaneously occur. The sum of the likelihoods of all mutually exclusive events must be 1.

Joint Probability We can represent the probability of more than one event at the same time. If two events are independent.

Joint Probability Table A Joint Probability function defines the likelihood of two (or more) events occurring. Let nij be the number of times event i and event j simultaneously occur. Orange Green Blue box 1 3 4 Red box 6 2 8 7 5 12

Marginalization Consider the probability of X irrespective of Y. The number of instances in column j is the sum of instances in each cell Therefore, we can marginalize or “sum over” Y:

Conditional Probability Consider only instances where X = xj. The fraction of these instances where Y = yi is the conditional probability “The probability of y given x”

Relating the Joint Conditional and Marginal

Sum and Product Rules In general, we’ll refer to a distribution over a random variable as p(X) and a distribution evaluated at a particular value as p(x). Sum Rule Product Rule

Bayes Rule

Interpretation of Bayes Rule Posterior Prior Likelihood Prior: Information we have before observation. Posterior: The distribution of Y after observing X Likelihood: The likelihood of observing X given Y

Expected Values The expected value of a random variable is a weighted average. Expected values are used to determine what is likely to happen in a random setting Expectation The expected value of a function is the hypothesis Variance The variance is the confidence in that hypothesis

What is a Probability? Frequentists A probability is the likelihood that an event will happen It is approximated by the ratio of the number of observed events to the number of total events Assessment is vital to selecting a model Point estimates are absolutely fine

What is a Probability? Bayesians A probability is a degree of believability of a proposition. Bayesians require that probabilities be prior beliefs conditioned on data. The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior

Boxes and Balls 2 Boxes, one red and one blue. Each contain colored balls.

Boxes and Balls Given some information about B and L, we want to ask questions about the likelihood of different events. What is the probability of selecting an apple? If I chose an orange ball, what is the probability that I chose from the blue box?

Naïve Bayes Classification This is a simple case of a simple classification approach. Here the Box is the class, and the colored ball is a feature, or the observation. We can extend this Bayesian classification approach to incorporate more independent features.

Naïve Bayes Classification

Naïve Bayes Classification Assuming independence between the features given the class simplifies the math

Argmax Identify the parameter that maximizes a function. When training a model, the goal is to maximize the likelihood of the model under some parameters. Since the log function is monotonic, optimizing a log transform of the likelihood is equivalent.

Bernoulli Distribution Also known as a Binary Distribution. Represented by a single parameter Constrained version of the more general, multinomial distribution 0.72 0.28 b 1-b

Multinomial Distribution If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution. The probability of x being in state k is μk 0.1 0.5 0.2

Gaussian Distribution One Dimension D-Dimensions

Gaussian Distribution

Gaussian Distributions We use Gaussian Distributions all over the place.

Gaussian Distributions We use Gaussian Distributions all over the place.

Supervised vs. Unsupervised Learning In supervised learning, the desired, target, or class value is known. In unsupervised learning, there is no observations of the target variable. Major Tasks Regression Predict a numerical value from features i.e. “other information” Classification Predict a categorical value Clustering Identify groups of similar entities

Graphical Example of Regression ?

Graphical Example of Regression

Graphical Example of Regression Different styles of regression learn different functions.

Graphical Example of Classification

Graphical Example of Classification ?

Graphical Example of Classification ?

Graphical Example of Classification

Graphical Example of Classification

Graphical Example of Classification

Decision Boundaries

Graphical Example of Clustering

Graphical Example of Clustering

Graphical Example of Clustering

Counting parameters The “size” of a statistical model is measured by the number of parameters that need to be trained. Bernouli distribution one parameter Multinomial distribution N-1 parameters 1-dimensional Gaussian 2 parameter: mean and variance N-dimensional Gaussian N-dimensional mean vector N*N dimensional covariance matrix

Curse of Dimensionality Increased number of features increases data needs exponentially. If 1 feature can be approximated with 10 observations, 2 features require 10*10 Models should be “small” – few parameters / features – relative to the amount of available data.

Overfitting Models with more parameters are more general. I.e., Can represent more relationships between variables More parameters can allow a statistical model to fit training data too well. Too well: When the model fails to generalize to unseen data.

Overfitting

Overfitting

Overfitting

Evaluation of Statistical Models Model Likelihood. Calculate p(x; Θ) of new data x based on trained parameters Θ. The model parameters (almost always) maximize the likelihood of the training data. Evaluate the likelihood of unseen – evaluation or testing – data.

Evaluation of Statistical Models Evaluating Classifiers Accuracy is the most common and most intuitive calculation of performance of a classifier.

Contingency Table Reports the confusion between True and Hypothesized classes True Values Positive Negative Hyp Values True Positive False Positive False Negative True Negative

Cross Validation Cross Validation is a technique to estimate the generalization performance of a classifier. Identify n “folds” of the available data. Train on n-1 folds Test on the remaining fold. In the extreme (n=N) this is known as “leave-one-out” cross validation n-fold cross validation (xval) gives n samples of the performance of the classifier.

Caveats – Black Swans In the 17th Century, all known swans were white. Based on evidence, it is impossible for a swan to be anything other than white. In the 18th Century, black swans were discovered in Western Australia Black Swans are rare, sometimes unpredictable events, that have extreme impact Almost all statistical models underestimate the likelihood of unseen events.

Caveats – The Long Tail Many events follow an exponential distribution These distributions have a very long “tail”. I.e. A large region with significant probability mass, but low likelihood at any particular point. Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.

Next Class Gaussian Mixture Models Reading: J&M 9.3