N-Gram Model Formulas Word sequences Chain rule of probability

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Learning HMM parameters

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Data Mining with Neural Networks (HK: Chapter 7.5)

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

1 Advanced Smoothing, Evaluation of Language Models.

Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

7-Speech Recognition Speech Recognition Concepts

Multi-Layer Perceptrons Michael J. Watts

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Hidden Markov Models BMI/CS 776 Mark Craven March 2002.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.

NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

John Lafferty Andrew McCallum Fernando Pereira

Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Hidden Markov Models BMI/CS 576

Machine Learning: Ensemble Methods

Machine Learning Supervised Learning Classification and Regression

Learning, Uncertainty, and Information: Learning Parameters

CS 388: Natural Language Processing: Neural Networks

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Real Neurons Cell structures Cell body Dendrites Axon

CSC 594 Topics in AI – Natural Language Processing

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

CS 2750: Machine Learning Hidden Markov Models

Neural Language Model CS246 Junghoo “John” Cho.

CSC 594 Topics in AI – Natural Language Processing

Data Mining with Neural Networks (HK: Chapter 7.5)

Hidden Markov Models Part 2: Algorithms

Three classic HMM problems

Word Embedding Word2Vec.

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CPSC 503 Computational Linguistics

Sanguthevar Rajasekaran University of Connecticut

Presentation transcript:

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation

Estimating Probabilities N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words. Bigram: N-gram:

Perplexity Measure of how well a model “fits” the test data. Uses the probability that the model assigns to the test corpus. Normalizes for the number of words in the test corpus and takes the inverse. Measures the weighted average branching factor in predicting the next word (lower is better).

Laplace (Add-One) Smoothing “Hallucinate” additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly. where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model). Bigram: N-gram: Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V).

Interpolation Linearly combine estimates of N-gram models of increasing order. Interpolated Trigram Model: Where: Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (a.k.a. tuning) corpus.

Formal Definition of an HMM A set of N +2 states S={s0,s1,s2, … sN, sF} Distinguished start state: s0 Distinguished final state: sF A set of M possible observations V={v1,v2…vM} A state transition probability distribution A={aij} Observation probability distribution for each state j B={bj(k)} Total parameter set λ={A,B}

Forward Probabilities Let t(j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j).

Computing the Forward Probabilities Initialization Recursion Termination

Viterbi Scores Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. Also record “backpointers” that subsequently allow backtracing the most probable state sequence. btt(j) stores the state at time t-1 that maximizes the probability that system was in state sj at time t (given the observed sequence).

Computing the Viterbi Scores Initialization Recursion Termination Analogous to Forward algorithm except take max instead of sum

Computing the Viterbi Backpointers Initialization Recursion Termination Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.

Supervised Parameter Estimation Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data. Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data. Use appropriate smoothing if training data is sparse.

Simple Artificial Neuron Model (Linear Threshold Unit) Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji Model net input to cell as Cell output is: 1 3 2 5 4 6 w12 w13 w14 w15 w16 oj 1 (Tj is threshold for unit j) Tj netj

Perceptron Learning Rule Update weights by: where η is the “learning rate” tj is the teacher specified output for unit j. Equivalent to rules: If output is correct do nothing. If output is high, lower weights on active inputs If output is low, increase weights on active inputs Also adjust threshold to compensate:

Perceptron Learning Algorithm Iteratively update weights until convergence. Each execution of the outer loop is typically called an epoch. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule

Context Free Grammars (CFG) N a set of non-terminal symbols (or variables)  a set of terminal symbols (disjoint from N) R a set of productions or rules of the form A→, where A is a non-terminal and  is a string of symbols from ( N)* S, a designated non-terminal called the start symbol

Estimating Production Probabilities Set of production rules can be taken directly from the set of rewrites in the treebank. Parameters can be directly estimated from frequency counts in the treebank.