Conditional Random Fields model

Slides:

Advertisements

Similar presentations

An Introduction to Conditional Random Field Ching-Chun Hsiao 1.

Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Hidden Markov Models Eine Einführung.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Statistical NLP: Lecture 11

Entropy Rates of a Stochastic Process

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Visual Recognition Tutorial

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Conditional Random Fields

. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Graphical models for part of speech tagging

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

CS Statistical Machine learning Lecture 24

Lecture 2: Statistical learning primer for biologists

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

John Lafferty Andrew McCallum Fernando Pereira

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Other Models for Time Series. The Hidden Markov Model (HMM)

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Conditional Random Fields & Table Extraction Dongfang Xu School of Information.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy

Hidden Markov Models BMI/CS 576

Lecture 7: Constrained Conditional Models

Chapter 7. Classification and Prediction

Structured prediction

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Hidden Markov Models.

Conditional Random Fields

LECTURE 03: DECISION SURFACES

CSC 594 Topics in AI – Natural Language Processing

Data Mining Lecture 11.

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models Part 2: Algorithms

Hidden Markov Autoregressive Models

Ying shen Sse, tongji university Sep. 2016

N-Gram Model Formulas Word sequences Chain rule of probability

CONTEXT DEPENDENT CLASSIFICATION

Algorithms of POS Tagging

LECTURE 15: REESTIMATION, EM AND MIXTURES

Hidden Markov Models Teaching Demo The University of Arizona

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

The Improved Iterative Scaling Algorithm: A gentle Introduction

Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong

Presentation transcript:

Conditional Random Fields model QingSong.Guo

Recent work XML keyword query refinement Two ways: Focus on XML tree structure Focus on keywords

XML tree In keyword query, there are many nodes in the XML tree matching the keywords. Try to find semantically related keywords to avoid returning irrelevant XML nodes to users. LCA SLCA

Keywords Ambiguity “Mary Author Title Year”? 1、Find title and year of publications, of which Mary is an author. 2、Find additional author of the publications, of which Mary is an author. 3、Find year and author of publications with similar titles to Mary’s publications.

keywords Question: How to do that? Spelling error correction : machin machine Word splitting : universtyof RUC university of ruc Word merging: on line online Phrase segmentation: mark word’s postion in phrase Word stemming: do doing Acronym expansion: RUC Renming University of China Question: How to do that?

Labeling Sequence Data X are random variables over data sequences Y are random variables over label sequences A is the set of possible part-of-speech tags problem: how to get labeling sequence y from data sequence x ? Thinking is being X: x1 x2 x3 noun verb y1 y2 y3 Y:

Hidden Markov models (HMMs) Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

Markov model Markov property means that, given the present state, future states are independent of the past states State space, Random variables sequence from S Markov property:

HMM the state is not directly visible, but variables influenced by the state are visible labeling to the data sequence

Example of HMM 假设你有一个住得很远的朋友,他每天跟你打电话告诉你他那天作了什么.你的朋友仅仅对三种活动感兴趣:公园散步,购物以及清理房间.他选择做什么事情只凭天气.你对于他所住的地方的天气情况并不了解,但是你知道总的趋势.在他告诉你每天所做的事情基础上,你想要猜测他所在地的天气情况. 你认为天气的运行就像一个马尔可夫链.其有两个状态 "雨"和"晴",但是你无法直接观察它们,也就是说,它们对于你是隐藏的.每天,你的朋友有一定的概率进行下列活动:"散步", "购物", 或 "清理". 因为你朋友告诉你他的活动,所以这些活动就是你的观察数据.这整个系统就是一个隐马尔可夫模型HMM.

HMM Three Problems: Towards model λ=(A,B,π), how to compute the p(Y|λ) ? How to select the proper state sequence Y? how to estimate the parametrs to maximize the p(Y|λ) ?

HMM Get data Creat model application training Parameter Estimation Model establish

HMM Definition: quintuple form(五元组) (S , K, A, B, π ) S = {S1,...,Sn}：set of states K = {K1,...,Km}：set of observations A = {aij}，aij = p(Xt+1 = qj |Xt = qi)： state transition probability B = {bik}，bik = p(Ot = vk | Xt = qi)： output probability π = {πi}， πi = p(X1 = qi)： initial state parobability

HMM

Generative Models Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

Discriminative models used in machine learning modeling the dependence of an unobserved variable y on an observed variable x. modeling the conditional probability distribution P(y | x), which can be used for predicting y from x.

Maximum Entropy Markov Models (MEMMs) A conditional model that representing the probability of reaching a state given an observation and the previous state Given training set X with label sequence Y: Train parameter θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ)

MEMMs Have all the advantages of Conditional Models Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

Label Bias Problem Since P(1,2|ro) = P(2|1,ro)P(1|ro) = P(2|1,o)P(1|r) P(1,2|ri) = P(2|1,ri)P(1|ri) = P(2|1,i)P(1|r) Since P(2|1,x)=1 for all x, P(1,2|ro) = P(1,2|ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2|1) = 1, so P(2|1,x) = 1 for all x However, we expect P(1,2|ri) to be greater than P(1,2|ro). Per-state normalization does not allow the required expectation

Random Field

Conditional Random Fields (CRFs) have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

Definition of CRFs X : random variable over data sequences to be labeled Y : random variable over corresponding label sequences

Example of CRFs Here，we suppose the graph G is a chain

Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: v is a vertex from vertex set V set of label random variables e is an edge from edge set E over V k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization factor over the data sequence x

Feature function Transition feature funtion: State feature function: if the xi is the word “september” otherwise if y i-1 = IN and yi = NNP othewise

Maximum Entropy Principle The form of CRF given is heavily motivated by the principle of maximum entropy—‘‘A mathematical theory of communication‘‘,shannon The only probability distribution that can justifiably be constructed from finite training data is that which has maximum entropy subject to a set of constraints representing the information available

Maximum Entropy Principle If the information within the training data is represented using a set of feature functions desribed previously The maximum entropy distribution is that which is as uniform as possible while ensuring that the expectation of feature function with respect to the empirical distribution of the training data equals the expected value of that feature function with respect to the model distribution

Learning for CRFs Assumption: the features fk and gk are given and fixed The learning problem determine the parameters λ = (λ1, λ2, . . . ; µ1, µ2, . . .) maximize the log-likelihood function of training data D = {(x(k), y(k))} with empirical distribution p~(x, y). We simplify the notations by writing This allows the probability of a label sequence y given an observation sequence x to be written as

CRF Parameter Estimation For a CRF, the log-likelihood is given by Differentiating the log-likelihood function with respect to parameters gives

CRF Parameter Estimation There is no analytical solutions for the parameter by maximizing the log-likelihood Setting the derivative to zero and solving for does not always yield a closed form solution Iterative technique is adopted Iterative scaling Gradient decent The core of the above techniques lies in computing the expectation of each feature function with respect to the CRF model distribution

CRF Probability as Matrix Computations Augment the label sequence with start and end state. We define n+1 matrices of size : The probability of label sequence y given observation sequence x can be written as the product of the appropriate elements of the n+1 matrices for that pair of sequences Normalization factor can be computed based on graph theory

Dynamic Programming The expectation of each feature function with respect to the CRF model distribution for every observation sequence x(k) in the training data is given by Rewriting right hand side of above equation

Dynamic Programming Defining forward and backward vectors The probability of Yi and Yi-1 taking on labels y’ and y given observation sequence x(k) can be computed as

Making Predictions Once a CRF model has been trained, there are (at least) two possible ways to do inference for a test sequence We can predict the entire sequence Y that has the highest probability by Viterbi algorithm (MAP) We can also make predictions for individual yt and by forward-backward algorithm (MPM)

POS tagging Experiments

POS tagging Experiments Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

CRF for XML Trees XML documents are represented by DOM tree Only consider element nodes, attribute nodes and text nodes Attribute nodes are unordered, element nodes and text nodes are ordered

CRF for XML Trees With every set of nodes,associate a random field X of observables Xn and a random field Y of output variables Yn,n is position Xn will be the symbols of the input trees,and Yn will be the labels of their labelings Triangle feature function:

CRF for XML Trees Yn Xn DOM tree table account tr tr client product td @class id name address name price number id Yn Xn Y0 Y1 Y2 Y1.1 Y1.2 Y2.1 Y2.2 Y2.3 Y2.4 DOM tree

CRF-Query Refinement Introduce operations and incorporate the operations into the CRF model Let o denote a sequence of refinement operations o=o1,o2,…,on Conditional model P(y,o|x) called CRF-QR model

Operations Task Operation Seplling correction Deletion/insertion/insetion/substitution/transposition Word splitting Splitting Word merging Merging Phrase segmentation Begin/middle/end/out/ Word stemming +s/-s/+ed/-ed/+ing/-ing Acronym expansion expansion

Graphical representation

CRF-Query Refinement

Next work Along with the two ways Thanks!