Download presentation
Presentation is loading. Please wait.
Published byAlicia Martin Modified over 9 years ago
1
Conditional Random Fields: Probabilistic Models Pusan National University AILAB. Kim, Minho
2
Labeling Sequence Data Problem X is a random variable over data sequences Y is a random variable over label sequences Y i is assumed to range over a finite label alphabet A The problem: –Learn how to give labels from a closed set Y to a data sequence X Birdslikeflowers X: x1x1 x2x2 x3x3 nounverbnoun y1y1 y2y2 y3y3 Y:
3
Generative Probabilistic Models Learning problem: Choose Θ to maximize joint likelihood: L(Θ)= Σ log p Θ (y i,x i ) The goal: maximization of the joint likelihood of training examples y = argmax p*(y|x) = argmax p*(y,x)/p(x) Needs to enumerate all possible observation sequences
4
Hidden Markov Model In a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through (X) but only some probabilistic function of it (Y). Thus, it is a Markov model with the addition of emission probabilities: B ik = P(Y t = k|X t = i)
5
POS Tagging in HMM Optimal sequence Contextual probability Lexical probability
6
POS Tagging in HMM Learning(Maximum Likelihood Estimation)
7
HMM – why not? Advantages: –Estimation very easy. –Closed form solution –The parameters can be estimated with relatively high confidence from small samples But: –The model represents all possible (x,y) sequences and defines joint probability over all possible observation and label sequences needless effort
8
Discriminative Probabilistic Models “Solve the problem you need to solve”: The traditional approach inappropriately uses a generative joint model in order to solve a conditional problem in which the observations are given. To classify we need p(y|x) – there’s no need to implicitly approximate p(x). GenerativeDiscriminative
9
Discriminative Models - Estimation Choose Θ y to maximize conditional likelihood: L(Θ y )= Σ log p Θ y (y i |x i ) Estimation usually doesn’t have closed form Example – MinMI discriminative approach (2 nd week lecture)
10
Maximum Entropy Markov Model MEMM: –a conditional model that represents the probability of reaching a state given an observation and the previous state –These conditional probabilities are specified by exponential models based on arbitrary observation features
11
POS Tagging in MEMM Optimal sequence Joint probability
12
MEMM: the Label bias problem
13
The Label Bias Problem: Solutions Determinization of the Finite State Machine Not always possible May lead to combinatorial explosion Start with a fully connected model and let the training procedure to find a good structure Prior structural knowledge has proven to be valuable in information extraction tasks
14
Random Field Model: Definition Let G = (V, E) be a finite graph, and let A be a finite alphabet. The configuration space Ω is the set of all labelings of the vertices in V by letters in A. If C is a part of V and ω is an element of Ω is a configuration, the ωc denotes the configuration restricted to C. A random field on G is a probability distribution on Ω.
15
Random Field Model: The Problem Assume that a finite number of features can define a class The features f i (w) are given and fixed. The goal: estimating λ to maximize likelihood for training examples
16
Conditional Random Field: Definition X – random variable over data sequences Y - random variable over label sequences Y i is assumed to range over a finite label alphabet A Discriminative approach: we construct a conditional model p(y|x) and do not explicitly model marginal p(x)
17
CRF - Definition Let G = (V, E) be a finite graph, and let A be a finite alphabet Y is indexed by the vertices of G Then (X,Y) is a conditional random field if the random variables Y v, conditioned on X, obey the Markov property with respect to the graph: p(Y|X,Y w,w≠v) = p(Y v |X,Y w,w~v), where w~v means that w and v are neighbors in G
18
CRF on Simple Chain Graph We will handle the case when G is a simple chain: G = (V = {1,…,m}, E={ (I,i+1) }) HMM (Generative)MEMM (Discriminative) CRF
19
Fundamental Theorem of Random Fields (Hammersley & Clifford) Assumption: –G structure is a tree, of which simple chain is a private case
20
CRF – the Learning Problem Assumption: the features f k and g k are given and fixed. –For example, a boolean feature g k is TRUE if the word X i is upper case and the label Y i is a “noun”. The learning problem –We need to determine the parameters Θ = (λ 1, λ 2,... ; µ1, µ2,...) from training data D = {(x(i), y(i))} with empirical distribution p ~ (x, y).
21
21 최대 엔트로피 모델 우리가 알아낸 제약 조건을 다 만족하는 확률 분포들 중에서 엔트로피가 최대가 되는 확률 분포를 취함 알고 있는 정보는 반영하되, 확실하지 않은 경우에 대해서는 불확실성 정도를 최대로 두어 균일한 확률 분포를 구성
22
22 최대 엔트로피 원리 제약조건을 만족하는 확률 분포들 중 엔트로피가 최대가 되도록 모델을 구성 알려진 또는 사용하고자 하는 정보에 대해 확실히 지켜주고, 고려하지 않은 경우나 모르는 경우에 대해서는 동등하게 가중치를 줌으로써 특정 부분에 치우치지 않는 분포를 구한다 Ref. [1]
23
23 최대 엔트로피 예 이벤트 공간 경험적 데이터 엔트로피를 최대로 하는 확률 분포 – 제약조건 : E[NN, NNS, NNP, NNPS, VBZ, VBD]=1 NNNNSNNPNNPSVBZVBD35111331 Ref. [3]
24
24 최대 엔트로피 예 –N* 이 V* 보다 더 빈번하게 발생, 이를 자질 함수로 추가 – 고유명사가 보통명사보다 더 빈번하게 발생 NNNNSNNPNNPSVBZVBD8/36 2/36 4/36 12/36 2/36
25
25 최대 엔트로피 모델 구성 요소 자질 함수 – 정해놓은 조건들을 만족하는지 여부를 확인 – 일반적으로 이진 함수로 정의 제약조건 – 기대치를 구할 때 사용하는 정보는 학습문서로 한정 파라미터 추정 알고리즘 – 자질 함수의 가중치를 구하는 방법 –GIS, IIS
26
26 최대 엔트로피 모델에서 확률 계산 방법 자질 함수를 정의 제약조건을 정의 선택한 알고리즘을 이용해 자질 함수의 가중치 계산 가중치를 이용해 각각의 확률 계산 여러 확률 값 중 제일 큰 값을 최종확률로 선택
27
27 자질 함수 Trigger 형태로, 정해놓은 제약조건을 만족하였는지 여부를 구분해주는 함수 고려되고 있는 문맥에 사용하고자 하는 정보들이 적용가능한지 결정 Ref. [1]
28
28 제약조건 Ref. [1]
29
29 파라미터 추정 정해진 자질 함수를 학습 문서에 적용시켜 얻어낸 확률 정보를 가장 잘 반영하는 p* 를 최우추정법 (Maximum Likelihood Estimation) 사용하여 구한다 Ref. [1]
30
30 IIS (Improved Iterative Scaling) Ref. [1]
31
31 GIS (General Iterative Scaling) Ref. [2]
32
32 GIS (General Iterative Scaling) Ref. [2]
33
Conclusions Conditional random fields offer a unique combination of properties: –discriminatively trained models for sequence segmentation and labeling –combination of arbitrary and overlapping observation features from both the past and future –efficient training and decoding based on dynamic programming for a simple chain graph –parameter estimation guaranteed to find the global optimum CRFs main current limitation is the slow convergence of the training algorithm relative to MEMMs, let alone to HMMs, for which training on fully observed data is very efficient.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.