Download presentation
Presentation is loading. Please wait.
Published byChristian Morris Modified over 9 years ago
1
First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
2
Probabilistic First-Order Logic for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts, Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
3
Previous work: Conditional Random Fields for Coreference
4
A Pairwise Conditional Random Field for Coreference... Mr Powell...... Powell...... she... y [McCallum & Wellner, 2003, ICML] (PW-CRF) y y x2x2 x3x3 x1x1 Coreferent(x 2,x 3 )?
5
A Pairwise Conditional Random Field for Co-reference... Mr Powell...... Powell...... she... y [McCallum & Wellner, 2003, ICML] (PW-CRF) y y x2x2 x3x3 x1x1
6
A Pairwise Conditional Random Field for Co-reference... Mr Powell...... Powell...... she... 45 30 y [McCallum & Wellner, 2003, ICML] (PW-CRF) 11 y y Pairwise compatibility score learned from training data x2x2 x3x3 x1x1
7
A Pairwise Conditional Random Field for Co-reference... Mr Powell...... Powell...... she... 45 30 y [McCallum & Wellner, 2003, ICML] (PW-CRF) 11 y y Pairwise compatibility score learned from training data Hard transitivity constraints enforced by prediction algorithm x2x2 x3x3 x1x1
8
... Mr Powell...... Powell...... she... 45 30 11 Prediction in PW-CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] = 64 Often approximated with agglomerative clustering x2x2 x3x3 x1x1
9
Parameter Estimation in PW-CRFs Given labeled documents, generate all pairs of mentions –Optionally prune distant mention pairs [Soon, Ng, Lim 2001] Learn binary classifier to predict coreference Edge weights proportional to classifier output
10
Sometimes pairwise comparisons are insufficient Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. Having 2 “given names” is common, but not 4. –e.g. Howard M. Dean / Martin, Dean / Howard Martin Need to measure size of the clusters of mentions. a pair of name strings where edit distance differs > 0.5? Maximum distance between mentions in document A entity contains only pronoun mentions? We need measures on hypothesized “entities” We need First-order logic
11
First-Order Logic CRFs for Coreference
12
First-Order Logic CRFs for Co-reference... Mr Powell...... Powell...... she... (FOL-CRF) y 56 x2x2 x3x3 x1x1 Coreferent(x 1,x 2,x 3 )?
13
First-Order Logic CRFs for Co-reference... Mr Powell...... Powell...... she... (FOL-CRF) y Clusterwise compatibility score learned from training data Features are arbitrary FOL predicates over a set of mentions 56 x2x2 x3x3 x1x1 Coreferent(x 1,x 2,x 3 )?
14
First-Order Logic CRFs for Co-reference... Mr Powell...... Powell...... she... (FOL-CRF) y As in PW-CRF, prediction can be approximated with agglomerative clustering 56 Coreferent(x 1,x 2,x 3 )? x2x2 x3x3 x1x1
15
Learning Parameters of FOL-CRFs Generate classification examples where input is a set of mentions Unlike Pairwise CRF, cannot generate all possible examples in training data
16
HePowellRice She heSecretary Coreferent(x 1,x 2 ) … Coreferent(x 1,x 2,x 3 ) … Coreferent(x 1,x 2,x 3,x 4 ) Coreferent(x 1,x 2,x 3,x 4,x 5 ) Coreferent(x 1,x 2,x 3,x 4,x 5,x 6 ) … … …............ Combinatorial Explosion! Learning Parameters of FOL-CRFs
17
This space complexity is common in probabilistic first-order logic Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003 Richardson & Domingos 2006
18
Training in Probabilistic FOL Parameter estimation; weight learning Input –First-order formulae x S(x) T(x) –Labeled data a, b, c S(a), T(a), S(b), T(b), S(c) Output –Weights for each formula x S(x) T(x) [0.67] xy Coreferent(x,y) Pronoun(x) xy Coreferent(x,y) Pronoun(x) [-2.3]
19
Training in Probabilistic FOL Previous Work Maximum likelihood –Require intractable normalization constant Pseudo-likelihood [Richardson, Domingos 2006] –Ignores uncertainty of relational information E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997] Sampling [Paskin 2002] Perceptron [Singla, Domingos 2005] –Can be inefficient when prediction is expensive Piecewise training [Sutton, McCallum 2005] –Train “pieces” of world in isolation –Performance sensitive to which pieces are chosen
20
Most methods require “unrolling” [grounding] Unrolling has exponential space complexity –E.g., xyz S(x,y,z) -> T(x,y,z) For constants [a b c d e f g h] must examine all triples Sampling can be inefficient due to large sample space. Proposal: Let prediction errors guide sampling Training in Probabilistic FOL Parameter estimation; weight learning
21
Error-driven Training Input –Observed data X // Input mentions –True labeling P // True clustering –Prediction algorithm A // Clustering algorithm –Initial weights W, prediction Q // Initial clustering Iterate until convergence –Q’ A(Q, W, O) // Merge clusters –If Q’ introduces an error UpdateWeights(Q, Q’, P, O, W) –Else Q Q’
22
UpdateWeights(Q, Q’, P, O, W) Learning to Rank Pairs of Predictions Using truth P, generate a new Q’’ that is a better modification of Q than Q’. Update W s.t. Q’’ A(Q, W, O) Update parameters so Q’’ is ranked higher than Q’
23
Ranking vs Classification Training Instead of training [Powell, Mr. Powell, he] --> YES [Powell, Mr. Powell, she] --> NO...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
24
Ranking Parameter Update In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003] W t+1 = argmin W ||W t - W|| s.t. Score(Q’’, W) - Score(Q’, W) ≥ 1
25
Advantages Never need to unroll entire network –Only explore partial solutions prediction algorithm likely to produce Weights tuned for prediction algorithm Adaptable to different prediction algorithms –beam search, simulated annealing, etc. Adaptable to different loss functions Related: Incremental Perceptron [Collins, Roark 2004] LaSO [Daume, Marcu 2005] Extended here for FOL, ranking, max-margin loss. Rank partial, possibly mistaken predictions.
26
Disadvantages Difficult to analyze exactly what global objective function is being optimized Convergence issues –Average weight updates
27
Experiments ACE 2004 coreference –443 newswire documents Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002] –Text match, gender, number, context, Wordnet Additional first-order features –Min/Max/Average/Majority of pairwise features E.g., Average string edit distance, Max document distance –Existential/Universal quantifications of pairwise features E.g., There exists gender disagreement Prediction: Greedy agglomerative clustering
28
Experiments Sampling + Classification Error-driven + Ranking FOL-CRF69.279.3 PW-CRF62.472.5 B-Cubed F1 Score on ACE 2004 Noun Coreference [to our knowledge, best previously reported results ~ 69% (Ng, 2005)] Better Representation Better Training
29
Conclusions Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP. Simple approximations can make these approaches practical for real-world problems.
30
Future Work Fancier features –Over entire clusterings Less greedy inference –Metropolis-Hastings sampling Analysis of training –Which positive/negative examples to select when updating –Loss function sensitive to local minima of prediction Analyze theoretical/empirical convergence
31
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.