Presentation is loading. Please wait.

Presentation is loading. Please wait.

CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta,

Similar presentations


Presentation on theme: "CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta,"— Presentation transcript:

1 CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Xuerui Wang, Ben Wellner, Fuchun Peng, Michael Hay.

2 From Text to Actionable Knowledge Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

3 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Joint Inference

4 An HLT Pipeline SNA, KDD, Events TDT, Summarization Coreference Relations NER Parsing MT ASR Errors cascade & accumulate

5 An HLT Pipeline SNA, KDD TDT, Summarization Coreference Relations NER Parsing MT ASR Unified, joint inference.

6 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Joint Inference

7 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Solution: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model

8 (Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001] where

9 Outline Motivating Joint Inference for NLP. Brief introduction of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Extraction and Data Mining (Iterative) Topical N-gram models  

10 Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

11 Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

12 Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] But errors cascade--must be perfect at every stage to do well.

13 Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation

14 2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] Dependency among similar, distant mentions ignored.

15 2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002] See also [Finkel, et al, 2005]

16 3. Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] “Entity resolution” “Object correspondence”

17 Transfer Learning with Factorial CRFs Emailed seminar entities Email English words Too little labeled training data. 60k words training. From: Terri Stankus To: seminars@cs.cmu.edu Date: 26 Feb 1992 GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. [Sutton, McCallum, 2005]

18 Newswire named entities Newswire English words [Sutton, McCallum, 2005] Train on “related” task with more data. 200k words training. CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN (1996-08-22) South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Transfer Learning with Factorial CRFs

19 Newswire named entities Email English words [Sutton, McCallum, 2005] At test time, label email with newswire NEs... Transfer Learning with Factorial CRFs

20 Newswire named entities Emailed seminar ann’mt entities Email English words [Sutton, McCallum, 2005] …then use these labels as features for final task Transfer Learning with Factorial CRFs

21 Newswire named entities Seminar Announcement entities English words [Sutton, McCallum, 2005] Use joint inference at test time. An alternative to hierarchical Bayes. Needn’t know anything about parameterization of subtask. Accuracy No transfer < Cascaded Transfer < Joint Inference Transfer Transfer Learning with Factorial CRFs 11% Reduction in Error

22 p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Sparse Generalized Belief Propagation Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Pal, Sutton, McCallum, 2005] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003]

23 Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab…Springer Montemerlo, Thrun…FastSLAM… AAAI… Kjaerulff Approxi… Technic… 4. Joint segmentation and co-reference

24 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference

25 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields Citation Segmentation and Coreference

26 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Y?NY?N

27 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference Y?NY?N Segmentation QualityCitation Co-reference (F1) No Segmentation78% CRF Segmentation91% True Segmentation93% 1) Segment citation fields 2) Resolve coreferent citations

28 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Resolving conflicts

29 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Perform jointly.

30 x s Observed citation CRF Segmentation IE + Coreference Model J Besag 1986 On the… AUT AUT YR TITL TITL

31 x s Observed citation CRF Segmentation IE + Coreference Model Citation mention attributes J Besag 1986 On the… AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” c

32 x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Structure for each citation mention

33 x s IE + Coreference Model c Binary coreference variables for each pair of mentions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

34 x s IE + Coreference Model c y n n J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Binary coreference variables for each pair of mentions

35 y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute nodes AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…”...

36 y y y x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute node

37 y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

38 Inference by Sparse “Generalized BP” Exact inference on these linear-chain regions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… From each chain pass an N-best List into coreference [Pal, Sutton, McCallum 2005]

39 Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Approximate inference by graph partitioning… …integrating out uncertainty in samples of extraction Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000] [Pal, Sutton, McCallum 2005]

40 NameTitle… Laurel, BInterface Agents: Metaphors with Character The … Laurel, B.Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Inference: Sample = N-best List from CRF Segmentation y ? n

41 y n n Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Exact (exhaustive) inference over entity attributes [Pal, Sutton, McCallum 2005]

42 y n n Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes [Pal, Sutton, McCallum 2005]

43 y n n Parameter Estimation: Piecewise Training Coref graph edge weights MAP on individual edges Divide-and-conquer parameter estimation IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method [Sutton & McCallum 2005]

44 p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations.

45 Outline Motivating Joint Inference for NLP. Brief introduction of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Extraction and Data Mining (Iterative) Topical N-gram models  

46 “George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”

47

48

49 ?

50 Relation Extraction as Sequence Labeling George W. Bush …George H. W. Bush (son of Prescott Bush) … FatherGrandfather

51 Learning Relational Database Features George W. Bush …George H. W. Bush (son of Prescott Bush) … FatherGrandfather NameSon Prescott BushGeorge H. W. Bush George W. Bush Search DB for “relational paths” between subject and token Subject_Is_SonOf_SonOf_Token=1.0

52 Highly weighted relational paths Many Family equivalences –Sibling=Parent_Offspring –Cousin=Parent_Sibling_Offspring College=Parent_College Religion=Parent_Religion Ally=Opponent_Opponent Friend=Person_Same_School Preliminary results: nice performance boost using relational features (~8% absolute F1)

53 Testing on Unknown Entities John F. Kennedy … son of Joseph P. Kennedy, Sr. and Rose Fitzgerald NameSon Joseph P. KennedyJohn F. Kennedy Rose FitzgeraldJohn F. Kennedy FatherMother Fill DB with “first-pass” CRF Use relational features with “second-pass” CRF

54 Next Steps Feature induction to discover complex rules Measure relational features’ sensitivity to noise in DB Collective inference among related relations

55 Outline Motivating Joint Inference for NLP. Brief introduction of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Extraction and Data Mining (Iterative) Topical N-gram models   

56 Topical N-gram Model - Our first attempt z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 T D...  W T W  11 22  22 {0, 1, 1:2, 2:2, 1:3, 2:3, 3:3} Wang & McCallum

57 Beyond bag-of-words z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4   TW D  ... Wallach

58 LDA-COL (Collocation) Model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 22 T Griffiths & Steyvers D  11 22 W W ...

59 Topical N-gram Model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 T D...  W T W  11 22  22 Wang & McCallum

60 Topical N-gram Model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 T D...  W T W  11 22  22 Wang & McCallum

61 Topic Comparison learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning LDA reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning RL function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies Topical N-grams (2+)Topical N-grams (1)

62 Topic Comparison motion visual field position figure direction fields eye location retina receptive velocity vision moving system flow edge center light local LDA receptive field spatial frequency temporal frequency visual motion motion energy tuning curves horizontal cells motion detection preferred direction visual processing area mt visual cortex light intensity directional selectivity high contrast motion detectors spatial phase moving stimuli decision strategy visual stimuli motion response direction cells stimulus figure contrast velocity model responses stimuli moving cell intensity population image center tuning complex directions Topical N-grams (2+)Topical N-grams (1)

63 Topic Comparison word system recognition hmm speech training performance phoneme words context systems frame trained speaker sequence speakers mlp frames segmentation models LDA speech recognition training data neural network error rates neural net hidden markov model feature vectors continuous speech training procedure continuous speech recognition gamma filter hidden control speech production neural nets input representation output layers training algorithm test set speech frames speaker dependent speech word training system recognition hmm speaker performance phoneme acoustic words context systems frame trained sequence phonetic speakers mlp hybrid Topical N-grams (2+)Topical N-grams (1)

64 Summary Joint inference can avoid accumulating errors in an pipeline from extraction to data mining. Examples –Factorial finite state models –Jointly labeling distant entities –Coreference analysis –Segmentation uncertainty aiding coreference & vice-versa –Joint Extraction and Data Mining Many examples of sequential topic models.


Download ppt "CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta,"

Similar presentations


Ads by Google