Learning Relational Dependency Networks for Relation Extraction

Slides:

Advertisements

Similar presentations

Recommender System A Brief Survey.

Advertisements

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

On-line learning and Boosting

Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.

Learning First-Order Probabilistic Models with Combining Rules Sriraam Natarajan Prasad Tadepalli Eric Altendorf Thomas G. Dietterich Alan Fern Angelo.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

1 © 1998 HRL Laboratories, LLC. All Rights Reserved Construction of Bayesian Networks for Diagnostics K. Wojtek Przytula: HRL Laboratories & Don Thompson:

Sparse vs. Ensemble Approaches to Supervised Learning

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Sparse vs. Ensemble Approaches to Supervised Learning

Distributed Representations of Sentences and Documents

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Jan 4 th 2013 Event Extraction Using Distant Supervision Kevin Reschke.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.

1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.

DeepDive Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

SIGIR, August 2005, Salvador, Brazil On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Unsupervised Sparse Vector Densification for Short Text Similarity

Fill-in-The-Blank Using Sum Product Network

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural Machine Translation

Automatically Labeled Data Generation for Large Scale Event Extraction

A Brief Introduction to Distant Supervision

Boosted Augmented Naive Bayes. Efficient discriminative learning of

David Mareček and Zdeněk Žabokrtský

Boosting and Additive Trees (2)

Relation Extraction CSCI-GA.2591

Erasmus University Rotterdam

Conditional Random Fields for ASR

Boosting and Additive Trees

Multimodal Learning with Deep Boltzmann Machines

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

Predicting Primary Myocardial Infarction from Electronic Health Records -Jitong Lou.

Combining Base Learners

Word Embeddings with Limited Memory

Introduction Task: extracting relational facts from text

iSRD Spam Review Detection with Imbalanced Data Distributions

Review-Level Aspect-Based Sentiment Analysis Using an Ontology

Automatic Detection of Causal Relations for Question Answering

Ensemble learning Reminder - Bagging of Trees Random Forest

Learning linguistic structure with simple recurrent neural networks

Enriching Taxonomies With Functional Domain Knowledge

Introduction to Sensor Interpretation

Word embeddings (continued)

Introduction to Sensor Interpretation

Dan Roth Department of Computer Science

Embedding based entity summarization

GhostLink: Latent Network Inference for Influence-aware Recommendation

Learning to Detect Human-Object Interactions with Knowledge

Presentation transcript:

Learning Relational Dependency Networks for Relation Extraction Ameet Soni, Dileep Viswanathan, Jude Shavlik, and Sriraam Natarajan

Knowledge Extraction Given Large corpus of unstructured text List of target relations Goal Compute probability of relation existing between entities

Motivating Task - TAC KBP Knowledge Base Population - automatic construction of knowledge base from unstructured text

Motivating Task - TAC KBP Knowledge Base Population - automatic construction of knowledge base from unstructured text Sentence: “President Barack Obama, 55, attended the event with wife Michelle Obama”

Motivating Task - TAC KBP Knowledge Base Population - automatic construction of knowledge base from unstructured text Sentence: “President Barack Obama, 55, attended the event with wife Michelle Obama” Knowledge: spouse(Barack Obama, Michelle Obama) age(Barack Obama, 55) title(Barack Obama, President)

TAC-KBP Challenges Large scale – over 300GB of data No labeled data High degree of uncertainty Extracting structure from unstructured text

CRF/MRF/Deep/Deeper/Deepest System Overview Train corpus Features CRF/MRF/Deep/Deeper/Deepest Training Labels Learned models Features Test corpus p(y|x)

Contributions RDN Boost algorithm on benchmark Relation Extraction task Joint learning of target relations Weak supervision through knowledge rules word2vec relational features Incorporating human advice

System Overview Boosted RDN learner Features Train corpus Stanford NLP facts word2vec Boosted RDN learner Advice Training Labels Weakly supervised Gold annotations Learned models Features Test corpus Knowledge rules p(y|x)

Dependency Network Approximate the joint distribution over the variables as a product of conditional distributions P(A,P,C,D) ≈ P(D|C,A) × P(C|A,P) × P(A|D) × P(P|C,D) Even, with the approximation, no closed form solution Typically, Gibbs sampling is used P(D|C,A) Difficulty Course rating Prof rating P(C|A,P) P(P|C,D) Average grade D. Heckerman et al., JMLR 2001 P(A|D)

Relational Dependency Network (RDN) Extend dependency networks to relational domains Random variables are Objects and Relations Aggregators (eg. count) handle multiple instance problem Learning RDNs correspond to learning CPDs Relational Probability Trees (RPTs) represent CPDs Each RPT learned independently Inference - Gibbs Sampling Neville & Jensen ‘07

RDN Example Aggregator Student(S) Professor(P) IQ(S,I) avgSGrade(S,G) Level(P,L) avgSGrade(S,G) satisfaction(S,B) avgCGrade(C,G) taughtBy(P,C) takes(S,C) ratings(P,C,R) grade(S,C,G) Course(C) Difficulty(C,D)

Gradient (Tree) Boosting [Friedman Annals of Statistics 29(5):1189-1232, 2001] Models = weighted combination of a large number of small trees (models) Intuition: Generate an additive model by sequentially fitting small trees to pseudo-residuals from a regression at each iteration… Data + Loss fct Initial Model Data - = Residuals Induce Predictions Iterate Final Model = + … + +

Gradient (Tree) Boosting [Friedman Annals of Statistics 29(5):1189-1232, 2001] Models = weighted combination of a large number of small trees (models) Intuition: Generate an additive model by sequentially fitting small trees to pseudo-residuals from a regression at each iteration… Data + Loss fct Initial Model Data - = Residuals Induce Predictions Iterate Final Model = + … + +

Feature Extraction Pipeline Corpus First order logic facts generator entityTye(a,”PER”) prevLemma(“Obama”,””age”) Nextword(“Barack”,”Obama) … RDN Boost

NLP Features Part of speech Word lemma Adjacent words Named Entity Tag Noun, Verb, Adj, etc. Word lemma sitting::sit ate::eat Adjacent words next and previous words Named Entity Tag Tom::Person LA::City Dependency path root word, child of root, etc.

word2vec features Goal: detect similarities in the linguistic context of words Each word is represented as a vector-space embedding Words with similar contexts have high (cosine) similarity Approach: introduce word2vec features in relational context. Learn basket of words with similar contexts as target relation (e.g., father, dad, mother, mom) isCosineSimilar(target_word, cur_word, threshold) entity(A) ^ entity(B) ^ wordInBetween(C) ^ isCosineSimilar(‘father’, C, 0.75) → parent(A,B)

Feature Extraction Pipeline Corpus First order logic facts generator entityTye(a,”PER”) prevLemma(“Obama”,””age”) Nextword(“Barack”,”Obama) … RDN Boost

Joint Learning RDNs can learn multiple target relations jointly TAC KBP contains multiple targets with strong correlations E.g., spouse(A,B) and spouse(B,A); parent(A,B) and child(B,A) Joint inference can detect relations despite weak evidence Example “Barack and Michelle’s daughter, Malia Obama, is taking a gap year before beginning college” Individual inference: positive for child; positive for parent Joint inference: positive for spouse

Incorporating Human Advice Humans can benefit model construction beyond mere labelling of examples Odom et al. (2015) demonstrated a successful approach to incorporating human advice in the RDN Boost algorithm Advice influences the calculation of gradients Learned trade-off between data and advice Entity1 is “also known as” Entity2 probably refers to alternate name.

Weak Supervision Manually labeling target relations is prohibitively expensive Weak supervision - automatically identify training examples with some noise (silver standard examples) Knowledge-based weak supervision Encode intuition of human labelers in FOL format; weights indicate confidence Construct MLN to perform inference on unlabeled corpus Inferred positives (with probabilities) become training labels Shown to be superior to distant supervision [ILP 2013] entityType(a, “PER”), commaAfter(a), nextWord(a,b), entityType(b, “NUM”) → age(a,b)

System Overview RDN Boost Features Train corpus Stanford NLP facts word2vec RDN Boost Advice Training Labels Weakly supervised Gold annotations Learned models Features Test corpus Knowledge rules p(y|x)

Experimental Methodology Data: TAC KBP corpus (train on 2014; test on 2015) 14 target relations 10 person relations 4 organization relations Experiments: RDN Boost vs Relation Factory [Roth et al] (2013 KBP champion) Joint learning vs individual learning RDN + word2vec RDN + Advice RDN + Weak supervision

WS + 20 GS examples is mostly able to replace full GS training set

Joint learning improves performance when target relations are connected

word2vec does not improve performance

Advice improves performance (particularly recall)

RDN (the full system) outperforms a state-of-the art system on KBP

Conclusions Weak supervision helps when supplemented with few gold examples Joint learning performs well for connected relations Results do not favor word2vec features over standard features Use of human advice improves performance over learning from just data RDN performs better than the state-of-the-art relation extraction system

Future Work Expand our established framework for word2vec in relation domain Compare RDN to other relational algorithms – Word embeddings using Problog Expand to all KBP target relations