RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji jih@rpi.edu Oct13, 2015 Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan

2 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline

relation: a semantic relationship between two entities ACE relation typeexample Agent-Artifact Rubin Military Design, the makers of the Kursk Discourse each of whom Employment/ MembershipMr. Smith, a senior programmer at Microsoft Place-Affiliation Salzburg Red Cross officials Person-Social relatives of the dead Physical a town some 50 miles south of Salzburg Other-Affiliation Republican senators Relation Extraction: Task

Test Sample Train Sample K=3 A Simple Baseline with K-Nearest-Neighbor (KNN)

Test Sample Train Sample: Employment Train Sample: Physical Train Sample: Employment Train Sample: Physical 1. If the heads of the mentions don’t match: +8 2. If the entity types of the heads of the mentions don’t match: +20 3. If the intervening words don’t match: +10 the president of the United States the previous president of the United States the secretary of NIST US forces in Bahrain Connecticut ’s governor his ranch in Texas 4626 46 36 0 Relation Extraction with KNN

Lexical  Heads of the mentions and their context words, POS tags Entity  Entity and mention type of the heads of the mentions  Entity Positional Structure  Entity Context Syntactic  Chunking  Premodifier, Possessive, Preposition, Formulaic  The sequence of the heads of the constituents, chunks between the two mentions  The syntactic relation path between the two mentions  Dependent words of the mentions Semantic Gazetteers  Synonyms in WordNet  Name Gazetteers  Personal Relative Trigger Word List Wikipedia  If the head extent of a mention is found (via simple string matching) in the predicted Wikipedia article of another mention References: Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007; Chan and Roth, 2010,2011 Typical Relation Extraction Features

7 Using Background Knowledge (Chan and Roth, 2010) Features employed are usually restricted to being defined on the various representations of the target sentences Humans rely on background knowledge to recognize relations Overall aim of this work Propose methods of using knowledge or resources that exists beyond the sentence Wikipedia, word clusters, hierarchy of relations, entity type constraints, coreference As additional features, or under the Constraint Conditional Model (CCM) framework with Integer Linear Programming (ILP) 7

8 8 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

9 9 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

10 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

11 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

12 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge David Brian Cone (born January 2, 1963) is a former Major League Baseball pitcher. He compiled an 8–3 postseason record over 21 postseason starts and was a part of five World Series championship teams (1992 with the Toronto Blue Jays and 1996, 1998, 1999 & 2000 with the New York Yankees). He had a career postseason ERA of 3.80. He is the subject of the book A Pitcher's Story: Innings With David Cone by Roger Angell. Fans of David are known as "Cone-Heads." Major League BaseballpitcherWorld Series1992Toronto Blue Jays1996199819992000New York YankeesRoger AngellCone-Heads Cone lives in Stamford, Connecticut, and is formerly a color commentator for the Yankees on the YES Network. [1]Stamford, Connecticut color commentatorYES Network [1] Contents [hide]hide 1 Early years 2 Kansas City Royals 3 New York Mets Partly because of the resulting lack of leadership, after the 1994 season the Royals decided to reduce payroll by trading pitcher David Cone and outfielder Brian McRae, then continued their salary dump in the 1995 season. In fact, the team payroll, which was always among the league's highest, was sliced in half from $40.5 million in 1994 (fourth-highest in the major leagues) to $18.5 million in 1996 (second-lowest in the major leagues)David ConeBrian McRae1995 season1996

13 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grained Employment:Staff0.20 Employment:Executive0.15 Personal:Family0.10 Personal:Business0.10 Affiliation:Citizen0.20 Affiliation:Based-in0.25

14 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

15 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

16 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25 0.55

17 Knowledge 1 : Wikipedia 1 (as additional feature) We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages Introduce a new feature based on: introduce a new feature by combining the above with the coarse- grained entity types of m i, m j 17 mimi mjmj r ?

18 Knowledge 1 : Wikipedia 2 (as additional feature) Given m i, m j, we use a Parent-Child system (Do and Roth, 2010) to predict whether they have a parent-child relation Introduce a new feature based on: combine the above with the coarse-grained entity types of m i, m j 18 mimi mjmj parent-child?

19 Knowledge 2 : Word Class Information (as additional feature) Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 19

20 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 20 IBM

21 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 21 IBM011

22 Knowledge 2 : Word Class Information All lexical features consisting of single words will be duplicated with its corresponding bit-string representation apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 22 00 0110 11

23 weight vector for “local” models collection of classifiers Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)

24 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 24 weight vector for “local” models collection of classifiers penalty for violating the constraint how far y is from a “legal” assignment

25 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 25 Wikipedia word clusters hierarchy of relations entity type constraints coreference

26 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Constraint Conditional Models (CCMs) fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

27 Key steps Write down a linear objective function Write down constraints as linear inequalities Solve using integer linear programming (ILP) packages 27 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)

28 Knowledge 3 : Relations between our target relations... personal... employment family bizexecutive staff 28

29 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 29 coarse-grained classifier fine-grained classifier

30 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 30 mimi mjmj coarse-grained? fine-grained?

31 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 31

36 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 36 coarse-grained prediction probabilities fine-grained prediction probabilities

37 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 37 coarse-grained prediction probabilities fine-grained prediction probabilities coarse-grained indicator variable fine-grained indicator variable indicator variable == relation assignment

38 Knowledge 3 : Hierarchy of Relations Write down constraints If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc. (Capturing the inverse relationship) If we assign rf to R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label 38

39 Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007) Entity types are useful for constraining the possible labels that a relation R can assume 39 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in

40 Entity types are useful for constraining the possible labels that a relation R can assume 40 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)

41 We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations By improving the coarse-grained predictions and combining with the hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications 41 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)

42 Knowledge 5 : Coreference 42 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in

43 Knowledge 5 : Coreference In this work, we assume that we are given the coreference information, which is available from the ACE annotation. 43 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in null

44 Experiment Results 44 F1% improvement from using each knowledge source All nwire10% of nwire BasicRE50.5%31.0%

Consider different levels of syntactic information Deep processing of text produces structural but less reliable results Simple surface information is less structural, but more reliable Generalization of feature-based solutions A kernel (kernel function) defines a similarity metric Ψ(x, y) on objects No need for enumeration of features Efficient extension of normal features into high-order spaces Possible to solve linearly non-separable problem in a higher order space Nice combination properties Closed under linear combination Closed under polynomial extension Closed under direct sum/product on different domains References: Zelenko et al., 2002, 2003; Aron Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Che et al., 2005, Zhang et al., 2006; Qian et al., 2007; Zhou et al., 2007; Khayyamian et al., 2009; Reichartz et al., 2009 Most Successful Learning Methods: Kernel-based

, where 1) Argument 2) Local dependency, where Kernel Examples for Relation Extraction K T is a token kernel defined as: (Zhao and Grishman, 2005) 3) Path, where Composite Kernels:

Occurrences of seed tuples: Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor. Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Bootstrapping for Relation Extraction

’s headquarters in -based, Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Learned Patterns: Bootstrapping for Relation Extraction (Cont’)

Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Generate new seed tuples; start new iteration Bootstrapping for Relation Extraction (Cont’)

50 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline

Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity 51

Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) 52 Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… 53

What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 54

Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 58

Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 3)Comparing algorithms across all hyperparameter settings Over 5,000 experiments 59

Background 60

What is word2vec ? 61

What is word2vec ? How is it related to PMI? 62

What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram Various training methods Negative Sampling Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 63

What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram(SG) Various training methods Negative Sampling(NS) Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 64

Demo http://rare-technologies.com/word2vec-tutorial/#app 65

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 66

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 67

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. wordscontexts wampimukfurry wampimuklittle wampimukhiding wampimukin… “word2vec Explained…” Goldberg & Levy, arXiv 2014 68

Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 69

Skip-Grams with Negative Sampling (SGNS) 73

What is SGNS learning? 74

What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 75

What is SGNS learning? SGNS is doing something very similar to the older approaches SGNS is factorizing the traditional word-context PMI matrix So does SVD! GloVe factorizes a similar word-context matrix 81

But embeddings are still better, right? Plenty of evidence that embeddings outperform traditional methods “Don’t Count, Predict!” (Baroni et al., ACL 2014) GloVe (Pennington et al., EMNLP 2014) How does this fit with our story? 82

The Big Impact of “Small” Hyperparameters 83

The Big Impact of “Small” Hyperparameters word2vec & GloVe are more than just algorithms… Introduce new hyperparameters May seem minor, but make a big difference in practice 84

Identifying New Hyperparameters 85

New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 86

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 90

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 91

Dynamic Context Windows 92

Adding Context Vectors 93

Adding Context Vectors 94

Adapting Hyperparameters across Algorithms 95

Context Distribution Smoothing 96

Comparing Algorithms 99

Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter 100

Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter* * If transferable 101

Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 102

Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 103

Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI 104

Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) Preprocessing Dynamic Context Window Subsampling Postprocessing Association Metric Shifted PMI/PPMI Context Distribution Smoothing 105

Experiments 106

Experiments: Prior Art 107 Experiments: “Apples to Apples” Experiments: “Oranges to Oranges”

Experiments: Hyperparameter Tuning 108 [different settings]

Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate 109

Re-evaluating Prior Claims 110

Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods 111

Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods* * Except for one task… 112

GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 113

GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 114

Linguistic Regularities in Sparse and Explicit Word Representations (Levy and Goldberg, 2014) “PPMI vectors perform on par with SGNS on analogy tasks” Holds for semantic analogies Does not hold for syntactic analogies (MSR dataset) Hyperparameter settings account for most of the reported gaps Different context type for PPMI vectors Syntactic Analogies: there is a real gap in favor of SGNS 115

Conclusions 116

Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust 117

Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust & efficient 118

Conclusions: Methodology Look for hyperparameters Adapt hyperparameters across different algorithms For good results: tune hyperparameters For good science: tune baselines’ hyperparameters Thank you :) 119

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Similar presentations

Presentation on theme: "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Similar presentations

Presentation on theme: "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."— Presentation transcript:

Similar presentations

About project

Feedback